Commit Graph

153 Commits

Author SHA1 Message Date
Jingyu Zhou 05e63bc703
Fix orphaned storage server due to force recovery (#6914)
* Fix orphaned storage server due to force recovery

The force recovery can roll back the transaction that adds a storage server.
However, the storage server may now at version B > A, the recovery version.
As a result, its peek to buddy TLog won't return TLogPeekReply::popped to
trigger its exit, and instead getting a higher version C > B back. To the
storage server, this means the message is empty, thus not removing itself and
keeps peeking.

The fix is to instead of using recovery version as the popped version for the
SS, we use the recovery transaction version, which is the first transaction
after the recovery. Force recovery bumps this version to a much higher version
than the SS's version. So the TLog would set TLogPeekReply::popped to trigger
the storage server exit.

* Fix tlog peek to disallow return empty message between recoveredAt and recovery txn version

This contract today is not explicitly set and can cause storage server to fail
with assertion "rollbackVersion >= data->storageVersion()". This is because if
such an empty version is returned, SS may advance its storage version to a
value larger than the rollback version set in the recovery transaction.

The fix is to block peek reply until recovery transaction has been received.

* Move recoveryTxnReceived to be per LogData

This is because a shared TLog can have a first generation TLog which is already
setting the promise, thus later generations won't wait for the recovery version.
For the current generation, all peeks need to wait, while for older generations,
there is no need to wait (by checking if they are stopped).

* For initial commit, poppedVersion needs to be at least 2

To get rid of the previous unsuccessful recovery's recruited seed
storage servers.
2022-05-02 17:17:37 -07:00
sfc-gh-tclinkenbeard a71099471b Update copyright header dates 2022-03-21 13:36:23 -07:00
A.J. Beamon 250a88e682 Enforce that trace event suppression calls happen first when using trace event call chaining. Fix various instances where we weren't following this requirement. 2022-02-24 12:25:52 -08:00
Xiaoge Su abf73047ca Enforce std:: specifier rather than using namespace 2021-09-16 19:40:28 -07:00
FDB Formatster 2c788c233d apply clang-format to *.c, *.cpp, *.h, *.hpp files 2021-08-27 17:07:47 -07:00
Xiaoxi Wang d12bda94ae disable trace log 2021-08-16 16:33:20 -07:00
Xiaoxi Wang a97570bd06 solve mis-spelling, trace log and format problems 2021-08-11 18:26:00 -07:00
Xiaoxi Wang 1f6cee89ab merge master, fix conflicts 2021-08-10 10:01:45 -07:00
Steve Atherton 54c7036eaf Move role UIDs for MutationTracking TraceEvents from various inconsistent detail fields into the TraceEvent UID field. 2021-08-10 01:52:36 -07:00
Xiaoxi Wang 2263626cdc 200k test clean: enable remote Log pull from LogRouter 2021-08-07 09:53:32 -07:00
Xiaoxi Wang 80a5120df8 support LogRouter peek from TLog 2021-08-05 19:51:17 -07:00
Xiaoxi Wang 9986d2b0b6 change log severity 2021-08-02 22:33:17 -07:00
Xiaoxi Wang 3dfe7a51e0 trivial merge 2021-08-02 14:32:12 -07:00
Xiaoxi Wang fd74a16f35 format code 2021-08-02 14:24:20 -07:00
Xiaoxi Wang 2df0474fec merge master 2021-08-02 11:58:35 -07:00
Xiaoxi Wang ae2268f9f2 200k simulation: check stream sequence; delay in GetMore loop 2021-08-02 10:52:24 -07:00
Xiaoxi Wang 1c4bce17aa revert code refactor 2021-07-30 19:08:22 -07:00
Xiaoxi Wang 12d4f5c261 disable streaming peek for localities < 0 2021-07-28 14:11:25 -07:00
Xiaoxi Wang c6b0de1264 problem: OOM 2021-07-26 09:36:53 -07:00
sfc-gh-tclinkenbeard e006e4fed4 Fix -Wreorder-ctor warnings in LogSystemPeekCursor.actor.cpp and several other files 2021-07-24 00:48:13 -07:00
sfc-gh-tclinkenbeard 64dc1dc185 Fix -Wreorder-ctor warnings in NativeAPI.actor.cpp and several other files 2021-07-24 00:23:06 -07:00
sfc-gh-tclinkenbeard b9a22a61ef Fix many -Wreorder-ctor warnings 2021-07-23 17:33:18 -07:00
Xiaoxi Wang cd32478b52 memory error(Simple config) 2021-07-22 15:45:59 -07:00
Xiaoxi Wang 5046ee3b07 add stream peek to logRouter 2021-07-20 17:42:00 +00:00
Xiaoxi Wang f3667ce91a more debug logs; let tryEstablishStream wait until the connection is good 2021-07-19 18:43:51 +00:00
Xiaoxi Wang 227570357a trace log and reset changes; byteAcknownledge overflow 2021-07-15 21:30:14 +00:00
Xiaoxi Wang 066d534194 trivial changes 2021-07-14 16:19:23 +00:00
Xiaoxi Wang 6d1c12899d catch exceptions 2021-07-09 22:46:16 +00:00
Xiaoxi Wang 5a43a8c367 add returnIfBlocked in stream request 2021-07-08 19:32:58 +00:00
Xiaoxi Wang 15347773d9 fix double destruction memory bug 2021-07-07 22:55:49 +00:00
Xiaoxi Wang b6d5c8a091 implement tLogPeekStream 2021-07-06 23:14:58 +00:00
Xiaoxi Wang b50fda6b4b add simple streaming peek functions 2021-07-01 23:17:28 +00:00
Sreenath Bodagala 6275adc5a0 Address build failure
LogSystemPeekCursor.actor.cpp:
Check if "interf" is set before referencing it.
2021-05-13 21:38:07 +00:00
Sreenath Bodagala b0554b4554 Capture how fast an SS is catching up to its tLog-SS lag
Changes:
LogSystem.h, LogSystemPeekCursor.actor.cpp:
Add APIs to find the ID of the tLog from which an SS has fetched the latest
set of versions.

storageserver.actor.cpp:
Capture the number of latest set of versions fetched, the time (in seconds)
in which those versions were fetched, and the tLog from which they were
fetched. Add this information to a TraceLogEvent.

Capture how many versions an SS has fetched in the
2021-05-11 20:03:21 +00:00
FDB Formatster df90cc89de apply clang-format to *.c, *.cpp, *.h, *.hpp files 2021-03-10 10:18:07 -08:00
sfc-gh-tclinkenbeard 5020e3faa1 Make ILogSystem::IPeekCursor const-correct 2020-12-08 09:09:31 -08:00
David Youngworth d64cf8b9e3 Merge branch 6.3 into master 2020-11-17 11:22:45 -08:00
David Youngworth d0391db862 Merge branch 'release-6.2' into release-6.3 2020-11-16 10:15:23 -08:00
Markus Pilman bdd3dbfa7d remove duplicates 2020-11-10 14:01:07 -07:00
sfc-gh-tclinkenbeard 4669f837fa Add uses of makeReference 2020-11-07 22:10:18 -08:00
Vishesh Yadav 7b28de8a41 Add IDs to ConnectionReset TraceEvents 2020-11-04 14:06:49 -08:00
Vishesh Yadav 22b16302c3 Make ConnectionReset logs easier to query #3977
All TraceLogs that are related to ConnectionReset should be prefixed with
ConnectionReset. This should make it easy to query and aggregate by address and
role.
2020-11-02 15:10:51 -08:00
Evan Tschannen 12edadd059 Merge branch 'release-6.3'
# Conflicts:
#	CMakeLists.txt
#	fdbclient/Knobs.cpp
#	fdbclient/MasterProxyInterface.h
#	fdbrpc/simulator.h
#	fdbserver/MasterProxyServer.actor.cpp
#	tests/fast/CycleAndLock.txt
#	tests/fast/TxnStateStoreCycleTest.txt
#	tests/fast/VersionStamp.txt
#	tests/slow/ParallelRestoreOldBackupApiCorrectnessAtomicRestore.txt
#	tests/slow/ParallelRestoreOldBackupCorrectnessCycle.txt
#	versions.target
2020-08-31 19:33:34 -07:00
Evan Tschannen 29eec30183 Merge branch 'release-6.2' into release-6.3
# Conflicts:
#	CMakeLists.txt
#	build/Dockerfile
#	build/Dockerfile.devel
#	documentation/sphinx/source/downloads.rst
#	fdbserver/Knobs.cpp
#	fdbserver/LogSystem.h
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/TagPartitionedLogSystem.actor.cpp
#	fdbserver/WaitFailure.actor.cpp
#	fdbserver/fdbserver.vcxproj
#	fdbserver/fdbserver.vcxproj.filters
#	packaging/msi/FDBInstaller.wxs
2020-08-31 01:10:29 -07:00
Evan Tschannen 507c67c930 Added additional information to trace events 2020-08-26 11:42:23 -07:00
Meng Xu ef8c1060a2 Merge branch 'master' into mengxu/tmp-merge-6.3 2020-07-13 10:15:56 -07:00
A.J. Beamon b09dddc07e Merge branch 'release-6.2' into merge-release-6.2-into-release-6.3
# Conflicts:
#	cmake/ConfigureCompiler.cmake
#	documentation/sphinx/source/downloads.rst
#	fdbrpc/FlowTransport.actor.cpp
#	fdbrpc/fdbrpc.vcxproj
#	fdbserver/DataDistributionQueue.actor.cpp
#	fdbserver/Knobs.cpp
#	fdbserver/Knobs.h
#	fdbserver/LogSystemPeekCursor.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/Status.actor.cpp
#	fdbserver/storageserver.actor.cpp
#	flow/flow.vcxproj
2020-07-10 15:06:34 -07:00
Evan Tschannen 33c9b1374a more compile fixes 2020-07-09 22:57:43 -07:00
Evan Tschannen f6163d0a79 fix compile errors 2020-07-09 22:53:02 -07:00
Evan Tschannen 717242a0ee reset WAN network connections every 5 minutes is responses take more than 500ms 2020-07-09 22:50:47 -07:00