Commit Graph

642 Commits

Author SHA1 Message Date
Aaron Molitor 30b05b469c Revert "Refactor: ClusterController driving cluster-recovery state machine"
This reverts commit dfe9d184ff.
2021-12-24 11:25:51 -08:00
Aaron Molitor d174bb2e06 Revert "Refactor: ClusterController driving cluster-recovery state machine"
This reverts commit abd2959702.
2021-12-24 11:25:51 -08:00
Ata E Husain Bohra abd2959702 Refactor: ClusterController driving cluster-recovery state machine
diff-1: Address Jingyu's review comments

At present, cluster recovery process consists of following steps:
1. ClusterController clusterWatchDatabase actor recruits
   master/sequencer process.
2. Sequencer process implements the cluster recovery state machine,
   responsible to recruit all other processes as well restore the
   cluster state.

Patch proposes a scheme where the cluster recovery state machine
is implemented and driven by the ClusterController process instead
of the Sequencer process.

Advantages of the scheme could be:
1. Simplified design where ClusterController recruits "sequencer"
   process like other worker processes compared to current scheme
   where "sequencer" process gets special treatment. In newer scheme
   sequencer is responsible for maintaining/providing
   "committed version" (as expected).
2. ClusterController is responsible for worker processes recruitment,
   the sequencer though orchestrating the recovery state machine, it
   need to reachout to the ClusterController for recruiting worker
   processes etc.

NOTE:
Patch has moved the recovery state machine code from
'sequencer' -> 'cluster-controller' process, however, necessary
updates were done for both functionality as well as performance
improvement reasons.

Next Steps:
Cluster recovery documentation will be updated in near future.
2021-12-22 14:06:27 -08:00
Ata E Husain Bohra dfe9d184ff Refactor: ClusterController driving cluster-recovery state machine
At present, cluster recovery process consists of following steps:
1. ClusterController clusterWatchDatabase actor recruits
   master/sequencer process.
2. Sequencer process implements the cluster recovery state machine,
   responsible to recruit all other processes as well restore the
   cluster state.

Patch proposes a scheme where the cluster recovery state machine
is implemented and driven by the ClusterController process instead
of the Sequencer process.

Advantages of the scheme could be:
1. Simplified design where ClusterController recruits "sequencer"
   process like other worker processes compared to current scheme
   where "sequencer" process gets special treatment. In newer scheme
   sequencer is responsible for maintaining/providing
   "committed version" (as expected).
2. ClusterController is responsible for worker processes recruitment,
   the sequencer though orchestrating the recovery state machine, it
   need to reachout to the ClusterController for recruiting worker
   processes etc.

NOTE:
Patch has moved the recovery state machine code from
'sequencer' -> 'cluster-controller' process, however, necessary
updates were done for both functionality as well as performance
improvement reasons.

Next Steps:
Cluster recovery documentation will be updated in near future.
2021-12-22 14:06:27 -08:00
Dan Lambright 9f4ac866cd Avoid context switch between appending version list and updating dv
Port PR 6117 (Resolver saves shardChanged in recent state transactions)
2021-12-13 13:02:32 -05:00
Dan Lambright 0222d8669d fix simulation failures 2021-12-10 09:56:21 -05:00
Evan Tschannen e3819dad7c fix: If a removed tlog never attempted a queue commit, the update storage loop could get stuck waiting for queueCommittingVersion to advance 2021-11-25 09:55:01 -08:00
Evan Tschannen 964d0209ca
Merge pull request #5637 from sfc-gh-ljoswiak/features/data-loss-prevention
Data loss protection when joining new cluster
2021-11-15 15:26:32 -08:00
Dan Lambright 4979ccb889 commits recovered if written to every tlog minus failure tolerance. 2021-11-12 12:10:04 -05:00
Lukas Joswiak e4c3f886da Fix recovery issue 2021-11-10 16:15:13 -08:00
Dan Lambright 0f99ad582b first cut unicast recovery 2021-11-10 12:31:16 -05:00
Sreenath Bodagala 1ec238b8b4 - Address a review comment 2021-11-09 20:46:42 +00:00
Lukas Joswiak 15e0d5b29f Add explicit transaction options when reading cluster ID 2021-11-09 12:29:49 -08:00
Lukas Joswiak 74cf64fe0f Sync cluster ID through ServerDBInfo 2021-11-09 12:29:48 -08:00
Lukas Joswiak 4640045243 Fix rare simulation failures
When partitions appear before a cluster has fully recovered, it was
possible to have different tlogs persist different cluster IDs because
they were involved in different partitions. This would affect recovery
when a quorum was eventually reached. The solution to this is to avoid
persisting the cluster ID before a cluster has fully recovered, to make
sure all nodes agree on the cluster ID.
2021-11-09 12:29:48 -08:00
Lukas Joswiak 3988b11fd6 Cleanup 2021-11-09 12:29:48 -08:00
Lukas Joswiak aa3383f0e3 Exclude when joining new cluster 2021-11-09 12:29:48 -08:00
Lukas Joswiak 3e2c65bb11 Allow tlog to join another cluster but retain its data 2021-11-09 12:29:48 -08:00
Lukas Joswiak 30867750b5 Add protection against storage and tlog data deletion when joining a new cluster 2021-11-09 12:29:47 -08:00
Sreenath Bodagala 26ac1529fa - Unblock any waiting peeks before stopping a tlog. 2021-11-09 17:22:50 +00:00
Markus Pilman 7df059570a Make sure unit tests are run often enough 2021-11-08 15:43:32 -07:00
Dan Lambright 05a1419ba0 Fix corner-case where poppedVersion races with wait on new mutations in tLog 2021-11-03 11:32:31 -04:00
Dan Lambright befe1993c4 fix conflict on rebase 2021-10-29 12:25:26 -04:00
Sreenath Bodagala 2bf54fda90 - Address review comments 2021-10-28 20:06:11 +00:00
Sreenath Bodagala 4503b0a347 - Capture metrics about empty/non-empty peeks done by storage servers 2021-10-26 14:37:46 +00:00
Evan Tschannen c615279807
Merge pull request #5720 from sfc-gh-ljoswiak/fixes/recovery-failure-fix
Fix possible recovery hang
2021-10-25 12:35:31 -07:00
Evan Tschannen f1158371a7 Merge branch 'master' of https://github.com/apple/foundationdb into feature-range-feed
# Conflicts:
#	flow/error_definitions.h
2021-10-21 00:55:12 -07:00
Lukas Joswiak 120d99e941 Fix a recovery hang that could occur when a new recovery was started during the existing recovery 2021-10-19 17:37:14 -07:00
sfc-gh-tclinkenbeard 9e06b6e6e3 Make IClosable interface const-correct 2021-10-18 13:40:47 -07:00
Dan Lambright 23062b892e Calculate tpcv on resolvers 2021-10-15 16:40:00 -04:00
Dan Lambright f099bb2574 comments on this PR's change 2021-10-15 15:08:25 -04:00
Dan Lambright 15dc5a3e41 wake waiters when data made durable 2021-10-15 10:58:48 -04:00
Evan Tschannen 5c642f706e Merge branch 'master' of https://github.com/apple/foundationdb into feature-range-feed
# Conflicts:
#	fdbcli/fdbcli.actor.cpp
2021-10-09 19:34:16 -07:00
Dan Lambright 58e1888d8e remove network hop by getting previous commit versions in GetCommitVersionRequest 2021-09-30 11:51:57 -04:00
Sreenath Bodagala 2aa3b44d4e Merge remote-tracking branch 'apple-upstream/master' into version-vector-prototype
- Conflicts:
	fdbserver/LogSystem.h
	fdbserver/LogSystemConfig.h
	fdbserver/TagPartitionedLogSystem.actor.cpp

- Files modified during merge:

modified:   fdbserver/LogSystem.cpp
modified:   fdbserver/LogSystemConfig.cpp
2021-09-17 19:36:18 +00:00
Xiaoge Su abf73047ca Enforce std:: specifier rather than using namespace 2021-09-16 19:40:28 -07:00
Xiaoge Su 067c1cc55b Extract methods in LogSystem.h to corresponding cpp file 2021-09-12 14:17:19 -07:00
Evan Tschannen ac5b580e2d Merge branch 'master' into feature-range-feed
# Conflicts:
#	fdbcli/fdbcli.actor.cpp
#	fdbclient/StorageServerInterface.cpp
#	fdbclient/StorageServerInterface.h
#	fdbserver/ApplyMetadataMutation.cpp
#	fdbserver/TLogServer.actor.cpp
#	flow/error_definitions.h
2021-09-09 23:13:22 -07:00
Dan Lambright d8d64ecc6f Add TODO 2021-09-09 12:47:00 -04:00
Dan Lambright ea748f3273 Add latency metrics for blocking peek 2021-09-08 09:50:01 -04:00
Dan Lambright 8689e1f106 merge with master 2021-08-30 15:29:08 -04:00
Steve Atherton deeb6b3404 Merge branch 'master' of https://github.com/apple/foundationdb into durability-bug-repro1
# Conflicts:
#	fdbserver/TLogServer.actor.cpp
2021-08-24 16:19:16 -07:00
Steve Atherton ec0e39b40f Bug fix: Popped versions are exclusive, so after recovery a tag for which there is no longer data should be considered popped up until the version *after* recovery, indicating that data at the recovery version itself has been popped. 2021-08-24 15:16:20 -07:00
Sreenath Bodagala 7c269b5225 - Address a bug 2021-08-17 14:40:00 +00:00
Xiaoxi Wang a97570bd06 solve mis-spelling, trace log and format problems 2021-08-11 18:26:00 -07:00
Sreenath Bodagala cec744cebf - Address the following issues:
- Sequencer should update the version vector once for a given commit
version (irrespective of the number of times that it receives and
processes the ReportRawCommittedVersionRequest message for that commit
version). Issue found by simulation tests.

- Storage server should take both its latest commit version and the
read version into account while processing a read request. This is to
address transaction_too_old error that we saw while running tests with
mako (and also in YCSB tests).

- Do not enable the tlog blocking-peek logic if ENABLE_VERSION_VECTOR
flag is set to false.
2021-08-10 19:47:18 +00:00
Xiaoxi Wang 1f6cee89ab merge master, fix conflicts 2021-08-10 10:01:45 -07:00
Steve Atherton c73e861074 Move role UIDs for MutationTracking TraceEvents from various inconsistent detail fields into the TraceEvent UID field. 2021-08-10 01:59:28 -07:00
Steve Atherton 54c7036eaf Move role UIDs for MutationTracking TraceEvents from various inconsistent detail fields into the TraceEvent UID field. 2021-08-10 01:52:36 -07:00
Evan Tschannen 208a5790ad fixed usage of durable version 2021-08-09 21:58:44 -07:00
Evan Tschannen ed28aecde0 Merge branch 'master' into feature-range-feed 2021-08-09 20:40:55 -07:00
Evan Tschannen bc9a0e1315 first attempt to add data distribution support for range feeds 2021-08-09 10:05:56 -07:00
Xiaoxi Wang 2263626cdc 200k test clean: enable remote Log pull from LogRouter 2021-08-07 09:53:32 -07:00
Sreenath Bodagala 1758c92683 - Pull changes related to tlog-peeks from the version indexer branch
Pull commits 5e37bc37a0 and
95e85aaffb from the version indexer branch.
2021-08-06 14:42:35 +00:00
Sreenath Bodagala a081c0baa5 Merge remote-tracking branch 'apple-upstream/master' into version-vector-prototype 2021-08-05 22:40:32 +00:00
Xiaoxi Wang 2df0474fec merge master 2021-08-02 11:58:35 -07:00
Xiaoxi Wang ae2268f9f2 200k simulation: check stream sequence; delay in GetMore loop 2021-08-02 10:52:24 -07:00
Xiaoxi Wang 2a88033800 clean 100k simulation test. revert changes of fdbrpc.h 2021-07-31 16:46:14 -07:00
Xiaoxi Wang 1c4bce17aa revert code refactor 2021-07-30 19:08:22 -07:00
Xiaoxi Wang 10c82b422f merge master branch 2021-07-28 14:19:46 -07:00
Xiaoxi Wang 12d4f5c261 disable streaming peek for localities < 0 2021-07-28 14:11:25 -07:00
sfc-gh-tclinkenbeard c74047c665 Merge remote-tracking branch 'origin/master' into fix-more-clang-warnings 2021-07-28 11:51:02 -07:00
Steve Atherton 507c1f11e3 Add .log() to bare TraceEvent() invocations without any .detail()s to avoid clang-tidy warning about immediate destruction of object without use. 2021-07-26 19:55:10 -07:00
Xiaoxi Wang c6b0de1264 problem: OOM 2021-07-26 09:36:53 -07:00
sfc-gh-tclinkenbeard 23558a5430 Fix -Wreorder-ctor warnings in TLogServer.actor.cpp 2021-07-24 23:15:22 -07:00
sfc-gh-tclinkenbeard b9a22a61ef Fix many -Wreorder-ctor warnings 2021-07-23 17:33:18 -07:00
Xiaoxi Wang bfebd4e812 Merge branch 'master' of https://github.com/apple/foundationdb into tlog_dev 2021-07-22 16:15:07 -07:00
Xiaoxi Wang cd32478b52 memory error(Simple config) 2021-07-22 15:45:59 -07:00
Xiaoxi Wang 1057835e8b merge with master 2021-07-20 17:09:34 -07:00
Xiaoxi Wang 5046ee3b07 add stream peek to logRouter 2021-07-20 17:42:00 +00:00
sfc-gh-tclinkenbeard 6f81155784 Merge remote-tracking branch 'origin/master' into const-serverdbinfo 2021-07-20 10:18:40 -07:00
Xiaoxi Wang f3667ce91a more debug logs; let tryEstablishStream wait until the connection is good 2021-07-19 18:43:51 +00:00
Steve Atherton f596a81073 Rename ::TRUE and ::FALSE in BooleanParams to ::True and ::False so as to not conflict with the TRUE and FALSE macros provided by the Windows and MacOS SDKs. 2021-07-17 00:11:40 -07:00
Xiaoxi Wang 227570357a trace log and reset changes; byteAcknownledge overflow 2021-07-15 21:30:14 +00:00
Sreenath Bodagala 5f504d2148 - Block a peek request on a tlog until the tlog has a commit version
that is relevant to the requester

Code extracted from https://github.com/apple/foundationdb/pull/5058
2021-07-15 19:49:20 +00:00
Xiaoxi Wang 1584ed5853 Merge branch 'master' of https://github.com/apple/foundationdb into tlog_dev 2021-07-14 16:20:19 +00:00
Xiaoxi Wang 066d534194 trivial changes 2021-07-14 16:19:23 +00:00
sfc-gh-tclinkenbeard 84f6b55e6c Prevent tLog from modifying ServerDBInfo object 2021-07-11 23:29:36 -07:00
Xiaoxi Wang 6d1c12899d catch exceptions 2021-07-09 22:46:16 +00:00
Xiaoxi Wang 5a43a8c367 add returnIfBlocked in stream request 2021-07-08 19:32:58 +00:00
sfc-gh-tclinkenbeard 020371a78f Merge remote-tracking branch 'origin/master' into add-boolean-param 2021-07-07 16:50:51 -07:00
Zhe Wang cc10c9aee2 clean up add_trace_event_to_tLog_pop 2021-07-07 14:14:59 -05:00
Zhe Wang b82a3f4276 add trace event to tLog pop 2021-07-07 12:13:49 -05:00
Xiaoxi Wang b6d5c8a091 implement tLogPeekStream 2021-07-06 23:14:58 +00:00
Xiaoxi Wang 9948b9d4ef refactor TLog Peek code 2021-07-05 00:14:27 +00:00
sfc-gh-tclinkenbeard 8cc40e3a2b Expand use of BOOLEAN_PARAM 2021-07-02 21:41:50 -07:00
sfc-gh-tclinkenbeard 79ff07a071 Added *BOOLEAN_PARAM macros to enforce documentation of boolean parameters 2021-07-02 15:04:42 -07:00
Xiaoxi Wang b50fda6b4b add simple streaming peek functions 2021-07-01 23:17:28 +00:00
Xiaoxi Wang ae3542f8ab add stream struct in Tlog 2021-06-29 17:06:09 +00:00
Evan Tschannen fcb8bd6475
Revert "Make the sim2 run loop match the behavior of the net2 run loop." 2021-06-22 14:50:01 -07:00
Evan Tschannen 154332a94b Merge branch 'master' of https://github.com/apple/foundationdb into feature-sim-time-batching
# Conflicts:
#	fdbserver/VersionedBTree.actor.cpp
2021-06-22 09:37:40 -07:00
Zhe Wang ae7b93dcce add epoch info to trace events when tLog begins 2021-06-09 19:14:36 -05:00
Evan Tschannen 801f147551 properly handle io_errors from the destructor of LogData 2021-05-20 18:23:11 -07:00
Evan Tschannen cc18022e7d small clang format finds 2021-05-20 16:45:08 -07:00
Evan Tschannen f57f0d64f4 Merge branch 'master' into feature-sim-time-batching
# Conflicts:
#	fdbserver/DataDistribution.actor.cpp
2021-05-20 09:09:35 -07:00
Evan Tschannen 907248dcd4 fixed a rare simulation bug where missingFinalCommit could be skipped by two successive logSystem changes 2021-05-19 13:26:01 -07:00
Lukas Joswiak e7d7b39f12
Merge pull request #4744 from sfc-gh-tclinkenbeard/add-rangeresult-type-alias
Create RangeResult type alias
2021-05-03 16:29:33 -07:00
sfc-gh-tclinkenbeard 5c2d7b6080 Create RangeResult type alias 2021-05-03 13:14:16 -07:00
Steve Atherton cbd77fe6f3 Added new StorageBytes member to StorageMetrics and TLogMetrics (for newest TLog version only). Moved StorageBytes detail from SpecialCounters to the traceCounters() decorator callback to avoid calling getStorageBytes(), which makes a system call, four extra times on storage servers and eight extra times on logs. 2021-04-08 01:09:47 -07:00
Evan Tschannen 0554a05fc2 typo 2021-03-19 13:19:26 -07:00