foundationdb

Commit Graph

Author	SHA1	Message	Date
Steve Atherton	c1b2519b9c	Remove additional calls to KVS init in TLogServer in favor of a single call at startup.	2022-10-06 13:14:24 -07:00
Steve Atherton	f2dbbcce40	Allow overlapping commits in Redwood in which the caller drops the commit futures. Call IKVS::init() in TLogServer.	2022-10-06 12:48:06 -07:00
Ata E Husain Bohra	201eac77cf	Init TLogPersistent storage for a sharedTLog (#8363 ) * Init TLogPersistent storage for a sharedTLog Description diff-1: Address review comments Patch udpates the code to intialize TLogPersistent storage for a shared TLog independent of intializing persistentState for a versioned Tlog data. Appraoch allows initializing Tlog persistent storage as well as writing 'persistFormat' key for a shared TLog earlier in the TLog creation lifecycle. Testing devRunCorrectness - 100K	2022-09-30 14:06:53 -07:00
sfc-gh-tclinkenbeard	985958c260	Add rare code probe decoration	2022-09-25 15:28:32 -07:00
Lukas Joswiak	c97377f48e	Remove durable cluster ID write (#8238 )	2022-09-20 07:56:07 -07:00
A.J. Beamon	4fd64630e8	Convert literal string ref instances to use _sr suffix	2022-09-19 11:35:58 -07:00
Evan Tschannen	f5161c362e	fix: persistentData->commit() was not protected by the persistentDataCommitLock (#8200 ) * fix: persistentData->commit() was not protected by the persistentDataCommitLock, meaning it is possible for inconsistent data to be made durable on the tlog * fixed a compilation error	2022-09-16 09:47:20 -07:00
sfc-gh-tclinkenbeard	82adc1e856	Make g_simulator a pointer	2022-09-15 09:00:33 -07:00
Josh Slocum	fdb509c99e	2 TLog stopped bug fixes - one setting stop from dbinfo, the other handling a race between peek and stopping (#8001 )	2022-08-29 15:25:29 -05:00
Dan Adkins	4317e528ab	Guard initPersistentState() calls with timeout. (#7649 ) We've observed the recovery process stuck in initPersistentState, while waiting to acquire persistentDataCommitLock. All of the other places in that function which potentially interact with a disk are guarded by a timeout: TLOG_MAX_CREATE_DURATION. Since it's possible that the current holder of that lock is stuck in persistentData->commit(), it makes sense to add a timeout around the entire function, rather than each of the places where it might get stuck on an I/O operation. The end result is that after 10 seconds, this process will fail and cluster recovery will restart.	2022-08-15 10:02:03 -07:00
Zhe Wu	410ad5ff18	TLog track unpopped recovery tag	2022-07-25 20:10:18 -07:00
Zhe Wu	aa2740e21e	Increase AllTags field length in TLogReady	2022-07-20 13:17:21 -07:00
Markus Pilman	1de37afd52	Make TEST macros C++ only (#7558 ) * proof of concept * use code-probe instead of test * code probe working on gcc * code probe implemented * renamed TestProbe to CodeProbe * fixed refactoring typo * support filtered output * print probes at end of simulation * fix missed probes print * fix deduplication * Fix refactoring issues * revert bad refactor * make sure file paths are relative * fix more wrong refactor changes	2022-07-19 13:15:51 -07:00
Markus Pilman	4ece33a0a8	Merge pull request #7445 from sfc-gh-anoyes/anoyes/fix-ubsan Fix UBSAN build when statically linking libcxx	2022-07-11 17:27:37 -06:00
Andrew Noyes	55548e4ac8	Avoid signed integer overflow	2022-07-07 10:19:20 -07:00
A.J. Beamon	c4b0f6eaae	Add an internal C API to support connection to a cluster using a connection string (#7438 ) * Add an internal C API to support memory connection records * Track shared state in the client using a unique and immutable cluster ID from the cluster * Add missing code to store the clusterId in the database state object * Update some arguments to pass by const&	2022-07-07 10:12:49 +02:00
Yi Wu	364644673f	Support TLog encryption in commit proxy (#6942 ) This PR add support for TLog encryption through commit proxy. The encryption is done on per-mutation basis. As CP writes mutations to TLog, it inserts encryption header alongside encrypted mutations. Storage server (and other consumers of TLog such as storage cache and backup worker) decrypts the mutations as they peek TLog.	2022-06-29 14:21:05 -07:00
Chaoguang Lin	29f98f3654	Avoid duplicate snapshot on one process if it serves as multiple roles (#7294 ) * Fix comments * Add simulation value for SERVER_KNOBS->SNAP_CREATE_MAX_TIMEOUT * A work version with correctness clean * Remove unnecessay comments; debugging symbols * Only check secondary address for coordinators, same as before * Change the trace to SevError and remove the ASSERT(false) * Remove TLogSnapRequest handling on TlogServer, which is changed to use WorkerSnapRequest * Add retry for network failures * Add retry limit for network failures; still allow duplicate snapshots on processes are both tlog and storage to avoid race * Add retry limit as a knob and make backoff exponentail * Add getDatabaseConfiguration(Transaction* tr) * revert back to send request for each role once * update some comments	2022-06-29 11:23:07 -07:00
Andrew Noyes	1f8fc32f41	Save a memcpy in the tlog peek path (#7328 )	2022-06-07 13:22:56 -07:00
Jingyu Zhou	ae5818afa8	Merge pull request #7240 from jzhou77/fix-7109 CC sends recovery txn version during TLog recruitment	2022-05-27 09:27:19 -07:00
Jingyu Zhou	b2fded5c51	CC sends recovery txn version during TLog recruitment This simplifies the logic for TLog to wait for recovery txn before replying back to peeks.	2022-05-24 14:57:55 -07:00
Andrew Noyes	53882ef741	Revert most logic in #5637	2022-05-24 12:23:49 -07:00
Xiaoxi Wang	5a431980d2	Merge branch 'main' of https://github.com/apple/foundationdb into features/debug-macro	2022-05-20 12:18:20 -07:00
Xiaoxi Wang	6c11fc74ba	add debug traces	2022-05-18 15:20:23 -07:00
Dan Lambright	8f884be4f5	To not block peeks during recovery in version vector.	2022-05-10 12:53:54 -04:00
Jingyu Zhou	05e63bc703	Fix orphaned storage server due to force recovery (#6914 ) * Fix orphaned storage server due to force recovery The force recovery can roll back the transaction that adds a storage server. However, the storage server may now at version B > A, the recovery version. As a result, its peek to buddy TLog won't return TLogPeekReply::popped to trigger its exit, and instead getting a higher version C > B back. To the storage server, this means the message is empty, thus not removing itself and keeps peeking. The fix is to instead of using recovery version as the popped version for the SS, we use the recovery transaction version, which is the first transaction after the recovery. Force recovery bumps this version to a much higher version than the SS's version. So the TLog would set TLogPeekReply::popped to trigger the storage server exit. * Fix tlog peek to disallow return empty message between recoveredAt and recovery txn version This contract today is not explicitly set and can cause storage server to fail with assertion "rollbackVersion >= data->storageVersion()". This is because if such an empty version is returned, SS may advance its storage version to a value larger than the rollback version set in the recovery transaction. The fix is to block peek reply until recovery transaction has been received. * Move recoveryTxnReceived to be per LogData This is because a shared TLog can have a first generation TLog which is already setting the promise, thus later generations won't wait for the recovery version. For the current generation, all peeks need to wait, while for older generations, there is no need to wait (by checking if they are stopped). * For initial commit, poppedVersion needs to be at least 2 To get rid of the previous unsuccessful recovery's recruited seed storage servers.	2022-05-02 17:17:37 -07:00
Evan Tschannen	442d2b34c7	fix: pops which were ignored during a snapshot would not be replayed on the proper tlogs within a shared tlog (#6892 )	2022-04-19 16:57:41 -07:00
Dan Lambright	e43fde16ec	formatting	2022-04-08 17:28:16 -04:00
Dan Lambright	62975f87d1	Formatting	2022-04-08 15:04:46 -04:00
Dan Lambright	5bdc525353	Merge branch 'main' into vv	2022-04-08 13:16:04 -04:00
Xiaoxi Wang	d25fc4db34	add ASSERT_WE_THINK	2022-04-07 09:21:50 -07:00
Xiaoxi Wang	20fee3dd06	check pseudo locality before pop	2022-04-05 23:48:18 -07:00
Jingyu Zhou	cfcf0f152c	Merge branch 'main-4a085fc84' into vv Fix Conflicts: fdbclient/NativeAPI.actor.cpp fdbserver/ClusterRecovery.actor.cpp fdbserver/MasterInterface.h fdbserver/masterserver.actor.cpp flow/error_definitions.h	2022-03-30 22:28:06 -07:00
Jingyu Zhou	e9659b5dd4	Merge branch 'master-PR-6500' into vv Fix Conflicts: fdbclient/CommitProxyInterface.h fdbclient/NativeAPI.actor.cpp fdbserver/masterserver.actor.cpp	2022-03-30 14:53:49 -07:00
sfc-gh-tclinkenbeard	a71099471b	Update copyright header dates	2022-03-21 13:36:23 -07:00
Dan Lambright	2bbace3c89	Fix tLogServer.actor.cpp	2022-02-25 16:35:24 -05:00
A.J. Beamon	250a88e682	Enforce that trace event suppression calls happen first when using trace event call chaining. Fix various instances where we weren't following this requirement.	2022-02-24 12:25:52 -08:00
Dan Lambright	9e5f6d8214	Fix clang format	2022-02-24 12:33:25 -05:00
Dan Lambright	8cc9a5af1a	Rebase 02/23	2022-02-23 14:23:28 -05:00
Zhe Wu	e07ae6fdb9	Address comments	2022-02-16 15:28:56 -08:00
Zhe Wu	9da735c38e	Batch empty peek reply	2022-02-16 15:28:56 -08:00
Dan Lambright	9544379cdf	rebase	2022-01-20 11:12:33 -05:00
Dan Lambright	1b0a1ac221	Do not recover different versions for the same key across tLogs	2022-01-12 13:27:53 -05:00
Ata E Husain Bohra	936bf5336a	Revert "Revert "Refactor: ClusterController driving cluster-recovery state machine" (#6191 ) * Revert "Revert "Refactor: ClusterController driving cluster-recovery state machine"" Major changes includes: 1. Re-revert Sequencer refactor commits listed below (in listed order): 1.a. This reverts commit `bb17e194d9`. 1.b. This reverts commit `d174bb2e06`. 1.c. This reverts commit `30b05b469c`. 2. Update Status.actor to track ClusterController interface to track recovery status. 3. Introduce a ServerKnob to define "cluster recovery trace event" prefix; for now keeping it as "Master", however, it should allow smooth transition to "Cluster" prefix as it seems more appropriate.	2022-01-06 12:15:51 -08:00
Dan Lambright	49e89571fa	Set recoverAt to max(all tlogs rv) for recovered (crashed) tLogs in UNICAST mode.	2022-01-04 12:27:20 -05:00
Aaron Molitor	30b05b469c	Revert "Refactor: ClusterController driving cluster-recovery state machine" This reverts commit `dfe9d184ff`.	2021-12-24 11:25:51 -08:00
Aaron Molitor	d174bb2e06	Revert "Refactor: ClusterController driving cluster-recovery state machine" This reverts commit `abd2959702`.	2021-12-24 11:25:51 -08:00
Ata E Husain Bohra	abd2959702	Refactor: ClusterController driving cluster-recovery state machine diff-1: Address Jingyu's review comments At present, cluster recovery process consists of following steps: 1. ClusterController clusterWatchDatabase actor recruits master/sequencer process. 2. Sequencer process implements the cluster recovery state machine, responsible to recruit all other processes as well restore the cluster state. Patch proposes a scheme where the cluster recovery state machine is implemented and driven by the ClusterController process instead of the Sequencer process. Advantages of the scheme could be: 1. Simplified design where ClusterController recruits "sequencer" process like other worker processes compared to current scheme where "sequencer" process gets special treatment. In newer scheme sequencer is responsible for maintaining/providing "committed version" (as expected). 2. ClusterController is responsible for worker processes recruitment, the sequencer though orchestrating the recovery state machine, it need to reachout to the ClusterController for recruiting worker processes etc. NOTE: Patch has moved the recovery state machine code from 'sequencer' -> 'cluster-controller' process, however, necessary updates were done for both functionality as well as performance improvement reasons. Next Steps: Cluster recovery documentation will be updated in near future.	2021-12-22 14:06:27 -08:00
Ata E Husain Bohra	dfe9d184ff	Refactor: ClusterController driving cluster-recovery state machine At present, cluster recovery process consists of following steps: 1. ClusterController clusterWatchDatabase actor recruits master/sequencer process. 2. Sequencer process implements the cluster recovery state machine, responsible to recruit all other processes as well restore the cluster state. Patch proposes a scheme where the cluster recovery state machine is implemented and driven by the ClusterController process instead of the Sequencer process. Advantages of the scheme could be: 1. Simplified design where ClusterController recruits "sequencer" process like other worker processes compared to current scheme where "sequencer" process gets special treatment. In newer scheme sequencer is responsible for maintaining/providing "committed version" (as expected). 2. ClusterController is responsible for worker processes recruitment, the sequencer though orchestrating the recovery state machine, it need to reachout to the ClusterController for recruiting worker processes etc. NOTE: Patch has moved the recovery state machine code from 'sequencer' -> 'cluster-controller' process, however, necessary updates were done for both functionality as well as performance improvement reasons. Next Steps: Cluster recovery documentation will be updated in near future.	2021-12-22 14:06:27 -08:00
Dan Lambright	9f4ac866cd	Avoid context switch between appending version list and updating dv Port PR 6117 (Resolver saves shardChanged in recent state transactions)	2021-12-13 13:02:32 -05:00
Dan Lambright	0222d8669d	fix simulation failures	2021-12-10 09:56:21 -05:00
Evan Tschannen	e3819dad7c	fix: If a removed tlog never attempted a queue commit, the update storage loop could get stuck waiting for queueCommittingVersion to advance	2021-11-25 09:55:01 -08:00
Evan Tschannen	964d0209ca	Merge pull request #5637 from sfc-gh-ljoswiak/features/data-loss-prevention Data loss protection when joining new cluster	2021-11-15 15:26:32 -08:00
Dan Lambright	4979ccb889	commits recovered if written to every tlog minus failure tolerance.	2021-11-12 12:10:04 -05:00
Lukas Joswiak	e4c3f886da	Fix recovery issue	2021-11-10 16:15:13 -08:00
Dan Lambright	0f99ad582b	first cut unicast recovery	2021-11-10 12:31:16 -05:00
Sreenath Bodagala	1ec238b8b4	- Address a review comment	2021-11-09 20:46:42 +00:00
Lukas Joswiak	15e0d5b29f	Add explicit transaction options when reading cluster ID	2021-11-09 12:29:49 -08:00
Lukas Joswiak	74cf64fe0f	Sync cluster ID through ServerDBInfo	2021-11-09 12:29:48 -08:00
Lukas Joswiak	4640045243	Fix rare simulation failures When partitions appear before a cluster has fully recovered, it was possible to have different tlogs persist different cluster IDs because they were involved in different partitions. This would affect recovery when a quorum was eventually reached. The solution to this is to avoid persisting the cluster ID before a cluster has fully recovered, to make sure all nodes agree on the cluster ID.	2021-11-09 12:29:48 -08:00
Lukas Joswiak	3988b11fd6	Cleanup	2021-11-09 12:29:48 -08:00
Lukas Joswiak	aa3383f0e3	Exclude when joining new cluster	2021-11-09 12:29:48 -08:00
Lukas Joswiak	3e2c65bb11	Allow tlog to join another cluster but retain its data	2021-11-09 12:29:48 -08:00
Lukas Joswiak	30867750b5	Add protection against storage and tlog data deletion when joining a new cluster	2021-11-09 12:29:47 -08:00
Sreenath Bodagala	26ac1529fa	- Unblock any waiting peeks before stopping a tlog.	2021-11-09 17:22:50 +00:00
Markus Pilman	7df059570a	Make sure unit tests are run often enough	2021-11-08 15:43:32 -07:00
Dan Lambright	05a1419ba0	Fix corner-case where poppedVersion races with wait on new mutations in tLog	2021-11-03 11:32:31 -04:00
Dan Lambright	befe1993c4	fix conflict on rebase	2021-10-29 12:25:26 -04:00
Sreenath Bodagala	2bf54fda90	- Address review comments	2021-10-28 20:06:11 +00:00
Sreenath Bodagala	4503b0a347	- Capture metrics about empty/non-empty peeks done by storage servers	2021-10-26 14:37:46 +00:00
Evan Tschannen	c615279807	Merge pull request #5720 from sfc-gh-ljoswiak/fixes/recovery-failure-fix Fix possible recovery hang	2021-10-25 12:35:31 -07:00
Evan Tschannen	f1158371a7	Merge branch 'master' of https://github.com/apple/foundationdb into feature-range-feed # Conflicts: # flow/error_definitions.h	2021-10-21 00:55:12 -07:00
Lukas Joswiak	120d99e941	Fix a recovery hang that could occur when a new recovery was started during the existing recovery	2021-10-19 17:37:14 -07:00
sfc-gh-tclinkenbeard	9e06b6e6e3	Make IClosable interface const-correct	2021-10-18 13:40:47 -07:00
Dan Lambright	23062b892e	Calculate tpcv on resolvers	2021-10-15 16:40:00 -04:00
Dan Lambright	f099bb2574	comments on this PR's change	2021-10-15 15:08:25 -04:00
Dan Lambright	15dc5a3e41	wake waiters when data made durable	2021-10-15 10:58:48 -04:00
Evan Tschannen	5c642f706e	Merge branch 'master' of https://github.com/apple/foundationdb into feature-range-feed # Conflicts: # fdbcli/fdbcli.actor.cpp	2021-10-09 19:34:16 -07:00
Dan Lambright	58e1888d8e	remove network hop by getting previous commit versions in GetCommitVersionRequest	2021-09-30 11:51:57 -04:00
Sreenath Bodagala	2aa3b44d4e	Merge remote-tracking branch 'apple-upstream/master' into version-vector-prototype - Conflicts: fdbserver/LogSystem.h fdbserver/LogSystemConfig.h fdbserver/TagPartitionedLogSystem.actor.cpp - Files modified during merge: modified: fdbserver/LogSystem.cpp modified: fdbserver/LogSystemConfig.cpp	2021-09-17 19:36:18 +00:00
Xiaoge Su	abf73047ca	Enforce std:: specifier rather than using namespace	2021-09-16 19:40:28 -07:00
Xiaoge Su	067c1cc55b	Extract methods in LogSystem.h to corresponding cpp file	2021-09-12 14:17:19 -07:00
Evan Tschannen	ac5b580e2d	Merge branch 'master' into feature-range-feed # Conflicts: # fdbcli/fdbcli.actor.cpp # fdbclient/StorageServerInterface.cpp # fdbclient/StorageServerInterface.h # fdbserver/ApplyMetadataMutation.cpp # fdbserver/TLogServer.actor.cpp # flow/error_definitions.h	2021-09-09 23:13:22 -07:00
Dan Lambright	d8d64ecc6f	Add TODO	2021-09-09 12:47:00 -04:00
Dan Lambright	ea748f3273	Add latency metrics for blocking peek	2021-09-08 09:50:01 -04:00
Dan Lambright	8689e1f106	merge with master	2021-08-30 15:29:08 -04:00
Steve Atherton	deeb6b3404	Merge branch 'master' of https://github.com/apple/foundationdb into durability-bug-repro1 # Conflicts: # fdbserver/TLogServer.actor.cpp	2021-08-24 16:19:16 -07:00
Steve Atherton	ec0e39b40f	Bug fix: Popped versions are exclusive, so after recovery a tag for which there is no longer data should be considered popped up until the version after recovery, indicating that data at the recovery version itself has been popped.	2021-08-24 15:16:20 -07:00
Sreenath Bodagala	7c269b5225	- Address a bug	2021-08-17 14:40:00 +00:00
Xiaoxi Wang	a97570bd06	solve mis-spelling, trace log and format problems	2021-08-11 18:26:00 -07:00
Sreenath Bodagala	cec744cebf	- Address the following issues: - Sequencer should update the version vector once for a given commit version (irrespective of the number of times that it receives and processes the ReportRawCommittedVersionRequest message for that commit version). Issue found by simulation tests. - Storage server should take both its latest commit version and the read version into account while processing a read request. This is to address transaction_too_old error that we saw while running tests with mako (and also in YCSB tests). - Do not enable the tlog blocking-peek logic if ENABLE_VERSION_VECTOR flag is set to false.	2021-08-10 19:47:18 +00:00
Xiaoxi Wang	1f6cee89ab	merge master, fix conflicts	2021-08-10 10:01:45 -07:00
Steve Atherton	c73e861074	Move role UIDs for MutationTracking TraceEvents from various inconsistent detail fields into the TraceEvent UID field.	2021-08-10 01:59:28 -07:00
Steve Atherton	54c7036eaf	Move role UIDs for MutationTracking TraceEvents from various inconsistent detail fields into the TraceEvent UID field.	2021-08-10 01:52:36 -07:00
Evan Tschannen	208a5790ad	fixed usage of durable version	2021-08-09 21:58:44 -07:00
Evan Tschannen	ed28aecde0	Merge branch 'master' into feature-range-feed	2021-08-09 20:40:55 -07:00
Evan Tschannen	bc9a0e1315	first attempt to add data distribution support for range feeds	2021-08-09 10:05:56 -07:00
Xiaoxi Wang	2263626cdc	200k test clean: enable remote Log pull from LogRouter	2021-08-07 09:53:32 -07:00
Sreenath Bodagala	1758c92683	- Pull changes related to tlog-peeks from the version indexer branch Pull commits `5e37bc37a0` and `95e85aaffb` from the version indexer branch.	2021-08-06 14:42:35 +00:00
Sreenath Bodagala	a081c0baa5	Merge remote-tracking branch 'apple-upstream/master' into version-vector-prototype	2021-08-05 22:40:32 +00:00

1 2 3 4 5 ...

687 Commits