foundationdb

Commit Graph

Author	SHA1	Message	Date
A.J. Beamon	04d1217941	Track statistics about server-side request latency on each process, to include min, max, mean, and various percentiles.	2020-07-09 16:39:15 -07:00
sfc-gh-tclinkenbeard	99bf993815	Replace BOOST_NOEXCEPT with noexcept	2020-06-09 22:39:19 -07:00
sfc-gh-ngoyal	693d9e8b89	Merge branch 'master' into fdb_cache_wo_allocator	2020-06-09 15:09:58 -07:00
negoyal	cf13e00a8f	Merge remote-tracking branch 'origin/release-6.3' into fdb_cache_wo_allocator	2020-06-01 17:38:31 -07:00
Meng Xu	1c35ad884f	Merge branch 'master' into mengxu/release-6.3-conflict-PR Has conflict with master; Next commit will fix the conflicts.	2020-05-25 12:01:49 -07:00
Evan Tschannen	ee6ff80064	another compile fix	2020-05-22 17:26:22 -07:00
Evan Tschannen	ced65cd30b	finished explicitly versioning everything stored in the database	2020-05-22 17:14:21 -07:00
A.J. Beamon	7a09d016a6	Merge branch 'release-6.3' into merge-release6.3-into-master	2020-05-19 12:52:44 -07:00
Markus Pilman	c2bc75516f	Merge branch 'release-6.3' of github.com:apple/foundationdb into features/trace-roles	2020-05-14 10:34:53 -07:00
Alvin Moore	a160f9199f	Merge pull request #3171 from apple/release-6.3 Merge Release 6.3 into Master	2020-05-14 10:00:47 -07:00
Alex Miller	bf6d056095	Changing the last suggestions from review.	2020-05-13 18:48:43 -07:00
Alex Miller	ccaac162e2	Resolve performance concerns of nearly-no-op debugMutation being frequently called This introduces unhygenic macro variants that inline a `ENABLED &&` before the TraceEvent. This way, they get entirely compiled out unless enabled. Then rewrite all debugMutation uses via sed.	2020-05-13 18:44:15 -07:00
Alex Miller	27da91ab9e	Merge remote-tracking branch 'upstream/master' into mutation-debugging	2020-05-13 12:51:44 -07:00
Alex Miller	f148412a32	Make UPDATE_STORAGE_BYTE_LIMIT the reference spill variety. Which is unrelated, but a change I was supposed to do a while ago and forgot.	2020-05-12 16:59:20 -07:00
Markus Pilman	5f9b127e56	Emit traces regularly about role assignment We are currently emitting Role transition traces when a role starts and when it ends. While this is useful for debugging, it doesn't work well with tools that inject data and might potentially miss some trace lines. We do decorate each trace lines with the roles assigned to that particular process, however, this is not sufficient for tools that can make use of the UID -> Role mapping	2020-05-08 16:27:57 -07:00
negoyal	dd033736ed	Merge branch 'master' into fdb_cache_subfeature2	2020-05-04 17:29:43 -07:00
Evan Tschannen	7cebe743f9	A number of bug fixes of rare correctness errors	2020-04-29 13:50:13 -07:00
Evan Tschannen	c87aa33941	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # bindings/go/src/fdb/generated.go # documentation/sphinx/source/api-common.rst.inc # documentation/sphinx/source/api-ruby.rst # documentation/sphinx/source/release-notes.rst # fdbclient/FailureMonitorClient.actor.cpp # fdbclient/NativeAPI.actor.cpp # fdbclient/vexillographer/fdb.options # fdbrpc/FlowTransport.actor.cpp # fdbserver/OldTLogServer_6_0.actor.cpp # fdbserver/TLogServer.actor.cpp # fdbserver/fdbserver.actor.cpp # versions.target	2020-04-23 13:47:53 -07:00
Evan Tschannen	0a1b2a572f	more compile fixes	2020-04-22 14:41:17 -07:00
Evan Tschannen	68906bf3c3	fix compile errors	2020-04-22 14:36:41 -07:00
Evan Tschannen	d0cc2a1ee4	added logging for parallel peeks on TLogs	2020-04-22 14:24:45 -07:00
Alex Miller	122762cce1	Add debugMessagesAndTags, and track mutations in more places. Like: * Leaving the proxy * Entering the TLog * Leaving the TLog * Being read on a cursor All of this brought to you by TagsAndMessage! This also slides in a minor optimization as to how mutations are serialized per target log.	2020-03-27 03:31:04 -07:00
negoyal	acaf91ac47	Merge branch 'master' into fdb_cache_subfeature2	2020-03-26 13:33:08 -07:00
negoyal	8abac91033	Fixed a bug in cache server while peeking at a version lower than popped version and added some logging.	2020-03-26 12:39:07 -07:00
Evan Tschannen	e08f0201f1	merge release 6.2 into master	2020-03-17 12:51:47 -07:00
Evan Tschannen	ea98c7a40a	added additional timeout on initPersistentState	2020-03-16 11:38:14 -07:00
Evan Tschannen	d6d347f665	treat a tlog which takes a long time to create its disk queue as failed	2020-03-13 10:31:59 -07:00
Evan Tschannen	96258b9809	Merge branch 'release-6.2' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbcli/fdbcli.actor.cpp # fdbclient/ManagementAPI.actor.cpp # fdbrpc/FlowTransport.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/DataDistribution.actor.h # fdbserver/DataDistributionQueue.actor.cpp # fdbserver/KeyValueStoreMemory.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # fdbserver/QuietDatabase.actor.cpp # fdbserver/SkipList.cpp # fdbserver/StorageMetrics.actor.h # fdbserver/TLogServer.actor.cpp # fdbserver/fdbserver.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/KVStoreTest.actor.cpp # flow/CMakeLists.txt # flow/Knobs.cpp # flow/Knobs.h # flow/genericactors.actor.cpp # flow/serialize.h	2020-02-21 19:09:16 -08:00
Evan Tschannen	8129f74a10	Merge pull request #2698 from etschannen/feature-recruit-delay The CC waits until no new workers register before starting a bad recruitment	2020-02-20 14:42:37 -08:00
A.J. Beamon	fcbdcda490	Merge pull request #2650 from ajbeamon/fix-reverse-range-read-byte-limit-bug Fix reverse range read performance bug	2020-02-20 12:47:17 -08:00
Evan Tschannen	fbd45963d8	The cluster controller waits until no new workers register for 1.0 before starting a bad recruitment	2020-02-19 16:48:30 -08:00
A.J. Beamon	1d9140d874	Removed TLogVersion logging. Added logging of SharedTLog ID for each TLog. Switched ID logged for TLogRejoining event to the TLog instead of the SharedTLog. Made some parameters to startRole passed by reference.	2020-02-14 12:33:43 -08:00
A.J. Beamon	56053c565b	Improve TLog "Role" event by adding the worker ID, the TLog version, and under what circumstances the TLog is being started (Restored, Recruited, or Recovered). The SharedTLog role was being started and stopped twice, so remove one instance of it.	2020-02-12 15:11:38 -08:00
Markus Pilman	e71fe44ee3	Merge branch 'master' into features/icc	2020-02-08 21:33:02 -08:00
A.J. Beamon	df2b0452b4	Step 3 of fixing storage server range reads: change return type of readRange from VectorRef<KeyValueRef> to RangeResultRef.	2020-02-06 13:19:24 -08:00
mpilman	d09e07f1f5	Merge remote-tracking branch 'upstream/master' into features/icc	2020-02-04 10:26:18 -08:00
Jingyu Zhou	7544ff88d9	Comment out frequent TLogPop trace event	2020-01-31 19:29:09 -08:00
Evan Tschannen	6c0b934dda	Merge pull request #2242 from alexmiller-apple/fix-10min-stall-again Fix the 10min multi-region recovery stall again	2020-01-23 17:53:02 -08:00
Jingyu Zhou	8b67a89eed	More review comments fixed.	2020-01-22 19:42:13 -08:00
Jingyu Zhou	9d7a1a77d0	Small fixes.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	85c4a4e422	Address review comments for PR #1625	2020-01-22 19:38:45 -08:00
Jingyu Zhou	73824faf65	Track pseudo tags popping for individual IDs For each log router ID, we track the popped version of each pseudo tag so that the popping only applied to the minimum of these versions. Also add more tracing for popping and epochs.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	11964733b7	WIP: should be divided into smaller commits.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	03a17a30ef	Refactor: check displacement in LogSystemConfig	2020-01-22 19:38:45 -08:00
Jingyu Zhou	442738b6db	Small code refactoring	2020-01-22 19:35:30 -08:00
Jingyu Zhou	8221d33eb1	Use emplace_back instead of push_back for TLogServer	2020-01-22 19:35:30 -08:00
Alex Miller	f0fe62a298	TLogs should not respond with data earlier than the begin version Parallel peek more code would prefer the begin version it was sent by the previous parallel peek over the request's begin version. This means that a merge cursor trying to advance past message versions would still get old data that it would have to filter out. A simple application of std::max fixes this.	2020-01-21 19:09:07 -08:00
Alex Miller	7798456201	Make TLogs have consistent parallel peek behavior. TLogServer and LogRouter had some leftover code from me trying to be more "correct" about parallel peek semantics, but those changes weren't reflected in the OldTLog* files. I've reverted the changes, as realistically, they are more likely to waste CPU than improve TLog behavior.	2020-01-21 18:23:16 -08:00
Alex Miller	ffc3506fff	Continuing a parallel peek after a timeout would hang.	2020-01-21 17:12:18 -08:00
Alex Miller	9c47bbe460	Remove trackerData time bump As we're in an error handling case, so this shouldn't be considered making forward progress.	2020-01-21 17:08:42 -08:00
Alex Miller	1cb311fcb8	Add an ASSERT_WE_THINK that peek cursors don't get timed_out() This should prevent us from regressing and having multi-region recoveries hang for 10min again.	2020-01-21 17:07:37 -08:00
Evan Tschannen	3f9d9d8b84	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # cmake/FlowCommands.cmake # documentation/sphinx/source/release-notes.rst # fdbclient/StorageServerInterface.h # fdbserver/DataDistributionTracker.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # fdbserver/fdbserver.actor.cpp # flow/Knobs.h # flow/Platform.cpp # versions.target	2020-01-16 18:37:47 -08:00
Evan Tschannen	827cea74b5	fix: tlogs must send a recruitment reply even when actor cancelled or the recruitment endpoint will be marked as permanently failed	2020-01-16 17:37:17 -08:00
Alex Miller	f58507c830	Rename poppedLocationForVersion -> versionForPoppedLocation	2019-12-19 10:24:31 -08:00
Alex Miller	b5d82a74c3	Update fdbserver/TLogServer.actor.cpp Co-Authored-By: Jingyu Zhou <jingyuzhou@gmail.com>	2019-12-19 10:20:52 -08:00
Alex Miller	d8cbd495af	Fix another pop + spill/dq-pop interleaving issue This fixes an issue introduced in the previous patch, where pop would immediately set `poppedLocationNeedsUpdate`, but setting the popped version was now delayed. This means that we could: 1. Run the spill loop and persist all popped versions 2. Receive a pop, and set the poppedLocationNeedsUpdate flag 3. Run the dq-pop loop, and clear the poppedLocationNeedsUpdate flag and now when we update the persistentPopped version again, we won't have the flag set for dq-pop to know that it needs to scan the spilled data again for the minLocation. We could more carefully update the flag, but instead, I've just converted it into a version that's kept in sync purely in the dq-pop loop, to remove shared state between pop and the dq-pop loop.	2019-12-17 23:15:48 -08:00
Alex Miller	b36062a509	DiskQueue should only pop based off of persisted popped tag versions This commit is to fix a bug where popping a tag between updatePersistentData and popDiskQueue can cause the TLog to recover to an incorrect understanding of what data it has available. The following series of events need to happen to trigger this bug: Tag 1:1 is popped to version 10 updatePersistentData is run... updatePersistentPopped runs and we persistentData stores 1:1 as popped to 10 A mutation is spilled for 1:1 at version 11 at location 1000 A mutation is spilled for 1:1 at version 21 at location 5000 updatePersistentData finishes and commits the btree changes Tag 1:1 is popped to version 20 popDiskQueue runs The btree is read for spilled mutations with version >=20 The minimum location required for the disk queue is found to be location 5000 The disk queue is popped to location 5000 The TLog crashes The worker restarts, and reloads the TLog files from disk restorePersistentPopped restores tag 1:1 as having been popped to version 10 Parallel peeks are received for tag 1:1 starting at version 0 The first peek is less than the popped version, so we respond with no data, and an end version of 10 The second peek starts at version 10, which is greater than the popped version The btree is read for spilled mutations, and we find that there is a mutation at version 11 at location 1000 Location 1000 is read in the DiskQueue The resulting page read at Location 1000 was popped pre-crash, and thus might either (a) be corrupt or (b) have an incorrect sequence number. The fix to this is to force popDiskQueue/updatePoppedLocation to use the popped version that was persisted to disk, and not the most recently popped version for the given tag. This bug doesn't manifest in simulation, because we don't have any code that peeks at a lower version than what has been popped.	2019-12-17 23:02:37 -08:00
Evan Tschannen	ebcb2f79ed	Merge branch 'master' of github.com:apple/foundationdb	2019-11-22 15:34:49 -08:00
Evan Tschannen	8d3ef89540	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # fdbclient/MutationList.h # fdbserver/MasterProxyServer.actor.cpp # versions.target	2019-11-14 15:49:56 -08:00
negoyal	a4a0bf18f9	Merging with Master.	2019-11-12 13:01:29 -08:00
Evan Tschannen	396dccbc98	when peeking from satellites we do not need to limit the amount of peeking on log router tags, because that is the only thing that can be peeked from a satellite log	2019-11-08 18:34:05 -08:00
Evan Tschannen	afc9713005	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # fdbclient/FDBTypes.h # fdbserver/LogSystem.h # fdbserver/LogSystemPeekCursor.actor.cpp # fdbserver/OldTLogServer_6_0.actor.cpp # fdbserver/TLogServer.actor.cpp # versions.target	2019-11-06 13:45:37 -08:00
Evan Tschannen	a8ca47beff	optimized memory allocations by using VectorRef<Tag> instead of std::vector<Tag>	2019-11-05 18:07:30 -08:00
Evan Tschannen	4de60fc437	Merge branch 'release-6.2' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbserver/TLogServer.actor.cpp	2019-11-01 15:48:04 -07:00
Evan Tschannen	85c315f684	Fix: parallelPeekMore was not enabled when peeking from log routers	2019-11-01 14:02:44 -07:00
Evan Tschannen	3325980c03	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # fdbserver/DataDistribution.actor.cpp # fdbserver/OldTLogServer_6_0.actor.cpp # fdbserver/TLogServer.actor.cpp # fdbserver/WorkerInterface.actor.h # fdbserver/worker.actor.cpp # versions.target	2019-10-24 17:38:15 -07:00
Evan Tschannen	2722c8b188	avoid starting a new startSpillingActor with every TLog recruitment	2019-10-23 11:15:54 -07:00
Evan Tschannen	e01e8371a6	Merge pull request #2256 from alexmiller-apple/spill-log-on-switch-6.2 Spill SharedTLog when there's more than one	2019-10-23 10:51:28 -07:00
Alex Miller	0c325c5351	Always check which SharedTLog is active In case it is set before we get to the onChange()	2019-10-23 01:59:36 -07:00
Alex Miller	1e5b8c74e3	Continuing a parallel peek after a timeout would hang. This is to guard against the case where 1. Peeks with sequence numbers 0-39 are submitted 2. A 15min pause happens, in which timeout removes the peek tracker data 3. Peeks with sequence numbers 40-59 are submitted, with the same peekId The second round of peeks wouldn't have the data left that it's allowed to start running peek 40 immediately, and thus would hang for 10min until it gets cleaned up. Also, guard against overflowing the sequence number.	2019-10-22 19:24:05 -07:00
Alex Miller	1eb3a70b96	Spill SharedTLog when there's more than one. When switching between spill_type or log_version, a new instance of a SharedTLog is created in the transaction log processes. If this is done in a saturated database, then doubling the amount of memory to hold mutations in memory can cause TLogs to be uncomfortably close to the 8GB OOM limit. Instead, we now thread which UID of a SharedTLog is active, and the other TLog spill out the majority of their mutations. This is a backport of #2213 (`fef89aa1`) to release-6.2	2019-10-17 01:24:50 -07:00
sramamoorthy	c9097cca18	deprecate isTLogInSameNode used by snapshot V1	2019-10-09 15:33:11 -07:00
Alex Miller	77c72de176	Comment variable and code style fix Co-Authored-By: Jingyu Zhou <jingyuzhou@gmail.com>	2019-10-07 18:08:27 -07:00
Alex Miller	71af24dff3	Fix a bug that would cause active logs to spill aggressively And add some useful logging about when things do or do not spill.	2019-10-07 18:08:27 -07:00
Alex Miller	1d8a7e5af7	Spill SharedTLog when there's more than one. When switching between spill_type or log_version, a new instance of a SharedTLog is created in the transaction log processes. If this is done in a saturated database, then doubling the amount of memory to hold mutations in memory can cause TLogs to be uncomfortably close to the 8GB OOM limit. Instead, we now thread which UID of a SharedTLog is active, and the other TLog spill out the majority of their mutations.	2019-10-07 18:08:27 -07:00
Alex Miller	5016f3fedd	Whitespace fixes no idea what happened here Co-Authored-By: Jingyu Zhou <jingyuzhou@gmail.com>	2019-10-04 13:37:59 -07:00
Alex Miller	6bcb72fa74	Fix stray Unversioned() I forgot there were two	2019-10-03 19:45:13 -07:00
Alex Miller	28f6275f94	Use AssumeVersion instead of Unversioned Which lets us revert the unversioned serilaization of TLogSpillType	2019-10-03 15:59:09 -07:00
Alex Miller	9401a6941a	Code review nits const correctness and file renaming in comment. Co-Authored-By: Jingyu Zhou <jingyuzhou@gmail.com>	2019-10-03 15:53:39 -07:00
Alex Miller	6742222084	Make TLogServer able to spill by value and by reference ...and test it in simulation, but not combined yet. It turns out that because of txsTag, we basically had to support spill-by-value anyway. Thus, if we treat all tags like txsTag when spilling and peeking, then we have an easy way to bring the two spilling types back into one implementation.	2019-10-03 01:45:10 -07:00
Alex Miller	d38a96ab73	Make LogData aware of the spill type it was created to perform. The spilling type is now pulled out of the request, and then stored on LogData for later access, and persisted in the tlog metadata per tlog generation. It turns out that serializing types as Unversioned is a bit wonky.	2019-10-03 01:45:10 -07:00
Evan Tschannen	b495cc697b	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # versions.target	2019-09-13 09:25:08 -07:00
Alex Miller	53bcf41805	Fix the build.	2019-09-12 18:46:30 -07:00
Alex Miller	befa0646b3	Merge remote-tracking branch 'upstream/release-6.2' into faster-remote-dc	2019-09-12 18:46:03 -07:00
Evan Tschannen	6a7f109788	added logging on the TLog for the tag with smallest popped version	2019-09-12 16:22:01 -07:00
Alex Miller	99843bd4ba	Add parallel peek support to log routers	2019-09-12 14:26:37 -07:00
Evan Tschannen	94668c6f1f	Merge pull request #2063 from jzhou77/clang Refactor deserialization of on-wire buffer with TagsAndMessage	2019-09-09 16:34:56 -07:00
Jingyu Zhou	2d5ebebb7b	Use TagsAndMessage for deserialization in TLogServer	2019-09-05 16:53:10 -07:00
Jingyu Zhou	2723922f5f	Replace -1 as VERSION_HEADER constant for serialization	2019-09-05 12:45:39 -07:00
Meng Xu	c2355f721e	Merge branch 'master' into mengxu/performant-restore-PR	2019-09-04 17:11:42 -07:00
Jingyu Zhou	cd3f1e33d4	Refactor deserialization of TagsAndMessages Consolidate deserialization of TagsAndMessages in the structure itself and change both TLog and ServerPeekCursor to use it.	2019-09-04 14:55:05 -07:00
Evan Tschannen	24aad14f06	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # versions.target	2019-08-30 17:23:58 -07:00
Evan Tschannen	dc1d055b27	Merge pull request #2042 from senthil-ram/snap_cli_fix fix fdbcli --exec 'snapshot create.sh' failure	2019-08-30 13:40:38 -07:00
sramamoorthy	b3277f2982	Fix #2009 posix compliant args for snapshot binary	2019-08-30 12:54:09 -07:00
Andrew Noyes	6aa0ada7b1	Replace scalar root types with proper messages	2019-08-28 14:40:50 -07:00
Jingyu Zhou	4a63de16e9	Merge pull request #1945 from xumengpanda/mengxu/tLog-code-read-v2 Add comments to DiskQueue and tLog	2019-08-08 13:24:32 -07:00
Meng Xu	c9c50ceff8	Comments:Add comments to DiskQueue No functional change.	2019-08-01 15:20:01 -07:00
Meng Xu	7ccaeddf05	Merge branch 'master' into mengxu/performant-restore-PR	2019-08-01 13:23:17 -07:00
Evan Tschannen	3774ff55b0	There were still use cases where this checks are necessary	2019-07-31 17:45:21 -07:00
Evan Tschannen	854ee75664	we no longer need to special case for txs tag, because it will be initialized by createTagData	2019-07-31 17:13:15 -07:00
Evan Tschannen	ff171e293e	fix: always make sure to add txsTags to localTags for remote logs	2019-07-31 16:04:35 -07:00
Evan Tschannen	9f11f2ec53	Merge branch 'master' of github.com:apple/foundationdb	2019-07-30 16:55:56 -07:00
Evan Tschannen	aaeeb605b2	Changes to degraded can cause master recoveries, which are not supposed to happen when speedUpSimulation is true	2019-07-30 16:33:40 -07:00
Evan Tschannen	6977e7d2e8	do not return recovered version as popped for txsTags because it could cause recovery to start over optimized how buffered peek cursor discards popped data	2019-07-30 12:21:48 -07:00
Evan Tschannen	13203da199	fix: do not set the popped version of txsTag because it could be copied over at the recoveredAt version	2019-07-27 22:36:06 -07:00
Evan Tschannen	28df2c35bb	Merge pull request #1855 from alexmiller-apple/sharded-txs-safe-upgrade Make sharded txsTag upgradeable and downgradeable	2019-07-26 13:29:39 -07:00
Meng Xu	1706aaf199	Merge branch 'master' into mengxu/performant-restore-PR Fix conflict in TlogServer.actor.cpp by accepting master changes	2019-07-26 11:46:27 -07:00
sramamoorthy	9afd162e2f	remove snap v1 related code	2019-07-25 17:29:31 -07:00
Meng Xu	45083edf74	Merge branch 'master' into mengxu/performant-restore-PR Fix conflicts as well.	2019-07-25 10:46:11 -07:00
sramamoorthy	a65c9f92ed	get rid of all timeouts and other changes	2019-07-24 15:36:28 -07:00
sramamoorthy	a2f2ad96ff	code review comments and merge to master changes	2019-07-24 15:36:28 -07:00
sramamoorthy	31c010b393	few minor fixes	2019-07-24 15:36:28 -07:00
sramamoorthy	c73bdfad9f	do not pop txsTag	2019-07-24 15:36:28 -07:00
sramamoorthy	a335ed2011	includeCancelled for tLogSnapCreate	2019-07-24 15:36:28 -07:00
sramamoorthy	61cd690add	enable/disable pop req with UID mis-match to fail	2019-07-24 15:36:28 -07:00
sramamoorthy	f4e257e464	snap v2: TLog related changes	2019-07-24 15:36:28 -07:00
Evan Tschannen	6d694cc2ce	Merge pull request #1818 from alexmiller-apple/peek-cursor-timeout-bug Fix parallel peek stalling for 10min when a TLog generation is destroyed	2019-07-19 16:39:31 -07:00
Alex Miller	9863ace96c	Replace usages with intialization lists. But C++ needs a bit of help to inference though the templates.	2019-07-18 22:27:36 -07:00
Alex Miller	55258709a0	Remove an ASSERT from testing and now inaccurate comment.	2019-07-17 01:30:01 -07:00
Alex Miller	e9684a1f63	Fix issues configuring from sharded txs tag to not Which is an intermingling of what should be two commits: 1. Rely on TLogVersion instead of txsTags==0 2. Copy and index sharded txsTags between KCV and RV as txsTag when configuring log_version 4->3.	2019-07-17 01:25:09 -07:00
Alex Miller	812ce37bcd	Remove buggify and unneeded safeguards. The buggify was actually incorrect and broke an invariant, which I then fixed on the other side, but this work was actually unneeded in total. The real issue being fixed was returnIfBlock not sending an error, as well as the other error cases.	2019-07-16 15:58:02 -07:00
Alex Miller	4cc60dc9b8	Merge remote-tracking branch 'upstream/master' into peek-cursor-timeout-bug	2019-07-15 17:05:39 -07:00
Alex Miller	2cbc05fc72	Address more issues that cause peek cursors to time out. There were error cases that would cause a peek to terminate early or be cancelled without sending anything to the next peek in line. We would thus end up with the first peek in a sequence waiting on its future, and nothing that exists that would send to that future.	2019-07-15 16:03:37 -07:00
Alex Miller	c8e94e601a	Merge pull request #1729 from etschannen/feature-fast-txs-recovery Improve the recovery speed of the txnStateStore	2019-07-15 13:27:41 -07:00
Vishesh Yadav	2606794df6	Merge pull request #1812 from alexmiller-apple/improve-only-spilled Improve the behavior of parallelPeekMore+onlySpilled.	2019-07-10 17:15:19 -07:00
Evan Tschannen	d8948c8be1	Merge branch 'master' into feature-fast-txs-recovery # Conflicts: # fdbserver/TagPartitionedLogSystem.actor.cpp	2019-07-10 13:59:52 -07:00
Evan Tschannen	49121172ea	Merge pull request #1795 from alexmiller-apple/peek-from-satellites Log Routers will prefer to peek from satellite logs.	2019-07-09 17:38:57 -07:00
Alex Miller	fd769ad878	Fix parallel peek stalling for 10min when a TLog generation is destroyed. `peekTracker` was held on the Shared TLog (TLogData), whereas peeks are received and replied to as part of a TLog instance (LogData). When a peek was received on a TLog, it was registered into peekTracker along with the ReplyPromise. If the TLog was then removed as part of a no-longer-needed generation of TLogs, there is nothing left to reply to the request, but by holding onto the ReplyPromise in peekTracker, we leave the remote end with an expectation that we will reply. Then, 10min later, peekTrackerCleanup runs and finally times out the peek cursor, thus preventing FDB from being completely stuck. Now, each TLog generation has its own `peekTracker`, and when a TLog is destroyed, it times out all of the pending peek curors that are still expecting a response. This will then trigger the client to re-issue them to the next generation of TLogs, thus removing the 10min gap to do so.	2019-07-09 17:27:36 -07:00
Alex Miller	44f11702a8	Log Routers will prefer to peek from satellite logs. Formerly, they would prefer to peek from the primary's logs. Testing of a failed region rejoining the cluster revealed that this becomes quite a strain on the primary logs when extremely large volumes of peek requests are coming from the Log Routers. It happens that we have satellites that contain the same mutations with Log Router tags, that have no other peeking load, so we can prefer to use the satellite to peek rather than the primary to distribute load across TLogs better. Unfortunately, this revealed a latent bug in how tagged mutations in the KnownCommittedVersion->RecoveryVersion gap were copied across generations when the number of log router tags were decreased. Satellite TLogs would be assigned log router tags using the team-building based logic in getPushLocations(), whereas TLogs would internally re-index tags according to tag.id%logRouterTags. This mismatch would mean that we could have: Log0 -2:0 ----- -2:0 Log 0 Log1 -2:1 \ >--- -2:1,-2:0 (-2:2 mod 2 becomes -2:0) Log 1 Log2 -2:2 / And now we have data that's tagged as -2:0 on a TLog that's not the preferred location for -2:0, and therefore a BestLocationOnly cursor would miss the mutations. This was never noticed before, as we never used a satellite as a preferred location to peek from. Merge cursors always peek from all locations, and thus a peek for -2:0 that needed data from the satellites would have gone to both TLogs and merged the results. We now take this mod-based re-indexing into account when assigning which TLogs need to recover which tags from the previous generation, to make sure that tag.id%logRouterTags always results in the assigned TLog being the preferred location. Unfortunately, previously existing will potentially have existing satellites with log router tags indexed incorrectly, so this transition needs to be gated on a `log_version` transition. Old LogSets will have an old LogVersion, and we won't prefer the sattelite for peeking. Log Sets post-6.2 (opt-in) or post-6.3 (default) will be indexed correctly, and therefore we can safely offload peeking onto the satellites.	2019-07-08 22:25:01 -07:00
Alex Miller	6c8f50ca66	Improve the behavior of parallelPeekMore+onlySpilled. When onlySpilled transitions from true (don't peek memory) to false (do peek memory) as part of a parallel peek, we'll end up wasting the rest of the replies because we'll honor their onlySpilled=true setting and thus not have any additional data to return. Instead, we thread the onlySpilled back through in the same way that the ending version of the last peek is used overrides the requested starting version of the next peek. This simulated the same behavior that the client has, where the value of onlySpilled that we reply with comes back in the next request. I haven't actually seen it be a problem, but this should help make sure the onlySpilled transition when catching up doesn't ever cause any ill effects if a process starts riding the line between onlySpilled settings.	2019-07-08 22:13:09 -07:00
Evan Tschannen	15e894c724	Merge in master	2019-07-05 15:49:24 -07:00
Evan Tschannen	235697f688	fix: txsTags are not popped at the recovery version	2019-06-27 23:18:26 -07:00
Alex Miller	bf883d7055	Merge remote-tracking branch 'upstream/master' into flowlock-api	2019-06-25 14:26:50 -07:00
Alex Miller	7a500cd37f	A giant translation of TaskFooPriority -> TaskPriority::Foo This is so that APIs that take priorities don't take ints, which are common and easy to accidentally pass the wrong thing.	2019-06-25 02:47:35 -07:00
Evan Tschannen	1c005d5878	Merge pull request #1584 from alexmiller-apple/spilled-only-peek Save TLog resources by letting peek request only spilled data.	2019-06-20 18:22:31 -07:00
mpilman	844dd60202	FDB compiling with intel compiler	2019-06-20 09:29:01 -07:00
Evan Tschannen	e0be631414	shard the txs tag so that more transaction logs are involved in its recovery	2019-06-19 18:15:09 -07:00
mpilman	68ce9a5e75	ProtocolVersion type - second try	2019-06-18 17:55:27 -07:00
Alex Miller	51fd42a4d2	Merge remote-tracking branch 'upstream/master' into spilled-only-peek	2019-06-18 17:33:52 -07:00
mpilman	8576665a90	Revert "Revert "Make protocol version a type"" This reverts commit `455bf3b3ec`.	2019-06-18 14:49:04 -07:00
Alex Miller	455bf3b3ec	Revert "Make protocol version a type"	2019-06-18 10:59:17 -07:00
mpilman	da53a92bec	Make protocol version a type This fixes #1214 The basic idea is that ProtocolVersion is now its own type. This alone is an improvement as it makes many things more typesafe. For each version, we can now add breaking features (for example Fearless). After that, there's no need to test against actual (confusing) version numbers. Instead a developer can simply test `protocolVersion->hasFearless()` and this will return true iff the protocolVersion is newer than the newest version that didn't support fearless.	2019-06-16 09:59:15 -07:00
sramamoorthy	1190f2f33d	rebased related changes	2019-05-28 22:07:46 -07:00
sramamoorthy	b43c100e57	TLog bug fixes	2019-05-28 22:07:46 -07:00
sramamoorthy	3877f87481	comment change in tLogCommit	2019-05-28 22:07:46 -07:00
sramamoorthy	31b6c86650	ignorePopDeadline to have high limit in simulator - ignorePopDeadline to have highier limit in simulator to accommdate for the buggify delays and make snapshot succeed. - introduce a new knob for auto resetting the disabling of tlog pop	2019-05-28 22:07:46 -07:00
sramamoorthy	b1b96946af	logData->stop check right after execOpHold wait	2019-05-28 22:07:46 -07:00
sramamoorthy	5749e220bd	use FlowLock for implementing critical section Instead of using Promises and future to implement critcal section use FlowLock	2019-05-28 22:07:46 -07:00
sramamoorthy	e6c0b87a4d	remove unused variable	2019-05-28 22:07:46 -07:00
sramamoorthy	f27a40f118	execProcessingHelper made synchronous tLogCommit exects no blocking between duplicate check and setting of the new version, that constraint was broken when synchronous execProcessingHelper was introduced. As a fix, execProcessingHelper was made asynchronous.	2019-05-28 22:07:46 -07:00
sramamoorthy	d3a179b6f9	Multiple bug fixes - wait for snapTLogFailKeys in a loop, otherwise in some race condition it can cause a false assert - in single region, there does not seem to be a guarantee of tagLocalityListKey for a given DC ID, avoiding that assert for now - to find the workers that are coordinators, looking up by primary address is not sufficient in some cases, hence looking by both primary and secondary address - test make files to reflect the location of the new test cases	2019-05-28 22:07:46 -07:00
sramamoorthy	dcd2d96751	make spawnProcess predictable in the simulator	2019-05-28 22:07:46 -07:00
sramamoorthy	4083af0b01	Avoid using trackLatest for TLog pop test cases	2019-05-28 22:07:46 -07:00
sramamoorthy	ec7834e2f7	code re-orgnaization and address comments	2019-05-28 22:07:46 -07:00
sramamoorthy	b6e037ffbc	Replace fork with boost::process::child	2019-05-28 22:07:46 -07:00
sramamoorthy	e91c76834e	tlog: move snap create part to indepdendent funcs	2019-05-28 22:07:46 -07:00
sramamoorthy	61e93a9304	Address review comments and minor fixes	2019-05-28 22:07:46 -07:00
sramamoorthy	9e3104c2d4	Fix: races in async exec leading to bad backup	2019-05-28 22:07:46 -07:00
sramamoorthy	cfdad0c5e6	tlog to snapshot exactly at exec version	2019-05-28 22:07:46 -07:00
sramamoorthy	539e65efad	Skip parsing mutations if it is tagged for TxsTag In Tlog, if a mutation is targetted for TxsTag then skip from parsing them.	2019-05-28 22:07:46 -07:00
sramamoorthy	17ecba8313	trace cleanup and other indentation changes	2019-05-28 22:07:46 -07:00
sramamoorthy	aa79480d69	changes to make fdbfork asynchronous	2019-05-28 22:07:46 -07:00
sramamoorthy	4016f16c76	Fix few compilation and bugs in rebase	2019-05-28 22:07:46 -07:00
sramamoorthy	3d5998e9dd	tlog: when pops are disabled, store them & replay In Tlogs, disable pop is done whlie taking snapshots. Earlier, tlogs were ignoring the pops if it got pop requests when pops were disabled. In this change, instead of ignoring the pop - it remembers the list of pops in-memory and plays them once the popping is enabled.	2019-05-28 22:07:46 -07:00
sramamoorthy	4bc4c615da	exec op to all tlog, restore change in test &other - exec operation to go to all the TLogs - minor bug fix in tlog - restore implementation for the simulator - restore snap UID to be stored in restartInfo.ini - test cases added - indentation and trace file fixes	2019-05-28 22:07:46 -07:00
sramamoorthy	72dd067173	Trace message changes and fix few FIXMEs	2019-05-28 22:07:46 -07:00
sramamoorthy	69edefe68b	Snapshot based backup and resotre implementation	2019-05-28 22:07:46 -07:00
A.J. Beamon	f417e60264	Merge branch 'merge-release-6.1-into-master' into thread-safe-random-number-generation # Conflicts: # fdbserver/QuietDatabase.actor.cpp	2019-05-23 09:52:00 -07:00
A.J. Beamon	d29c7e4c9b	Merge branch 'release-6.1' into merge-release-6.1-into-master # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbserver/QuietDatabase.actor.cpp # versions.target	2019-05-23 09:28:45 -07:00
Evan Tschannen	003cc6be18	fix: nothingPersistent could be incorrect when popped is equal to persistentDataVersion	2019-05-22 20:23:35 -10:00
Evan Tschannen	ee04c583fa	fix: do not pop the disk queue past the persistentDataVersion	2019-05-21 10:40:30 -07:00
Evan Tschannen	4059d68348	fix: the tlog would not pop data from the disk queue after a storage server was removed, because the tag still exists in memory on the logs fix: we could incorrectly make data durable if eraseMessagesFromMemory was in progress while running updatePersistentData the quiet database check now ensure that tlogs have no more than 30 seconds of versions unpopped from the disk queue	2019-05-20 23:58:45 -07:00
Meng Xu	9ea83e0f3c	FastRestore:Remove dbprintf	2019-05-17 17:34:42 -07:00
Alex Miller	4eb4c03ce5	Save TLog resources by letting peek request only spilled data. If a peek is entirely fulfilled from spilled data, then it's likely that the next peek will be also. It is thus wasteful for each of these peeks to call peekMessagesFromMemory, which memcpy's excessively, and then throw all that data away without using it. Now, TLogs will give a hint back to peek cursors about if the provided reply was served entirely from the spilled data, which peek curors then feed back as the hint into their next request. At some point, a cursor will send a request for only spilled data, get an incomplete response, and then be told to send its next request as one that peeks from memory as well, and then it will fully catch up.	2019-05-14 15:38:48 -10:00
A.J. Beamon	5f55f3f613	Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used.	2019-05-10 14:01:52 -07:00
Evan Tschannen	22499666d0	Merge branch 'release-6.1' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbserver/LogRouter.actor.cpp # flow/Trace.cpp # versions.target	2019-05-08 18:19:35 -07:00
Evan Tschannen	93eb2a9395	Merge pull request #1527 from alexmiller-apple/tstlog-6.1 Spill-by-reference knob + TLog6.0 Spilled Peek deprioritization	2019-05-03 17:19:45 -07:00
Alex Miller	c918b21137	Deprioritize spilled peeks in spill-by-value, and improve its logic. This deprioritizes before calling peekMessagesFromMemory, which should improve the memory usage of the TLog, and makes sure to keep txsTag peeks at a high priority to help recoveries stay fast.	2019-05-03 15:27:11 -07:00
Alex Miller	4052f3826a	Add a knob to limit the number of commits indexed per key. Theoretically, we could spill 20MB of 22B mutations for one key, which would generate a very long value being stored in SQLite, and very inefficiently read back. This stops that from being a problem, at the cost of some extra write calls.	2019-05-03 15:27:10 -07:00
Evan Tschannen	12088119d2	Merge pull request #1517 from alexmiller-apple/tstlog-6.1 Add a knob to limit amount of data read from sqlite for one PeekRequest.	2019-05-03 11:01:11 -07:00
Alex Miller	f4e48c3851	Add a knob to limit amount of data read from sqlite for one PeekRequest. This prevents peeking from degrading over time if there are a very large number of SpilledData entries for one particular tag.	2019-05-02 17:26:45 -07:00
Evan Tschannen	8590b710bf	added additional logging on the logs and log routers	2019-05-02 17:24:39 -07:00
Jingyu Zhou	8b5449e608	Fix review comments for PR #1473	2019-04-29 16:45:42 -07:00
Jingyu Zhou	5462f560e7	Add pseudo locality for log routers and tlogs This changes the logic of pop operations from log routers (LG): - LG pops tagLocalityLogRouterMapped from TLogs; - TLog converts tagLocalityLogRouterMapped back to tagLocalityLogRouter before popping. Later when we add more psuedo localities, the same pattern can be used.	2019-04-23 21:35:56 -07:00
Jingyu Zhou	0b1984978a	Small code refactoring.	2019-04-21 10:41:07 -07:00
Jingyu Zhou	ec1bc5cfca	Add LogSystemType enum	2019-04-21 10:41:07 -07:00
Meng Xu	529ce66b6c	Merge branch 'apple/master' into mengxu/performant-restore-PR	2019-04-18 18:02:45 -07:00
Meng Xu	4c3ccebe8a	FastRestore: Cleanup code Remove unused code and comments.	2019-04-12 13:49:55 -07:00
Evan Tschannen	6220a5ce0f	Merge pull request #1370 from jzhou77/fix-unreferenced Remove unused functions	2019-04-09 11:49:45 -07:00
mpilman	1c16f87a4e	Remove trace-calls to printable (in non-workloads)	2019-04-05 13:12:19 -07:00
Meng Xu	c4a8a80d6f	Merge branch 'apple/master' into mengxu/performant-restore-PR	2019-04-04 22:51:00 -07:00
Jingyu Zhou	47b4b82628	Merge branch 'master' into fix-unreferenced	2019-04-01 14:07:19 -07:00
Meng Xu	70d7c289f4	Merge branch 'master' into mengxu/restore/parallel-v7	2019-03-30 22:13:10 -07:00
Alex Miller	e7ad39246c	Fix typo	2019-03-29 20:16:26 -07:00
Evan Tschannen	a44ffd851e	fix: the shared tlog could fail to update a stopped tlog’s queueCommitVersion to version if a second tlog registered before it could issue the first commit for the tlog	2019-03-29 20:11:30 -07:00
Evan Tschannen	b6008558d3	renamed BinaryWriter.toStringRef() to .toValue(), because the function now returns a Standalone<StringRef>() eliminated an unnecessary copy from the proxy commit path eliminated an unnecessary copy from buffered peek cursor	2019-03-28 11:52:50 -07:00
Jingyu Zhou	a55f06e082	Remove unused functions Found with -Wunused-function flag.	2019-03-27 15:45:28 -07:00
Evan Tschannen	c705a1af74	fix: make sure recoveryLocation is always a valid page	2019-03-20 19:33:09 -07:00
Evan Tschannen	1c6ad6d307	fix: change the location where stopped is checked, because a yield could cause cause stopped to be set after the existing check	2019-03-20 19:33:09 -07:00
Alex Miller	b11ecb3210	Remove random bits of code that were either unneeded or leftover from debugging.	2019-03-18 15:47:20 -07:00

... 2 3 4 5 6 ...

576 Commits