foundationdb

Commit Graph

Author	SHA1	Message	Date
Evan Tschannen	ced65cd30b	finished explicitly versioning everything stored in the database	2020-05-22 17:14:21 -07:00
Evan Tschannen	8fd926e08e	serialize old tlog entries with old protocol versions to support downgrades	2020-05-22 14:00:07 -07:00
Markus Pilman	5f9b127e56	Emit traces regularly about role assignment We are currently emitting Role transition traces when a role starts and when it ends. While this is useful for debugging, it doesn't work well with tools that inject data and might potentially miss some trace lines. We do decorate each trace lines with the roles assigned to that particular process, however, this is not sufficient for tools that can make use of the UID -> Role mapping	2020-05-08 16:27:57 -07:00
Evan Tschannen	7cebe743f9	A number of bug fixes of rare correctness errors	2020-04-29 13:50:13 -07:00
Evan Tschannen	c87aa33941	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # bindings/go/src/fdb/generated.go # documentation/sphinx/source/api-common.rst.inc # documentation/sphinx/source/api-ruby.rst # documentation/sphinx/source/release-notes.rst # fdbclient/FailureMonitorClient.actor.cpp # fdbclient/NativeAPI.actor.cpp # fdbclient/vexillographer/fdb.options # fdbrpc/FlowTransport.actor.cpp # fdbserver/OldTLogServer_6_0.actor.cpp # fdbserver/TLogServer.actor.cpp # fdbserver/fdbserver.actor.cpp # versions.target	2020-04-23 13:47:53 -07:00
Evan Tschannen	91fba9106d	ported peek metrics to old tlog 6.0	2020-04-22 23:35:48 -07:00
Balachandar Namasivayam	a476127f5f	Merge pull request #2802 from xumengpanda/mengxu/debug-master-PR Fix correctness failure on master branch	2020-03-18 16:07:36 -07:00
Evan Tschannen	e08f0201f1	merge release 6.2 into master	2020-03-17 12:51:47 -07:00
Evan Tschannen	ea98c7a40a	added additional timeout on initPersistentState	2020-03-16 11:38:14 -07:00
Evan Tschannen	d6d347f665	treat a tlog which takes a long time to create its disk queue as failed	2020-03-13 10:31:59 -07:00
Meng Xu	bd345f85db	ConsistencyCheck:Fix failue due to address inconsistency between process and worker With TLS, a worker (or process) can have a TLS address and non-TLS address. When a process is created in simulation, the primary address is TLS by default. The non-TLS one is the TLS address port plus one. In a connection between two workers, if their primary addresses do not enable or disable TLS together, one worker will swap its primary address and secondary address so that the TLS config of the two endpoints can match. The swap can make the primary address no longer the TLS one that was created when the process is created. And the swap only happens for worker instead of process struct in simulation. This swap can cause worker->address != process->address. In checkForExtraDataStores actor, we use worker->address to check if a process is killable and use the process->address to kill the process. The inconsistency can cause simulation to kill a protected process that is not killable and leads to simulation failure.	2020-03-10 21:07:16 -07:00
Evan Tschannen	96258b9809	Merge branch 'release-6.2' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbcli/fdbcli.actor.cpp # fdbclient/ManagementAPI.actor.cpp # fdbrpc/FlowTransport.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/DataDistribution.actor.h # fdbserver/DataDistributionQueue.actor.cpp # fdbserver/KeyValueStoreMemory.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # fdbserver/QuietDatabase.actor.cpp # fdbserver/SkipList.cpp # fdbserver/StorageMetrics.actor.h # fdbserver/TLogServer.actor.cpp # fdbserver/fdbserver.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/KVStoreTest.actor.cpp # flow/CMakeLists.txt # flow/Knobs.cpp # flow/Knobs.h # flow/genericactors.actor.cpp # flow/serialize.h	2020-02-21 19:09:16 -08:00
Evan Tschannen	8129f74a10	Merge pull request #2698 from etschannen/feature-recruit-delay The CC waits until no new workers register before starting a bad recruitment	2020-02-20 14:42:37 -08:00
A.J. Beamon	fcbdcda490	Merge pull request #2650 from ajbeamon/fix-reverse-range-read-byte-limit-bug Fix reverse range read performance bug	2020-02-20 12:47:17 -08:00
Evan Tschannen	fbd45963d8	The cluster controller waits until no new workers register for 1.0 before starting a bad recruitment	2020-02-19 16:48:30 -08:00
A.J. Beamon	1d9140d874	Removed TLogVersion logging. Added logging of SharedTLog ID for each TLog. Switched ID logged for TLogRejoining event to the TLog instead of the SharedTLog. Made some parameters to startRole passed by reference.	2020-02-14 12:33:43 -08:00
A.J. Beamon	56053c565b	Improve TLog "Role" event by adding the worker ID, the TLog version, and under what circumstances the TLog is being started (Restored, Recruited, or Recovered). The SharedTLog role was being started and stopped twice, so remove one instance of it.	2020-02-12 15:11:38 -08:00
A.J. Beamon	df2b0452b4	Step 3 of fixing storage server range reads: change return type of readRange from VectorRef<KeyValueRef> to RangeResultRef.	2020-02-06 13:19:24 -08:00
Evan Tschannen	6c0b934dda	Merge pull request #2242 from alexmiller-apple/fix-10min-stall-again Fix the 10min multi-region recovery stall again	2020-01-23 17:53:02 -08:00
Jingyu Zhou	8b67a89eed	More review comments fixed.	2020-01-22 19:42:13 -08:00
Jingyu Zhou	9d7a1a77d0	Small fixes.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	73824faf65	Track pseudo tags popping for individual IDs For each log router ID, we track the popped version of each pseudo tag so that the popping only applied to the minimum of these versions. Also add more tracing for popping and epochs.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	11964733b7	WIP: should be divided into smaller commits.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	03a17a30ef	Refactor: check displacement in LogSystemConfig	2020-01-22 19:38:45 -08:00
Jingyu Zhou	8221d33eb1	Use emplace_back instead of push_back for TLogServer	2020-01-22 19:35:30 -08:00
Alex Miller	f0fe62a298	TLogs should not respond with data earlier than the begin version Parallel peek more code would prefer the begin version it was sent by the previous parallel peek over the request's begin version. This means that a merge cursor trying to advance past message versions would still get old data that it would have to filter out. A simple application of std::max fixes this.	2020-01-21 19:09:07 -08:00
Alex Miller	ffc3506fff	Continuing a parallel peek after a timeout would hang.	2020-01-21 17:12:18 -08:00
Alex Miller	1cb311fcb8	Add an ASSERT_WE_THINK that peek cursors don't get timed_out() This should prevent us from regressing and having multi-region recoveries hang for 10min again.	2020-01-21 17:07:37 -08:00
Evan Tschannen	3f9d9d8b84	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # cmake/FlowCommands.cmake # documentation/sphinx/source/release-notes.rst # fdbclient/StorageServerInterface.h # fdbserver/DataDistributionTracker.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # fdbserver/fdbserver.actor.cpp # flow/Knobs.h # flow/Platform.cpp # versions.target	2020-01-16 18:37:47 -08:00
Evan Tschannen	827cea74b5	fix: tlogs must send a recruitment reply even when actor cancelled or the recruitment endpoint will be marked as permanently failed	2020-01-16 17:37:17 -08:00
Evan Tschannen	ebcb2f79ed	Merge branch 'master' of github.com:apple/foundationdb	2019-11-22 15:34:49 -08:00
Evan Tschannen	8d3ef89540	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # fdbclient/MutationList.h # fdbserver/MasterProxyServer.actor.cpp # versions.target	2019-11-14 15:49:56 -08:00
negoyal	a4a0bf18f9	Merging with Master.	2019-11-12 13:01:29 -08:00
Evan Tschannen	396dccbc98	when peeking from satellites we do not need to limit the amount of peeking on log router tags, because that is the only thing that can be peeked from a satellite log	2019-11-08 18:34:05 -08:00
Evan Tschannen	afc9713005	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # fdbclient/FDBTypes.h # fdbserver/LogSystem.h # fdbserver/LogSystemPeekCursor.actor.cpp # fdbserver/OldTLogServer_6_0.actor.cpp # fdbserver/TLogServer.actor.cpp # versions.target	2019-11-06 13:45:37 -08:00
Evan Tschannen	a8ca47beff	optimized memory allocations by using VectorRef<Tag> instead of std::vector<Tag>	2019-11-05 18:07:30 -08:00
Evan Tschannen	4de60fc437	Merge branch 'release-6.2' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbserver/TLogServer.actor.cpp	2019-11-01 15:48:04 -07:00
Evan Tschannen	85c315f684	Fix: parallelPeekMore was not enabled when peeking from log routers	2019-11-01 14:02:44 -07:00
Evan Tschannen	3325980c03	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # fdbserver/DataDistribution.actor.cpp # fdbserver/OldTLogServer_6_0.actor.cpp # fdbserver/TLogServer.actor.cpp # fdbserver/WorkerInterface.actor.h # fdbserver/worker.actor.cpp # versions.target	2019-10-24 17:38:15 -07:00
Evan Tschannen	2722c8b188	avoid starting a new startSpillingActor with every TLog recruitment	2019-10-23 11:15:54 -07:00
Evan Tschannen	e01e8371a6	Merge pull request #2256 from alexmiller-apple/spill-log-on-switch-6.2 Spill SharedTLog when there's more than one	2019-10-23 10:51:28 -07:00
Alex Miller	0c325c5351	Always check which SharedTLog is active In case it is set before we get to the onChange()	2019-10-23 01:59:36 -07:00
Alex Miller	1e5b8c74e3	Continuing a parallel peek after a timeout would hang. This is to guard against the case where 1. Peeks with sequence numbers 0-39 are submitted 2. A 15min pause happens, in which timeout removes the peek tracker data 3. Peeks with sequence numbers 40-59 are submitted, with the same peekId The second round of peeks wouldn't have the data left that it's allowed to start running peek 40 immediately, and thus would hang for 10min until it gets cleaned up. Also, guard against overflowing the sequence number.	2019-10-22 19:24:05 -07:00
Alex Miller	1eb3a70b96	Spill SharedTLog when there's more than one. When switching between spill_type or log_version, a new instance of a SharedTLog is created in the transaction log processes. If this is done in a saturated database, then doubling the amount of memory to hold mutations in memory can cause TLogs to be uncomfortably close to the 8GB OOM limit. Instead, we now thread which UID of a SharedTLog is active, and the other TLog spill out the majority of their mutations. This is a backport of #2213 (`fef89aa1`) to release-6.2	2019-10-17 01:24:50 -07:00
sramamoorthy	c9097cca18	deprecate isTLogInSameNode used by snapshot V1	2019-10-09 15:33:11 -07:00
Alex Miller	77c72de176	Comment variable and code style fix Co-Authored-By: Jingyu Zhou <jingyuzhou@gmail.com>	2019-10-07 18:08:27 -07:00
Alex Miller	1d8a7e5af7	Spill SharedTLog when there's more than one. When switching between spill_type or log_version, a new instance of a SharedTLog is created in the transaction log processes. If this is done in a saturated database, then doubling the amount of memory to hold mutations in memory can cause TLogs to be uncomfortably close to the 8GB OOM limit. Instead, we now thread which UID of a SharedTLog is active, and the other TLog spill out the majority of their mutations.	2019-10-07 18:08:27 -07:00
Evan Tschannen	b495cc697b	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # versions.target	2019-09-13 09:25:08 -07:00
Alex Miller	6ef43399ac	Also update OldTLogServer_6_0	2019-09-12 18:45:51 -07:00
Jingyu Zhou	2723922f5f	Replace -1 as VERSION_HEADER constant for serialization	2019-09-05 12:45:39 -07:00

1 2 3

137 Commits