foundationdb

Commit Graph

Author	SHA1	Message	Date
Meng Xu	bd345f85db	ConsistencyCheck:Fix failue due to address inconsistency between process and worker With TLS, a worker (or process) can have a TLS address and non-TLS address. When a process is created in simulation, the primary address is TLS by default. The non-TLS one is the TLS address port plus one. In a connection between two workers, if their primary addresses do not enable or disable TLS together, one worker will swap its primary address and secondary address so that the TLS config of the two endpoints can match. The swap can make the primary address no longer the TLS one that was created when the process is created. And the swap only happens for worker instead of process struct in simulation. This swap can cause worker->address != process->address. In checkForExtraDataStores actor, we use worker->address to check if a process is killable and use the process->address to kill the process. The inconsistency can cause simulation to kill a protected process that is not killable and leads to simulation failure.	2020-03-10 21:07:16 -07:00
Evan Tschannen	303df197cf	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # bindings/c/test/mako/mako.c # documentation/sphinx/source/release-notes.rst # fdbbackup/backup.actor.cpp # fdbclient/NativeAPI.actor.cpp # fdbclient/NativeAPI.actor.h # fdbserver/DataDistributionQueue.actor.cpp # fdbserver/Knobs.cpp # fdbserver/Knobs.h # fdbserver/LogRouter.actor.cpp # fdbserver/SkipList.cpp # fdbserver/fdbserver.actor.cpp # flow/CMakeLists.txt # flow/Knobs.cpp # flow/Knobs.h # flow/flow.vcxproj # flow/flow.vcxproj.filters # versions.target	2020-03-06 18:22:46 -08:00
Evan Tschannen	1076abdee5	fixed crash when interf was not created	2020-03-05 19:09:08 -08:00
Evan Tschannen	1128666840	added additional logging on the log router	2020-03-05 18:17:06 -08:00
Evan Tschannen	96258b9809	Merge branch 'release-6.2' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbcli/fdbcli.actor.cpp # fdbclient/ManagementAPI.actor.cpp # fdbrpc/FlowTransport.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/DataDistribution.actor.h # fdbserver/DataDistributionQueue.actor.cpp # fdbserver/KeyValueStoreMemory.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # fdbserver/QuietDatabase.actor.cpp # fdbserver/SkipList.cpp # fdbserver/StorageMetrics.actor.h # fdbserver/TLogServer.actor.cpp # fdbserver/fdbserver.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/KVStoreTest.actor.cpp # flow/CMakeLists.txt # flow/Knobs.cpp # flow/Knobs.h # flow/genericactors.actor.cpp # flow/serialize.h	2020-02-21 19:09:16 -08:00
Evan Tschannen	cf4efca852	fix: buffered cursor should always make sure all of the sub-cursors are completely exhausted before calculating minVersion. It is not legal to advance a cursor version past an epochEnd (+100 million versions) without also returning the epochEnd mutation, or the storage servers might not be able to rollback far enough because the end of the previous epoch will be made durable	2020-02-19 15:24:32 -08:00
Alex Miller	7798456201	Make TLogs have consistent parallel peek behavior. TLogServer and LogRouter had some leftover code from me trying to be more "correct" about parallel peek semantics, but those changes weren't reflected in the OldTLog* files. I've reverted the changes, as realistically, they are more likely to waste CPU than improve TLog behavior.	2020-01-21 18:23:16 -08:00
Alex Miller	858e4e5900	Move the check to a better location. This way, we avoid some ID randomness, and also avoid the potential for resetting the randomID and sequence without clearing out the future vector.	2020-01-21 17:08:42 -08:00
Alex Miller	1cb311fcb8	Add an ASSERT_WE_THINK that peek cursors don't get timed_out() This should prevent us from regressing and having multi-region recoveries hang for 10min again.	2020-01-21 17:07:37 -08:00
Alex Miller	0662f8dba0	When switching parallel->single->parallel, reset sequence and peekId This fixes an issue where one could hang for 10min for the second parallel peek to time out, if one happened to catch the edge of a onlySpilled transition wrong.	2020-01-21 17:07:37 -08:00
Evan Tschannen	afc9713005	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # fdbclient/FDBTypes.h # fdbserver/LogSystem.h # fdbserver/LogSystemPeekCursor.actor.cpp # fdbserver/OldTLogServer_6_0.actor.cpp # fdbserver/TLogServer.actor.cpp # versions.target	2019-11-06 13:45:37 -08:00
Evan Tschannen	dbc5a2393c	combineMessages still did not serialize tags correctly	2019-11-05 18:44:30 -08:00
Evan Tschannen	1c873591be	fixed a compiler error	2019-11-05 18:32:15 -08:00
Evan Tschannen	86560fe727	fix: tempTags was not used correctly	2019-11-05 18:22:25 -08:00
Evan Tschannen	a8ca47beff	optimized memory allocations by using VectorRef<Tag> instead of std::vector<Tag>	2019-11-05 18:07:30 -08:00
Evan Tschannen	daac8a2c22	Knobified a few variables	2019-11-04 20:21:38 -08:00
Evan Tschannen	457896b80d	remote logs use bufferedCursor when peeking from log routers to improve performance bufferedCursor performance has been improved	2019-11-04 19:47:45 -08:00
Evan Tschannen	3325980c03	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # fdbserver/DataDistribution.actor.cpp # fdbserver/OldTLogServer_6_0.actor.cpp # fdbserver/TLogServer.actor.cpp # fdbserver/WorkerInterface.actor.h # fdbserver/worker.actor.cpp # versions.target	2019-10-24 17:38:15 -07:00
Evan Tschannen	a7492aab0a	fix: poppedVersion can update during a yield, so all work must be done immediately after getMore returns	2019-10-23 23:06:02 -07:00
Alex Miller	1e5b8c74e3	Continuing a parallel peek after a timeout would hang. This is to guard against the case where 1. Peeks with sequence numbers 0-39 are submitted 2. A 15min pause happens, in which timeout removes the peek tracker data 3. Peeks with sequence numbers 40-59 are submitted, with the same peekId The second round of peeks wouldn't have the data left that it's allowed to start running peek 40 immediately, and thus would hang for 10min until it gets cleaned up. Also, guard against overflowing the sequence number.	2019-10-22 19:24:05 -07:00
Alex Miller	c008e7f8b3	When switching parallel->single->parallel, reset sequence and peekId This fixes an issue where one could hang for 10min for the second parallel peek to time out, if one happened to catch the edge of a onlySpilled transition wrong.	2019-10-22 19:10:58 -07:00
Evan Tschannen	b495cc697b	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # versions.target	2019-09-13 09:25:08 -07:00
Alex Miller	324289039a	When reloading one cursor in a merge cursor, top off the other cursors as well.	2019-09-12 16:22:28 -07:00
Jingyu Zhou	2723922f5f	Replace -1 as VERSION_HEADER constant for serialization	2019-09-05 12:45:39 -07:00
Jingyu Zhou	f9357c5ad8	Fix side effect of ArenaReader ServerPeekCursor::nextMessage() should only consume the message header, because the reader() directly inherits the current position. The previous commit changes the positon to the begining of the next message, which breaks storage server code.	2019-09-05 11:07:07 -07:00
Jingyu Zhou	cd3f1e33d4	Refactor deserialization of TagsAndMessages Consolidate deserialization of TagsAndMessages in the structure itself and change both TLog and ServerPeekCursor to use it.	2019-09-04 14:55:05 -07:00
Evan Tschannen	b0480edd15	fix: messageVersion could be larger than poppedVersion, and we will discard messages that are needed	2019-08-06 16:31:05 -07:00
Evan Tschannen	7ac7eb82f2	fix: buffered cursor would start multiple bufferedGetMore actors advance all of the cursors to the poppedVersion	2019-07-30 14:42:05 -07:00
Evan Tschannen	b5cb7919b6	fix: canDiscardPopped was not reset when necessary in all cases	2019-07-30 13:44:44 -07:00
Evan Tschannen	5d79e4141f	fix: buffered cursor messageVersion should be set to the version we will be at after exhausting everything in messages	2019-07-30 12:38:44 -07:00
Evan Tschannen	6977e7d2e8	do not return recovered version as popped for txsTags because it could cause recovery to start over optimized how buffered peek cursor discards popped data	2019-07-30 12:21:48 -07:00
Evan Tschannen	7a932479dd	throw away state if we ever read popped data from the disk queue adapter	2019-07-30 10:14:39 -07:00
Evan Tschannen	45f7b41b48	fix: multi-cursor could discard popped commits after already returning data	2019-07-29 21:36:42 -07:00
Evan Tschannen	5bb322b483	implement popped on bufferedCursor	2019-07-29 21:19:47 -07:00
Evan Tschannen	d8948c8be1	Merge branch 'master' into feature-fast-txs-recovery # Conflicts: # fdbserver/TagPartitionedLogSystem.actor.cpp	2019-07-10 13:59:52 -07:00
Evan Tschannen	b27a909f3a	fix: onDisconnectOrFailure can spuriously trigger	2019-07-09 16:38:59 -07:00
Evan Tschannen	15e894c724	Merge in master	2019-07-05 15:49:24 -07:00
Evan Tschannen	cfce1e1705	fix: buffered peek cursor would advance very slowly through large ranges of empty versions	2019-06-28 15:54:08 -07:00
Alex Miller	bf883d7055	Merge remote-tracking branch 'upstream/master' into flowlock-api	2019-06-25 14:26:50 -07:00
Alex Miller	7a500cd37f	A giant translation of TaskFooPriority -> TaskPriority::Foo This is so that APIs that take priorities don't take ints, which are common and easy to accidentally pass the wrong thing.	2019-06-25 02:47:35 -07:00
Evan Tschannen	1c005d5878	Merge pull request #1584 from alexmiller-apple/spilled-only-peek Save TLog resources by letting peek request only spilled data.	2019-06-20 18:22:31 -07:00
Evan Tschannen	e0be631414	shard the txs tag so that more transaction logs are involved in its recovery	2019-06-19 18:15:09 -07:00
Alex Miller	ce24db3c53	Fully consume parallelPeekMore results before switching back.	2019-06-19 01:30:49 -07:00
Alex Miller	51fd42a4d2	Merge remote-tracking branch 'upstream/master' into spilled-only-peek	2019-06-18 17:33:52 -07:00
mpilman	8576665a90	Revert "Revert "Make protocol version a type"" This reverts commit `455bf3b3ec`.	2019-06-18 14:49:04 -07:00
Alex Miller	455bf3b3ec	Revert "Make protocol version a type"	2019-06-18 10:59:17 -07:00
mpilman	da53a92bec	Make protocol version a type This fixes #1214 The basic idea is that ProtocolVersion is now its own type. This alone is an improvement as it makes many things more typesafe. For each version, we can now add breaking features (for example Fearless). After that, there's no need to test against actual (confusing) version numbers. Instead a developer can simply test `protocolVersion->hasFearless()` and this will return true iff the protocolVersion is newer than the newest version that didn't support fearless.	2019-06-16 09:59:15 -07:00
Alex Miller	658e61b394	And now use spilledOnly as a hint to do parallel peeks. If there's some spilled data, there's probably a lot of spilled data, and now we can pull all of it faster.	2019-05-14 21:03:44 -10:00
Alex Miller	4eb4c03ce5	Save TLog resources by letting peek request only spilled data. If a peek is entirely fulfilled from spilled data, then it's likely that the next peek will be also. It is thus wasteful for each of these peeks to call peekMessagesFromMemory, which memcpy's excessively, and then throw all that data away without using it. Now, TLogs will give a hint back to peek cursors about if the provided reply was served entirely from the spilled data, which peek curors then feed back as the hint into their next request. At some point, a cursor will send a request for only spilled data, get an incomplete response, and then be told to send its next request as one that peeks from memory as well, and then it will fully catch up.	2019-05-14 15:38:48 -10:00
A.J. Beamon	5f55f3f613	Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used.	2019-05-10 14:01:52 -07:00

1 2

98 Commits