Commit Graph

413 Commits

Author SHA1 Message Date
Alvin Moore a160f9199f
Merge pull request #3171 from apple/release-6.3
Merge Release 6.3 into Master
2020-05-14 10:00:47 -07:00
Alex Miller bf6d056095 Changing the last suggestions from review. 2020-05-13 18:48:43 -07:00
Alex Miller ccaac162e2 Resolve performance concerns of nearly-no-op debugMutation being frequently called
This introduces unhygenic macro variants that inline a `ENABLED &&`
before the TraceEvent.  This way, they get entirely compiled out unless
enabled.

Then rewrite all debugMutation uses via sed.
2020-05-13 18:44:15 -07:00
Alex Miller 27da91ab9e Merge remote-tracking branch 'upstream/master' into mutation-debugging 2020-05-13 12:51:44 -07:00
Alex Miller f148412a32 Make UPDATE_STORAGE_BYTE_LIMIT the reference spill variety.
Which is unrelated, but a change I was supposed to do a while ago and
forgot.
2020-05-12 16:59:20 -07:00
Evan Tschannen 7cebe743f9 A number of bug fixes of rare correctness errors 2020-04-29 13:50:13 -07:00
Evan Tschannen c87aa33941 Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	bindings/go/src/fdb/generated.go
#	documentation/sphinx/source/api-common.rst.inc
#	documentation/sphinx/source/api-ruby.rst
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/FailureMonitorClient.actor.cpp
#	fdbclient/NativeAPI.actor.cpp
#	fdbclient/vexillographer/fdb.options
#	fdbrpc/FlowTransport.actor.cpp
#	fdbserver/OldTLogServer_6_0.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/fdbserver.actor.cpp
#	versions.target
2020-04-23 13:47:53 -07:00
Evan Tschannen 0a1b2a572f more compile fixes 2020-04-22 14:41:17 -07:00
Evan Tschannen 68906bf3c3 fix compile errors 2020-04-22 14:36:41 -07:00
Evan Tschannen d0cc2a1ee4 added logging for parallel peeks on TLogs 2020-04-22 14:24:45 -07:00
Alex Miller 122762cce1 Add debugMessagesAndTags, and track mutations in more places.
Like:
* Leaving the proxy
* Entering the TLog
* Leaving the TLog
* Being read on a cursor

All of this brought to you by TagsAndMessage!

This also slides in a minor optimization as to how mutations are serialized per target log.
2020-03-27 03:31:04 -07:00
Evan Tschannen e08f0201f1 merge release 6.2 into master 2020-03-17 12:51:47 -07:00
Evan Tschannen ea98c7a40a added additional timeout on initPersistentState 2020-03-16 11:38:14 -07:00
Evan Tschannen d6d347f665 treat a tlog which takes a long time to create its disk queue as failed 2020-03-13 10:31:59 -07:00
Evan Tschannen 96258b9809 Merge branch 'release-6.2'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbcli/fdbcli.actor.cpp
#	fdbclient/ManagementAPI.actor.cpp
#	fdbrpc/FlowTransport.actor.cpp
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/DataDistribution.actor.h
#	fdbserver/DataDistributionQueue.actor.cpp
#	fdbserver/KeyValueStoreMemory.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/QuietDatabase.actor.cpp
#	fdbserver/SkipList.cpp
#	fdbserver/StorageMetrics.actor.h
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/fdbserver.actor.cpp
#	fdbserver/storageserver.actor.cpp
#	fdbserver/workloads/KVStoreTest.actor.cpp
#	flow/CMakeLists.txt
#	flow/Knobs.cpp
#	flow/Knobs.h
#	flow/genericactors.actor.cpp
#	flow/serialize.h
2020-02-21 19:09:16 -08:00
Evan Tschannen 8129f74a10
Merge pull request #2698 from etschannen/feature-recruit-delay
The CC waits until no new workers register before starting a bad recruitment
2020-02-20 14:42:37 -08:00
A.J. Beamon fcbdcda490
Merge pull request #2650 from ajbeamon/fix-reverse-range-read-byte-limit-bug
Fix reverse range read performance bug
2020-02-20 12:47:17 -08:00
Evan Tschannen fbd45963d8 The cluster controller waits until no new workers register for 1.0 before starting a bad recruitment 2020-02-19 16:48:30 -08:00
A.J. Beamon 1d9140d874 Removed TLogVersion logging.
Added logging of SharedTLog ID for each TLog.
Switched ID logged for TLogRejoining event to the TLog instead of the SharedTLog.
Made some parameters to startRole passed by reference.
2020-02-14 12:33:43 -08:00
A.J. Beamon 56053c565b Improve TLog "Role" event by adding the worker ID, the TLog version, and under what circumstances the TLog is being started (Restored, Recruited, or Recovered).
The SharedTLog role was being started and stopped twice, so remove one instance of it.
2020-02-12 15:11:38 -08:00
Markus Pilman e71fe44ee3
Merge branch 'master' into features/icc 2020-02-08 21:33:02 -08:00
A.J. Beamon df2b0452b4 Step 3 of fixing storage server range reads: change return type of readRange from VectorRef<KeyValueRef> to RangeResultRef. 2020-02-06 13:19:24 -08:00
mpilman d09e07f1f5 Merge remote-tracking branch 'upstream/master' into features/icc 2020-02-04 10:26:18 -08:00
Jingyu Zhou 7544ff88d9 Comment out frequent TLogPop trace event 2020-01-31 19:29:09 -08:00
Evan Tschannen 6c0b934dda
Merge pull request #2242 from alexmiller-apple/fix-10min-stall-again
Fix the 10min multi-region recovery stall again
2020-01-23 17:53:02 -08:00
Jingyu Zhou 8b67a89eed More review comments fixed. 2020-01-22 19:42:13 -08:00
Jingyu Zhou 9d7a1a77d0 Small fixes. 2020-01-22 19:38:45 -08:00
Jingyu Zhou 85c4a4e422 Address review comments for PR #1625 2020-01-22 19:38:45 -08:00
Jingyu Zhou 73824faf65 Track pseudo tags popping for individual IDs
For each log router ID, we track the popped version of each pseudo tag so that
the popping only applied to the minimum of these versions.

Also add more tracing for popping and epochs.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 11964733b7 WIP: should be divided into smaller commits. 2020-01-22 19:38:45 -08:00
Jingyu Zhou 03a17a30ef Refactor: check displacement in LogSystemConfig 2020-01-22 19:38:45 -08:00
Jingyu Zhou 442738b6db Small code refactoring 2020-01-22 19:35:30 -08:00
Jingyu Zhou 8221d33eb1 Use emplace_back instead of push_back for TLogServer 2020-01-22 19:35:30 -08:00
Alex Miller f0fe62a298 TLogs should not respond with data earlier than the begin version
Parallel peek more code would prefer the begin version it was sent by
the previous parallel peek over the request's begin version.  This means
that a merge cursor trying to advance past message versions would still
get old data that it would have to filter out.

A simple application of std::max fixes this.
2020-01-21 19:09:07 -08:00
Alex Miller 7798456201 Make TLogs have consistent parallel peek behavior.
TLogServer and LogRouter had some leftover code from me trying to be
more "correct" about parallel peek semantics, but those changes weren't
reflected in the OldTLog* files.  I've reverted the changes, as
realistically, they are more likely to waste CPU than improve TLog behavior.
2020-01-21 18:23:16 -08:00
Alex Miller ffc3506fff Continuing a parallel peek after a timeout would hang. 2020-01-21 17:12:18 -08:00
Alex Miller 9c47bbe460 Remove trackerData time bump
As we're in an error handling case, so this shouldn't be considered
making forward progress.
2020-01-21 17:08:42 -08:00
Alex Miller 1cb311fcb8 Add an ASSERT_WE_THINK that peek cursors don't get timed_out()
This should prevent us from regressing and having multi-region
recoveries hang for 10min again.
2020-01-21 17:07:37 -08:00
Evan Tschannen 3f9d9d8b84 Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	cmake/FlowCommands.cmake
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/StorageServerInterface.h
#	fdbserver/DataDistributionTracker.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/fdbserver.actor.cpp
#	flow/Knobs.h
#	flow/Platform.cpp
#	versions.target
2020-01-16 18:37:47 -08:00
Evan Tschannen 827cea74b5 fix: tlogs must send a recruitment reply even when actor cancelled or the recruitment endpoint will be marked as permanently failed 2020-01-16 17:37:17 -08:00
Alex Miller f58507c830 Rename poppedLocationForVersion -> versionForPoppedLocation 2019-12-19 10:24:31 -08:00
Alex Miller b5d82a74c3
Update fdbserver/TLogServer.actor.cpp
Co-Authored-By: Jingyu Zhou <jingyuzhou@gmail.com>
2019-12-19 10:20:52 -08:00
Alex Miller d8cbd495af Fix another pop + spill/dq-pop interleaving issue
This fixes an issue introduced in the previous patch, where pop would
immediately set `poppedLocationNeedsUpdate`, but setting the popped
version was now delayed.  This means that we could:

1. Run the spill loop and persist all popped versions
2. Receive a pop, and set the poppedLocationNeedsUpdate flag
3. Run the dq-pop loop, and clear the poppedLocationNeedsUpdate flag

and now when we update the persistentPopped version again, we won't have
the flag set for dq-pop to know that it needs to scan the spilled data
again for the minLocation.

We could more carefully update the flag, but instead, I've just
converted it into a version that's kept in sync purely in the dq-pop
loop, to remove shared state between pop and the dq-pop loop.
2019-12-17 23:15:48 -08:00
Alex Miller b36062a509 DiskQueue should only pop based off of persisted popped tag versions
This commit is to fix a bug where popping a tag between
updatePersistentData and popDiskQueue can cause the TLog to recover to
an incorrect understanding of what data it has available.

The following series of events need to happen to trigger this bug:

    Tag 1:1 is popped to version 10
    updatePersistentData is run...
      updatePersistentPopped runs and we persistentData stores 1:1 as popped to 10
      A mutation is spilled for 1:1 at version 11 at location 1000
      A mutation is spilled for 1:1 at version 21 at location 5000
    updatePersistentData finishes and commits the btree changes
    Tag 1:1 is popped to version 20
    popDiskQueue runs
      The btree is read for spilled mutations with version >=20
      The minimum location required for the disk queue is found to be location 5000
      The disk queue is popped to location 5000

    The TLog crashes

    The worker restarts, and reloads the TLog files from disk
    restorePersistentPopped restores tag 1:1 as having been popped to version 10
    Parallel peeks are received for tag 1:1 starting at version 0
      The first peek is less than the popped version, so we respond with no data, and an end version of 10
      The second peek starts at version 10, which is greater than the popped version
      The btree is read for spilled mutations, and we find that there is a mutation at version 11 at location 1000
      Location 1000 is read in the DiskQueue

The resulting page read at Location 1000 was popped pre-crash, and thus
might either (a) be corrupt or (b) have an incorrect sequence number.

The fix to this is to force popDiskQueue/updatePoppedLocation to use the
popped version that was persisted to disk, and not the most recently
popped version for the given tag.

This bug doesn't manifest in simulation, because we don't have any code
that peeks at a lower version than what has been popped.
2019-12-17 23:02:37 -08:00
Evan Tschannen ebcb2f79ed Merge branch 'master' of github.com:apple/foundationdb 2019-11-22 15:34:49 -08:00
Evan Tschannen 8d3ef89540 Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/MutationList.h
#	fdbserver/MasterProxyServer.actor.cpp
#	versions.target
2019-11-14 15:49:56 -08:00
negoyal a4a0bf18f9 Merging with Master. 2019-11-12 13:01:29 -08:00
Evan Tschannen 396dccbc98 when peeking from satellites we do not need to limit the amount of peeking on log router tags, because that is the only thing that can be peeked from a satellite log 2019-11-08 18:34:05 -08:00
Evan Tschannen afc9713005 Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/FDBTypes.h
#	fdbserver/LogSystem.h
#	fdbserver/LogSystemPeekCursor.actor.cpp
#	fdbserver/OldTLogServer_6_0.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	versions.target
2019-11-06 13:45:37 -08:00
Evan Tschannen a8ca47beff optimized memory allocations by using VectorRef<Tag> instead of std::vector<Tag> 2019-11-05 18:07:30 -08:00