Commit Graph

3079 Commits

Author SHA1 Message Date
Andrew Noyes daeb0e9ed6 Attempt to fix Makefile 2019-10-25 10:42:22 -07:00
Andrew Noyes d4de608bb6 Fix OPEN_FOR_IDE build 2019-10-25 10:42:22 -07:00
Evan Tschannen ef14f7a718
Merge pull request #2292 from etschannen/master
Merge 6.2 into master
2019-10-25 09:18:20 -07:00
Jingyu Zhou a30e6ec147
Merge pull request #2277 from xumengpanda/mengxu/fastrestore-atomicOpTest-increaseLoadAndBugFix-PR
Performant restore [7/XX]: Add tests for transactionBatchSizeThreshold when apply mutations
2019-10-24 21:21:14 -07:00
Evan Tschannen 3325980c03 Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/OldTLogServer_6_0.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/WorkerInterface.actor.h
#	fdbserver/worker.actor.cpp
#	versions.target
2019-10-24 17:38:15 -07:00
Meng Xu 2383c29123 FastRestore:Use reference for handleInitVersionBatchRequest func 2019-10-24 13:54:44 -07:00
Meng Xu 7903b47b82 FastRestore:Remove unnecessary return 2019-10-24 13:09:24 -07:00
Meng Xu c53f817c5e FastRestore:Convert handleInitVersionBatchRequest to plain func 2019-10-24 13:06:50 -07:00
Xin Dong f70000184e Log the number of samples captured for the read bandwidth to verify the assumption. 2019-10-24 13:05:23 -07:00
Meng Xu 60d26ff5d7 FastRestore:Resolve review comments 2019-10-24 12:52:12 -07:00
Xin Dong a290e2cb2b Use 8 MiB for real 2019-10-24 11:02:17 -07:00
Jon Fu 5d7c84b803 moved shuffle outside of the conditional blocks 2019-10-24 09:45:04 -07:00
Evan Tschannen a7492aab0a fix: poppedVersion can update during a yield, so all work must be done immediately after getMore returns 2019-10-23 23:06:02 -07:00
Evan Tschannen f8e44d2f71 fix: If a storage server was offline, it would not be checked for being in an undesired dc 2019-10-23 23:04:39 -07:00
Meng Xu b1881a7c1c FastRestore:Apply clang-format 2019-10-23 20:49:14 -07:00
Meng Xu 1ae02dd1df FastRestore:AtomicOp test:Add sanity check for setup step 2019-10-23 17:28:21 -07:00
Meng Xu bae0c907a6 FastRestore:Convert unnecessary actor function to plain function 2019-10-23 15:10:34 -07:00
Jon Fu ab262e5e4d use StringRef over std::string for workload params 2019-10-23 14:55:28 -07:00
Jon Fu 103cc37a35 added datahall kill and option to target a specific datahall/dc/machine id 2019-10-23 14:19:17 -07:00
Meng Xu ba7e499efe FastRestore:AtomicOpTest:Limit 1 actor per client 2019-10-23 14:04:14 -07:00
Evan Tschannen eb910b850b fixed a window build error 2019-10-23 13:48:24 -07:00
Meng Xu 41f0cd624b FastRestore:Applier:Use shouldCommit to replace the duplicate code 2019-10-23 13:36:19 -07:00
Xin Dong fe54a4bde1 - Changed SHARD_MAX_BYTES_READ_PRE_KEYSEC to be equivalent to 8MiB/s, which when times the sample expire interval(120 seconds) yields 960MiB/s. A shard having a read rate larger than that will be marked as read-hot. The number 960MiB was chosen to be roughtly twice the size of the max allowed shard size to avoid wrongly marking a shard as read-hot when doing a table scan on it.
- Also tuned down the empty key sampling percentage to be 5%.
2019-10-23 12:00:19 -07:00
Jon Fu d97ff75638 added mode to specifically kill all workers with same machineId 2019-10-23 11:30:16 -07:00
Jon Fu 47dc0ee25c removed coordinator check and added pre-processing of workers rather than checking each cycle 2019-10-23 11:19:27 -07:00
Evan Tschannen 2722c8b188 avoid starting a new startSpillingActor with every TLog recruitment 2019-10-23 11:15:54 -07:00
Evan Tschannen ae3f8132a7
Merge pull request #2280 from satherton/feature-redwood
Update redwood
2019-10-23 10:57:38 -07:00
Evan Tschannen 9197b03122
Merge pull request #2279 from ajbeamon/latency-band-ignore-batch
Ignore batch priority GRVs for latency band tracking
2019-10-23 10:52:44 -07:00
Evan Tschannen e01e8371a6
Merge pull request #2256 from alexmiller-apple/spill-log-on-switch-6.2
Spill SharedTLog when there's more than one
2019-10-23 10:51:28 -07:00
Evan Tschannen c1731e3b8d
Merge pull request #2276 from alexmiller-apple/fix-10min-stall-again-6.2
More fixes to prevent 10min stalls in recovering secondaries
2019-10-23 10:45:55 -07:00
A.J. Beamon a1bed51d34 Ignore batch priority GRVs for latency band tracking 2019-10-23 10:29:58 -07:00
Stephen Atherton 0e51a248b4 Merge branch 'release-6.2' of github.com:apple/foundationdb into feature-redwood 2019-10-23 10:12:54 -07:00
Jon Fu 6583c499f8 Merge branch 'master' of https://github.com/apple/foundationdb into modify-attrition 2019-10-23 09:42:14 -07:00
Stephen Atherton 613bbaecc4 Bug fix in queue page footprint tracking. Added VersionedBTree::destroyAndCheckSanity() which clears the tree, processes the entire lazy delete queue, and then verifies some pager usage statistics. This check is currently disabled because it appears to find a bug where the final state has a few more pages in use than expected. StorageBytes now includes the delayed free list pages as free space since they will be reusable soon. 2019-10-23 09:31:06 -07:00
Alex Miller 0c325c5351 Always check which SharedTLog is active
In case it is set before we get to the onChange()
2019-10-23 01:59:36 -07:00
Meng Xu e676348710
Merge pull request #1955 from fzhjon/mark-ss-failed
Add fdbcli and API command to mark storage servers as permanently failed
2019-10-22 23:36:30 -07:00
Meng Xu 96d463bab6 FastRestore:Fix bug in applying mutations and increase atomicOp test worload
When Applier applies mutations to the destination cluster, it advances the
mutation cursor twice when it should only advance it once.
This makes restore miss some mutations when the applying txn includes
more than 1 mutations.
2019-10-22 23:24:23 -07:00
Alex Miller 1e5b8c74e3 Continuing a parallel peek after a timeout would hang.
This is to guard against the case where

1. Peeks with sequence numbers 0-39 are submitted
2. A 15min pause happens, in which timeout removes the peek tracker data
3. Peeks with sequence numbers 40-59 are submitted, with the same peekId

The second round of peeks wouldn't have the data left that it's allowed
to start running peek 40 immediately, and thus would hang for 10min
until it gets cleaned up.

Also, guard against overflowing the sequence number.
2019-10-22 19:24:05 -07:00
Evan Tschannen f65f0cd37a
Merge pull request #2274 from etschannen/feature-cleanup-destuidlookup
Automatically cleanup backup and DR sharing metadata
2019-10-22 19:11:23 -07:00
Alex Miller c008e7f8b3 When switching parallel->single->parallel, reset sequence and peekId
This fixes an issue where one could hang for 10min for the second
parallel peek to time out, if one happened to catch the edge of a
onlySpilled transition wrong.
2019-10-22 19:10:58 -07:00
Stephen Atherton 6a57fab431 Bug fixes in lazy subtree deletion, queue pushFront(), queue flush(), and advancing the oldest pager version. CommitSubtree no longer forces page rewrites due to boundary changes. IPager2 and IVersionedStore now have explicit async init() functions to avoid returning futures from some frequently used functions. 2019-10-22 17:17:29 -07:00
Evan Tschannen 35ac0071a8 fixed a compiler error 2019-10-22 17:06:54 -07:00
Evan Tschannen 2d74288d16 Added a comment to clarify why cleanup work is done in status 2019-10-22 16:33:44 -07:00
Xin Dong af72d15566
Update fdbserver/Knobs.cpp
From AJ: to match typical aligned format used on other variables.

Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-10-22 13:53:28 -07:00
Xin Dong e6f5748791 Use a large value for read sampling size threshold. Also at sampling site, don't round up small values to avoid sampling every key. 2019-10-22 13:47:58 -07:00
Evan Tschannen 3478652d06
Apply suggestions from code review
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-10-22 13:32:09 -07:00
Evan Tschannen d5c2147c0c
Update fdbserver/Status.actor.cpp
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-10-22 13:27:52 -07:00
Evan Tschannen 2caad04d9c Keys in the destUIDLookupPrefix can be cleaned up automatically if they do not have an associated entry in the logRangesRange keyspace 2019-10-22 11:58:40 -07:00
Jon Fu e39d0dde9b Merge branch 'master' of https://github.com/apple/foundationdb into modify-attrition 2019-10-22 11:51:08 -07:00
A.J. Beamon 29a0014b41 Fix "bandwith" typo 2019-10-22 09:51:59 -07:00
Evan Tschannen 12c517ab16 limit the number of committed version updates in progress simultaneously to prevent running out of memory 2019-10-21 16:01:45 -07:00
Meng Xu 2dbbce55a8 FastRestore:Applier:Mute debug trace 2019-10-21 14:36:07 -07:00
Meng Xu 4af69fd94f Merge branch 'master' into mengxu/fastrestore-multifiles-has-sameversion-mutations-PR-testPR 2019-10-21 14:35:04 -07:00
Xin Dong fca9aab17a
Merge pull request #2046 from dongxinEric/feature/hot-read-key-detection
Added metrics for read hot key detection
2019-10-21 14:31:48 -07:00
Meng Xu f08ad48b7b FastRestore:Applier:handleSendMutationVectorRequest:Add comment 2019-10-21 14:31:21 -07:00
Meng Xu 4efddc9b89 FastRestore:Applier:Reduce LoC
When a key does not exist in a map, it is created by default when it is accessed by []
2019-10-21 14:31:21 -07:00
Meng Xu 6f1ecd1b11 FastRestore:handleSendMutationVectorRequest:Receive mutations in order of versions 2019-10-21 14:31:21 -07:00
Jon Fu d2b6626d5c Merge branch 'master' of https://github.com/apple/foundationdb into mark-ss-failed 2019-10-21 13:47:06 -07:00
Evan Tschannen 688940b685 merge 6.2 into master 2019-10-21 11:43:46 -07:00
Xin Dong 9a81948843
Accept review suggestions.
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-10-21 10:08:43 -07:00
Meng Xu e9a48cb63b FastRestore:Fix bug in handleInitVersionBatchRequest
We should unconditionally resetPerVersionBatch()
2019-10-19 17:40:50 -07:00
Meng Xu ab946eb24f FastRestore:Applier:Turn on debug 2019-10-19 17:07:31 -07:00
tclinken bb0ae31002 Removed dead code. 2019-10-18 17:06:48 -07:00
Meng Xu 6d0c9e9198 FastRestore:AtomicOpTestCase:Add the test case
Also add trace events for AtomicOps.actor.cpp
2019-10-18 16:58:45 -07:00
Xin Dong 6a40ef25e5 Credit to Evan for pointing out the missing line which costs me weeks debugging some weird behaviors. 2019-10-18 16:46:19 -07:00
Jon Fu b1fd6b4443 addressed review comments 2019-10-18 09:43:25 -07:00
Stephen Atherton 44175e0921 COWPager will no longer expire read Snapshots that are still in use. 2019-10-18 01:27:00 -07:00
Stephen Atherton 0e9d082805 Bug fixes in FIFOQueue concurrent nested reads and writes caused by the pager/freelist circular dependencies. 2019-10-17 21:34:17 -07:00
Meng Xu 0fe0a14987 FastRestore:handleSendMutationVectorRequest:Receive mutations in order of versions 2019-10-17 17:21:00 -07:00
Evan Tschannen 43e99ef6a4 fix: better master exists must check if fitness is better for proxies or resolvers before looking at the count of either of them 2019-10-17 13:18:31 -07:00
Meng Xu ab4a375b95 FastRestore:RestoreLoader:Define SerializedMutationPartMap type 2019-10-17 10:12:38 -07:00
Alex Miller 1eb3a70b96 Spill SharedTLog when there's more than one.
When switching between spill_type or log_version, a new instance of a
SharedTLog is created in the transaction log processes.  If this is done
in a saturated database, then doubling the amount of memory to hold
mutations in memory can cause TLogs to be uncomfortably close to the 8GB
OOM limit.

Instead, we now thread which UID of a SharedTLog is active, and the
other TLog spill out the majority of their mutations.

This is a backport of #2213 (fef89aa1) to release-6.2
2019-10-17 01:24:50 -07:00
Meng Xu 78b1ebc7c2 FastRestore:Loader:Handle multiple mutations at same verions in multiple files 2019-10-16 20:57:16 -07:00
Evan Tschannen 42b7acf7b7
Merge pull request #2202 from etschannen/feature-share-mutations
Backup and DR would not share mutations if started on different versions of FDB
2019-10-16 20:28:39 -07:00
Evan Tschannen a81ff63147
Merge pull request #2250 from etschannen/feature-fix-proxy-slow-task
added a yield on the proxy to remove a slow task when processing large transactions
2019-10-16 20:22:05 -07:00
Evan Tschannen 587cbefe7f duplicate mutation stream checker did not have a timeout
duplicate mutation stream did not work properly when multiple ranges exist with the same begin key
2019-10-16 20:17:09 -07:00
Meng Xu 27db9c326b FastRestore:unlockDatabase should always succeed 2019-10-16 16:59:01 -07:00
Evan Tschannen 5be773f145
Update fdbserver/Status.actor.cpp
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-10-16 16:35:24 -07:00
Evan Tschannen 2facfc090b
Update fdbserver/Status.actor.cpp
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-10-16 16:35:12 -07:00
Evan Tschannen a85f69c62f
Merge pull request #2241 from etschannen/feature-recruitment-cleanup
Fixed a few small issues with recruitment logic on the cluster controller
2019-10-16 16:25:42 -07:00
Meng Xu cc556d77b6 FastRestore:RestoreMaster:Remove the extra lockDatabase in RestoreMaster 2019-10-16 16:08:53 -07:00
Evan Tschannen 552eb44bf8
Merge pull request #2230 from ajbeamon/fix-fault-tolerance-reporting-with-remote-regions
Fix: status would fail to account for remote regions when...
2019-10-16 14:51:48 -07:00
Evan Tschannen 8b09cd16b2 Merge branch 'release-6.2' of github.com:apple/foundationdb into feature-share-mutations 2019-10-16 14:50:37 -07:00
Evan Tschannen ac28e96bbf added a yield on the proxy to remove a slow task when processing large transactions 2019-10-16 14:31:59 -07:00
Meng Xu cc85da4876 FastRestore:resetPerVersionBatch:fix compile error 2019-10-16 13:06:42 -07:00
Meng Xu 408af31275 FastRestore:Add fileIndex to RestoreFileFR struct and bug fix
Fix bugs in RestoreMaster that cannot properly lock or unlock DB when
exception occurs;
Fix bug in ordering backup files
2019-10-16 11:45:35 -07:00
Jon Fu f4237ebfff Merge branch 'master' of https://github.com/apple/foundationdb into mark-ss-failed 2019-10-16 11:32:16 -07:00
Jon Fu 896701006f addressed code review changes 2019-10-16 11:30:20 -07:00
Jon Fu fa654d9da7 updated to not kill majority of coordinators 2019-10-16 10:00:16 -07:00
Stephen Atherton 6b7317da9b Bug and clarity fixes to tracking FIFOQueue page and item count. 2019-10-15 03:36:22 -07:00
Stephen Atherton c3e2bde987 Deferred subtree clears and expiring/reusing old pages is complete. Many bug fixes involving scheduled page freeing, page list queue flushing, and expiring old snapshots (this was mostly written but not used yet). Rewrote most of FIFOQueue (again) to more cleanly handle queue cyclical dependencies caused by having queues that use a pager which in tern uses the same queues for managing page freeing and allocation. Many debug output improvements, including making BTreePageIDs and LogicalPageIDs stringify the same way everywhere to make following a PageID easier. 2019-10-15 03:10:50 -07:00
chaoguang e7b97c393d added zipfian distribution to mako workload 2019-10-15 01:14:21 -07:00
Evan Tschannen 298b815109 one proxy or resolver with best fitness no longer prevents more proxies or resolvers from being recruited with good fitness 2019-10-14 18:32:17 -07:00
Evan Tschannen 5064d91b75 fix: the cluster controller would not change to a new set of satellite tlogs when they become available in a better satellite location 2019-10-14 18:31:23 -07:00
Evan Tschannen 35e816e9ad added the ability to configure satellite_logs by satellite location, this will overwrite the region configure if both are present 2019-10-14 18:30:15 -07:00
Meng Xu af8047e79b FastRestore:ApplyToDB:Change state variable to variable 2019-10-14 16:38:01 -07:00
Meng Xu 0c8de91932 FastRestore:applyToDB:Add functions to DBApplyProgress for encapsulation 2019-10-14 16:24:36 -07:00
Jon Fu 373ac3026f update check for dcId 2019-10-14 15:03:04 -07:00
Meng Xu f89b5586df FastRestore:applyToDB:Record applyToDB progress in DBApplyProgress struct
This avoids repetitive code
2019-10-14 14:57:17 -07:00
Jon Fu 0489f81c10 Initial commit to modify machine attrition to work outside simulation 2019-10-14 13:11:49 -07:00