Andrew Noyes
daeb0e9ed6
Attempt to fix Makefile
2019-10-25 10:42:22 -07:00
Andrew Noyes
d4de608bb6
Fix OPEN_FOR_IDE build
2019-10-25 10:42:22 -07:00
Evan Tschannen
ef14f7a718
Merge pull request #2292 from etschannen/master
...
Merge 6.2 into master
2019-10-25 09:18:20 -07:00
Jingyu Zhou
a30e6ec147
Merge pull request #2277 from xumengpanda/mengxu/fastrestore-atomicOpTest-increaseLoadAndBugFix-PR
...
Performant restore [7/XX]: Add tests for transactionBatchSizeThreshold when apply mutations
2019-10-24 21:21:14 -07:00
Evan Tschannen
3325980c03
Merge branch 'release-6.2'
...
# Conflicts:
# CMakeLists.txt
# documentation/sphinx/source/release-notes.rst
# fdbserver/DataDistribution.actor.cpp
# fdbserver/OldTLogServer_6_0.actor.cpp
# fdbserver/TLogServer.actor.cpp
# fdbserver/WorkerInterface.actor.h
# fdbserver/worker.actor.cpp
# versions.target
2019-10-24 17:38:15 -07:00
Meng Xu
2383c29123
FastRestore:Use reference for handleInitVersionBatchRequest func
2019-10-24 13:54:44 -07:00
Meng Xu
7903b47b82
FastRestore:Remove unnecessary return
2019-10-24 13:09:24 -07:00
Meng Xu
c53f817c5e
FastRestore:Convert handleInitVersionBatchRequest to plain func
2019-10-24 13:06:50 -07:00
Xin Dong
f70000184e
Log the number of samples captured for the read bandwidth to verify the assumption.
2019-10-24 13:05:23 -07:00
Meng Xu
60d26ff5d7
FastRestore:Resolve review comments
2019-10-24 12:52:12 -07:00
Xin Dong
a290e2cb2b
Use 8 MiB for real
2019-10-24 11:02:17 -07:00
Jon Fu
5d7c84b803
moved shuffle outside of the conditional blocks
2019-10-24 09:45:04 -07:00
Evan Tschannen
a7492aab0a
fix: poppedVersion can update during a yield, so all work must be done immediately after getMore returns
2019-10-23 23:06:02 -07:00
Evan Tschannen
f8e44d2f71
fix: If a storage server was offline, it would not be checked for being in an undesired dc
2019-10-23 23:04:39 -07:00
Meng Xu
b1881a7c1c
FastRestore:Apply clang-format
2019-10-23 20:49:14 -07:00
Meng Xu
1ae02dd1df
FastRestore:AtomicOp test:Add sanity check for setup step
2019-10-23 17:28:21 -07:00
Meng Xu
bae0c907a6
FastRestore:Convert unnecessary actor function to plain function
2019-10-23 15:10:34 -07:00
Jon Fu
ab262e5e4d
use StringRef over std::string for workload params
2019-10-23 14:55:28 -07:00
Jon Fu
103cc37a35
added datahall kill and option to target a specific datahall/dc/machine id
2019-10-23 14:19:17 -07:00
Meng Xu
ba7e499efe
FastRestore:AtomicOpTest:Limit 1 actor per client
2019-10-23 14:04:14 -07:00
Evan Tschannen
eb910b850b
fixed a window build error
2019-10-23 13:48:24 -07:00
Meng Xu
41f0cd624b
FastRestore:Applier:Use shouldCommit to replace the duplicate code
2019-10-23 13:36:19 -07:00
Xin Dong
fe54a4bde1
- Changed SHARD_MAX_BYTES_READ_PRE_KEYSEC to be equivalent to 8MiB/s, which when times the sample expire interval(120 seconds) yields 960MiB/s. A shard having a read rate larger than that will be marked as read-hot. The number 960MiB was chosen to be roughtly twice the size of the max allowed shard size to avoid wrongly marking a shard as read-hot when doing a table scan on it.
...
- Also tuned down the empty key sampling percentage to be 5%.
2019-10-23 12:00:19 -07:00
Jon Fu
d97ff75638
added mode to specifically kill all workers with same machineId
2019-10-23 11:30:16 -07:00
Jon Fu
47dc0ee25c
removed coordinator check and added pre-processing of workers rather than checking each cycle
2019-10-23 11:19:27 -07:00
Evan Tschannen
2722c8b188
avoid starting a new startSpillingActor with every TLog recruitment
2019-10-23 11:15:54 -07:00
Evan Tschannen
ae3f8132a7
Merge pull request #2280 from satherton/feature-redwood
...
Update redwood
2019-10-23 10:57:38 -07:00
Evan Tschannen
9197b03122
Merge pull request #2279 from ajbeamon/latency-band-ignore-batch
...
Ignore batch priority GRVs for latency band tracking
2019-10-23 10:52:44 -07:00
Evan Tschannen
e01e8371a6
Merge pull request #2256 from alexmiller-apple/spill-log-on-switch-6.2
...
Spill SharedTLog when there's more than one
2019-10-23 10:51:28 -07:00
Evan Tschannen
c1731e3b8d
Merge pull request #2276 from alexmiller-apple/fix-10min-stall-again-6.2
...
More fixes to prevent 10min stalls in recovering secondaries
2019-10-23 10:45:55 -07:00
A.J. Beamon
a1bed51d34
Ignore batch priority GRVs for latency band tracking
2019-10-23 10:29:58 -07:00
Stephen Atherton
0e51a248b4
Merge branch 'release-6.2' of github.com:apple/foundationdb into feature-redwood
2019-10-23 10:12:54 -07:00
Jon Fu
6583c499f8
Merge branch 'master' of https://github.com/apple/foundationdb into modify-attrition
2019-10-23 09:42:14 -07:00
Stephen Atherton
613bbaecc4
Bug fix in queue page footprint tracking. Added VersionedBTree::destroyAndCheckSanity() which clears the tree, processes the entire lazy delete queue, and then verifies some pager usage statistics. This check is currently disabled because it appears to find a bug where the final state has a few more pages in use than expected. StorageBytes now includes the delayed free list pages as free space since they will be reusable soon.
2019-10-23 09:31:06 -07:00
Alex Miller
0c325c5351
Always check which SharedTLog is active
...
In case it is set before we get to the onChange()
2019-10-23 01:59:36 -07:00
Meng Xu
e676348710
Merge pull request #1955 from fzhjon/mark-ss-failed
...
Add fdbcli and API command to mark storage servers as permanently failed
2019-10-22 23:36:30 -07:00
Meng Xu
96d463bab6
FastRestore:Fix bug in applying mutations and increase atomicOp test worload
...
When Applier applies mutations to the destination cluster, it advances the
mutation cursor twice when it should only advance it once.
This makes restore miss some mutations when the applying txn includes
more than 1 mutations.
2019-10-22 23:24:23 -07:00
Alex Miller
1e5b8c74e3
Continuing a parallel peek after a timeout would hang.
...
This is to guard against the case where
1. Peeks with sequence numbers 0-39 are submitted
2. A 15min pause happens, in which timeout removes the peek tracker data
3. Peeks with sequence numbers 40-59 are submitted, with the same peekId
The second round of peeks wouldn't have the data left that it's allowed
to start running peek 40 immediately, and thus would hang for 10min
until it gets cleaned up.
Also, guard against overflowing the sequence number.
2019-10-22 19:24:05 -07:00
Evan Tschannen
f65f0cd37a
Merge pull request #2274 from etschannen/feature-cleanup-destuidlookup
...
Automatically cleanup backup and DR sharing metadata
2019-10-22 19:11:23 -07:00
Alex Miller
c008e7f8b3
When switching parallel->single->parallel, reset sequence and peekId
...
This fixes an issue where one could hang for 10min for the second
parallel peek to time out, if one happened to catch the edge of a
onlySpilled transition wrong.
2019-10-22 19:10:58 -07:00
Stephen Atherton
6a57fab431
Bug fixes in lazy subtree deletion, queue pushFront(), queue flush(), and advancing the oldest pager version. CommitSubtree no longer forces page rewrites due to boundary changes. IPager2 and IVersionedStore now have explicit async init() functions to avoid returning futures from some frequently used functions.
2019-10-22 17:17:29 -07:00
Evan Tschannen
35ac0071a8
fixed a compiler error
2019-10-22 17:06:54 -07:00
Evan Tschannen
2d74288d16
Added a comment to clarify why cleanup work is done in status
2019-10-22 16:33:44 -07:00
Xin Dong
af72d15566
Update fdbserver/Knobs.cpp
...
From AJ: to match typical aligned format used on other variables.
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-10-22 13:53:28 -07:00
Xin Dong
e6f5748791
Use a large value for read sampling size threshold. Also at sampling site, don't round up small values to avoid sampling every key.
2019-10-22 13:47:58 -07:00
Evan Tschannen
3478652d06
Apply suggestions from code review
...
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-10-22 13:32:09 -07:00
Evan Tschannen
d5c2147c0c
Update fdbserver/Status.actor.cpp
...
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-10-22 13:27:52 -07:00
Evan Tschannen
2caad04d9c
Keys in the destUIDLookupPrefix can be cleaned up automatically if they do not have an associated entry in the logRangesRange keyspace
2019-10-22 11:58:40 -07:00
Jon Fu
e39d0dde9b
Merge branch 'master' of https://github.com/apple/foundationdb into modify-attrition
2019-10-22 11:51:08 -07:00
A.J. Beamon
29a0014b41
Fix "bandwith" typo
2019-10-22 09:51:59 -07:00
Evan Tschannen
12c517ab16
limit the number of committed version updates in progress simultaneously to prevent running out of memory
2019-10-21 16:01:45 -07:00
Meng Xu
2dbbce55a8
FastRestore:Applier:Mute debug trace
2019-10-21 14:36:07 -07:00
Meng Xu
4af69fd94f
Merge branch 'master' into mengxu/fastrestore-multifiles-has-sameversion-mutations-PR-testPR
2019-10-21 14:35:04 -07:00
Xin Dong
fca9aab17a
Merge pull request #2046 from dongxinEric/feature/hot-read-key-detection
...
Added metrics for read hot key detection
2019-10-21 14:31:48 -07:00
Meng Xu
f08ad48b7b
FastRestore:Applier:handleSendMutationVectorRequest:Add comment
2019-10-21 14:31:21 -07:00
Meng Xu
4efddc9b89
FastRestore:Applier:Reduce LoC
...
When a key does not exist in a map, it is created by default when it is accessed by []
2019-10-21 14:31:21 -07:00
Meng Xu
6f1ecd1b11
FastRestore:handleSendMutationVectorRequest:Receive mutations in order of versions
2019-10-21 14:31:21 -07:00
Jon Fu
d2b6626d5c
Merge branch 'master' of https://github.com/apple/foundationdb into mark-ss-failed
2019-10-21 13:47:06 -07:00
Evan Tschannen
688940b685
merge 6.2 into master
2019-10-21 11:43:46 -07:00
Xin Dong
9a81948843
Accept review suggestions.
...
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-10-21 10:08:43 -07:00
Meng Xu
e9a48cb63b
FastRestore:Fix bug in handleInitVersionBatchRequest
...
We should unconditionally resetPerVersionBatch()
2019-10-19 17:40:50 -07:00
Meng Xu
ab946eb24f
FastRestore:Applier:Turn on debug
2019-10-19 17:07:31 -07:00
tclinken
bb0ae31002
Removed dead code.
2019-10-18 17:06:48 -07:00
Meng Xu
6d0c9e9198
FastRestore:AtomicOpTestCase:Add the test case
...
Also add trace events for AtomicOps.actor.cpp
2019-10-18 16:58:45 -07:00
Xin Dong
6a40ef25e5
Credit to Evan for pointing out the missing line which costs me weeks debugging some weird behaviors.
2019-10-18 16:46:19 -07:00
Jon Fu
b1fd6b4443
addressed review comments
2019-10-18 09:43:25 -07:00
Stephen Atherton
44175e0921
COWPager will no longer expire read Snapshots that are still in use.
2019-10-18 01:27:00 -07:00
Stephen Atherton
0e9d082805
Bug fixes in FIFOQueue concurrent nested reads and writes caused by the pager/freelist circular dependencies.
2019-10-17 21:34:17 -07:00
Meng Xu
0fe0a14987
FastRestore:handleSendMutationVectorRequest:Receive mutations in order of versions
2019-10-17 17:21:00 -07:00
Evan Tschannen
43e99ef6a4
fix: better master exists must check if fitness is better for proxies or resolvers before looking at the count of either of them
2019-10-17 13:18:31 -07:00
Meng Xu
ab4a375b95
FastRestore:RestoreLoader:Define SerializedMutationPartMap type
2019-10-17 10:12:38 -07:00
Alex Miller
1eb3a70b96
Spill SharedTLog when there's more than one.
...
When switching between spill_type or log_version, a new instance of a
SharedTLog is created in the transaction log processes. If this is done
in a saturated database, then doubling the amount of memory to hold
mutations in memory can cause TLogs to be uncomfortably close to the 8GB
OOM limit.
Instead, we now thread which UID of a SharedTLog is active, and the
other TLog spill out the majority of their mutations.
This is a backport of #2213 (fef89aa1
) to release-6.2
2019-10-17 01:24:50 -07:00
Meng Xu
78b1ebc7c2
FastRestore:Loader:Handle multiple mutations at same verions in multiple files
2019-10-16 20:57:16 -07:00
Evan Tschannen
42b7acf7b7
Merge pull request #2202 from etschannen/feature-share-mutations
...
Backup and DR would not share mutations if started on different versions of FDB
2019-10-16 20:28:39 -07:00
Evan Tschannen
a81ff63147
Merge pull request #2250 from etschannen/feature-fix-proxy-slow-task
...
added a yield on the proxy to remove a slow task when processing large transactions
2019-10-16 20:22:05 -07:00
Evan Tschannen
587cbefe7f
duplicate mutation stream checker did not have a timeout
...
duplicate mutation stream did not work properly when multiple ranges exist with the same begin key
2019-10-16 20:17:09 -07:00
Meng Xu
27db9c326b
FastRestore:unlockDatabase should always succeed
2019-10-16 16:59:01 -07:00
Evan Tschannen
5be773f145
Update fdbserver/Status.actor.cpp
...
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-10-16 16:35:24 -07:00
Evan Tschannen
2facfc090b
Update fdbserver/Status.actor.cpp
...
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-10-16 16:35:12 -07:00
Evan Tschannen
a85f69c62f
Merge pull request #2241 from etschannen/feature-recruitment-cleanup
...
Fixed a few small issues with recruitment logic on the cluster controller
2019-10-16 16:25:42 -07:00
Meng Xu
cc556d77b6
FastRestore:RestoreMaster:Remove the extra lockDatabase in RestoreMaster
2019-10-16 16:08:53 -07:00
Evan Tschannen
552eb44bf8
Merge pull request #2230 from ajbeamon/fix-fault-tolerance-reporting-with-remote-regions
...
Fix: status would fail to account for remote regions when...
2019-10-16 14:51:48 -07:00
Evan Tschannen
8b09cd16b2
Merge branch 'release-6.2' of github.com:apple/foundationdb into feature-share-mutations
2019-10-16 14:50:37 -07:00
Evan Tschannen
ac28e96bbf
added a yield on the proxy to remove a slow task when processing large transactions
2019-10-16 14:31:59 -07:00
Meng Xu
cc85da4876
FastRestore:resetPerVersionBatch:fix compile error
2019-10-16 13:06:42 -07:00
Meng Xu
408af31275
FastRestore:Add fileIndex to RestoreFileFR struct and bug fix
...
Fix bugs in RestoreMaster that cannot properly lock or unlock DB when
exception occurs;
Fix bug in ordering backup files
2019-10-16 11:45:35 -07:00
Jon Fu
f4237ebfff
Merge branch 'master' of https://github.com/apple/foundationdb into mark-ss-failed
2019-10-16 11:32:16 -07:00
Jon Fu
896701006f
addressed code review changes
2019-10-16 11:30:20 -07:00
Jon Fu
fa654d9da7
updated to not kill majority of coordinators
2019-10-16 10:00:16 -07:00
Stephen Atherton
6b7317da9b
Bug and clarity fixes to tracking FIFOQueue page and item count.
2019-10-15 03:36:22 -07:00
Stephen Atherton
c3e2bde987
Deferred subtree clears and expiring/reusing old pages is complete. Many bug fixes involving scheduled page freeing, page list queue flushing, and expiring old snapshots (this was mostly written but not used yet). Rewrote most of FIFOQueue (again) to more cleanly handle queue cyclical dependencies caused by having queues that use a pager which in tern uses the same queues for managing page freeing and allocation. Many debug output improvements, including making BTreePageIDs and LogicalPageIDs stringify the same way everywhere to make following a PageID easier.
2019-10-15 03:10:50 -07:00
chaoguang
e7b97c393d
added zipfian distribution to mako workload
2019-10-15 01:14:21 -07:00
Evan Tschannen
298b815109
one proxy or resolver with best fitness no longer prevents more proxies or resolvers from being recruited with good fitness
2019-10-14 18:32:17 -07:00
Evan Tschannen
5064d91b75
fix: the cluster controller would not change to a new set of satellite tlogs when they become available in a better satellite location
2019-10-14 18:31:23 -07:00
Evan Tschannen
35e816e9ad
added the ability to configure satellite_logs by satellite location, this will overwrite the region configure if both are present
2019-10-14 18:30:15 -07:00
Meng Xu
af8047e79b
FastRestore:ApplyToDB:Change state variable to variable
2019-10-14 16:38:01 -07:00
Meng Xu
0c8de91932
FastRestore:applyToDB:Add functions to DBApplyProgress for encapsulation
2019-10-14 16:24:36 -07:00
Jon Fu
373ac3026f
update check for dcId
2019-10-14 15:03:04 -07:00
Meng Xu
f89b5586df
FastRestore:applyToDB:Record applyToDB progress in DBApplyProgress struct
...
This avoids repetitive code
2019-10-14 14:57:17 -07:00
Jon Fu
0489f81c10
Initial commit to modify machine attrition to work outside simulation
2019-10-14 13:11:49 -07:00