Commit Graph

184 Commits

Author SHA1 Message Date
Evan Tschannen a9d3c9f9b3
Added throttling when a blob worker falls behind (#7751)
* throttle the cluster when blob workers fall behind

* do not throttle on blob workers if they are not enabled

* remove an unnecessary actor

* fixed a compile error

* fetch blob worker metrics at the same interval as the rate is updated, avoid fetching the complete blob worker list too frequently

* fixed another compilation bug

* added a 5 second delay before bw throttling to prevent false positives caused by the 100e6 version jump during recovery. Lower the throttling thresholds to react much quicker to bw lag.

* fixed a number of problems

* changed the minBlobVersionRequest to look at storage server versions since this will be a lot more efficient

* fix: do not let desired go backwards

* fix: track the version of notAtLatest changefeeds for throttling

* ratekeeper now throttled blob workers by estimating the transaction per second throughput of the blob workers

* added metrics for blob worker change feeds

* added a knob to disable bw throttling

* fixed the transaction options in blob manager
2022-08-12 13:15:56 -07:00
He Liu bc5bfaffda
Shard based move (#6981)
* Shard based move.

* Clean up.

* Clear results on retry in getInitialDataDistribution.

* Remove assertion on SHARD_ENCODE_LOCATION_METADATA for compatibility.

* Resolved comments.

Co-authored-by: He Liu <heliu@apple.com>
2022-07-07 20:49:16 -07:00
Bharadwaj V.R 71705bf930 Increase timeout for QuietDatabase when buggify is on 2022-06-27 23:03:00 -07:00
Bharadwaj V.R 990c789a5c
Increase quiet-database timeout when buggify is on; data-movements in simulation take longer than the timeout allows, and waiting for quiet-database does succeed when given some more time (#7290) 2022-06-06 13:13:11 -07:00
A.J. Beamon 917b271a37
Merge pull request #6996 from sfc-gh-mpilman/features/fail-quietdatabase-before-timeout
Make QuietDatabase more human friendly
2022-05-10 08:36:25 -07:00
Markus Pilman e0cbe74d94 Only fail DD early in simulation 2022-04-28 11:32:35 -06:00
Markus Pilman eb22ac1c1f Address review comments 2022-04-28 10:09:06 -06:00
Markus Pilman f959e84b85 fix comparison 2022-04-28 09:46:28 -06:00
Markus Pilman 74abca44d8 Make QuietDatabase more human friendly
QuietDatabase will now fail by itself after 1000 seconds
instead of relying on the general simulation timeout.
Additionally it will print a more human friendly error.
2022-04-28 09:15:20 -06:00
Renxuan Wang c69a07a858
Check in the new Hostname logic. (#6926)
* Revert #6655.

20220407-031010-renxuan-c101052c21da8346           compressed=True data_size=31004844 duration=4310801 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:04:15 sanity=False started=100047 stopped=20220407-041425 submitted=20220407-031010 timeout=5400 username=renxuan

* Revert #6271.

20220407-051532-renxuan-470f0fe6aac1c217           compressed=True data_size=30982370 duration=3491067 ended=100002 fail_fast=10 max_runs=100000 pass=100002 priority=100 remaining=0 runtime=0:59:57 sanity=False started=100141 stopped=20220407-061529 submitted=20220407-051532 timeout=5400 username=renxuan

* Revert #6266.

Remove resolving-related functionalities in connection string. Connection string will be used for storing purpose only, and non-mutable.

20220407-175119-renxuan-55d30ee1a4b42c2f           compressed=True data_size=30970443 duration=5437659 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:59:31 sanity=False started=100154 stopped=20220407-185050 submitted=20220407-175119 timeout=5400 username=renxuan

* Add hostname to coordinator interfaces.

* Turn on the new hostname logic.

* Add the corresponding change in config txns.

The most notable change is before calling basicLoadBalance(), we need to call tryInitializeRequestStream() to initialize request streams first.

Passed correctness tests.

* Return error when hostnames cannot be resolved in coordinators command.

* Minor fixes.
2022-04-27 21:54:13 -07:00
Trevor Clinkenbeard ba8fbca038
Merge pull request #6752 from sfc-gh-tclinkenbeard/improve-snapshot-fault-tolerance
Improve fault tolerance of snapshots
2022-04-08 12:46:50 -07:00
Lukas Joswiak 73a7c32982
Add fdbcli command to read/write version epoch (#6480)
* Initialize cluster version at wall-clock time

Previously, new clusters would begin at version 0. After this change,
clusters will initialize at a version matching wall-clock time. Instead
of using the Unix epoch (or Windows epoch), FDB clusters will use a new
epoch, defaulting to January 1, 2010, 01:00:00+00:00. In the future,
this base epoch will be modifiable through fdbcli, allowing
administrators to advance the cluster version.

Basing the version off of time allows different FDB clusters to share
data without running into version issues.

* Send version epoch to master

* Cleanup

* Update fdbserver/storageserver.actor.cpp

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>

* Jump directly to expected version if possible

* Fix initial version issue on storage servers

* Add random recovery offset to start version in simulation

* Type fixes

* Disable reference time by default

Enable on a cluster using the fdbcli command `versionepoch add 0`.

* Use correct recoveryTransactionVersion when recovering

* Allow version epoch to be adjusted forwards (to decrease the version)

* Set version epoch in simulation

* Add quiet database check to ensure small version offset

* Fix initial version issue on storage servers

* Disable reference time by default

Enable on a cluster using the fdbcli command `versionepoch add 0`.

* Add fdbcli command to read/write version epoch

* Cause recovery when version epoch is set

* Handle optional version epoch key

* Add ability to clear the version epoch

This causes version advancement to revert to the old methodology whereas
versions attempt to advance by about a million versions per second,
instead of trying to match the clock.

* Update transaction access

* Modify version epoch to use microseconds instead of seconds

* Modify fdbcli version target API

Move commands from `versionepoch` to `targetversion` top level command.

* Add fdbcli tests for

* Temporarily disable targetversion cli tests

* Fix version epoch fetch issue

* Fix Arena issue

* Reduce max version jump in simulation to 1,000,000

* Rework fdbcli API

It now requires two commands to fully switch a cluster to using the
version epoch. First, enable the version epoch with `versionepoch
enable` or `versionepoch set <versionepoch>`. At this point, versions
will be given out at a faster or slower rate in an attempt to reach the
expected version. Then, run `versionepoch commit` to perform a one time
jump to the expected version. This is essentially irreversible.

* Temporarily disable old targetversion tests

* Cleanup

* Move version epoch buggify to sequencer

This will cause some issues with the QuietDatabase check for the version
offset - namely, it won't do anything, since the version epoch is not
being written to the txnStateStore in simulation. This will get fixed in
the future.

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
2022-04-08 12:33:19 -07:00
sfc-gh-tclinkenbeard e3acbd1388 Fix bug in getStorageWorkers 2022-04-08 11:21:29 -07:00
sfc-gh-tclinkenbeard 91930b8040 Remove getMinReplicasRemaining PromiseStream.
Instead, in order to enforce the maximum fault tolerance for snapshots,
update getStorageWorkers to return the number of unavailable storage
servers (instead of throwing an error when unavailable storage servers
exist).
2022-04-07 23:23:23 -07:00
Josh Slocum f27475e2f4 Merge branch 'main' into blob_integration 2022-03-22 11:41:58 -05:00
sfc-gh-tclinkenbeard a71099471b Update copyright header dates 2022-03-21 13:36:23 -07:00
Josh Slocum 37e7c80f26 Merge branch 'main' into blob_integration 2022-03-17 18:45:42 -05:00
A.J. Beamon 2a21126028 Don't apply read prefixes on the client. Cache tenant data locally. 2022-03-15 09:23:30 -07:00
Josh Slocum e71b3533f9 Merge branch 'main' into blob_integration 2022-03-09 08:59:56 -06:00
A.J. Beamon 250a88e682 Enforce that trace event suppression calls happen first when using trace event call chaining. Fix various instances where we weren't following this requirement. 2022-02-24 12:25:52 -08:00
Renxuan Wang 622d89b552 Rebase on main.
Since we changed ClusterConnectionString's status flag from boolean to enum in #6422, we need to update this PR correspondingly.
2022-02-22 16:29:59 -08:00
Renxuan Wang 481587a8c6 Turn on hostname logic. 2022-02-22 16:29:59 -08:00
Suraj Gupta 99606482ea initial thoughts 2021-10-26 16:16:00 -04:00
Josh Slocum 0ff8ddc2b6 Merge branch 'master' into blob_full_clean 2021-10-25 13:38:48 -05:00
Trevor Clinkenbeard c69364d5aa
Verify that cluster is fully recovered in quietDatabase check (#5807)
* Verify that cluster is fully recovered in quietDatabase check

* Add trace event to waitForQuietDatabase
2021-10-21 09:01:52 -07:00
Josh Slocum 5f0ec0612a Merge branch 'feature-range-feed' into blob_full 2021-10-13 15:44:35 -05:00
Suraj Gupta 282f9d35cd Cleanup comments and debugging code. 2021-10-04 11:07:08 -04:00
Suraj Gupta 4d54669ccd Recruit the blob workers via blob manager.
In this PR, the blob manager now recruits blob workers
(via communication with the cluster controller). Blob workers
are onboarded as blob worker processes enter the cluster.
2021-10-04 11:07:08 -04:00
Xiaoxi Wang 1730d75f73 change configure test
add store type check
add test file
2021-09-21 18:11:04 -07:00
Chaoguang Lin 65956ae6b7 Refactor configure command; refactor changeConfig to template code to reuse existing tests 2021-09-21 10:06:04 -07:00
Xiaoge Su abf73047ca Enforce std:: specifier rather than using namespace 2021-09-16 19:40:28 -07:00
Xiaoxi Wang 10c82b422f merge master branch 2021-07-28 14:19:46 -07:00
Xiaoxi Wang 12d4f5c261 disable streaming peek for localities < 0 2021-07-28 14:11:25 -07:00
Steve Atherton 507c1f11e3 Add .log() to bare TraceEvent() invocations without any .detail()s to avoid clang-tidy warning about immediate destruction of object without use. 2021-07-26 19:55:10 -07:00
Xiaoxi Wang bfebd4e812 Merge branch 'master' of https://github.com/apple/foundationdb into tlog_dev 2021-07-22 16:15:07 -07:00
Xiaoxi Wang cd32478b52 memory error(Simple config) 2021-07-22 15:45:59 -07:00
Xiaoxi Wang 1057835e8b merge with master 2021-07-20 17:09:34 -07:00
Xiaoxi Wang 5046ee3b07 add stream peek to logRouter 2021-07-20 17:42:00 +00:00
sfc-gh-tclinkenbeard 6f81155784 Merge remote-tracking branch 'origin/master' into const-serverdbinfo 2021-07-20 10:18:40 -07:00
Steve Atherton f596a81073 Rename ::TRUE and ::FALSE in BooleanParams to ::True and ::False so as to not conflict with the TRUE and FALSE macros provided by the Windows and MacOS SDKs. 2021-07-17 00:11:40 -07:00
sfc-gh-tclinkenbeard 8a212862f0 Prevent dataDistributor from modifying ServerDBInfo object 2021-07-11 22:04:54 -07:00
sfc-gh-tclinkenbeard 8cc40e3a2b Expand use of BOOLEAN_PARAM 2021-07-02 21:41:50 -07:00
Josh Slocum d1d2ca9285 Don't inject TSS faults if speedUpSimulation is set 2021-06-18 12:41:48 -05:00
Markus Pilman 05aea49d16
Merge pull request #4986 from sfc-gh-mpilman/bugfixes/double-ss
Bugfixes/double ss
2021-06-16 14:43:32 -06:00
Markus Pilman 56eaf1bc83 added comments 2021-06-15 16:49:27 -06:00
Markus Pilman b2271f2176 additional tracing for quietDatabase 2021-06-15 16:00:28 -06:00
Trevor Clinkenbeard 866f536983
Merge pull request #4888 from sfc-gh-tclinkenbeard/remove-fdbserver-includes
Remove fdbserver includes from fdbclient
2021-06-07 10:22:13 -07:00
Xiaoxi Wang 838d847d4e
Merge pull request #4860 from sfc-gh-xwang/ppwtest
implement perpetual storage wiggling feature
2021-06-04 16:18:39 -07:00
Xiaoxi Wang e0981d6732 add code coverage mark 2021-06-03 19:58:28 +00:00
Xiaoxi Wang 351325b3af comment modification; wait perpetual wiggling close 2021-06-03 05:13:20 +00:00