Commit Graph

280 Commits

Author SHA1 Message Date
Alvin Moore 3bf971ba8b Merge branch 'release-6.2' of github.com:apple/foundationdb into release_6.2_merge
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/storageserver.actor.cpp
2019-12-12 07:13:12 -08:00
Andrew Noyes 6bde67f2b3 Fix UBSAN report
/home/anoyes/workspace/foundationdb/fdbserver/Ratekeeper.actor.cpp:86:8: runtime error: load of value 1231493777, which is not a valid value for type 'limitReason_t'
    #0 0x310e961 in StorageQueueInfo::StorageQueueInfo(StorageQueueInfo&&) /home/anoyes/workspace/foundationdb/fdbserver/Ratekeeper.actor.cpp:86
    #1 0x310eacd in MapPair<UID, StorageQueueInfo>::MapPair<UID, StorageQueueInfo>(UID&&, StorageQueueInfo&&) /home/anoyes/workspace/foundationdb/flow/IndexedSet.h:242
    #2 0x310b35e in MapPair<std::decay<UID>::type, std::decay<StorageQueueInfo>::type> mapPair<UID, StorageQueueInfo>(UID&&, StorageQueueInfo&&) /home/anoyes/workspace/foundationdb/flow/IndexedSet.h:258
    #3 0x30a8b79 in a_body1 /home/anoyes/workspace/foundationdb/fdbserver/Ratekeeper.actor.cpp:195
    #4 0x309b529 in TrackStorageServerQueueInfoActor /home/anoyes/build/foundationdb/fdbserver/Ratekeeper.actor.g.cpp:495
    #5 0x309b9be in trackStorageServerQueueInfo(RatekeeperData* const&, StorageServerInterface const&) /home/anoyes/workspace/foundationdb/fdbserver/Ratekeeper.actor.cpp:194
    #6 0x30cff63 in a_body1loopBody1when1cont1 /home/anoyes/workspace/foundationdb/fdbserver/Ratekeeper.actor.cpp:303
    #7 0x30cd9da in a_body1loopBody1when1when1 /home/anoyes/build/foundationdb/fdbserver/Ratekeeper.actor.g.cpp:1170
    #8 0x30ed4dd in a_callback_fire /home/anoyes/build/foundationdb/fdbserver/Ratekeeper.actor.g.cpp:1185
    #9 0x30e6d81 in fire /home/anoyes/workspace/foundationdb/flow/flow.h:998
    #10 0x4df0dc in void SAV<Void>::send<Void>(Void&&) /home/anoyes/workspace/foundationdb/flow/flow.h:447
    #11 0x959891 in void Promise<Void>::send<Void>(Void&&) const /home/anoyes/workspace/foundationdb/flow/flow.h:778
    #12 0x7b4b018 in Sim2::execTask(Sim2::Task&) (/home/anoyes/build/foundationdb/bin/fdbserver+0x7b4b018)
    #13 0x7bf9168 in Sim2::RunLoopActorState<Sim2::RunLoopActor>::a_body1loopBody1cont1(Void const&, int) /home/anoyes/workspace/foundationdb/fdbrpc/sim2.actor.cpp:979
    #14 0x7be7b68 in Sim2::RunLoopActorState<Sim2::RunLoopActor>::a_body1loopBody1when1(Void const&, int) /home/anoyes/build/foundationdb/fdbrpc/sim2.actor.g.cpp:5391
    #15 0x7c329ff in Sim2::RunLoopActorState<Sim2::RunLoopActor>::a_callback_fire(ActorCallback<Sim2::RunLoopActor, 0, Void>*, Void) /home/anoyes/build/foundationdb/fdbrpc/sim2.actor.g.cpp:5406
    #16 0x7c1fc73 in ActorCallback<Sim2::RunLoopActor, 0, Void>::fire(Void const&) /home/anoyes/workspace/foundationdb/flow/flow.h:998
    #17 0x4df0dc in void SAV<Void>::send<Void>(Void&&) /home/anoyes/workspace/foundationdb/flow/flow.h:447
    #18 0x959891 in void Promise<Void>::send<Void>(Void&&) const /home/anoyes/workspace/foundationdb/flow/flow.h:778
    #19 0x7fe74a4 in N2::PromiseTask::operator()() /home/anoyes/workspace/foundationdb/flow/Net2.actor.cpp:481
    #20 0x7fb6ff7 in N2::Net2::run() /home/anoyes/workspace/foundationdb/flow/Net2.actor.cpp:657
    #21 0x7b71bd3 in Sim2::_runActorState<Sim2::_runActor>::a_body1(int) /home/anoyes/workspace/foundationdb/fdbrpc/sim2.actor.cpp:989
    #22 0x7b2ee51 in Sim2::_runActor::_runActor(Sim2* const&) /home/anoyes/build/foundationdb/fdbrpc/sim2.actor.g.cpp:5608
    #23 0x7b2f268 in Sim2::_run(Sim2* const&) /home/anoyes/workspace/foundationdb/fdbrpc/sim2.actor.cpp:987
    #24 0x7b2f2c8 in Sim2::run() /home/anoyes/workspace/foundationdb/fdbrpc/sim2.actor.cpp:996
    #25 0x21040a6 in main /home/anoyes/workspace/foundationdb/fdbserver/fdbserver.actor.cpp:1793
    #26 0x7f03492ba504 in __libc_start_main (/lib64/libc.so.6+0x22504)
    #27 0x464914  (/home/anoyes/build/foundationdb/bin/fdbserver+0x464914)
2019-12-03 12:49:12 -08:00
Andrew Noyes e0bf7c4d65 Fix signed integer overflow
Not sure if this is the right fix or not

fdbserver/Ratekeeper.actor.cpp:557:40: runtime error: signed integer overflow: -9223372036854775808 - 9223372036854775807 cannot be represented in type 'long long'
2019-12-02 12:51:33 -08:00
Jon Fu d96a7b2c69 Merge branch 'master' of https://github.com/apple/foundationdb into mark-ss-failed 2019-10-03 09:47:45 -07:00
Evan Tschannen 3cc5d484a5 the include and exclude commands do not need to set the moveKeysLockOwnerKey, which will kill the data distribution algorithm 2019-09-27 18:33:56 -07:00
Evan Tschannen 1f2499c74f
Merge pull request #2012 from ajbeamon/rk-durability-lag-considers-mvcc-window
Ratekeeper ignores intentionally non-durable versions on the SS for durability lag computations
2019-08-19 14:24:21 -07:00
Evan Tschannen 2bd59d1055
Merge pull request #2003 from ajbeamon/add-rk-durability-lag-to-status
Add ratekeeper's durability lag statistics to status
2019-08-19 14:19:59 -07:00
A.J. Beamon ac2f310104 Ratekeeper ignores intentionally non-durable versions on the SS for durability lag computations. 2019-08-16 14:46:44 -07:00
A.J. Beamon 6581161dd3 Add ratekeeper's durability lag statistics to status 2019-08-15 11:07:04 -07:00
A.J. Beamon f6ba8509ae Remove unused local rate limit variables in ratekeeper. 2019-08-15 10:08:28 -07:00
Balachandar Namasivayam 14e54f44b3 Address review comments. 2019-07-18 12:32:35 -07:00
Balachandar Namasivayam 406bcebdc4 Ratekeeper to throttle tpsLimit to 1 if it is not able to fetch storage server list for some configurable amount of time. 2019-07-17 18:08:17 -07:00
Evan Tschannen db5b4a6331 avoid going to unlimited immediately after going below the durabilityLagTargetVersion 2019-07-12 18:50:56 -07:00
Evan Tschannen 6e34e16699 durable version needs more smoothing because it will be updated in bursts 2019-07-12 18:50:56 -07:00
Evan Tschannen b2b2e25324 the durabilityLagLimit needs to be tracked separately for batch priority and normal priority 2019-07-12 18:50:56 -07:00
Evan Tschannen fef58e13a4 adding logging for durability lag in ratekeeper 2019-07-12 18:50:56 -07:00
Evan Tschannen 1a18c859c7 knobified the durability lag rate controls 2019-07-12 18:50:56 -07:00
Evan Tschannen c5fb5494f5 a better attempt a ratekeeper control on durability lag 2019-07-12 18:50:56 -07:00
Evan Tschannen dc171b3eae fixed compiler error 2019-07-12 18:50:56 -07:00
Evan Tschannen e85c05c906 experimental slow control on durability lag 2019-07-12 18:50:56 -07:00
Jingyu Zhou 50e7593c5b
Merge pull request #1796 from ajbeamon/remove-trace-event-underscores
Remove trace event underscores
2019-07-05 21:45:55 -07:00
A.J. Beamon 9f4b6fd770 Remove additional underscores 2019-07-05 08:12:25 -07:00
Alex Miller 8e1ab6e7db Merge remote-tracking branch 'upstream/master' into flowlock-api 2019-06-28 17:32:54 -07:00
Evan Tschannen 5041ff38b1 removed unneeded description 2019-06-28 16:54:22 -07:00
Evan Tschannen a124fc6e8a fixed compiler error 2019-06-28 16:54:22 -07:00
Evan Tschannen b9a6271375 local ratekeeper no longer globally limits 2019-06-28 16:54:22 -07:00
Evan Tschannen f539b5f09a fix: a large targetRateRatio means limiting more 2019-06-28 16:54:22 -07:00
Evan Tschannen db413c37f7 restored the STORAGE_DURABILITY_LAG_SOFT_MAX knob and made the rk target slightly smaller than the soft limit, to avoid inaccuracies in ratekeeper control causing behavior changes on the storage servers 2019-06-28 16:54:22 -07:00
Evan Tschannen a97940a10b fixed compiler error 2019-06-28 16:54:22 -07:00
Evan Tschannen 92b32855ca ratekeeper’s control algorithm would oscillate when limited by local ratekeeper 2019-06-28 16:54:22 -07:00
Alex Miller 7a500cd37f A giant translation of TaskFooPriority -> TaskPriority::Foo
This is so that APIs that take priorities don't take ints, which are
common and easy to accidentally pass the wrong thing.
2019-06-25 02:47:35 -07:00
Evan Tschannen dccb9bc26d fixed a number of correctness problems 2019-06-12 19:40:50 -07:00
Trevor Clinkenbeard 8144882d7b Merge branch 'apple-master' into features/local-rk 2019-06-10 19:40:25 -07:00
A.J. Beamon 5f55f3f613 Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used. 2019-05-10 14:01:52 -07:00
mpilman bdba8e22eb Added test and bugfixes 2019-04-08 11:05:29 -07:00
mpilman 207049e852 fixed serialization 2019-04-08 11:04:44 -07:00
mpilman 32393ec4c9 Prototype of local ratekeeper 2019-04-08 11:04:44 -07:00
A.J. Beamon 91014d4529 Add file changes that I accidentally failed to commit; fix naming issue in worker. 2019-03-27 08:41:19 -07:00
Evan Tschannen 36ab852bb1 Merge branch 'master' into ratekeeper
# Conflicts:
#	fdbserver/ClusterController.actor.cpp
2019-03-22 18:41:00 -07:00
Evan Tschannen 3ced178348 maxVersionDifference is a copy of a knob which is a double 2019-03-21 12:58:48 -07:00
Jingyu Zhou 99d521ef4f Monitor Ratekeeper and DataDistributor to use stateless processes
Since Ratekeeper and DataDistributor are no longer running with Master, they
might be running with stateful processes before a new Master becomes alive,
which is undesirable.

This PR adds a monitoring of both Ratekeeper and DataDistributor at Cluster
Controller -- if Master runs on a stateless class and RK/DD runs at a worse
class, then RK/DD will be killed. I.e., RK/DD should be running at their own
classes or on the same stateless process as Master. After restart, RK/DD should
be running at a better process class.
2019-03-14 15:00:57 -07:00
Jingyu Zhou 2b0139670e Fix review comment for PR 1176 2019-03-12 12:02:30 -07:00
Jingyu Zhou cdfe906c30 Data distributor pulls batch limited info from proxy
Add a flag in HealthMetrics to indicate that batch priority is rate limited.
Data distributor pulls this flag from proxy to know roughly when rate limiting
happens.

DD uses this information to determine when to do the rebalance in the background,
i.e., moving data from heavily loaded servers to lighter ones. If the cluster is
currently rate limited for batch commits, then the rebalance will use longer
time intervals, otherwise use shorter intervals. See BgDDMountainChopper() and
BgDDValleyFiller() in DataDistributionQueue.actor.cpp.
2019-03-07 13:16:20 -08:00
Jingyu Zhou f43277e819 Format Ratekeeper.actor.cpp code 2019-03-07 13:16:20 -08:00
Jingyu Zhou dc129207a9 Minor fix after rebase. 2019-03-07 13:16:20 -08:00
Jingyu Zhou 517966fce2 Remove lastLimited from rate keeper
Refactor code to make IDE happy.
2019-03-07 13:16:20 -08:00
Jingyu Zhou b2ee41ba33 Remove lastLimited from data distribution
Fix a serialization bug in ServerDBInfo, which causes test failures.
2019-03-07 13:16:20 -08:00
Jingyu Zhou e6ac3f7fe8 Minor fix on ratekeeper work registration. 2019-03-07 13:16:20 -08:00
Jingyu Zhou 3c86643822 Separate Ratekeeper from data distribution.
Add a new role for ratekeeper.

Remove StorageServerChanges from data distribution.
Ratekeeper monitors storage servers, which borrows the idea from
DataDistribution.
2019-03-07 13:16:20 -08:00
Trevor Clinkenbeard 39f612d132 Merge branch 'master' of https://github.com/apple/foundationdb into add-health-metrics 2019-03-02 17:07:00 -08:00
Trevor Clinkenbeard 2940b8d5fd Update all per-process storage server health metrics at once
Ratekeeper updates all storage servers' health metrics in updateRate
with only a single map lookup
2019-03-02 16:08:28 -08:00
A.J. Beamon 655c9d82c7 Various cleanup from review 2019-03-01 14:06:47 -08:00
A.J. Beamon 93f7849261 Fix typo 2019-02-28 12:02:47 -08:00
A.J. Beamon 3e6a6a6569 Update status schema for correctness. Send the count of batch transactions started back to ratekeeper so that it can be logged with other ratekeeper metrics. 2019-02-28 12:00:58 -08:00
A.J. Beamon eb629d87a5 Add information about batch ratekeeper to status. Make it possible to track latencies in the ReadWrite workload for concurrently run instances separately. 2019-02-28 09:53:16 -08:00
Trevor Clinkenbeard 3f59f82670 Calculate durabilityLag instead of NDV for health metrics 2019-02-27 16:30:01 -08:00
A.J. Beamon a051055caf Initial implementation of adding separate limits for batch priority in ratekeeper 2019-02-27 10:31:56 -08:00
Trevor Clinkenbeard abfe057805 Merge branch 'master' of https://github.com/apple/foundationdb into add-health-metrics 2019-02-25 13:47:16 -08:00
Trevor Clinkenbeard 07f800eeee Got rid of detailed field in GetRateInfoReply message 2019-02-23 17:52:11 -08:00
Trevor Clinkenbeard f3a73963b4 Got rid of detailedLeaseDuration in GetRateInfoReply message 2019-02-23 16:42:11 -08:00
Trevor Clinkenbeard a20f5482bc Created StorageStats struct to combine health metrics for storage servers 2019-02-20 11:57:41 -08:00
Trevor Clinkenbeard 7594606ee2 Use DETAILED_METRIC_UPDATE_RATE knob to determine GetRateInfoReply lease duration 2019-02-20 11:40:17 -08:00
Evan Tschannen 3a572b010f fix: a forced recovery needed to force the data distributor to restart 2019-02-19 16:04:52 -08:00
Trevor Clinkenbeard 80cf5e057f Compute worstStorageNDV for Ratekeeper health metrics 2019-02-02 21:03:02 -08:00
Trevor Clinkenbeard 5822bd65bf Track health metrics in Ratekeeper and send these metrics to proxies in GetRateInfoReply messages 2019-01-31 12:56:58 -08:00
Trevor Clinkenbeard d7930af2cb Storage server periodically calculates cpuUsage and diskUsage metrics. These metrics (as well as all other metrics necessary for health metrics calculation) are sent in the StorageQueuingMetricsReply message. 2019-01-31 12:23:04 -08:00
Robert Escriva 268093a96d Adjust all includes to be relative to the root.
Remove the use of relative paths.  A header at foo/bar.h could be included by
files under foo/ with "bar.h", but would be included everywhere else as
"foo/bar.h".  Adjust so that every include references such a header with the
latter form.

Signed-off-by: Robert Escriva <rescriva@dropbox.com>
2018-10-19 17:35:33 +00:00
A.J. Beamon 2a97139d5d This is the first step in eliminating the usage of database names in our code. The C API remains the same, but underneath that all usage of database names is eliminated. 2018-08-16 10:24:12 -07:00
Alex Miller fb31a6999f Rewrite all files to have #include actorcompiler.h as the last include. 2018-08-14 15:50:26 -07:00
Alex Miller 535b5701e5 Rewrite all `Void _ = wait(...)` -> `wait(...)`.
This takes advantage of the new actorcompiler functionality to avoid
having duplicate definitions of `Void _` when trying to feed the
un-actorompiled source through clang.
2018-08-14 15:50:26 -07:00
Evan Tschannen 372ed67497 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/TagPartitionedLogSystem.actor.cpp
2018-06-11 11:34:10 -07:00
A.J. Beamon e5488419cc Attempt to normalize trace events:
* Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check.
* Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase.
* Use seconds instead of milliseconds in details.

Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed.

This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.
2018-06-08 11:11:08 -07:00
Evan Tschannen 0e699a3c23 fix: ratekeeper should only control on local logs 2018-05-29 10:51:23 -07:00
Evan Tschannen 1f6c6a886b Merge branch 'release-5.1' into release-5.2 2018-05-08 13:08:11 -07:00
Alec Grieser 752deb07a1
fix fdbmonitor help message output ; fix spelling error Ratekeeper.actor.cpp 2018-05-07 16:19:50 -07:00
A.J. Beamon 432a295bc2 Add read bytes and read keys info to status. Collect this information directly from StorageMetrics rather than through ratekeeper. 2018-05-04 12:01:40 -07:00
Alec Grieser 0bae9880f1 remove trailing whitespace from our copyright headers ; fixed formatting of python setup.py 2018-02-21 10:25:11 -08:00
A.J. Beamon 35b91bfb55 Add back (in different form) some ratekeeper trace events when a storage server or log doesn't respond. Add actualTPS (named TPSBasis) to RkUpdate. 2018-01-18 14:51:38 -08:00
A.J. Beamon bb1297c686 Remove RkServerQueueInfo and RkTLogQueueInfo trace events, since this information is more or less already logged on the storage servers and tlogs. Update the quiet database check and magnesium to use the information from the logs and storage servers. 2017-11-14 12:59:42 -08:00
FDB Dev Team a674cb4ef4 Initial repository commit 2017-05-25 13:48:44 -07:00