Commit Graph

83 Commits

Author SHA1 Message Date
Evan Tschannen 1128666840 added additional logging on the log router 2020-03-05 18:17:06 -08:00
Evan Tschannen 08914a2acd Once available space ratio falls below 0.3 avoid moving data to teams with less free space than the median team 2020-02-21 15:14:32 -08:00
Evan Tschannen 819c55556c More aggressively attempt to find teams that do not have low disk space 2020-02-20 16:47:50 -08:00
Evan Tschannen 855f03a41f ratekeeper needed to check remoteDC in another location
the storage server scoped a transaction incorrectly
2020-01-10 15:58:36 -08:00
Evan Tschannen 7898f4425f fix: ratekeeper could limit based on remote storage servers 2020-01-10 12:21:08 -08:00
Andrew Noyes 6bde67f2b3 Fix UBSAN report
/home/anoyes/workspace/foundationdb/fdbserver/Ratekeeper.actor.cpp:86:8: runtime error: load of value 1231493777, which is not a valid value for type 'limitReason_t'
    #0 0x310e961 in StorageQueueInfo::StorageQueueInfo(StorageQueueInfo&&) /home/anoyes/workspace/foundationdb/fdbserver/Ratekeeper.actor.cpp:86
    #1 0x310eacd in MapPair<UID, StorageQueueInfo>::MapPair<UID, StorageQueueInfo>(UID&&, StorageQueueInfo&&) /home/anoyes/workspace/foundationdb/flow/IndexedSet.h:242
    #2 0x310b35e in MapPair<std::decay<UID>::type, std::decay<StorageQueueInfo>::type> mapPair<UID, StorageQueueInfo>(UID&&, StorageQueueInfo&&) /home/anoyes/workspace/foundationdb/flow/IndexedSet.h:258
    #3 0x30a8b79 in a_body1 /home/anoyes/workspace/foundationdb/fdbserver/Ratekeeper.actor.cpp:195
    #4 0x309b529 in TrackStorageServerQueueInfoActor /home/anoyes/build/foundationdb/fdbserver/Ratekeeper.actor.g.cpp:495
    #5 0x309b9be in trackStorageServerQueueInfo(RatekeeperData* const&, StorageServerInterface const&) /home/anoyes/workspace/foundationdb/fdbserver/Ratekeeper.actor.cpp:194
    #6 0x30cff63 in a_body1loopBody1when1cont1 /home/anoyes/workspace/foundationdb/fdbserver/Ratekeeper.actor.cpp:303
    #7 0x30cd9da in a_body1loopBody1when1when1 /home/anoyes/build/foundationdb/fdbserver/Ratekeeper.actor.g.cpp:1170
    #8 0x30ed4dd in a_callback_fire /home/anoyes/build/foundationdb/fdbserver/Ratekeeper.actor.g.cpp:1185
    #9 0x30e6d81 in fire /home/anoyes/workspace/foundationdb/flow/flow.h:998
    #10 0x4df0dc in void SAV<Void>::send<Void>(Void&&) /home/anoyes/workspace/foundationdb/flow/flow.h:447
    #11 0x959891 in void Promise<Void>::send<Void>(Void&&) const /home/anoyes/workspace/foundationdb/flow/flow.h:778
    #12 0x7b4b018 in Sim2::execTask(Sim2::Task&) (/home/anoyes/build/foundationdb/bin/fdbserver+0x7b4b018)
    #13 0x7bf9168 in Sim2::RunLoopActorState<Sim2::RunLoopActor>::a_body1loopBody1cont1(Void const&, int) /home/anoyes/workspace/foundationdb/fdbrpc/sim2.actor.cpp:979
    #14 0x7be7b68 in Sim2::RunLoopActorState<Sim2::RunLoopActor>::a_body1loopBody1when1(Void const&, int) /home/anoyes/build/foundationdb/fdbrpc/sim2.actor.g.cpp:5391
    #15 0x7c329ff in Sim2::RunLoopActorState<Sim2::RunLoopActor>::a_callback_fire(ActorCallback<Sim2::RunLoopActor, 0, Void>*, Void) /home/anoyes/build/foundationdb/fdbrpc/sim2.actor.g.cpp:5406
    #16 0x7c1fc73 in ActorCallback<Sim2::RunLoopActor, 0, Void>::fire(Void const&) /home/anoyes/workspace/foundationdb/flow/flow.h:998
    #17 0x4df0dc in void SAV<Void>::send<Void>(Void&&) /home/anoyes/workspace/foundationdb/flow/flow.h:447
    #18 0x959891 in void Promise<Void>::send<Void>(Void&&) const /home/anoyes/workspace/foundationdb/flow/flow.h:778
    #19 0x7fe74a4 in N2::PromiseTask::operator()() /home/anoyes/workspace/foundationdb/flow/Net2.actor.cpp:481
    #20 0x7fb6ff7 in N2::Net2::run() /home/anoyes/workspace/foundationdb/flow/Net2.actor.cpp:657
    #21 0x7b71bd3 in Sim2::_runActorState<Sim2::_runActor>::a_body1(int) /home/anoyes/workspace/foundationdb/fdbrpc/sim2.actor.cpp:989
    #22 0x7b2ee51 in Sim2::_runActor::_runActor(Sim2* const&) /home/anoyes/build/foundationdb/fdbrpc/sim2.actor.g.cpp:5608
    #23 0x7b2f268 in Sim2::_run(Sim2* const&) /home/anoyes/workspace/foundationdb/fdbrpc/sim2.actor.cpp:987
    #24 0x7b2f2c8 in Sim2::run() /home/anoyes/workspace/foundationdb/fdbrpc/sim2.actor.cpp:996
    #25 0x21040a6 in main /home/anoyes/workspace/foundationdb/fdbserver/fdbserver.actor.cpp:1793
    #26 0x7f03492ba504 in __libc_start_main (/lib64/libc.so.6+0x22504)
    #27 0x464914  (/home/anoyes/build/foundationdb/bin/fdbserver+0x464914)
2019-12-03 12:49:12 -08:00
Andrew Noyes e0bf7c4d65 Fix signed integer overflow
Not sure if this is the right fix or not

fdbserver/Ratekeeper.actor.cpp:557:40: runtime error: signed integer overflow: -9223372036854775808 - 9223372036854775807 cannot be represented in type 'long long'
2019-12-02 12:51:33 -08:00
Evan Tschannen 3cc5d484a5 the include and exclude commands do not need to set the moveKeysLockOwnerKey, which will kill the data distribution algorithm 2019-09-27 18:33:56 -07:00
Evan Tschannen 1f2499c74f
Merge pull request #2012 from ajbeamon/rk-durability-lag-considers-mvcc-window
Ratekeeper ignores intentionally non-durable versions on the SS for durability lag computations
2019-08-19 14:24:21 -07:00
Evan Tschannen 2bd59d1055
Merge pull request #2003 from ajbeamon/add-rk-durability-lag-to-status
Add ratekeeper's durability lag statistics to status
2019-08-19 14:19:59 -07:00
A.J. Beamon ac2f310104 Ratekeeper ignores intentionally non-durable versions on the SS for durability lag computations. 2019-08-16 14:46:44 -07:00
A.J. Beamon 6581161dd3 Add ratekeeper's durability lag statistics to status 2019-08-15 11:07:04 -07:00
A.J. Beamon f6ba8509ae Remove unused local rate limit variables in ratekeeper. 2019-08-15 10:08:28 -07:00
Balachandar Namasivayam 14e54f44b3 Address review comments. 2019-07-18 12:32:35 -07:00
Balachandar Namasivayam 406bcebdc4 Ratekeeper to throttle tpsLimit to 1 if it is not able to fetch storage server list for some configurable amount of time. 2019-07-17 18:08:17 -07:00
Evan Tschannen db5b4a6331 avoid going to unlimited immediately after going below the durabilityLagTargetVersion 2019-07-12 18:50:56 -07:00
Evan Tschannen 6e34e16699 durable version needs more smoothing because it will be updated in bursts 2019-07-12 18:50:56 -07:00
Evan Tschannen b2b2e25324 the durabilityLagLimit needs to be tracked separately for batch priority and normal priority 2019-07-12 18:50:56 -07:00
Evan Tschannen fef58e13a4 adding logging for durability lag in ratekeeper 2019-07-12 18:50:56 -07:00
Evan Tschannen 1a18c859c7 knobified the durability lag rate controls 2019-07-12 18:50:56 -07:00
Evan Tschannen c5fb5494f5 a better attempt a ratekeeper control on durability lag 2019-07-12 18:50:56 -07:00
Evan Tschannen dc171b3eae fixed compiler error 2019-07-12 18:50:56 -07:00
Evan Tschannen e85c05c906 experimental slow control on durability lag 2019-07-12 18:50:56 -07:00
Jingyu Zhou 50e7593c5b
Merge pull request #1796 from ajbeamon/remove-trace-event-underscores
Remove trace event underscores
2019-07-05 21:45:55 -07:00
A.J. Beamon 9f4b6fd770 Remove additional underscores 2019-07-05 08:12:25 -07:00
Alex Miller 8e1ab6e7db Merge remote-tracking branch 'upstream/master' into flowlock-api 2019-06-28 17:32:54 -07:00
Evan Tschannen 5041ff38b1 removed unneeded description 2019-06-28 16:54:22 -07:00
Evan Tschannen a124fc6e8a fixed compiler error 2019-06-28 16:54:22 -07:00
Evan Tschannen b9a6271375 local ratekeeper no longer globally limits 2019-06-28 16:54:22 -07:00
Evan Tschannen f539b5f09a fix: a large targetRateRatio means limiting more 2019-06-28 16:54:22 -07:00
Evan Tschannen db413c37f7 restored the STORAGE_DURABILITY_LAG_SOFT_MAX knob and made the rk target slightly smaller than the soft limit, to avoid inaccuracies in ratekeeper control causing behavior changes on the storage servers 2019-06-28 16:54:22 -07:00
Evan Tschannen a97940a10b fixed compiler error 2019-06-28 16:54:22 -07:00
Evan Tschannen 92b32855ca ratekeeper’s control algorithm would oscillate when limited by local ratekeeper 2019-06-28 16:54:22 -07:00
Alex Miller 7a500cd37f A giant translation of TaskFooPriority -> TaskPriority::Foo
This is so that APIs that take priorities don't take ints, which are
common and easy to accidentally pass the wrong thing.
2019-06-25 02:47:35 -07:00
Evan Tschannen dccb9bc26d fixed a number of correctness problems 2019-06-12 19:40:50 -07:00
Trevor Clinkenbeard 8144882d7b Merge branch 'apple-master' into features/local-rk 2019-06-10 19:40:25 -07:00
A.J. Beamon 5f55f3f613 Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used. 2019-05-10 14:01:52 -07:00
mpilman bdba8e22eb Added test and bugfixes 2019-04-08 11:05:29 -07:00
mpilman 207049e852 fixed serialization 2019-04-08 11:04:44 -07:00
mpilman 32393ec4c9 Prototype of local ratekeeper 2019-04-08 11:04:44 -07:00
A.J. Beamon 91014d4529 Add file changes that I accidentally failed to commit; fix naming issue in worker. 2019-03-27 08:41:19 -07:00
Evan Tschannen 36ab852bb1 Merge branch 'master' into ratekeeper
# Conflicts:
#	fdbserver/ClusterController.actor.cpp
2019-03-22 18:41:00 -07:00
Evan Tschannen 3ced178348 maxVersionDifference is a copy of a knob which is a double 2019-03-21 12:58:48 -07:00
Jingyu Zhou 99d521ef4f Monitor Ratekeeper and DataDistributor to use stateless processes
Since Ratekeeper and DataDistributor are no longer running with Master, they
might be running with stateful processes before a new Master becomes alive,
which is undesirable.

This PR adds a monitoring of both Ratekeeper and DataDistributor at Cluster
Controller -- if Master runs on a stateless class and RK/DD runs at a worse
class, then RK/DD will be killed. I.e., RK/DD should be running at their own
classes or on the same stateless process as Master. After restart, RK/DD should
be running at a better process class.
2019-03-14 15:00:57 -07:00
Jingyu Zhou 2b0139670e Fix review comment for PR 1176 2019-03-12 12:02:30 -07:00
Jingyu Zhou cdfe906c30 Data distributor pulls batch limited info from proxy
Add a flag in HealthMetrics to indicate that batch priority is rate limited.
Data distributor pulls this flag from proxy to know roughly when rate limiting
happens.

DD uses this information to determine when to do the rebalance in the background,
i.e., moving data from heavily loaded servers to lighter ones. If the cluster is
currently rate limited for batch commits, then the rebalance will use longer
time intervals, otherwise use shorter intervals. See BgDDMountainChopper() and
BgDDValleyFiller() in DataDistributionQueue.actor.cpp.
2019-03-07 13:16:20 -08:00
Jingyu Zhou f43277e819 Format Ratekeeper.actor.cpp code 2019-03-07 13:16:20 -08:00
Jingyu Zhou dc129207a9 Minor fix after rebase. 2019-03-07 13:16:20 -08:00
Jingyu Zhou 517966fce2 Remove lastLimited from rate keeper
Refactor code to make IDE happy.
2019-03-07 13:16:20 -08:00
Jingyu Zhou b2ee41ba33 Remove lastLimited from data distribution
Fix a serialization bug in ServerDBInfo, which causes test failures.
2019-03-07 13:16:20 -08:00