Commit Graph

224 Commits

Author SHA1 Message Date
Evan Tschannen 6220a5ce0f
Merge pull request #1370 from jzhou77/fix-unreferenced
Remove unused functions
2019-04-09 11:49:45 -07:00
mpilman 1c16f87a4e Remove trace-calls to printable (in non-workloads) 2019-04-05 13:12:19 -07:00
Evan Tschannen a38c396283 made all maintenance transactions lock aware 2019-04-02 14:27:48 -07:00
Evan Tschannen 628fec8c8b updated status with information about ongoing maintenance
clear the maintenance zone if a different storage server is detected failed
2019-04-02 14:15:51 -07:00
Evan Tschannen 781cf9b5a0 added the ability to make a zoneId for maintenance in fdbcli 2019-04-01 17:55:13 -07:00
Jingyu Zhou f7f8ddd894 Fix warnings on unused variables
Found by -Wunused-variable flag.
2019-04-01 14:00:20 -07:00
Jingyu Zhou 9f6fe5f649 Merge remote-tracking branch 'apple/master' into ratekeeper 2019-03-15 11:30:04 -07:00
Jingyu Zhou 99d521ef4f Monitor Ratekeeper and DataDistributor to use stateless processes
Since Ratekeeper and DataDistributor are no longer running with Master, they
might be running with stateful processes before a new Master becomes alive,
which is undesirable.

This PR adds a monitoring of both Ratekeeper and DataDistributor at Cluster
Controller -- if Master runs on a stateless class and RK/DD runs at a worse
class, then RK/DD will be killed. I.e., RK/DD should be running at their own
classes or on the same stateless process as Master. After restart, RK/DD should
be running at a better process class.
2019-03-14 15:00:57 -07:00
Evan Tschannen a2108047aa removed LocalitySetRef and IRepPolicyRef typedefs, because for clarity the Ref suffix is reserved for arena allocated objects instead of reference counted objects. 2019-03-13 13:14:39 -07:00
Jingyu Zhou 2b0139670e Fix review comment for PR 1176 2019-03-12 12:02:30 -07:00
Jingyu Zhou cdfe906c30 Data distributor pulls batch limited info from proxy
Add a flag in HealthMetrics to indicate that batch priority is rate limited.
Data distributor pulls this flag from proxy to know roughly when rate limiting
happens.

DD uses this information to determine when to do the rebalance in the background,
i.e., moving data from heavily loaded servers to lighter ones. If the cluster is
currently rate limited for batch commits, then the rebalance will use longer
time intervals, otherwise use shorter intervals. See BgDDMountainChopper() and
BgDDValleyFiller() in DataDistributionQueue.actor.cpp.
2019-03-07 13:16:20 -08:00
Jingyu Zhou 835cc278c3 Fix rebase conflicts. 2019-03-07 13:16:20 -08:00
Jingyu Zhou d52ff738c0 Fix merge conflicts during rebase. 2019-03-07 13:16:20 -08:00
Jingyu Zhou b2ee41ba33 Remove lastLimited from data distribution
Fix a serialization bug in ServerDBInfo, which causes test failures.
2019-03-07 13:16:20 -08:00
Jingyu Zhou 36a51a7b57 Fix a segfault bug due to uncopied ratekeeper interface 2019-03-07 13:16:20 -08:00
Jingyu Zhou e6ac3f7fe8 Minor fix on ratekeeper work registration. 2019-03-07 13:16:20 -08:00
Jingyu Zhou 3c86643822 Separate Ratekeeper from data distribution.
Add a new role for ratekeeper.

Remove StorageServerChanges from data distribution.
Ratekeeper monitors storage servers, which borrows the idea from
DataDistribution.
2019-03-07 13:16:20 -08:00
anoyes 981426bac9 More ide fixes 2019-03-05 18:03:57 -08:00
Evan Tschannen b8910ba7cd Merge branch 'master' into feature-fix-force-recovery
# Conflicts:
#	fdbclient/ManagementAPI.actor.h
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/storageserver.actor.cpp
#	fdbserver/workloads/KillRegion.actor.cpp
2019-02-22 14:38:13 -08:00
Evan Tschannen d008de576e
Merge pull request #1139 from xumengpanda/mengxu/machine-team-upgrade-PR
Add background actor to remove redundant teams
2019-02-22 14:22:07 -08:00
Meng Xu 9445ac0b0c Status: Use new data distributor worker to publish status
After we add a new data distributor role, we publish the data
related to data distributor and rate keeper through the new
role (and new worker).

So the status needs to contact the data distributor, instead of master,
to get the status information.
2019-02-21 18:05:50 -08:00
Meng Xu 3e703dc2d1 TeamRemover: Fix bug that may not remove all teams needed 2019-02-21 15:54:16 -08:00
Meng Xu 7cca439e00 TeamRemover: Add status to show redundant team removing
Distinguish the removal of unhealthy team and redundant team.
Change status report to include redundant team removal report.
2019-02-21 14:16:46 -08:00
Meng Xu 0ac7014142 TeamRemover: Resolve minor comments from code review 2019-02-21 13:18:11 -08:00
Evan Tschannen 329ab766f1 factored out a duplicate code block
attempted to fix a compiler error
2019-02-20 18:20:10 -08:00
Meng Xu d86ba0e811 TeamRemover: Change it to run periodically
This simplifies the problem of when we should invoke the teamRemover
2019-02-20 16:08:34 -08:00
Evan Tschannen 27e3617548 fix: remove bad teams needed to use dd_stall_check delay, because in simulation the buggified delay time could make us remove bad teams before they submit their ranges to the queue 2019-02-20 14:18:36 -08:00
Evan Tschannen 3a572b010f fix: a forced recovery needed to force the data distributor to restart 2019-02-19 16:04:52 -08:00
mpilman 27a3153719 Use ACTOR forward declarations in MoveKeys
Also MoveKeys.h -> MoveKeys.actor.h
2019-02-19 15:16:59 -08:00
mpilman 3a0f9839b9 Fix minor IDE build errors 2019-02-19 15:16:59 -08:00
mpilman 3cb2391b58 use proper fwd declarations in ManagementAPI
Also ManagementAPI.h -> ManagementAPI.actor.h
2019-02-19 15:16:59 -08:00
Meng Xu 111ab2eccc TeamRemover: Check redundant team flag before satisfiesPolicy
In addTeam(), to determine the team is badTeam or not, we should check
redundantTeam before check satisfiesPolicy. Because if a team is
redundantTeam, it has been removed from the system before we call addTeam().
The only reason we call addTeam() for a removed redundantTeam is to
kick off the badTeam cleanup logic.
2019-02-19 14:46:47 -08:00
Meng Xu e256d9a9ac TeamRemover: Change ASSERT in teamRemover function
When we remove a machine team in teamRemover function,
we should always find the machine team in the global machineTeams.
Change the ASSERT to the above invariant.
2019-02-19 08:13:10 -08:00
Meng Xu 3c1ed2eba9 TeamRemover: Confident no duplicate machine teams
In removeMachineTeam, we are confident that there is no duplicate
machine team when remove a machine team from a machines vector of
machineTeams
2019-02-19 08:13:10 -08:00
Meng Xu ed1d4635bc TeamRemover: Format cleaning
Use clang-format and remove debug messages for the code
that fixes bugs in merging the PR of adding a
DataDistributor role
2019-02-19 08:13:10 -08:00
Meng Xu 211036ee22 TeamRemover: Fix bugs introduced in the previous commit 2019-02-19 08:13:10 -08:00
Meng Xu a7810d9594 TeamRemover: Fix ASSERT condition in teamRemover 2019-02-19 08:13:10 -08:00
Meng Xu 06b6a1d2ad TeamRemover: Bug fix in teamRemover and add teamRemover invocation point 2019-02-19 08:13:10 -08:00
Meng Xu a6d3a5a3d6 TeamRemover: Change machineTeamNumber to healthyMachineTeamNumber
Always use healthy machine team number as the condition of
if redundant teams exist
2019-02-19 08:13:10 -08:00
Meng Xu b35631365f TeamRemover: Solve confict when merge with PR 1061
The previous commit merge with the master, which just merges
the pull request #1062 from jzhou77/PR that adds a new DataDistribution role.

The merge causes conflicts and errors in simulation tests.

This commit resolves the code conflicts and
tries to fix the new errors after incorporating the new DataDistribution role
2019-02-19 08:13:10 -08:00
Evan Tschannen 065a45e05f Merge branch 'master' into feature-fix-force-recovery
# Conflicts:
#	fdbclient/ManagementAPI.actor.cpp
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/workloads/KillRegion.actor.cpp
2019-02-18 17:09:06 -08:00
Evan Tschannen d492395f84 fix: simulation could buggify a delay such that data distribution incorrectly thinks the queue is not processing unhealthy relocations 2019-02-18 14:57:07 -08:00
Vishesh Yadav e05b53d755 Merge remote-tracking branch 'apple/master' into task/tls-upgrade 2019-02-15 20:37:07 -08:00
Meng Xu 6d09ac483c Merge with master 2019-02-15 17:03:40 -08:00
Meng Xu 5ca074d86f TeamRemover: No order of removing team and machine team
We do NOT enforce the removing order of removing a machine team
and the server teams on the machine team.

This is for the benefit of clear code logic.

When a storage server locality changes, we first remove the server
and its machine if needed, before we handle the server team removal
and addition.
2019-02-15 10:54:29 -08:00
Meng Xu cfd323dafe TeamRemover: Check when a server team is removed
We do not actively remove a machine team when it has no server team on it.
But since adding a server team may add a machine team, we need to be
careful that the machine team number is not larger than the desired number
due to server team creation.

So whenever a server team is removed, we should check if the teamRemover
should be kicked in.
2019-02-15 09:35:31 -08:00
Meng Xu e803eef906 TeamRemover: Must be called when machine number changes
When the machine number changes due to machine remove event,
the desired machine team number changes. Then we need to
make sure the teamRemover actor is running to clean up the
redundant teams.
2019-02-14 20:53:26 -08:00
Meng Xu 1e55e8fea6 TeamRemover: Do not call teamRemover in getTeam
getTeam is called very frequently and does not create a new team.
So no need to call teamRemover in getTeam.

teamRemover should be called only when a new team may be added.
2019-02-14 17:37:20 -08:00
Jingyu Zhou 5e6577cc82 Final cleanup per review comments
Make distributor interface optional in ServerDBInfo and many other small
changes.
2019-02-14 16:37:17 -08:00
Jingyu Zhou bf6da81bf9 Remove recovery version from data distribution queue
This parameter is no longer used/needed.
2019-02-14 16:37:16 -08:00