Commit Graph

79 Commits

Author SHA1 Message Date
Jingyu Zhou 254c78053c Fix a segfault error
After wait, ServerDBInfo may have changed. Using the old copy is wrong.
2019-03-15 22:11:13 -07:00
Jingyu Zhou 12ddd56698 Fix Ratekeeper and DataDistributor placement
Make sure both RateKeeper and DataDistributor are placed in the same data
center as the Master. Make sure only one RateKeeper is live in the cluster as
well.
2019-03-15 17:09:28 -07:00
Jingyu Zhou bb5686eb75 Fix monitoring of DD and RK 2019-03-15 16:02:17 -07:00
Jingyu Zhou 40860e0093 Attempt to fix. 2019-03-15 11:29:04 -07:00
Jingyu Zhou 9e59c9c253 Check DataDistributor and RateKeeper fitness
Fail the test if they are not put in the best fitness.
2019-03-14 16:14:57 -07:00
Evan Tschannen 044b6b4f8a Merge branch 'master' into feature-degraded-tlog
# Conflicts:
#	fdbserver/ClusterController.actor.cpp
2019-03-08 22:50:41 -05:00
Evan Tschannen 710a64dc4e replaced std::pair<WorkerInterface,ProcessClass> with a struct named WorkerDetails 2019-03-08 11:25:07 -05:00
anoyes 981426bac9 More ide fixes 2019-03-05 18:03:57 -08:00
Evan Tschannen fafb66b0a8
Merge pull request #1126 from bnamasivayam/ratelimit-consistencycheck
Dynamically rate limit consistency check.
2019-02-27 14:43:09 -08:00
Balachandar Namasivayam ab99497695 Addressed review comments. 2019-02-25 18:29:30 -08:00
Evan Tschannen d008de576e
Merge pull request #1139 from xumengpanda/mengxu/machine-team-upgrade-PR
Add background actor to remove redundant teams
2019-02-22 14:22:07 -08:00
mpilman 999ea09bfd Use correct fwd decls in TesterInterface
Also TesterInterface.h -> TesterInterface.actor.h
2019-02-19 15:16:59 -08:00
mpilman 699216f713 Use fwd decls in workloads
Also workloads.h -> workloads.actor.h
2019-02-19 15:16:59 -08:00
mpilman 0bb60e5a3b Use proper fwd decl in NativeAPI
Also NativeAPI.h -> NativeAPI.actor.h
2019-02-19 15:16:59 -08:00
mpilman 3cb2391b58 use proper fwd declarations in ManagementAPI
Also ManagementAPI.h -> ManagementAPI.actor.h
2019-02-19 15:16:59 -08:00
Meng Xu b35631365f TeamRemover: Solve confict when merge with PR 1061
The previous commit merge with the master, which just merges
the pull request #1062 from jzhou77/PR that adds a new DataDistribution role.

The merge causes conflicts and errors in simulation tests.

This commit resolves the code conflicts and
tries to fix the new errors after incorporating the new DataDistribution role
2019-02-19 08:13:10 -08:00
Meng Xu 6d09ac483c Merge with master 2019-02-15 17:03:40 -08:00
Meng Xu 5481851e82 TeamCollection: Add knobs for team remover
Added three knobs to control team remover

bool TR_FLAG_DISABLE_TEAM_REMOVER:
	Disable the teamRemover actor
double TR_REMOVE_MACHINE_TEAM_DELAY:
	Wait for the specified time before try to remove next machine team
double TR_WAIT_FOR_ALL_MACHINES_HEALTHY_DELAY:
	Wait before checking if all machines are healthy
2019-02-13 15:11:56 -08:00
Meng Xu 3ae8767ee8 TeamCollection: Apply clang-format 2019-02-12 13:41:18 -08:00
Balachandar Namasivayam f44f26c232 Dynamically rate limit consistency check. 2019-02-07 16:08:39 -08:00
Meng Xu 2b73c89e98 TeamCollection: Test the number of teams
Call the traceTeamCollectionInfo function to record the team numbers
when we add a team directly from the shard information, instead of
using addTeamsBestOf logic.
2019-02-05 15:58:16 -08:00
Meng Xu f5171d1b57 TeamCollection: Test the number of teams
The current simulator does not validate if the number of teams in
the system is larger than the maximum desired number of teams.
This validation should be added because we do NOT want too many teams
in the system, which may impede the systems availability when
multiple fault zones (e.g., machines) crashes at the same time.

This commit adds the test at the consistency check in simulation.
Since the current code does not handle the upgrading situation
when we enforce the machine teams, the test is expected to fail.

The later commit will handle the upgrading situation which gracefully
remove the surplus teams.
2019-02-04 18:14:36 -08:00
Evan Tschannen 699f8dd617 fix: coordinators auto could put two coordinators in the same zone
simulation now tests two machines in the same zone
2019-01-18 15:42:48 -08:00
Balachandar Namasivayam a8e2e75cd5 Re-enable CheckDesiredClasses after making necessary changes for multi-region setup.
Fixed a couple of bugs
1) A rare race condition where a worker is being roles even after it died.
2) Fix how RoleFitness is calculated for TLog and LogRouter. Only worst fitness is compared to see if a better fit is available.
2019-01-10 10:28:32 -08:00
Meng Xu 8de031f9a6 TeamCollection: clang-format
Format the changes with git clang-format.
No functional changes.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-21 11:18:26 -08:00
Meng Xu f7a7e069f0 TeamCollection: Remove unnecessary comments
Pass 41806 tests with no failure

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-16 15:56:35 -08:00
Meng Xu 5051b35c61 TeamCollection: Use machine team to create server team
Current server team collection logic does not consider
the fact that multipe storage servers can run on the same machine.
When multiple machines fail, all servers on the machines will fail, and
the possibility of having one process team fail and lose data is very high.

To reduce the possibility of losing data when multiple machine fails,
we first create machine teams which span across different fault zones;
we then create server teams based on machine teams by
first picking 1 machine team, and then
picking 1 server from each machine in the machine team.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-16 15:53:22 -08:00
Robert Escriva 268093a96d Adjust all includes to be relative to the root.
Remove the use of relative paths.  A header at foo/bar.h could be included by
files under foo/ with "bar.h", but would be included everywhere else as
"foo/bar.h".  Adjust so that every include references such a header with the
latter form.

Signed-off-by: Robert Escriva <rescriva@dropbox.com>
2018-10-19 17:35:33 +00:00
Evan Tschannen 90301f497f Merge branch 'release-6.0'
# Conflicts:
#	fdbclient/ManagementAPI.actor.cpp
#	fdbrpc/FlowTransport.actor.cpp
#	fdbrpc/TLSConnection.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/Status.actor.cpp
#	fdbserver/storageserver.actor.cpp
#	fdbserver/workloads/StatusWorkload.actor.cpp
#	versions.target
2018-09-05 16:06:33 -07:00
Evan Tschannen 21f5cf9ce9 suppress spammy trace events 2018-09-04 17:12:26 -07:00
Alex Miller fb31a6999f Rewrite all files to have #include actorcompiler.h as the last include. 2018-08-14 15:50:26 -07:00
Alex Miller 535b5701e5 Rewrite all `Void _ = wait(...)` -> `wait(...)`.
This takes advantage of the new actorcompiler functionality to avoid
having duplicate definitions of `Void _` when trying to feed the
un-actorompiled source through clang.
2018-08-14 15:50:26 -07:00
Evan Tschannen 1c29275672 call all methods which could disable a trace event before it is initialized. In practice this means calling .error first, then .suppressFor, then all your details. 2018-08-01 14:30:57 -07:00
Evan Tschannen 2820b6e0bb data inconsistency is always an error when detected by the consistency check 2018-07-09 22:26:13 -07:00
Evan Tschannen 89a4b2cd68 fix: consistency check could loop too long 2018-07-02 12:08:02 -04:00
Evan Tschannen 4a3247da69 fixed a few problems with the consistency check 2018-06-30 10:39:28 -07:00
Evan Tschannen 02f616eb68 fix: consistency check was broken when the key server key space is sharded 2018-06-28 23:16:32 -07:00
Evan Tschannen 45cf0067e4 fix: consistency check was not checking for data inconsistencies 2018-06-28 11:08:16 -07:00
Evan Tschannen 0913368651 added usable_regions to specify if we will replicate into a remote region
remote replication defaults to the primary replication
removed remote_logs, because they should be specified as an override in the regions object
2018-06-17 19:31:15 -07:00
A.J. Beamon e5488419cc Attempt to normalize trace events:
* Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check.
* Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase.
* Use seconds instead of milliseconds in details.

Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed.

This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.
2018-06-08 11:11:08 -07:00
Evan Tschannen 19762b847d Merge branch 'release-5.2'
# Conflicts:
#	fdbserver/DatabaseConfiguration.cpp
#	fdbserver/SimulatedCluster.actor.cpp
2018-04-10 17:02:43 -07:00
Evan Tschannen b95e68eb5a fix: getDatabaseSize is really inefficient and causes slow tasks in the real world. Outside of simulation just assume the database is really large, because we only need the InvalidShardSize check in simulation 2018-03-26 17:35:11 -07:00
Evan Tschannen 65b532658f added support for single region configurations 2018-03-15 10:59:30 -07:00
Evan Tschannen 3abf4d7fdf Merge branch 'master' into feature-remote-logs 2018-03-09 14:50:04 -08:00
Evan Tschannen 91bb8faa45 Merge commit 'f773b9460d31d31b7d421860fc647936f31aa1fa'
# Conflicts:
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-03-09 14:47:03 -08:00
Evan Tschannen cf6dd1437b suppress spammy trace events 2018-03-09 10:16:34 -08:00
Balachandar Namasivayam e7309a3535 Add trace events to print the ranges in ConsistencyCheck. 2018-03-08 13:53:59 -08:00
Balachandar Namasivayam 4f58bca66a Simple refactor of code... 2018-03-08 11:34:25 -08:00
Balachandar Namasivayam 1c1a497ea2 Refactor getKeyServers to be more readable.
Fix possible memory corruption by returning KeyRange instead of KeyRangeRef in getKeyServers.
Simplify getMasterProxies on DatabaseContext class.
2018-03-08 11:34:18 -08:00
Balachandar Namasivayam 03a40354e3 Having 1000 as the limit for Limit for GetKeyServerLocationsRequest sometimes generate large packet warnings. Reduce it to 100.
Fix the bug where some of the key server shards may not be fetched.
2018-03-08 11:34:11 -08:00