Commit Graph

153 Commits

Author SHA1 Message Date
Meng Xu 5051b35c61 TeamCollection: Use machine team to create server team
Current server team collection logic does not consider
the fact that multipe storage servers can run on the same machine.
When multiple machines fail, all servers on the machines will fail, and
the possibility of having one process team fail and lose data is very high.

To reduce the possibility of losing data when multiple machine fails,
we first create machine teams which span across different fault zones;
we then create server teams based on machine teams by
first picking 1 machine team, and then
picking 1 server from each machine in the machine team.

Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-16 15:53:22 -08:00
Robert Escriva 268093a96d Adjust all includes to be relative to the root.
Remove the use of relative paths.  A header at foo/bar.h could be included by
files under foo/ with "bar.h", but would be included everywhere else as
"foo/bar.h".  Adjust so that every include references such a header with the
latter form.

Signed-off-by: Robert Escriva <rescriva@dropbox.com>
2018-10-19 17:35:33 +00:00
Evan Tschannen 90301f497f Merge branch 'release-6.0'
# Conflicts:
#	fdbclient/ManagementAPI.actor.cpp
#	fdbrpc/FlowTransport.actor.cpp
#	fdbrpc/TLSConnection.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/Status.actor.cpp
#	fdbserver/storageserver.actor.cpp
#	fdbserver/workloads/StatusWorkload.actor.cpp
#	versions.target
2018-09-05 16:06:33 -07:00
Evan Tschannen 21f5cf9ce9 suppress spammy trace events 2018-09-04 17:12:26 -07:00
Alex Miller fb31a6999f Rewrite all files to have #include actorcompiler.h as the last include. 2018-08-14 15:50:26 -07:00
Alex Miller 535b5701e5 Rewrite all `Void _ = wait(...)` -> `wait(...)`.
This takes advantage of the new actorcompiler functionality to avoid
having duplicate definitions of `Void _` when trying to feed the
un-actorompiled source through clang.
2018-08-14 15:50:26 -07:00
Evan Tschannen 1c29275672 call all methods which could disable a trace event before it is initialized. In practice this means calling .error first, then .suppressFor, then all your details. 2018-08-01 14:30:57 -07:00
Evan Tschannen 2820b6e0bb data inconsistency is always an error when detected by the consistency check 2018-07-09 22:26:13 -07:00
Evan Tschannen 89a4b2cd68 fix: consistency check could loop too long 2018-07-02 12:08:02 -04:00
Evan Tschannen 4a3247da69 fixed a few problems with the consistency check 2018-06-30 10:39:28 -07:00
Evan Tschannen 02f616eb68 fix: consistency check was broken when the key server key space is sharded 2018-06-28 23:16:32 -07:00
Evan Tschannen 45cf0067e4 fix: consistency check was not checking for data inconsistencies 2018-06-28 11:08:16 -07:00
Evan Tschannen 0913368651 added usable_regions to specify if we will replicate into a remote region
remote replication defaults to the primary replication
removed remote_logs, because they should be specified as an override in the regions object
2018-06-17 19:31:15 -07:00
A.J. Beamon e5488419cc Attempt to normalize trace events:
* Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check.
* Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase.
* Use seconds instead of milliseconds in details.

Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed.

This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.
2018-06-08 11:11:08 -07:00
Evan Tschannen 19762b847d Merge branch 'release-5.2'
# Conflicts:
#	fdbserver/DatabaseConfiguration.cpp
#	fdbserver/SimulatedCluster.actor.cpp
2018-04-10 17:02:43 -07:00
Evan Tschannen b95e68eb5a fix: getDatabaseSize is really inefficient and causes slow tasks in the real world. Outside of simulation just assume the database is really large, because we only need the InvalidShardSize check in simulation 2018-03-26 17:35:11 -07:00
Evan Tschannen 65b532658f added support for single region configurations 2018-03-15 10:59:30 -07:00
Evan Tschannen 3abf4d7fdf Merge branch 'master' into feature-remote-logs 2018-03-09 14:50:04 -08:00
Evan Tschannen 91bb8faa45 Merge commit 'f773b9460d31d31b7d421860fc647936f31aa1fa'
# Conflicts:
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-03-09 14:47:03 -08:00
Evan Tschannen cf6dd1437b suppress spammy trace events 2018-03-09 10:16:34 -08:00
Balachandar Namasivayam e7309a3535 Add trace events to print the ranges in ConsistencyCheck. 2018-03-08 13:53:59 -08:00
Balachandar Namasivayam 4f58bca66a Simple refactor of code... 2018-03-08 11:34:25 -08:00
Balachandar Namasivayam 1c1a497ea2 Refactor getKeyServers to be more readable.
Fix possible memory corruption by returning KeyRange instead of KeyRangeRef in getKeyServers.
Simplify getMasterProxies on DatabaseContext class.
2018-03-08 11:34:18 -08:00
Balachandar Namasivayam 03a40354e3 Having 1000 as the limit for Limit for GetKeyServerLocationsRequest sometimes generate large packet warnings. Reduce it to 100.
Fix the bug where some of the key server shards may not be fetched.
2018-03-08 11:34:11 -08:00
Evan Tschannen 1194e3a361 added region-based configuration to support a large variety of fearless setups. Currently only 1 primary 1 remote setups are allowed. 2018-03-05 19:27:46 -08:00
Evan Tschannen 470f5c01f3 changed remoteDcId to a vector of ids, to support future configurations where there are multiple remote databases 2018-02-26 17:09:09 -08:00
Evan Tschannen 37a6a81634 Merge commit '7f6fc3e039c911cd84b8540f7f799fc38a1c1822' into feature-remote-logs
# Conflicts:
#	fdbserver/workloads/RestartRecovery.actor.cpp
2018-02-23 12:33:28 -08:00
Evan Tschannen 719bb5bd0c
Merge pull request #4 from bnamasivayam/getKeyServers-refactor
Having 1000 as the limit for Limit for GetKeyServerLocationsRequest s…
2018-02-22 12:39:48 -08:00
Balachandar Namasivayam 2fe2b522d5 Simple refactor of code... 2018-02-22 12:38:14 -08:00
Balachandar Namasivayam e2030db5a8 Refactor getKeyServers to be more readable.
Fix possible memory corruption by returning KeyRange instead of KeyRangeRef in getKeyServers.
Simplify getMasterProxies on DatabaseContext class.
2018-02-21 17:11:50 -08:00
Alec Grieser 0bae9880f1 remove trailing whitespace from our copyright headers ; fixed formatting of python setup.py 2018-02-21 10:25:11 -08:00
Balachandar Namasivayam 6218934c7b Having 1000 as the limit for Limit for GetKeyServerLocationsRequest sometimes generate large packet warnings. Reduce it to 100.
Fix the bug where some of the key server shards may not be fetched.
2018-02-20 17:41:34 -08:00
Evan Tschannen 1fedcba890 fix: do not use log router tags when configured without remote logs
fix: data distribution tracks undesired storage servers
re-enabled consistency check
2018-02-13 17:01:34 -08:00
Evan Tschannen 1dc9eceb6d optimize GetKeyLocationRequests on the proxy so they only require a single map lookup, instead of doing 3 + (3* [number of ranges]) lookups 2017-12-15 20:13:44 -08:00
Evan Tschannen 73a0a07eac clients ask for key location information directly from the proxy, instead of reading it from the database 2017-12-09 16:10:22 -08:00
Yichi Chiang 8ba0eaebff Check cluster controller using desired process class in consistency check 2017-11-29 15:09:23 -08:00
Evan Tschannen 57aba0b3bc fix: excluded servers were the same fitness as storage servers for the master role
fix: better master exists did not considers exclusion for master fitness
2017-11-03 17:09:14 -07:00
Yichi Chiang c2a117fe07 Merge pull request #189 from cie/enable-check-desired-class
Enable checkUsingDesiredClasses() in consistency check
2017-10-24 15:18:21 -07:00
Yichi Chiang 3865c5ae0e Enable checkUsingDesiredClasses() in consistency check 2017-10-24 12:58:54 -07:00
Evan Tschannen ef41b07bb3 renamed past_version to transaction_too_old
implemented read_lock_aware option
2017-09-28 16:35:08 -07:00
Evan Tschannen 7b60e26660 Merge pull request #160 from cie/use-error-descriptions
Add the ability to access name and description in Error. Update error…
2017-09-28 16:00:39 -07:00
Evan Tschannen 73fca75239 added the ability to disable timeKeeper; disabled timeKeeper before consistency check in simulation 2017-09-28 13:13:24 -07:00
A.J. Beamon d30c730f75 Add the ability to access name and description in Error. Update error descriptions. 2017-09-28 12:35:03 -07:00
Evan Tschannen d61be4c760 Merge branch 'release-5.0' 2017-08-30 12:59:24 -07:00
Evan Tschannen 963e1c3f31 fix: we need to reboot the process even if it will result in too many files, because the check will not succeed without it 2017-08-30 12:58:46 -07:00
Alvin Moore 6020d70863 Added trace event to track reboots initiated by ConsistencyCheck workload in simulation 2017-08-29 11:41:27 -07:00
Alvin Moore c95a1be5ec Add trace event for rebooting process during simulation for consistency check 2017-08-29 11:00:44 -07:00
Alec Grieser 300b5a17ed Merge branch 'release-5.0' 2017-08-25 18:55:33 -07:00
Evan Tschannen 272b4b984c fix: fixed a rare bug where we do not wait for a file in the process of being deleted to shutdown before rebooting a machine 2017-08-25 10:12:58 -07:00
Alec Grieser ca7437ecf6 Merge branch 'release-5.0' 2017-08-02 22:07:01 -07:00
John King d0fbc41338 set LOCK_AWARE on several transactions used for getting cluster info for the consistency check 2017-07-28 18:50:32 -07:00
Yichi Chiang 53e1ae9f60 shard system keyspace 2017-07-26 13:47:31 -07:00
FDB Dev Team a674cb4ef4 Initial repository commit 2017-05-25 13:48:44 -07:00