foundationdb

Commit Graph

Author	SHA1	Message	Date
Meng Xu	cc6a0e9bcd	TeamCollectionTest:Do not enforce minServerTeamOnServer larger than 0 In ConfigureTest, one server may be left with 0 server teams, even if we call buildTeams in the storageServerTracker.	2019-06-27 11:27:29 -07:00
Meng Xu	02cdcc0b0c	TeamCollectionTest: Only ensure each server and machine have a team	2019-06-27 11:27:29 -07:00
Meng Xu	21664742a6	TeamCollection:Desired team number may be larger than the max possible team number For example, we have 3 servers for replica factor 3. We can have only 1 team but the desired team number is 3 times 5 equal to 15. Instead of sanity checking the absolute team number per server, we check the difference between the minServerTeamOnServer and maxServerTeamOnServer.	2019-06-27 11:15:06 -07:00
Meng Xu	08f28e99f9	TeamCollection:Test no server or machine has incorrect team number Add test for simulation test which make sure the server team number per server will be no less than the desired_teams_per_server defined in knobs and no larger than the max_teams_per_server. Add similar test for machine teams number per machine as well.	2019-06-27 11:15:06 -07:00
A.J. Beamon	f417e60264	Merge branch 'merge-release-6.1-into-master' into thread-safe-random-number-generation # Conflicts: # fdbserver/QuietDatabase.actor.cpp	2019-05-23 09:52:00 -07:00
A.J. Beamon	d29c7e4c9b	Merge branch 'release-6.1' into merge-release-6.1-into-master # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbserver/QuietDatabase.actor.cpp # versions.target	2019-05-23 09:28:45 -07:00
Evan Tschannen	f4b18f2c4f	fixed whitespace	2019-05-21 11:31:34 -07:00
Evan Tschannen	23091a7d96	fixed review comments	2019-05-21 10:53:36 -07:00
Evan Tschannen	4059d68348	fix: the tlog would not pop data from the disk queue after a storage server was removed, because the tag still exists in memory on the logs fix: we could incorrectly make data durable if eraseMessagesFromMemory was in progress while running updatePersistentData the quiet database check now ensure that tlogs have no more than 30 seconds of versions unpopped from the disk queue	2019-05-20 23:58:45 -07:00
A.J. Beamon	5f55f3f613	Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used.	2019-05-10 14:01:52 -07:00
Austin Seipp	bf378952cb	fdbserver: fix some print/scan format warnings Signed-off-by: Austin Seipp <aseipp@pobox.com>	2019-05-06 13:35:29 -07:00
Evan Tschannen	710a64dc4e	replaced std::pair<WorkerInterface,ProcessClass> with a struct named WorkerDetails	2019-03-08 11:25:07 -05:00
Evan Tschannen	d008de576e	Merge pull request #1139 from xumengpanda/mengxu/machine-team-upgrade-PR Add background actor to remove redundant teams	2019-02-22 14:22:07 -08:00
mpilman	999ea09bfd	Use correct fwd decls in TesterInterface Also TesterInterface.h -> TesterInterface.actor.h	2019-02-19 15:16:59 -08:00
mpilman	3f0fd2a20c	Use fwd decls in WorkerInterface Also WorkerInterface.h -> WorkerInterface.actor.h	2019-02-19 15:16:59 -08:00
mpilman	0bb60e5a3b	Use proper fwd decl in NativeAPI Also NativeAPI.h -> NativeAPI.actor.h	2019-02-19 15:16:59 -08:00
mpilman	3cb2391b58	use proper fwd declarations in ManagementAPI Also ManagementAPI.h -> ManagementAPI.actor.h	2019-02-19 15:16:59 -08:00
Meng Xu	ed1d4635bc	TeamRemover: Format cleaning Use clang-format and remove debug messages for the code that fixes bugs in merging the PR of adding a DataDistributor role	2019-02-19 08:13:10 -08:00
Meng Xu	b35631365f	TeamRemover: Solve confict when merge with PR 1061 The previous commit merge with the master, which just merges the pull request #1062 from jzhou77/PR that adds a new DataDistribution role. The merge causes conflicts and errors in simulation tests. This commit resolves the code conflicts and tries to fix the new errors after incorporating the new DataDistribution role	2019-02-19 08:13:10 -08:00
Meng Xu	6d09ac483c	Merge with master	2019-02-15 17:03:40 -08:00
Jingyu Zhou	5e6577cc82	Final cleanup per review comments Make distributor interface optional in ServerDBInfo and many other small changes.	2019-02-14 16:37:17 -08:00
Jingyu Zhou	07dab56133	Fix a data movement stuck bug When moving keys to a team, if one of the server in the target team died, then the move can become stuck. This is because the DDTeamCollection waits for all the data movement of the failed server to be completed. However, in this case, because the movement has not finished yet, checking the database tells us there is no key assocated with this server and it is safe to go ahead. In reality, only the in-memory structure knows there is pending movement, i.e., unfinished move causes some keys to be attributed to the failed server. Thus, the server can't be removed yet. Fix by adding a check with in-memory structure in waitForAllDataRemoved(). Use const& to optimize a few function parameters.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	b3d1633114	Fix bugs of missing request The quite database can fail to send out requests and report timeout. This seems to be caused by reusing a request that uses the same ReplyPromise. Another bug is Proxy can wait for unneeded time for a dabase change, while the distributor is already known to itself.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	3135f1d84b	Cluster controller ignores distrobutor rejoin After controller starts one, it will wait for that one and ignore any rejoins received later. Add remoteRecovered() to data distribution for remote team collection.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	ef868f599c	Add DataDistributorInterface to ServerDBInfo Also change the Proxy and QuietDatabase to use the DataDistributorInterface.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	0490160714	Fix according to Evan's comments Use getRateInfo's endpoint as the ID for the DataDistributorInterface. For now, added a "rejoined" flag for ClusterControllerData and Proxy. TODO: move DataDistributorInterface into ServerDBInfo.	2019-02-14 16:30:13 -08:00
Jingyu Zhou	886e7ab2ba	Add a new DataDistributor role. Let cluster controller to start a new data distributor role by sending a message to a chosen worker. Change MasterInterface usage in DataDistribution to masterId Add DataDistributor rejoin handling. This allows the data distributor to tell the new cluster controller of its existence so that the controller doesn't spawn a new one. I.e., there should be only ONE data distributor in the cluster. If DataDistributor (DD) doesn't join in a while, then ClusterController (CC) tries to recruit one as DD. CC also monitors DD and restarts one if it failed. The Proxy is also monitoring the DD. If DD failed, the Proxy will ask CC for the new DD. Add GetRecoveryInfo RPC to master server, which is called by data distributor to obtain the recovery Transaction version from the master server.	2019-02-14 16:30:13 -08:00
Andrew Noyes	067a445e06	Replace unused _ variables with wait(success(...))	2019-02-12 17:30:30 -08:00
Meng Xu	3ae8767ee8	TeamCollection: Apply clang-format	2019-02-12 13:41:18 -08:00
Meng Xu	7cfe6de27e	TeamCollection: Server team number must match machine team number DESIRED_TEAMS_PER_MACHINE must equal to DESIRED_TEAMS_PER_SERVER. Otherwise, we may have to few machine teams to create enough server teams. Note that BUGGIFY macro value is based on a random number generator. When you have two BUGGIFY, one may be true and the other is false. Also fix a bug in get the number of healthy machine teams.	2019-02-07 13:53:55 -08:00
Meng Xu	76d022f71c	TeamCollection: Remove redundant teams When the total number of teams is larger than the desired number, we should gracefully remove the redundant teams so that the number of teams is kept to a low number and the possibility of losing data is guaranteed to be extremely low even when multiple racks fail at the same time.	2019-02-07 11:24:51 -08:00
Meng Xu	455024b3fe	SimulationTest: Test the number of teams Magnify the possibility that the number of created machine teams is larger than the number of desired machine teams if we do NOT try to remove the surplus machine teams. This help test the upgrade to machine team in FDB 6.1	2019-02-06 11:04:41 -08:00
Meng Xu	2b73c89e98	TeamCollection: Test the number of teams Call the traceTeamCollectionInfo function to record the team numbers when we add a team directly from the shard information, instead of using addTeamsBestOf logic.	2019-02-05 15:58:16 -08:00
Meng Xu	f5171d1b57	TeamCollection: Test the number of teams The current simulator does not validate if the number of teams in the system is larger than the maximum desired number of teams. This validation should be added because we do NOT want too many teams in the system, which may impede the systems availability when multiple fault zones (e.g., machines) crashes at the same time. This commit adds the test at the consistency check in simulation. Since the current code does not handle the upgrading situation when we enforce the machine teams, the test is expected to fail. The later commit will handle the upgrading situation which gracefully remove the surplus teams.	2019-02-04 18:14:36 -08:00
Evan Tschannen	4b5d0b4e2c	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/AsyncFileBlobStore.actor.cpp # fdbclient/AsyncFileBlobStore.actor.h # fdbclient/BlobStore.actor.cpp # fdbclient/BlobStore.h # fdbclient/HTTP.actor.cpp # fdbclient/ManagementAPI.actor.cpp # fdbclient/NativeAPI.actor.cpp # fdbrpc/LoadBalance.actor.h # fdbrpc/batcher.actor.h # fdbrpc/fdbrpc.vcxproj # fdbrpc/sim2.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/DataDistributionTracker.actor.cpp # fdbserver/SimulatedCluster.actor.cpp # fdbserver/TLogServer.actor.cpp # fdbserver/masterserver.actor.cpp	2018-11-10 13:04:24 -08:00
Evan Tschannen	3e2484baf7	fix: a team tracker could downgrade the priority of a relocation issued by the team tracker for the other region	2018-11-09 10:07:55 -08:00
Evan Tschannen	c02690471d	added protection against configuration changes which cannot be immediately reverted the configure database workload tests region configurations	2018-11-04 19:53:55 -08:00
Robert Escriva	268093a96d	Adjust all includes to be relative to the root. Remove the use of relative paths. A header at foo/bar.h could be included by files under foo/ with "bar.h", but would be included everywhere else as "foo/bar.h". Adjust so that every include references such a header with the latter form. Signed-off-by: Robert Escriva <rescriva@dropbox.com>	2018-10-19 17:35:33 +00:00
Evan Tschannen	1314bcec9e	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst	2018-10-05 12:54:00 -07:00
Evan Tschannen	daed31708b	fix: we can only repair dead DCs if we have a fearless configuration	2018-10-05 12:35:37 -07:00
A.J. Beamon	2a97139d5d	This is the first step in eliminating the usage of database names in our code. The C API remains the same, but underneath that all usage of database names is eliminated.	2018-08-16 10:24:12 -07:00
Alex Miller	fb31a6999f	Rewrite all files to have #include actorcompiler.h as the last include.	2018-08-14 15:50:26 -07:00
Alex Miller	535b5701e5	Rewrite all `Void _ = wait(...)` -> `wait(...)`. This takes advantage of the new actorcompiler functionality to avoid having duplicate definitions of `Void _` when trying to feed the un-actorompiled source through clang.	2018-08-14 15:50:26 -07:00
Evan Tschannen	9c918a28f6	fix: status was reporting no replicas remaining when the remote datacenter was initially configured with usable_regions=2	2018-08-09 13:16:09 -07:00
Evan Tschannen	1c29275672	call all methods which could disable a trace event before it is initialized. In practice this means calling .error first, then .suppressFor, then all your details.	2018-08-01 14:30:57 -07:00
Evan Tschannen	f72a9f60c0	only disable fearless if a datacenter has actually been killed fix: we must prevent recovery into the dead datacenter while reducing usable_regions	2018-07-16 10:06:57 -07:00
Evan Tschannen	d42c9914d2	fix: future quiet databases need to be able to continue the reconfigure if the first one completes the repopulate but is cancelled before changing usable_regions	2018-07-08 19:56:55 -07:00
Evan Tschannen	ce6b0d4952	fix: consistency check must also configuration usable regions to 1, because the remote log set might not be able to copy data	2018-07-08 18:25:01 -07:00
Evan Tschannen	cd4fb9285a	waitForExlusion requires both regions to be healthy, which is only possible if we do not kill all logs in a region	2018-07-05 14:04:42 -07:00
Evan Tschannen	507b3bacb0	fix: kill all tlogs in one region prevents the remote logs from recovering in that region, do not allow that to prevent us from configuring usable_regions=1. added more recovery states.	2018-07-05 00:08:51 -07:00
Evan Tschannen	e17dfea3b6	fix: desiredTLogCount was used instead of getDesiredLogs(), which caused problems with recruitment when desiredTLogCount was -1. canKillProcess logic was wrong. We still need to configure usable_regions because if datacenterVersionDifference is too large we cannot complete data movement.	2018-07-04 16:22:32 -04:00
Evan Tschannen	ea3365dc38	fix: quiet database only needs to use repopulate_anti_quorum instead of reducing usable_regions	2018-07-04 02:52:00 -04:00
A.J. Beamon	9f545ce002	Merge commit '892727e358c0b3f075564c60c2b7cedb64306f83' into trace-log-refactor	2018-06-26 11:37:23 -07:00
Evan Tschannen	0913368651	added usable_regions to specify if we will replicate into a remote region remote replication defaults to the primary replication removed remote_logs, because they should be specified as an override in the regions object	2018-06-17 19:31:15 -07:00
A.J. Beamon	0ca51989bb	Merge branch 'master' into trace-log-refactor # Conflicts: # fdbserver/QuietDatabase.actor.cpp # fdbserver/Status.actor.cpp # flow/Trace.cpp	2018-06-08 13:24:30 -07:00
A.J. Beamon	e5488419cc	Attempt to normalize trace events: * Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check. * Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase. * Use seconds instead of milliseconds in details. Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed. This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.	2018-06-08 11:11:08 -07:00
A.J. Beamon	78839b20fd	Merge branch 'master' into trace-log-refactor # Conflicts: # flow/Trace.cpp	2018-05-31 10:46:20 -07:00
A.J. Beamon	ce0c991e78	Refactor trace events to store a vector of fields that aren't encoded until write time. Better support for pre-network trace events. Rework how trace events are queried. Some initial work towards pluggable formatting of logs.	2018-05-02 10:44:38 -07:00
Evan Tschannen	656a817e74	fix: only reconfigure during the quiet database check, because excluding at the same time as reconfiguring causes the master to indefinitely restart recovery	2018-05-01 15:31:49 -07:00
Alec Grieser	551ea9c7f8	Merge remote-tracking branch 'upstream/release-5.2' into master-release-5.2-merge	2018-03-19 12:34:50 -07:00
Alec Grieser	70a05c1a9b	fix some compiler whinges	2018-03-13 15:00:16 -07:00
A.J. Beamon	f2c804e14f	Reverting changes from merge of master into release-5.2 (`b25810711c`). Note that we never intend to release master into release-5.2, but if we did we would need to revert this commit.	2018-03-06 10:15:04 -08:00
Evan Tschannen	37a6a81634	Merge commit '7f6fc3e039c911cd84b8540f7f799fc38a1c1822' into feature-remote-logs # Conflicts: # fdbserver/workloads/RestartRecovery.actor.cpp	2018-02-23 12:33:28 -08:00
Alec Grieser	0bae9880f1	remove trailing whitespace from our copyright headers ; fixed formatting of python setup.py	2018-02-21 10:25:11 -08:00
Evan Tschannen	1b5628d2c5	testing a single configured fearless setup in simulated cluster consolidated simulation connection disablers into one call in the tester automatically reconfigure from a fearless setup in simulation	2018-02-18 12:59:43 -08:00
Evan Tschannen	645dc5ead6	warmRange needs to get a read version occasionally to prevent it from overwhelming the proxy quietDatabase waits for all data distribution to be completely finished so that databases are cached in a cleaner state	2018-01-14 12:50:52 -08:00
A.J. Beamon	bb1297c686	Remove RkServerQueueInfo and RkTLogQueueInfo trace events, since this information is more or less already logged on the storage servers and tlogs. Update the quiet database check and magnesium to use the information from the logs and storage servers.	2017-11-14 12:59:42 -08:00
Yichi Chiang	3865c5ae0e	Enable checkUsingDesiredClasses() in consistency check	2017-10-24 12:58:54 -07:00
Evan Tschannen	e8b895c878	added the ability to disable connection failures for a period of time after one happens	2017-09-18 12:46:29 -07:00
John King	d0fbc41338	set LOCK_AWARE on several transactions used for getting cluster info for the consistency check	2017-07-28 18:50:32 -07:00
FDB Dev Team	a674cb4ef4	Initial repository commit	2017-05-25 13:48:44 -07:00

1 2 3

121 Commits