foundationdb

Commit Graph

Author	SHA1	Message	Date
Evan Tschannen	065a45e05f	Merge branch 'master' into feature-fix-force-recovery # Conflicts: # fdbclient/ManagementAPI.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/workloads/KillRegion.actor.cpp	2019-02-18 17:09:06 -08:00
Evan Tschannen	d492395f84	fix: simulation could buggify a delay such that data distribution incorrectly thinks the queue is not processing unhealthy relocations	2019-02-18 14:57:07 -08:00
Vishesh Yadav	e05b53d755	Merge remote-tracking branch 'apple/master' into task/tls-upgrade	2019-02-15 20:37:07 -08:00
Meng Xu	6d09ac483c	Merge with master	2019-02-15 17:03:40 -08:00
Meng Xu	5ca074d86f	TeamRemover: No order of removing team and machine team We do NOT enforce the removing order of removing a machine team and the server teams on the machine team. This is for the benefit of clear code logic. When a storage server locality changes, we first remove the server and its machine if needed, before we handle the server team removal and addition.	2019-02-15 10:54:29 -08:00
Meng Xu	cfd323dafe	TeamRemover: Check when a server team is removed We do not actively remove a machine team when it has no server team on it. But since adding a server team may add a machine team, we need to be careful that the machine team number is not larger than the desired number due to server team creation. So whenever a server team is removed, we should check if the teamRemover should be kicked in.	2019-02-15 09:35:31 -08:00
Meng Xu	e803eef906	TeamRemover: Must be called when machine number changes When the machine number changes due to machine remove event, the desired machine team number changes. Then we need to make sure the teamRemover actor is running to clean up the redundant teams.	2019-02-14 20:53:26 -08:00
Meng Xu	1e55e8fea6	TeamRemover: Do not call teamRemover in getTeam getTeam is called very frequently and does not create a new team. So no need to call teamRemover in getTeam. teamRemover should be called only when a new team may be added.	2019-02-14 17:37:20 -08:00
Jingyu Zhou	5e6577cc82	Final cleanup per review comments Make distributor interface optional in ServerDBInfo and many other small changes.	2019-02-14 16:37:17 -08:00
Jingyu Zhou	bf6da81bf9	Remove recovery version from data distribution queue This parameter is no longer used/needed.	2019-02-14 16:37:16 -08:00
Evan Tschannen	038144adb1	Update fdbserver/DataDistribution.actor.cpp Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>	2019-02-14 16:37:16 -08:00
Jingyu Zhou	fc3a784963	Fix another build team bug The buildTeam() can create teams with undesired storage servers, which are considered unhealthy. As a result, the data movement can become stuck. Fix this by adding an ACTOR monitorHealthyTeams that builds team every one second whenever there is no healthy teams. Clean up storageServerTracker() interface.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	8afe84d31b	Fix an ordering bug for buildTeam When zeroHealthyTeams signals and the storage server becomes healthy, we could attempt buildTeam before the ServerStatusMap is updated. As a result, the healthy server is not available for use. Fix by delaying the buildTeam after the status map is updated.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	a7d1111a10	Make servers and serverIDs private for TCTeamInfo Make both accessible through public member functions instead.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	8b1235533e	Fix segfault during configuration change This bug was introduced in cee23ee3. During a configuration change, the data distributor was restarted, which destroys previous DDTeamCollection and cancels all previous teamTracker(). In this case, even though the healthy team count reaches 0, there is no need to try to rebuild teams. The bug is triggered when trying rebuilding teams, DDTeamCollection is already destroyed.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	07dab56133	Fix a data movement stuck bug When moving keys to a team, if one of the server in the target team died, then the move can become stuck. This is because the DDTeamCollection waits for all the data movement of the failed server to be completed. However, in this case, because the movement has not finished yet, checking the database tells us there is no key assocated with this server and it is safe to go ahead. In reality, only the in-memory structure knows there is pending movement, i.e., unfinished move causes some keys to be attributed to the failed server. Thus, the server can't be removed yet. Fix by adding a check with in-memory structure in waitForAllDataRemoved(). Use const& to optimize a few function parameters.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	961d71538e	A follow-on fix to ensure build team for zero teams	2019-02-14 16:37:16 -08:00
Jingyu Zhou	5deeec29e3	Fix a bug where team is not rebuild after storage failure When two failures happened to a team, one of the server recovered. The current logic skips for building a new team, which is wrong.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	21066b013a	Remove DataDistributorRejoinRequest This is no longer needed, since worker registration piggybacks distributor interface now.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	b3d1633114	Fix bugs of missing request The quite database can fail to send out requests and report timeout. This seems to be caused by reusing a request that uses the same ReplyPromise. Another bug is Proxy can wait for unneeded time for a dabase change, while the distributor is already known to itself.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	8c61de318f	Fix segfault and no_more_servers errors	2019-02-14 16:37:16 -08:00
Jingyu Zhou	7897616164	Fix wait failure bug on cluster controller The setDistributor() sets an AsyncVar and then runs waitFailureClient. This ordering is wrong because the AsyncVar::set triggers the other loop to run first, which will wait on Never(). The correct code should wait on the Future returned by the waitFailureClient.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	6a655143e8	A follow-on fix for config key usage And some trace event cleanups.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	be5c962bb7	Add a new configuration version key \xff/conf/version This fixed a bug found by upgrade test, where the configuration monitor of the data distributor was monitoring excludedServersVersionKey, which doesn't change in ChangeConfig workload. As a result, data distributor was not aware of configuration changes. Adding this new key and make sure this key is updated in configuration changes so that the monitor can detect configuration changes.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	3135f1d84b	Cluster controller ignores distrobutor rejoin After controller starts one, it will wait for that one and ignore any rejoins received later. Add remoteRecovered() to data distribution for remote team collection.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	99e109d6c5	Fix timeout error due to lost exception Found in tests, a move key conflict exception was not handled because the Future object was not waited by someone. As a result, the data distributor did not die and database checking couldn't get the metric and keep trying until timeout.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	f5242bda7c	Update data distributor to use configuration monitor This enable removal of GetRecoveryInfoRequest from master interface. Remove recoveryTransactionVersion from dataDistribution().	2019-02-14 16:37:16 -08:00
Jingyu Zhou	7a205b1732	Move remoteRecovered to dataDistributionTeamCollection() Let the remote DC to wait until fully recovered before team collection starts.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	ef868f599c	Add DataDistributorInterface to ServerDBInfo Also change the Proxy and QuietDatabase to use the DataDistributorInterface.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	0490160714	Fix according to Evan's comments Use getRateInfo's endpoint as the ID for the DataDistributorInterface. For now, added a "rejoined" flag for ClusterControllerData and Proxy. TODO: move DataDistributorInterface into ServerDBInfo.	2019-02-14 16:30:13 -08:00
Jingyu Zhou	886e7ab2ba	Add a new DataDistributor role. Let cluster controller to start a new data distributor role by sending a message to a chosen worker. Change MasterInterface usage in DataDistribution to masterId Add DataDistributor rejoin handling. This allows the data distributor to tell the new cluster controller of its existence so that the controller doesn't spawn a new one. I.e., there should be only ONE data distributor in the cluster. If DataDistributor (DD) doesn't join in a while, then ClusterController (CC) tries to recruit one as DD. CC also monitors DD and restarts one if it failed. The Proxy is also monitoring the DD. If DD failed, the Proxy will ask CC for the new DD. Add GetRecoveryInfo RPC to master server, which is called by data distributor to obtain the recovery Transaction version from the master server.	2019-02-14 16:30:13 -08:00
Meng Xu	8ee8b98122	TeamCollection: Cosmetic change	2019-02-14 15:59:20 -08:00
Meng Xu	5481851e82	TeamCollection: Add knobs for team remover Added three knobs to control team remover bool TR_FLAG_DISABLE_TEAM_REMOVER: Disable the teamRemover actor double TR_REMOVE_MACHINE_TEAM_DELAY: Wait for the specified time before try to remove next machine team double TR_WAIT_FOR_ALL_MACHINES_HEALTHY_DELAY: Wait before checking if all machines are healthy	2019-02-13 15:11:56 -08:00
Meng Xu	01e55e43bd	TeamCollection: Minor improve code efficiency and style Rewording the feature item in the release document as well.	2019-02-12 19:10:53 -08:00
Meng Xu	c8db205fd9	TeamCollection: Fix bug in remove a server When we remove a server due to server failure, we need to remove the related server teams AND remove the server team from the machine team. In the previous commit, we forgot to remove the server team from the machine team.	2019-02-12 16:18:19 -08:00
Meng Xu	fe4f43203d	TeamCollection: getTeam may add a new team getTeam function may add a new team for the GetTeamRequest. We need to check if the number of teams is larger than the desired team number.	2019-02-12 14:57:35 -08:00
Meng Xu	3ae8767ee8	TeamCollection: Apply clang-format	2019-02-12 13:41:18 -08:00
Meng Xu	214a72fba3	TeamCollection: Resolve review comments 1) Reduce the frequency of checking if we need to call teamRemover 2) Improve code efficiency in finding the machine team to remove 3) Remove unused code 4) Add sanity check	2019-02-12 10:59:57 -08:00
Meng Xu	3b8ae0fe95	TeamCollection: Add into 6.1 release note	2019-02-08 13:50:27 -08:00
Meng Xu	7cfe6de27e	TeamCollection: Server team number must match machine team number DESIRED_TEAMS_PER_MACHINE must equal to DESIRED_TEAMS_PER_SERVER. Otherwise, we may have to few machine teams to create enough server teams. Note that BUGGIFY macro value is based on a random number generator. When you have two BUGGIFY, one may be true and the other is false. Also fix a bug in get the number of healthy machine teams.	2019-02-07 13:53:55 -08:00
Meng Xu	76d022f71c	TeamCollection: Remove redundant teams When the total number of teams is larger than the desired number, we should gracefully remove the redundant teams so that the number of teams is kept to a low number and the possibility of losing data is guaranteed to be extremely low even when multiple racks fail at the same time.	2019-02-07 11:24:51 -08:00
Meng Xu	455024b3fe	SimulationTest: Test the number of teams Magnify the possibility that the number of created machine teams is larger than the number of desired machine teams if we do NOT try to remove the surplus machine teams. This help test the upgrade to machine team in FDB 6.1	2019-02-06 11:04:41 -08:00
Meng Xu	2b73c89e98	TeamCollection: Test the number of teams Call the traceTeamCollectionInfo function to record the team numbers when we add a team directly from the shard information, instead of using addTeamsBestOf logic.	2019-02-05 15:58:16 -08:00
Meng Xu	f5171d1b57	TeamCollection: Test the number of teams The current simulator does not validate if the number of teams in the system is larger than the maximum desired number of teams. This validation should be added because we do NOT want too many teams in the system, which may impede the systems availability when multiple fault zones (e.g., machines) crashes at the same time. This commit adds the test at the consistency check in simulation. Since the current code does not handle the upgrading situation when we enforce the machine teams, the test is expected to fail. The later commit will handle the upgrading situation which gracefully remove the surplus teams.	2019-02-04 18:14:36 -08:00
Evan Tschannen	1d7fec3074	Merge commit '048bfc5c368063d9e009513078dab88be0cbd5b0' into task/tls-upgrade-2 # Conflicts: # .gitignore	2019-01-24 17:43:06 -08:00
Evan Tschannen	684a22a52b	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbbackup/backup.actor.cpp # fdbclient/BackupContainer.actor.cpp # fdbclient/HTTP.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/BackupCorrectness.actor.cpp # versions.target	2019-01-09 16:14:46 -08:00
Evan Tschannen	57293a2db0	byte sample recovery did not use limits for its range reads, leading to slow tasks	2019-01-04 10:32:31 -08:00
Simon Zhou	7edf221986	Avoid null check	2018-12-28 13:09:04 -08:00
Meng Xu	486a7b04fa	TeamCollection: Fix build in osX In osX, we cannot adding unsigned long to a string to append to the string.	2018-12-14 13:44:11 -08:00
Vishesh Yadav	3eb9b23024	Listen to multiple addresses and start using vector<NetworkAdddress> in Endpoint - This patch will make FDB listen to multiple addresses given via command line. Although, we'll still use first address in most places, this patch starts using vector<NetworkAddress> in Endpoint at some basic places. - When sending packets to an endpoint, pick a random network address in endpoints - Renames Endpoint::address to Endpoint::addresses since it now holds a vector of addresses.	2018-12-13 13:36:52 -08:00
Vishesh Yadav	43e5a46f9b	Change Endpoint::address(NetworkAddress) to vector<NetworkAddress> Extend `Endpoint` class to take multiple NetworkAddresses instead of just one. Hence, to talk to an endpoint instead of one IP:PORT, we'll have multiple IP:PORT pairs. This patch simply adds the field and makes changes to compile the codebase. The first element of of `address` field is used everywhere. Hence the way we talk to remains same with this patch. NOTE: Directly accessing the first memeber of Endpoint::address is unsafe as Endpoint() doesn't enforces non-empty address list. However, since the correctness test pass for now and are anyway replacing all those unsafe accesses with ones considering the whole vector, this patch ignores to access them in safe way.	2018-12-13 13:36:52 -08:00
Meng Xu	79d94f78f1	TeamCollection: Improve code efficiency Further improve code efficiency by 1) Avoid rebuild machine locality map when machine locality is changed. This may leave the global machine locality map stale. This is ok as long as we do not use the global map to validate the machine team follows the locality policy. 2) Use ASSERT_WE_THINK instead of ASSERT to avoid runtime overhead. ASSERT_WE_THINK will only validate the condition in simulation mode. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-12 22:38:38 -08:00
Meng Xu	e197926c80	TeamCollection: Remove a duplicate function Remove a duplicate function that has different signature. No functionality change. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-12 15:21:37 -08:00
Meng Xu	ad7040efcd	TeamCollection: Bug fix in handle server locality change Make sure the link between server and machine is updated in both server and machine. Rename function name to better reflect its functionality. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-12 14:03:29 -08:00
Meng Xu	e069b5c31c	TeamCollection: Use clang format No functional change. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-06 11:39:35 -08:00
Meng Xu	5d47b9c884	TeamCollection: Handle server locality change A server locality may change from one machine to another. This affects the old machine and machine team the server is on, and the new machine the server moves to. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-05 22:23:14 -08:00
Meng Xu	c5047bc8c3	TeamCollection: All machine teams are correct size We only create correct size machine teams. When configuration (e.g., team size) is changed, the DDTeamCollection will be destroyed and rebuilt so that the invariant will not be violated. Based on the invariant, we can count the number of machine teams more quickly. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-05 15:09:38 -08:00
Meng Xu	57eab1f283	DataDistribution: Remove addAllTeams function The addAllTeams function can be replaced with the new addTeamsBestOf function by passing a large enough number of teams to build. Remove addAllTeams function and update the related unit tests. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-05 15:03:16 -08:00
Meng Xu	38c5c2562b	DataDistribution: Update NotEnoughServers unit test The buggify option may set 1 to the knob parameters (DESIRED_TEAMS_PER_SERVER and MAX_TEAMS_PER_SERVER). When this happens, the number of machine teams to build will be less than what we want, which prevents us from building enough server teams. To avoid this problem, we build machine teams before we call addTeamsBestOf to build server teams. We also add the ASSERT to ensure we build enough machine teams and server teams in the test case. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-05 14:36:48 -08:00
Meng Xu	f32c04c834	DataDistribution: Update NotEnoughServers unit test Change the test condition for the NotEnoughServers unit test. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-03 23:14:01 -08:00
Meng Xu	54a4d6b308	TeamCollection: Improve code efficiency Improve code efficiency with the following changes: 1) Change always-true if-statement to ASSERT; 2) Return when we are confident we will not find more machine teams. No functionality change. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-01 17:10:50 -08:00
Meng Xu	8d6c6e000b	DataDistribution: Mute the NotEnoughServers test Due to the randomness in choosing a server, we cannot gurantee to find all teams. The NotEnoughServers test case may create false positive bug report in the correctness test. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-01 13:29:45 -08:00
Meng Xu	68dcec2240	DataDistribution: Change a unit test Try multiple times of addTeamsBestOf() when we cannot find an available team due to the pure randomness in choosing the server teams. The changes for the unit test reduces the false positive in the simulation test results. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-01 13:12:55 -08:00
Meng Xu	a43f579f66	TeamCollection: Change 1 unit test Relax the assert condition on the random unit test. Due to the randomness in choosing the machine team and the server team from the machine team, it is possible that we may not find the remaining several (e.g., 1 or 2) available teams. For example, there are at most 10 teams available, and we have found 9 teams, the chance of finding the last one is low when we do pure random selection. It is ok to not find every available team because 1) In reality, we only create a small fraction of available teams, and 2) In practical system, this situation only happens when most of servers are temporarily unhealthy. When this situation happens, we will abandon all existing teams and restart the build team from scratch. In simulation test, the situation happens 100 times out of 128613 test cases when we run RandomUnitTests.txt only. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-12-01 13:11:19 -08:00
Meng Xu	f311455c45	TeamCollection: Cleanup code and add checks Remove unnecessary sanity checks and remove the dead code. Add some necessary sanity checks. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-30 17:40:21 -08:00
Meng Xu	ea3bd1502d	TeamCollection: Calculate machine team number Calculate the number of machine teams in the same way as we calculate the number of server teams. Only count the machine teams that has the correct size and is healthy. Simplify code by removing unnecessary check. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-29 15:38:23 -08:00
Meng Xu	2b41ad5e57	TeamCollection: Pick server team randomly Pick server team purely randomly instead of picking the least used one. This is to avoid creating correlation in the server teams we pick when new machines are added. The logic is: First pick the one random least used server as chosen server; Then pick a machine team that has the server; Then pick a server on each machine in the machine team. We make sure the chosen server is picked. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-28 15:57:53 -08:00
Meng Xu	e4c9d4cbae	TeamCollection: Build all machine teams first Before we build server teams, we build the desired number of machine teams. Then we pick the least used server, from which we pick the least used machine team. Then we pick the least used server on each machine in the least used machine team to get the server team. Note: The logic of building machine teams should be independent from server teams. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-27 18:06:36 -08:00
Meng Xu	4c2c65c1b3	TeamCollection: Replace TraceEvent with ASSERT Replace one TraceEvent that never happens in correctness test with an ASSERT. Change format in one comment. Signed-off-by: Meng xu <meng_xu@apple.com>	2018-11-27 09:48:24 -08:00
Meng Xu	5cbff740ca	TeamCollection: Add ASSERT Remove sanity check code for performance benefit. Replace TraceEvent(SevError) with ASSERT. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-21 13:16:52 -08:00
Meng Xu	8de031f9a6	TeamCollection: clang-format Format the changes with git clang-format. No functional changes. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-21 11:18:26 -08:00
Meng Xu	12c3bec968	TeamCollection: Misc changes to resolve review comments No functional change. Report error in TraceEvent when invariant is violated. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-19 20:44:52 -08:00
Meng Xu	52c6a66601	TeamCollection: Fix a bug introduced in code review When we GetTeam, the data distribution actor may have zero teams in rare situation in the ConfigureTest.txt test. We should return an empty team in this situation instead of triggering error. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-16 16:34:38 -08:00
Meng Xu	f7a7e069f0	TeamCollection: Remove unnecessary comments Pass 41806 tests with no failure Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-16 15:56:35 -08:00
Meng Xu	73c58852f0	TeamCollection: Resolve code review comments Resolve code review comments: 1) Improve the code efficiency by avoiding unnecessary map search and avoiding unnecessary checking 2) Remove or comment out trace events when they can be spammy 3) Improve coding style Tested for 1 hour and no error was found. KillRegionCycle.txt test was excluded from the test because existing code cannot pass that test either Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-16 15:55:33 -08:00
Meng Xu	5051b35c61	TeamCollection: Use machine team to create server team Current server team collection logic does not consider the fact that multipe storage servers can run on the same machine. When multiple machines fail, all servers on the machines will fail, and the possibility of having one process team fail and lose data is very high. To reduce the possibility of losing data when multiple machine fails, we first create machine teams which span across different fault zones; we then create server teams based on machine teams by first picking 1 machine team, and then picking 1 server from each machine in the machine team. Signed-off-by: Meng Xu <meng_xu@apple.com>	2018-11-16 15:53:22 -08:00
Evan Tschannen	4e54690005	Merge branch 'release-6.0' # Conflicts: # fdbserver/DataDistribution.actor.cpp # fdbserver/MoveKeys.actor.cpp	2018-11-12 20:26:58 -08:00
Evan Tschannen	26c49f21be	fix: we do not know a region is fully replicated until all the initial storage servers have either been heard from or have been removed	2018-11-12 17:39:40 -08:00
Evan Tschannen	cd188a351e	fix: if a destination team became unhealthy and then healthy again, it would lower the priority of a move even though the source servers we are moving from are still unhealthy fix: badTeams were not accounted for when checking priorities	2018-11-11 12:33:31 -08:00
Evan Tschannen	4b5d0b4e2c	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/AsyncFileBlobStore.actor.cpp # fdbclient/AsyncFileBlobStore.actor.h # fdbclient/BlobStore.actor.cpp # fdbclient/BlobStore.h # fdbclient/HTTP.actor.cpp # fdbclient/ManagementAPI.actor.cpp # fdbclient/NativeAPI.actor.cpp # fdbrpc/LoadBalance.actor.h # fdbrpc/batcher.actor.h # fdbrpc/fdbrpc.vcxproj # fdbrpc/sim2.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/DataDistributionTracker.actor.cpp # fdbserver/SimulatedCluster.actor.cpp # fdbserver/TLogServer.actor.cpp # fdbserver/masterserver.actor.cpp	2018-11-10 13:04:24 -08:00
Evan Tschannen	7c23b68501	fix: we need to build teams if a server becomes healthy and it is not already on any teams	2018-11-09 18:06:00 -08:00
Evan Tschannen	3e2484baf7	fix: a team tracker could downgrade the priority of a relocation issued by the team tracker for the other region	2018-11-09 10:07:55 -08:00
Evan Tschannen	19ae063b66	fix: storage servers need to be rebooted when increasing replication so that clients become aware that new options are available	2018-11-08 15:44:03 -08:00
Evan Tschannen	599cc6260e	fix: data distribution who not always add all subsets of emergency teams fix: data distribution would not stop tracking bad teams after all their data was moved to other teams fix: data distribution did not probably handle a server changing locality such that the teams it used to be on no longer satisfy the policy	2018-11-07 21:05:31 -08:00
Evan Tschannen	87d0b4c294	fix: the remote region does not have a full replica is usable_regions==1	2018-11-04 22:05:37 -08:00
Evan Tschannen	ad98acf795	fix: if the team started unhealthy and initialFailureReactionDelay was ready, we would not send relocations to the queue print wrong shard size team messages in simulation	2018-11-02 13:00:15 -07:00
Evan Tschannen	1d591acd0a	removed the countHealthyTeams check, because it was incorrect if it triggered during the wait(yield()) at the top of team tracker	2018-11-02 12:58:16 -07:00
Robert Escriva	268093a96d	Adjust all includes to be relative to the root. Remove the use of relative paths. A header at foo/bar.h could be included by files under foo/ with "bar.h", but would be included everywhere else as "foo/bar.h". Adjust so that every include references such a header with the latter form. Signed-off-by: Robert Escriva <rescriva@dropbox.com>	2018-10-19 17:35:33 +00:00
Evan Tschannen	db71b60d72	Merge pull request #819 from satherton/feature-redwood Redwood storage engine, initial/experimental version	2018-10-18 18:38:11 -07:00
Evan Tschannen	ed7036139a	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbserver/DataDistribution.actor.cpp # fdbserver/storageserver.actor.cpp	2018-10-18 17:00:52 -07:00
Evan Tschannen	e36b7cd417	Only log teamTracker trace events if sizes are not wrong, to avoid spammy messages when dropping a fearless configuration wrongSize previous was unneeded	2018-10-17 11:45:47 -07:00
Stephen Atherton	22f8a4efa9	Normalized all unit test names to begin with "/" if they should be included in random unit testing.	2018-10-05 22:09:58 -07:00
Evan Tschannen	3922e477a5	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/ManagementAPI.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/LogSystemDiskQueueAdapter.actor.cpp # fdbserver/SimulatedCluster.actor.cpp # fdbserver/TLogServer.actor.cpp	2018-10-03 16:57:18 -07:00
Evan Tschannen	a92fc911ac	do not spin on a failed storage server recruitment	2018-10-02 17:31:07 -07:00
Evan Tschannen	e64c55dce0	fix: data distribution would use the wrong priority sometimes when fixing an incomplete movement, this lead to the cluster thinking the data was replicated in all regions before it actually was	2018-09-28 12:15:23 -07:00
A.J. Beamon	92990d6aef	Merge release-6.0 into master	2018-09-21 16:14:39 -07:00
Evan Tschannen	861c8aa675	consider server health when building subsets of emergency teams	2018-09-19 17:57:01 -07:00
Evan Tschannen	702d018882	fix: we cannot use count on an async map, because someone waiting onChange for an item will cause it to exist in the map before it is set	2018-09-19 16:11:57 -07:00
Evan Tschannen	6d18193b3a	fix: team->setHealthy was not being called correctly on initially unhealthy teams	2018-09-19 14:48:07 -07:00
A.J. Beamon	4d39509034	Merge pull request #774 from apple/release-6.0 Merge release-6.0 into master	2018-09-18 15:01:26 -07:00
Balachandar Namasivayam	d622cb1f6e	When the cluster is configured from fearless setup to usable_regions=1, master goes into a loop changing team priority . Fix this issue.	2018-09-12 18:29:49 -07:00
Evan Tschannen	90301f497f	Merge branch 'release-6.0' # Conflicts: # fdbclient/ManagementAPI.actor.cpp # fdbrpc/FlowTransport.actor.cpp # fdbrpc/TLSConnection.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/Status.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/StatusWorkload.actor.cpp # versions.target	2018-09-05 16:06:33 -07:00
Evan Tschannen	d9906d7d6a	code cleanup	2018-09-05 13:42:10 -07:00
Evan Tschannen	4eaff42e4f	Merge pull request #712 from ajbeamon/remove-database-name-internal Eliminate use of database names (phase 1)	2018-09-05 10:35:00 -07:00
Evan Tschannen	65eabedb6c	fix: addSubsetOfEmergencyTeams could add unhealthy teams optimized teamTracker to check if it satisfies the policy more efficiently added yields to initialization to avoid slow tasks when adding lots of teams	2018-08-31 17:54:55 -07:00
Evan Tschannen	72c86e909e	fix: tracking of the number of unhealthy servers was incorrect fix: locality equality was only checking zoneId	2018-08-31 17:40:27 -07:00
Evan Tschannen	717c43a69f	merge 6.0 into master	2018-08-22 00:28:04 -07:00
Evan Tschannen	a694364a39	fix: teams larger than the storageTeamSize can never become healthy, so we do not need to track them in our data structures. After configuring from usable_regions=2 to usable_regions=1 we will have a lot of these types of teams, leading to performance issues	2018-08-21 21:08:15 -07:00
A.J. Beamon	2a97139d5d	This is the first step in eliminating the usage of database names in our code. The C API remains the same, but underneath that all usage of database names is eliminated.	2018-08-16 10:24:12 -07:00
Alex Miller	c9a07937a2	Fix the last `Void _ = wait(...)` before merging.	2018-08-14 16:00:31 -07:00
Alex Miller	63b1e85338	Ban `Void _ = wait(...)` constructions, and require just `wait(...)`. There's never any reason to save the value of a Void return, and it's the easiest source of redefined variable bugs that will creep back in over time. So just `wait(...)`, it's cleaner that way.	2018-08-14 15:50:26 -07:00
Alex Miller	fb31a6999f	Rewrite all files to have #include actorcompiler.h as the last include.	2018-08-14 15:50:26 -07:00
Alex Miller	535b5701e5	Rewrite all `Void _ = wait(...)` -> `wait(...)`. This takes advantage of the new actorcompiler functionality to avoid having duplicate definitions of `Void _` when trying to feed the un-actorompiled source through clang.	2018-08-14 15:50:26 -07:00
Evan Tschannen	cdcf056aef	Merge branch 'release-6.0'	2018-08-14 09:43:51 -07:00
Evan Tschannen	883050d12f	moved the creation of the yieldPromiseStream to properly yield moves from initialDataDistribution	2018-08-13 22:29:55 -07:00
Evan Tschannen	2341e5d8ad	fix: we must yield when updating shardsAffectedByTeamFailure with the initial shards. A test with 1 million shards caused a 22 second slow task	2018-08-13 19:46:47 -07:00
A.J. Beamon	574c5576a2	Merge branch 'release-6.0' of github.com:apple/foundationdb # Conflicts: # fdbrpc/TLSConnection.actor.cpp # versions.target	2018-08-10 14:31:58 -07:00
Evan Tschannen	9c918a28f6	fix: status was reporting no replicas remaining when the remote datacenter was initially configured with usable_regions=2	2018-08-09 13:16:09 -07:00
Evan Tschannen	6f02ea843a	prevented a slow task when too many shards were sent to the data distribution queue after switching to a fearless deployment	2018-08-09 12:37:46 -07:00
Alex Miller	1a7cda4149	Stop performing self-moves. (e.g. a = std::move(a)) self-moves are frowned upon in C++, and in our code this generally happens from calls to swap as part of trying to implement a "unordered erase" function via swap-to-the-end-and-pop_back. For convenience, a swapAndPop() function is now offered that performs this, while disallowing self-moves.	2018-08-01 18:09:54 -07:00
Evan Tschannen	392c73affb	fixed a few slow tasks	2018-07-12 14:06:59 -07:00
Evan Tschannen	d12dac60ec	fix: the same team was being added multiple times to primaryTeams	2018-07-12 12:10:18 -07:00
Evan Tschannen	9edbb8d6dd	fix: do not consider a storage server failed until the full failure reaction time has elapsed. This was being short-circuited when the endpoint was permanently failed (the storage server has been rebooted)	2018-07-11 15:45:32 -07:00
Evan Tschannen	380b2895f7	fix: we need to wait for the yield in the team tracker not just after the initial failure reaction delay, but also after zeroHealthyTeams changes	2018-07-08 17:44:19 -07:00
Evan Tschannen	d6c6e7d306	fix: do not attempt data movement to an unhealthy destination team allow building more teams than desired if all teams are unhealthy bestTeamStuck is an error in simulation again	2018-07-07 16:51:16 -07:00
Stephen Atherton	5a84b5e1ef	Renamed ShardInfo to avoid a name conflict which sometimes causes the wrong destructor to be used at link time.	2018-06-30 18:44:46 -07:00
Evan Tschannen	0bdd25df23	ratekeeper does not control on remote storage servers	2018-06-18 17:23:55 -07:00
Evan Tschannen	1ccfb3a0f4	fix: log_anti_quorum was always 0 in simulation removed durableStorageQuorum, because it is no longer a useful configuration parameter	2018-06-18 10:24:57 -07:00
Evan Tschannen	0913368651	added usable_regions to specify if we will replicate into a remote region remote replication defaults to the primary replication removed remote_logs, because they should be specified as an override in the regions object	2018-06-17 19:31:15 -07:00
Evan Tschannen	e28769b98e	fixed trace event name	2018-06-11 12:43:08 -07:00
Evan Tschannen	372ed67497	Merge branch 'master' into feature-remote-logs # Conflicts: # fdbserver/DataDistribution.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # fdbserver/TLogServer.actor.cpp # fdbserver/TagPartitionedLogSystem.actor.cpp	2018-06-11 11:34:10 -07:00
Evan Tschannen	134b5d6f65	fix: only consider data distribution started when remote has recovered so quite database works correctly	2018-06-10 20:25:15 -07:00
Evan Tschannen	4903df5ce9	fix: give time to detect failed servers before building teams	2018-06-10 20:21:39 -07:00
Evan Tschannen	6e48d93d39	backed out the healthy team check because it was unnecessary	2018-06-10 12:43:32 -07:00
A.J. Beamon	e5488419cc	Attempt to normalize trace events: * Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check. * Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase. * Use seconds instead of milliseconds in details. Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed. This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.	2018-06-08 11:11:08 -07:00
Evan Tschannen	e4d5817679	fix: we must server getTeam requests before readyToStart is set because we cannot complete relocateShard requests without getTeam responses from both team collections	2018-06-07 16:14:40 -07:00
Evan Tschannen	9f0c16f062	do not build teams which contain failed servers	2018-06-07 14:05:53 -07:00
Evan Tschannen	b423d73b42	fix: do not finish a shard relocation until all of the storage servers have made the current recovery version durable. This is to prevent dropping a needed storage server as a source for a shard after dropping a remote configuration	2018-06-07 12:29:25 -07:00
Evan Tschannen	be06938d9d	fix: dropping the remote replication will cause all remote storage servers to die. Make sure we are not restoring redundancy before doing this to prevent data loss in simulation.	2018-06-04 18:46:09 -07:00
Evan Tschannen	6cf9508aae	finished a comment	2018-06-03 19:38:51 -07:00
Evan Tschannen	b1935f1738	fix: do not allow a storage server to be removed within 5 million versions of it being added, because if a storage server is added and removed within the known committed version and recovery version, they storage server will need see either the add or remove when it peeks	2018-05-05 18:16:28 -07:00
Evan Tschannen	440e2ae609	fix: data distribution logic was incorrect for finding a complete source team in a failed DC	2018-05-01 23:08:31 -07:00
Evan Tschannen	10d25927cd	Merge branch 'master' into feature-remote-logs # Conflicts: # fdbserver/DataDistribution.actor.cpp	2018-04-30 22:15:39 -07:00
Evan Tschannen	9cdabfed0e	added useful trace events	2018-04-29 18:54:47 -07:00
Evan Tschannen	73597f190e	fix: new tlogs are initialized with exactly the tags which existed at the recovery version	2018-04-22 20:28:01 -07:00
Bruce Mitchener	9cdf25eda3	Fix some typos.	2018-04-20 00:49:22 +07:00
Evan Tschannen	a8662f8737	fix: remote recovered is does not need to wait for old logs to be removed	2018-04-16 10:14:39 -07:00
Evan Tschannen	e53f17a83a	fix: the newest log router needs to start where the last old one ends	2018-04-15 14:54:22 -07:00
Evan Tschannen	0496bee1ef	fix: suppress expected errors in data distribution	2018-04-15 11:30:22 -07:00
Evan Tschannen	7af892f50b	first working version of non-copying recovery working with fearless configurations	2018-04-08 21:24:05 -07:00
Evan Tschannen	579ba58930	pop old tags only looks are recovered tags, and checks if they are still being used	2018-03-30 19:08:01 -07:00
Evan Tschannen	82fb6424ec	fix: storage recruitment could get stuck in a spin loop	2018-03-15 11:00:44 -07:00
Evan Tschannen	3abf4d7fdf	Merge branch 'master' into feature-remote-logs	2018-03-09 14:50:04 -08:00
Evan Tschannen	91bb8faa45	Merge commit 'f773b9460d31d31b7d421860fc647936f31aa1fa' # Conflicts: # tests/fast/SidebandWithStatus.txt # tests/rare/LargeApiCorrectnessStatus.txt # tests/slow/DDBalanceAndRemoveStatus.txt	2018-03-09 14:47:03 -08:00
Evan Tschannen	cf6dd1437b	suppress spammy trace events	2018-03-09 10:16:34 -08:00
Evan Tschannen	5390af8be4	suppress spammy logs	2018-03-09 09:40:36 -08:00
Evan Tschannen	fa7eaea7cf	fix: shards affected by team failure did not properly handle separate teams for the remote and primary data centers	2018-03-08 10:50:05 -08:00
Evan Tschannen	470f5c01f3	changed remoteDcId to a vector of ids, to support future configurations where there are multiple remote databases	2018-02-26 17:09:09 -08:00
Evan Tschannen	37a6a81634	Merge commit '7f6fc3e039c911cd84b8540f7f799fc38a1c1822' into feature-remote-logs # Conflicts: # fdbserver/workloads/RestartRecovery.actor.cpp	2018-02-23 12:33:28 -08:00
Alec Grieser	e1162e9238	Merge remote-tracking branch 'upstream/release-5.1'	2018-02-22 11:16:12 -08:00
Alec Grieser	0bae9880f1	remove trailing whitespace from our copyright headers ; fixed formatting of python setup.py	2018-02-21 10:25:11 -08:00
Evan Tschannen	dc93759e15	suppressed trace events that are spammy	2018-02-16 16:01:19 -08:00
Evan Tschannen	d2b0c07558	storage servers continue to attempt to pop old tags after the log system updates	2018-02-13 18:34:13 -08:00
Evan Tschannen	1fedcba890	fix: do not use log router tags when configured without remote logs fix: data distribution tracks undesired storage servers re-enabled consistency check	2018-02-13 17:01:34 -08:00
Evan Tschannen	63a9f2aed6	fix: history tags were being incorrectly popped fix: history tags were not cleared when a storage server was removed	2018-02-03 12:20:18 -08:00
Evan Tschannen	ebd94bb654	removed a separately configurable storage team size for the remote data center, because it did not make sense fix: the master did not monitor for the failure of remote logs stop merge attempts when a data center is failed fixed a variety of other problems with data distribution when a data center is failed	2018-02-02 11:46:04 -08:00
Evan Tschannen	b48d8ce96d	getTeam will return an unhealthy exact match if all teams are unhealthy. Resubmit relocation requests once healthy teams are available	2018-01-30 17:00:51 -08:00
Evan Tschannen	29c5d4ad3d	upgrades from 5.X mostly supported, still some remaining correctness problems	2018-01-28 11:52:54 -08:00
Evan Tschannen	b5eba4f13a	fix: do not check for desired data centers if they have not been set	2018-01-20 10:28:59 -08:00
Evan Tschannen	2e46ee3dba	fix: getTeam works when there are no teams	2018-01-17 17:49:13 -08:00
Evan Tschannen	264dc44dfa	fixed many more bugs associated with running without remote logs	2018-01-17 17:03:17 -08:00
Evan Tschannen	21482a45e1	Merge branch 'master' into feature-remote-logs # Conflicts: # fdbserver/DBCoreState.h # fdbserver/LogSystem.h # fdbserver/LogSystemPeekCursor.actor.cpp # fdbserver/TLogServer.actor.cpp	2018-01-14 13:40:24 -08:00
Evan Tschannen	3915d6825c	we need to check the server list at a higher priority, because if we do not notice a storage server interface change for a long period of time, we will mark it as failed	2018-01-12 12:51:07 -08:00
Evan Tschannen	9630deba3a	fixed a number of bugs related to running fearless without remote logs	2018-01-08 12:04:19 -08:00
Evan Tschannen	3d2103075d	data distribution tracks teams for each data center separately	2017-10-10 10:36:33 -07:00
Evan Tschannen	76e7988663	Merge branch 'master' into feature-remote-logs # Conflicts: # fdbserver/ClusterController.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/OldTLogServer.actor.cpp # fdbserver/TLogServer.actor.cpp # fdbserver/WorkerInterface.h # flow/Net2.actor.cpp	2017-09-11 15:15:56 -07:00
Evan Tschannen	ea26bc1c43	passed first tests which kill entire datacenters added configuration options for the remote data center and satellite data centers updated cluster controller recruitment logic refactors how master writes core state updated log recovery, and log system peeking	2017-09-07 15:32:08 -07:00
Yichi Chiang	bd1c7e7295	Use addTeamsBestOf() instead of addAllTeams() when team size is greater than 3	2017-09-07 12:31:01 -07:00
Evan Tschannen	c22708b6d6	added tag localities fix: remote logs need to stop the master when they are stopped	2017-08-03 16:16:36 -07:00
Yichi Chiang	6a8a5c41b0	Add a switch to turn off data distribution in CLI	2017-07-28 18:14:55 -07:00
Alec Grieser	f75b6f333b	Merge branch 'release-5.0'	2017-07-13 11:21:18 -07:00
Evan Tschannen	aa1c903b52	fix: do not log that data distribution is initialized until readyToStart is ready	2017-06-30 16:21:59 -07:00
Evan Tschannen	9fd5955e92	Merge branch 'master' into removing-old-dc-code	2017-06-26 16:27:10 -07:00
FDB Dev Team	a674cb4ef4	Initial repository commit	2017-05-25 13:48:44 -07:00

... 3 4 5 6 7 ...

384 Commits