Meng Xu
54a4d6b308
TeamCollection: Improve code efficiency
...
Improve code efficiency with the following changes:
1) Change always-true if-statement to ASSERT;
2) Return when we are confident we will not find more machine teams.
No functionality change.
Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-12-01 17:10:50 -08:00
Meng Xu
8d6c6e000b
DataDistribution: Mute the NotEnoughServers test
...
Due to the randomness in choosing a server, we cannot gurantee to
find all teams. The NotEnoughServers test case may create false positive
bug report in the correctness test.
Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-12-01 13:29:45 -08:00
Meng Xu
68dcec2240
DataDistribution: Change a unit test
...
Try multiple times of addTeamsBestOf() when we cannot find an available team
due to the pure randomness in choosing the server teams.
The changes for the unit test reduces the false positive in the simulation test results.
Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-12-01 13:12:55 -08:00
Meng Xu
a43f579f66
TeamCollection: Change 1 unit test
...
Relax the assert condition on the random unit test.
Due to the randomness in choosing the machine team and
the server team from the machine team, it is possible that
we may not find the remaining several (e.g., 1 or 2) available teams.
For example, there are at most 10 teams available, and we have found
9 teams, the chance of finding the last one is low
when we do pure random selection.
It is ok to not find every available team because
1) In reality, we only create a small fraction of available teams, and
2) In practical system, this situation only happens when most of servers
are *temporarily* unhealthy. When this situation happens, we will
abandon all existing teams and restart the build team from scratch.
In simulation test, the situation happens 100 times out of 128613 test cases
when we run RandomUnitTests.txt only.
Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-12-01 13:11:19 -08:00
Meng Xu
f311455c45
TeamCollection: Cleanup code and add checks
...
Remove unnecessary sanity checks and remove the dead code.
Add some necessary sanity checks.
Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-30 17:40:21 -08:00
Meng Xu
ea3bd1502d
TeamCollection: Calculate machine team number
...
Calculate the number of machine teams in the same way
as we calculate the number of server teams.
Only count the machine teams that has the correct size and is healthy.
Simplify code by removing unnecessary check.
Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-29 15:38:23 -08:00
Meng Xu
2b41ad5e57
TeamCollection: Pick server team randomly
...
Pick server team purely randomly instead of picking the least used one.
This is to avoid creating correlation in the server teams we pick when
new machines are added.
The logic is:
First pick the one random least used server as chosen server;
Then pick a machine team that has the server;
Then pick a server on each machine in the machine team.
We make sure the chosen server is picked.
Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-28 15:57:53 -08:00
Meng Xu
e4c9d4cbae
TeamCollection: Build all machine teams first
...
Before we build server teams, we build the desired number of machine teams.
Then we pick the least used server, from which we pick the least used machine team.
Then we pick the least used server on each machine in the least used machine team to get the server team.
Note: The logic of building machine teams should be independent from server teams.
Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-27 18:06:36 -08:00
Meng Xu
4c2c65c1b3
TeamCollection: Replace TraceEvent with ASSERT
...
Replace one TraceEvent that never happens in correctness test with an ASSERT.
Change format in one comment.
Signed-off-by: Meng xu <meng_xu@apple.com>
2018-11-27 09:48:24 -08:00
Meng Xu
5cbff740ca
TeamCollection: Add ASSERT
...
Remove sanity check code for performance benefit.
Replace TraceEvent(SevError) with ASSERT.
Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-21 13:16:52 -08:00
Meng Xu
8de031f9a6
TeamCollection: clang-format
...
Format the changes with git clang-format.
No functional changes.
Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-21 11:18:26 -08:00
Meng Xu
12c3bec968
TeamCollection: Misc changes to resolve review comments
...
No functional change.
Report error in TraceEvent when invariant is violated.
Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-19 20:44:52 -08:00
Meng Xu
52c6a66601
TeamCollection: Fix a bug introduced in code review
...
When we GetTeam, the data distribution actor may have zero teams in
rare situation in the ConfigureTest.txt test.
We should return an empty team in this situation instead of triggering error.
Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-16 16:34:38 -08:00
Meng Xu
f7a7e069f0
TeamCollection: Remove unnecessary comments
...
Pass 41806 tests with no failure
Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-16 15:56:35 -08:00
Meng Xu
73c58852f0
TeamCollection: Resolve code review comments
...
Resolve code review comments:
1) Improve the code efficiency by avoiding unnecessary map search
and avoiding unnecessary checking
2) Remove or comment out trace events when they can be spammy
3) Improve coding style
Tested for 1 hour and no error was found.
KillRegionCycle.txt test was excluded from the test because
existing code cannot pass that test either
Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-16 15:55:33 -08:00
Meng Xu
5051b35c61
TeamCollection: Use machine team to create server team
...
Current server team collection logic does not consider
the fact that multipe storage servers can run on the same machine.
When multiple machines fail, all servers on the machines will fail, and
the possibility of having one process team fail and lose data is very high.
To reduce the possibility of losing data when multiple machine fails,
we first create machine teams which span across different fault zones;
we then create server teams based on machine teams by
first picking 1 machine team, and then
picking 1 server from each machine in the machine team.
Signed-off-by: Meng Xu <meng_xu@apple.com>
2018-11-16 15:53:22 -08:00
Evan Tschannen
4e54690005
Merge branch 'release-6.0'
...
# Conflicts:
# fdbserver/DataDistribution.actor.cpp
# fdbserver/MoveKeys.actor.cpp
2018-11-12 20:26:58 -08:00
Evan Tschannen
26c49f21be
fix: we do not know a region is fully replicated until all the initial storage servers have either been heard from or have been removed
2018-11-12 17:39:40 -08:00
Evan Tschannen
cd188a351e
fix: if a destination team became unhealthy and then healthy again, it would lower the priority of a move even though the source servers we are moving from are still unhealthy
...
fix: badTeams were not accounted for when checking priorities
2018-11-11 12:33:31 -08:00
Evan Tschannen
4b5d0b4e2c
Merge branch 'release-6.0'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbclient/AsyncFileBlobStore.actor.cpp
# fdbclient/AsyncFileBlobStore.actor.h
# fdbclient/BlobStore.actor.cpp
# fdbclient/BlobStore.h
# fdbclient/HTTP.actor.cpp
# fdbclient/ManagementAPI.actor.cpp
# fdbclient/NativeAPI.actor.cpp
# fdbrpc/LoadBalance.actor.h
# fdbrpc/batcher.actor.h
# fdbrpc/fdbrpc.vcxproj
# fdbrpc/sim2.actor.cpp
# fdbserver/DataDistribution.actor.cpp
# fdbserver/DataDistributionTracker.actor.cpp
# fdbserver/SimulatedCluster.actor.cpp
# fdbserver/TLogServer.actor.cpp
# fdbserver/masterserver.actor.cpp
2018-11-10 13:04:24 -08:00
Evan Tschannen
7c23b68501
fix: we need to build teams if a server becomes healthy and it is not already on any teams
2018-11-09 18:06:00 -08:00
Evan Tschannen
3e2484baf7
fix: a team tracker could downgrade the priority of a relocation issued by the team tracker for the other region
2018-11-09 10:07:55 -08:00
Evan Tschannen
19ae063b66
fix: storage servers need to be rebooted when increasing replication so that clients become aware that new options are available
2018-11-08 15:44:03 -08:00
Evan Tschannen
599cc6260e
fix: data distribution who not always add all subsets of emergency teams
...
fix: data distribution would not stop tracking bad teams after all their data was moved to other teams
fix: data distribution did not probably handle a server changing locality such that the teams it used to be on no longer satisfy the policy
2018-11-07 21:05:31 -08:00
Evan Tschannen
87d0b4c294
fix: the remote region does not have a full replica is usable_regions==1
2018-11-04 22:05:37 -08:00
Evan Tschannen
ad98acf795
fix: if the team started unhealthy and initialFailureReactionDelay was ready, we would not send relocations to the queue
...
print wrong shard size team messages in simulation
2018-11-02 13:00:15 -07:00
Evan Tschannen
1d591acd0a
removed the countHealthyTeams check, because it was incorrect if it triggered during the wait(yield()) at the top of team tracker
2018-11-02 12:58:16 -07:00
Robert Escriva
268093a96d
Adjust all includes to be relative to the root.
...
Remove the use of relative paths. A header at foo/bar.h could be included by
files under foo/ with "bar.h", but would be included everywhere else as
"foo/bar.h". Adjust so that every include references such a header with the
latter form.
Signed-off-by: Robert Escriva <rescriva@dropbox.com>
2018-10-19 17:35:33 +00:00
Evan Tschannen
db71b60d72
Merge pull request #819 from satherton/feature-redwood
...
Redwood storage engine, initial/experimental version
2018-10-18 18:38:11 -07:00
Evan Tschannen
ed7036139a
Merge branch 'release-6.0'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbserver/DataDistribution.actor.cpp
# fdbserver/storageserver.actor.cpp
2018-10-18 17:00:52 -07:00
Evan Tschannen
e36b7cd417
Only log teamTracker trace events if sizes are not wrong, to avoid spammy messages when dropping a fearless configuration
...
wrongSize previous was unneeded
2018-10-17 11:45:47 -07:00
Stephen Atherton
22f8a4efa9
Normalized all unit test names to begin with "/" if they should be included in random unit testing.
2018-10-05 22:09:58 -07:00
Evan Tschannen
3922e477a5
Merge branch 'release-6.0'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbclient/ManagementAPI.actor.cpp
# fdbserver/ClusterController.actor.cpp
# fdbserver/DataDistribution.actor.cpp
# fdbserver/LogSystemDiskQueueAdapter.actor.cpp
# fdbserver/SimulatedCluster.actor.cpp
# fdbserver/TLogServer.actor.cpp
2018-10-03 16:57:18 -07:00
Evan Tschannen
a92fc911ac
do not spin on a failed storage server recruitment
2018-10-02 17:31:07 -07:00
Evan Tschannen
e64c55dce0
fix: data distribution would use the wrong priority sometimes when fixing an incomplete movement, this lead to the cluster thinking the data was replicated in all regions before it actually was
2018-09-28 12:15:23 -07:00
A.J. Beamon
92990d6aef
Merge release-6.0 into master
2018-09-21 16:14:39 -07:00
Evan Tschannen
861c8aa675
consider server health when building subsets of emergency teams
2018-09-19 17:57:01 -07:00
Evan Tschannen
702d018882
fix: we cannot use count on an async map, because someone waiting onChange for an item will cause it to exist in the map before it is set
2018-09-19 16:11:57 -07:00
Evan Tschannen
6d18193b3a
fix: team->setHealthy was not being called correctly on initially unhealthy teams
2018-09-19 14:48:07 -07:00
A.J. Beamon
4d39509034
Merge pull request #774 from apple/release-6.0
...
Merge release-6.0 into master
2018-09-18 15:01:26 -07:00
Balachandar Namasivayam
d622cb1f6e
When the cluster is configured from fearless setup to usable_regions=1, master goes into a loop changing team priority . Fix this issue.
2018-09-12 18:29:49 -07:00
Evan Tschannen
90301f497f
Merge branch 'release-6.0'
...
# Conflicts:
# fdbclient/ManagementAPI.actor.cpp
# fdbrpc/FlowTransport.actor.cpp
# fdbrpc/TLSConnection.actor.cpp
# fdbserver/DataDistribution.actor.cpp
# fdbserver/Status.actor.cpp
# fdbserver/storageserver.actor.cpp
# fdbserver/workloads/StatusWorkload.actor.cpp
# versions.target
2018-09-05 16:06:33 -07:00
Evan Tschannen
d9906d7d6a
code cleanup
2018-09-05 13:42:10 -07:00
Evan Tschannen
4eaff42e4f
Merge pull request #712 from ajbeamon/remove-database-name-internal
...
Eliminate use of database names (phase 1)
2018-09-05 10:35:00 -07:00
Evan Tschannen
65eabedb6c
fix: addSubsetOfEmergencyTeams could add unhealthy teams
...
optimized teamTracker to check if it satisfies the policy more efficiently
added yields to initialization to avoid slow tasks when adding lots of teams
2018-08-31 17:54:55 -07:00
Evan Tschannen
72c86e909e
fix: tracking of the number of unhealthy servers was incorrect
...
fix: locality equality was only checking zoneId
2018-08-31 17:40:27 -07:00
Evan Tschannen
717c43a69f
merge 6.0 into master
2018-08-22 00:28:04 -07:00
Evan Tschannen
a694364a39
fix: teams larger than the storageTeamSize can never become healthy, so we do not need to track them in our data structures. After configuring from usable_regions=2 to usable_regions=1 we will have a lot of these types of teams, leading to performance issues
2018-08-21 21:08:15 -07:00
A.J. Beamon
2a97139d5d
This is the first step in eliminating the usage of database names in our code. The C API remains the same, but underneath that all usage of database names is eliminated.
2018-08-16 10:24:12 -07:00
Alex Miller
c9a07937a2
Fix the last `Void _ = wait(...)` before merging.
2018-08-14 16:00:31 -07:00
Alex Miller
63b1e85338
Ban `Void _ = wait(...)` constructions, and require just `wait(...)`.
...
There's never any reason to save the value of a Void return, and it's
the easiest source of redefined variable bugs that will creep back in
over time. So just `wait(...)`, it's cleaner that way.
2018-08-14 15:50:26 -07:00
Alex Miller
fb31a6999f
Rewrite all files to have #include actorcompiler.h as the last include.
2018-08-14 15:50:26 -07:00
Alex Miller
535b5701e5
Rewrite all `Void _ = wait(...)` -> `wait(...)`.
...
This takes advantage of the new actorcompiler functionality to avoid
having duplicate definitions of `Void _` when trying to feed the
un-actorompiled source through clang.
2018-08-14 15:50:26 -07:00
Evan Tschannen
cdcf056aef
Merge branch 'release-6.0'
2018-08-14 09:43:51 -07:00
Evan Tschannen
883050d12f
moved the creation of the yieldPromiseStream to properly yield moves from initialDataDistribution
2018-08-13 22:29:55 -07:00
Evan Tschannen
2341e5d8ad
fix: we must yield when updating shardsAffectedByTeamFailure with the initial shards. A test with 1 million shards caused a 22 second slow task
2018-08-13 19:46:47 -07:00
A.J. Beamon
574c5576a2
Merge branch 'release-6.0' of github.com:apple/foundationdb
...
# Conflicts:
# fdbrpc/TLSConnection.actor.cpp
# versions.target
2018-08-10 14:31:58 -07:00
Evan Tschannen
9c918a28f6
fix: status was reporting no replicas remaining when the remote datacenter was initially configured with usable_regions=2
2018-08-09 13:16:09 -07:00
Evan Tschannen
6f02ea843a
prevented a slow task when too many shards were sent to the data distribution queue after switching to a fearless deployment
2018-08-09 12:37:46 -07:00
Alex Miller
1a7cda4149
Stop performing self-moves. (e.g. a = std::move(a))
...
self-moves are frowned upon in C++, and in our code this generally happens from
calls to swap as part of trying to implement a "unordered erase" function via
swap-to-the-end-and-pop_back. For convenience, a swapAndPop() function is now
offered that performs this, while disallowing self-moves.
2018-08-01 18:09:54 -07:00
Evan Tschannen
392c73affb
fixed a few slow tasks
2018-07-12 14:06:59 -07:00
Evan Tschannen
d12dac60ec
fix: the same team was being added multiple times to primaryTeams
2018-07-12 12:10:18 -07:00
Evan Tschannen
9edbb8d6dd
fix: do not consider a storage server failed until the full failure reaction time has elapsed. This was being short-circuited when the endpoint was permanently failed (the storage server has been rebooted)
2018-07-11 15:45:32 -07:00
Evan Tschannen
380b2895f7
fix: we need to wait for the yield in the team tracker not just after the initial failure reaction delay, but also after zeroHealthyTeams changes
2018-07-08 17:44:19 -07:00
Evan Tschannen
d6c6e7d306
fix: do not attempt data movement to an unhealthy destination team
...
allow building more teams than desired if all teams are unhealthy
bestTeamStuck is an error in simulation again
2018-07-07 16:51:16 -07:00
Stephen Atherton
5a84b5e1ef
Renamed ShardInfo to avoid a name conflict which sometimes causes the wrong destructor to be used at link time.
2018-06-30 18:44:46 -07:00
Evan Tschannen
0bdd25df23
ratekeeper does not control on remote storage servers
2018-06-18 17:23:55 -07:00
Evan Tschannen
1ccfb3a0f4
fix: log_anti_quorum was always 0 in simulation
...
removed durableStorageQuorum, because it is no longer a useful configuration parameter
2018-06-18 10:24:57 -07:00
Evan Tschannen
0913368651
added usable_regions to specify if we will replicate into a remote region
...
remote replication defaults to the primary replication
removed remote_logs, because they should be specified as an override in the regions object
2018-06-17 19:31:15 -07:00
Evan Tschannen
e28769b98e
fixed trace event name
2018-06-11 12:43:08 -07:00
Evan Tschannen
372ed67497
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# fdbserver/DataDistribution.actor.cpp
# fdbserver/MasterProxyServer.actor.cpp
# fdbserver/TLogServer.actor.cpp
# fdbserver/TagPartitionedLogSystem.actor.cpp
2018-06-11 11:34:10 -07:00
Evan Tschannen
134b5d6f65
fix: only consider data distribution started when remote has recovered so quite database works correctly
2018-06-10 20:25:15 -07:00
Evan Tschannen
4903df5ce9
fix: give time to detect failed servers before building teams
2018-06-10 20:21:39 -07:00
Evan Tschannen
6e48d93d39
backed out the healthy team check because it was unnecessary
2018-06-10 12:43:32 -07:00
A.J. Beamon
e5488419cc
Attempt to normalize trace events:
...
* Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check.
* Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase.
* Use seconds instead of milliseconds in details.
Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed.
This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.
2018-06-08 11:11:08 -07:00
Evan Tschannen
e4d5817679
fix: we must server getTeam requests before readyToStart is set because we cannot complete relocateShard requests without getTeam responses from both team collections
2018-06-07 16:14:40 -07:00
Evan Tschannen
9f0c16f062
do not build teams which contain failed servers
2018-06-07 14:05:53 -07:00
Evan Tschannen
b423d73b42
fix: do not finish a shard relocation until all of the storage servers have made the current recovery version durable. This is to prevent dropping a needed storage server as a source for a shard after dropping a remote configuration
2018-06-07 12:29:25 -07:00
Evan Tschannen
be06938d9d
fix: dropping the remote replication will cause all remote storage servers to die. Make sure we are not restoring redundancy before doing this to prevent data loss in simulation.
2018-06-04 18:46:09 -07:00
Evan Tschannen
6cf9508aae
finished a comment
2018-06-03 19:38:51 -07:00
Evan Tschannen
b1935f1738
fix: do not allow a storage server to be removed within 5 million versions of it being added, because if a storage server is added and removed within the known committed version and recovery version, they storage server will need see either the add or remove when it peeks
2018-05-05 18:16:28 -07:00
Evan Tschannen
440e2ae609
fix: data distribution logic was incorrect for finding a complete source team in a failed DC
2018-05-01 23:08:31 -07:00
Evan Tschannen
10d25927cd
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# fdbserver/DataDistribution.actor.cpp
2018-04-30 22:15:39 -07:00
Evan Tschannen
9cdabfed0e
added useful trace events
2018-04-29 18:54:47 -07:00
Evan Tschannen
73597f190e
fix: new tlogs are initialized with exactly the tags which existed at the recovery version
2018-04-22 20:28:01 -07:00
Bruce Mitchener
9cdf25eda3
Fix some typos.
2018-04-20 00:49:22 +07:00
Evan Tschannen
a8662f8737
fix: remote recovered is does not need to wait for old logs to be removed
2018-04-16 10:14:39 -07:00
Evan Tschannen
e53f17a83a
fix: the newest log router needs to start where the last old one ends
2018-04-15 14:54:22 -07:00
Evan Tschannen
0496bee1ef
fix: suppress expected errors in data distribution
2018-04-15 11:30:22 -07:00
Evan Tschannen
7af892f50b
first working version of non-copying recovery working with fearless configurations
2018-04-08 21:24:05 -07:00
Evan Tschannen
579ba58930
pop old tags only looks are recovered tags, and checks if they are still being used
2018-03-30 19:08:01 -07:00
Evan Tschannen
82fb6424ec
fix: storage recruitment could get stuck in a spin loop
2018-03-15 11:00:44 -07:00
Evan Tschannen
3abf4d7fdf
Merge branch 'master' into feature-remote-logs
2018-03-09 14:50:04 -08:00
Evan Tschannen
91bb8faa45
Merge commit 'f773b9460d31d31b7d421860fc647936f31aa1fa'
...
# Conflicts:
# tests/fast/SidebandWithStatus.txt
# tests/rare/LargeApiCorrectnessStatus.txt
# tests/slow/DDBalanceAndRemoveStatus.txt
2018-03-09 14:47:03 -08:00
Evan Tschannen
cf6dd1437b
suppress spammy trace events
2018-03-09 10:16:34 -08:00
Evan Tschannen
5390af8be4
suppress spammy logs
2018-03-09 09:40:36 -08:00
Evan Tschannen
fa7eaea7cf
fix: shards affected by team failure did not properly handle separate teams for the remote and primary data centers
2018-03-08 10:50:05 -08:00
Evan Tschannen
470f5c01f3
changed remoteDcId to a vector of ids, to support future configurations where there are multiple remote databases
2018-02-26 17:09:09 -08:00
Evan Tschannen
37a6a81634
Merge commit '7f6fc3e039c911cd84b8540f7f799fc38a1c1822' into feature-remote-logs
...
# Conflicts:
# fdbserver/workloads/RestartRecovery.actor.cpp
2018-02-23 12:33:28 -08:00
Alec Grieser
e1162e9238
Merge remote-tracking branch 'upstream/release-5.1'
2018-02-22 11:16:12 -08:00
Alec Grieser
0bae9880f1
remove trailing whitespace from our copyright headers ; fixed formatting of python setup.py
2018-02-21 10:25:11 -08:00
Evan Tschannen
dc93759e15
suppressed trace events that are spammy
2018-02-16 16:01:19 -08:00
Evan Tschannen
d2b0c07558
storage servers continue to attempt to pop old tags after the log system updates
2018-02-13 18:34:13 -08:00
Evan Tschannen
1fedcba890
fix: do not use log router tags when configured without remote logs
...
fix: data distribution tracks undesired storage servers
re-enabled consistency check
2018-02-13 17:01:34 -08:00
Evan Tschannen
63a9f2aed6
fix: history tags were being incorrectly popped
...
fix: history tags were not cleared when a storage server was removed
2018-02-03 12:20:18 -08:00
Evan Tschannen
ebd94bb654
removed a separately configurable storage team size for the remote data center, because it did not make sense
...
fix: the master did not monitor for the failure of remote logs
stop merge attempts when a data center is failed
fixed a variety of other problems with data distribution when a data center is failed
2018-02-02 11:46:04 -08:00
Evan Tschannen
b48d8ce96d
getTeam will return an unhealthy exact match if all teams are unhealthy. Resubmit relocation requests once healthy teams are available
2018-01-30 17:00:51 -08:00
Evan Tschannen
29c5d4ad3d
upgrades from 5.X mostly supported, still some remaining correctness problems
2018-01-28 11:52:54 -08:00
Evan Tschannen
b5eba4f13a
fix: do not check for desired data centers if they have not been set
2018-01-20 10:28:59 -08:00
Evan Tschannen
2e46ee3dba
fix: getTeam works when there are no teams
2018-01-17 17:49:13 -08:00
Evan Tschannen
264dc44dfa
fixed many more bugs associated with running without remote logs
2018-01-17 17:03:17 -08:00
Evan Tschannen
21482a45e1
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# fdbserver/DBCoreState.h
# fdbserver/LogSystem.h
# fdbserver/LogSystemPeekCursor.actor.cpp
# fdbserver/TLogServer.actor.cpp
2018-01-14 13:40:24 -08:00
Evan Tschannen
3915d6825c
we need to check the server list at a higher priority, because if we do not notice a storage server interface change for a long period of time, we will mark it as failed
2018-01-12 12:51:07 -08:00
Evan Tschannen
9630deba3a
fixed a number of bugs related to running fearless without remote logs
2018-01-08 12:04:19 -08:00
Evan Tschannen
3d2103075d
data distribution tracks teams for each data center separately
2017-10-10 10:36:33 -07:00
Evan Tschannen
76e7988663
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# fdbserver/ClusterController.actor.cpp
# fdbserver/DataDistribution.actor.cpp
# fdbserver/OldTLogServer.actor.cpp
# fdbserver/TLogServer.actor.cpp
# fdbserver/WorkerInterface.h
# flow/Net2.actor.cpp
2017-09-11 15:15:56 -07:00
Evan Tschannen
ea26bc1c43
passed first tests which kill entire datacenters
...
added configuration options for the remote data center and satellite data centers
updated cluster controller recruitment logic
refactors how master writes core state
updated log recovery, and log system peeking
2017-09-07 15:32:08 -07:00
Yichi Chiang
bd1c7e7295
Use addTeamsBestOf() instead of addAllTeams() when team size is greater than 3
2017-09-07 12:31:01 -07:00
Evan Tschannen
c22708b6d6
added tag localities
...
fix: remote logs need to stop the master when they are stopped
2017-08-03 16:16:36 -07:00
Yichi Chiang
6a8a5c41b0
Add a switch to turn off data distribution in CLI
2017-07-28 18:14:55 -07:00
Alec Grieser
f75b6f333b
Merge branch 'release-5.0'
2017-07-13 11:21:18 -07:00
Evan Tschannen
aa1c903b52
fix: do not log that data distribution is initialized until readyToStart is ready
2017-06-30 16:21:59 -07:00
Evan Tschannen
9fd5955e92
Merge branch 'master' into removing-old-dc-code
2017-06-26 16:27:10 -07:00
FDB Dev Team
a674cb4ef4
Initial repository commit
2017-05-25 13:48:44 -07:00