Commit Graph

87 Commits

Author SHA1 Message Date
Evan Tschannen 065a45e05f Merge branch 'master' into feature-fix-force-recovery
# Conflicts:
#	fdbclient/ManagementAPI.actor.cpp
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/workloads/KillRegion.actor.cpp
2019-02-18 17:09:06 -08:00
Evan Tschannen d492395f84 fix: simulation could buggify a delay such that data distribution incorrectly thinks the queue is not processing unhealthy relocations 2019-02-18 14:57:07 -08:00
Jingyu Zhou fc3a784963 Fix another build team bug
The buildTeam() can create teams with undesired storage servers, which are
considered unhealthy. As a result, the data movement can become stuck.

Fix this by adding an ACTOR monitorHealthyTeams that builds team every one
second whenever there is no healthy teams.

Clean up storageServerTracker() interface.
2019-02-14 16:37:16 -08:00
Jingyu Zhou 816f8b1ae1 Per review comments
Add a knob for starting distributor delay.
Move distributor failed variable to a local loop.
2019-02-14 16:37:16 -08:00
Jingyu Zhou e0a7162cf8 Add a failure timeout knob for data distributor.
Set default time to 1.0s.
2019-02-14 16:37:16 -08:00
Trevor Clinkenbeard 5b89db811a Throttle status requests with MAX_STATUS_REQUESTS_PER_SECOND knob, whenever status batching is used. 2019-01-28 15:37:30 -08:00
Evan Tschannen 684a22a52b Merge branch 'release-6.0'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbbackup/backup.actor.cpp
#	fdbclient/BackupContainer.actor.cpp
#	fdbclient/HTTP.actor.cpp
#	fdbserver/storageserver.actor.cpp
#	fdbserver/workloads/BackupCorrectness.actor.cpp
#	versions.target
2019-01-09 16:14:46 -08:00
Evan Tschannen 57293a2db0 byte sample recovery did not use limits for its range reads, leading to slow tasks 2019-01-04 10:32:31 -08:00
Markus Pilman dbe9baff1f Several small compilation fixes for new versions of gcc
There are several missing includes for cmath in the code, I added those.

Next, Coro returns a reference to a stack variable and this causes a
warning. As this is probably ok for Coro, I disabled the warning in
that file for GCC. I want to have this warning in the build system as
it is generally a very useful warning to have.

Another change is that major and minor are deprecated for a while now.
I replaced those with gnu_dev_major and gnu_dev_minor.

ErrorOr currently implements operators ==, !=, and <. These do not
compile because Error does not implement ==. This compiles on older
versions of gcc and clang because ErrorOr<T>::operator== is not used
anywhere. It is still wrong though and newer gcc versions complain.
I simply removed these methods.

The most interesting fix is that TraceEvent::~TraceEvent is currently
throwing exceptions. This is illegal behavior in C++11 and a idea in
older versions of C++. For now I simply removed the throw, but this
might need some more thought.
2019-01-03 12:44:19 -08:00
Evan Tschannen 4e54690005 Merge branch 'release-6.0'
# Conflicts:
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/MoveKeys.actor.cpp
2018-11-12 20:26:58 -08:00
Evan Tschannen 3f3a562f75 updated resolution balancing knobs to be a little more aggressive 2018-11-12 19:11:28 -08:00
Evan Tschannen 3f39024640 buggify resolution balancing so that it still happens in simulation 2018-11-12 00:03:07 -08:00
Evan Tschannen 536ee826da tuned resolver balancing to keep the resolvers within 5MB per second of each other 2018-11-11 23:42:45 -08:00
Robert Escriva 268093a96d Adjust all includes to be relative to the root.
Remove the use of relative paths.  A header at foo/bar.h could be included by
files under foo/ with "bar.h", but would be included everywhere else as
"foo/bar.h".  Adjust so that every include references such a header with the
latter form.

Signed-off-by: Robert Escriva <rescriva@dropbox.com>
2018-10-19 17:35:33 +00:00
Evan Tschannen db71b60d72
Merge pull request #819 from satherton/feature-redwood
Redwood storage engine, initial/experimental version
2018-10-18 18:38:11 -07:00
Evan Tschannen 0613a34845 The storage server would block the main thread when processing a single version with a large amount of data 2018-10-18 13:37:31 -07:00
Stephen Atherton 7c1dc305cb Merge commit 'a72c8f5cb2e79a673abc0ed3d27ef1c51028fb13' into feature-redwood 2018-10-05 10:15:10 -07:00
Evan Tschannen 636420abee fix: if the disk queue adapter peek hangs for a while, switch to a peek from a different locality 2018-10-03 13:58:55 -07:00
Evan Tschannen 28545e0f8d multi cursors start a get more for the first 10 cursors to hide latency 2018-10-03 13:57:45 -07:00
Evan Tschannen a92fc911ac do not spin on a failed storage server recruitment 2018-10-02 17:31:07 -07:00
Stephen Atherton 2fc86c5ff3 Merge branch 'master' of github.com:apple/foundationdb into feature-redwood
# Conflicts:
#	fdbrpc/AsyncFileCached.actor.h
#	fdbserver/IKeyValueStore.h
#	fdbserver/KeyValueStoreMemory.actor.cpp
#	fdbserver/workloads/StatusWorkload.actor.cpp
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-09-20 03:39:55 -07:00
Evan Tschannen 200e65fe61 added a workload which tests killing an entire region, and recovering from the failure with data loss.
fix: we cannot pop the txs tag from remote logs until they have a full copy of the txnStateStore
fix: we have to modify all of history, we cannot stop after finding a local remote
2018-09-17 18:32:39 -07:00
Evan Tschannen 6496a6d9c8 fix: start move keys will only move destination servers to become source servers if less than destination servers are healthy and the total number of sources is less than 2x the number of destinations 2018-08-31 12:43:14 -07:00
Evan Tschannen d7c01f0419 added a separate knob for tlog’s recoverMemoryLimit 2018-08-21 21:11:23 -07:00
Evan Tschannen 7f7755165c slowly send notifications to clients to clear the list of dead clients 2018-08-08 17:29:32 -07:00
Evan Tschannen be1a4d74c7 tlogs serve reads to log routers at a low priority, to prevent them from using all their resources catching up a remote dc that has been down for a long time
increase the amount of memory ratekeeper budgets for tlogs so that there is a gap after the spill threshold to prevent temporarily overshooting the budget
2018-08-04 10:31:30 -07:00
Stephen Atherton 40762d9f9b Merge branch 'master' of github.com:apple/foundationdb into feature-redwood 2018-07-25 17:58:52 -07:00
Evan Tschannen 4fedd05506 added more yields to avoid slow tasks 2018-07-12 17:47:35 -07:00
Evan Tschannen 392c73affb fixed a few slow tasks 2018-07-12 14:06:59 -07:00
Evan Tschannen cd63c7a7cc added a buffered cursor, which efficiently merges lots of peek cursors 2018-07-12 12:09:48 -07:00
Stephen Atherton 96389c74cd Merge branch 'master' of github.com:apple/foundationdb into feature-redwood
# Conflicts:
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-07-10 16:42:34 -07:00
Stephen Atherton 1bc95862b7 Merge branch 'release-6.0' of github.com:apple/foundationdb into feature-redwood
# Conflicts:
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-07-10 04:16:02 -07:00
Evan Tschannen ed7c744a8e fix: we cannot lower max_read_transaction_life_versions too low when configured with remote logs, because it causes the log routers to fall too far behind 2018-07-09 16:55:33 -07:00
Evan Tschannen 5a2cb3037b merge 5.2 into 6.0 2018-07-08 20:14:06 -07:00
Evan Tschannen 4dd18afb84 fix: we cannot make MAX_RECOVERY_VERSIONS lower than MAX_VERSIONS_IN_FLIGHT because we can mark a recovery as stalled before finishing the recovery, leading to an infinite loop of recoveries 2018-07-07 17:41:20 -07:00
Evan Tschannen d6c6e7d306 fix: do not attempt data movement to an unhealthy destination team
allow building more teams than desired if all teams are unhealthy
bestTeamStuck is an error in simulation again
2018-07-07 16:51:16 -07:00
Evan Tschannen cdafd542ee fix: fixed a memory leak where leaderInfo notifications are not cleared out 2018-07-06 17:40:29 -07:00
Stephen Atherton 2925b9b984 Merge branch 'master' of github.com:apple/foundationdb into feature-redwood 2018-07-03 23:03:56 -07:00
Evan Tschannen 7a12d3e130 added the (untested) ability to force a recovery to the remote datacenter, even if that results in data loss. If the DR lag is more than 1 week there could be potential data corruption if any primary storage servers are still alive. 2018-07-01 09:39:04 -04:00
Stephen Atherton b95a2bd6c1 Merge commit 'b17c8359ec22892ed4daeaa569f2f5e105477251' into feature-redwood
# Conflicts:
#	flow/Trace.cpp
2018-06-30 23:18:29 -07:00
Evan Tschannen 1f02bdee0a do not buggify future version delay, because remote storage servers will be delayed getting data so they need additional time 2018-06-29 11:29:22 -07:00
Evan Tschannen 7e68bee692 update better machine classes first to give them a higher chance of becoming the next cluster controller 2018-06-29 01:11:59 -07:00
Evan Tschannen 58c2f67ff6 checking outstanding requests can be CPU intensive, so rate limit checking requests 2018-06-27 23:02:08 -07:00
Evan Tschannen 5fc8199abc Swapped OkayFit and UnsetFit, because generally if machine classes are set on one machine they are set everywhere and it helps with wait_for_good_recruitment logic
wait_for_good_recruitment now requires that you have the desired count of each roll
remote recruitment is given a much longer wait_for_good_recruitment time interval, which does not start until enough remote machines have registered
2018-06-22 10:15:24 -07:00
Evan Tschannen 678b4494f4 added logging for the datacenter version difference 2018-06-21 16:31:52 -07:00
Stephen Atherton e5c48d453a Merge branch 'master' of github.com:apple/foundationdb into feature-redwood 2018-06-18 22:45:27 -07:00
Evan Tschannen b79feaddd3 added a hard memory limit to the TLog to prevent it from running out of memory. Because remote logs are not ratekeeper controlled this is their only protection 2018-06-18 17:22:40 -07:00
Stephen Atherton 90c8288c68 Merge branch 'master' of github.com:apple/foundationdb into feature-redwood 2018-06-17 14:55:05 -07:00
Evan Tschannen 889889323e The master will tell the cluster controller if it is going to take a long time to recruit new logs in its DC; the cluster controller can determine if the other DC would be better and recruit there.
The cluster controller will not switch to the other data center if remote logs are too far behind.
We will not recruit in DCs with negative priority.
2018-06-13 18:14:14 -07:00
Stephen Atherton 2878f30f29 Merge branch 'master' of github.com:apple/foundationdb into feature-redwood
# Conflicts:
#	fdbserver/IKeyValueStore.h
#	fdbserver/KeyValueStoreMemory.actor.cpp
#	fdbserver/storageserver.actor.cpp
2018-06-13 15:56:06 -07:00