Commit Graph

48 Commits

Author SHA1 Message Date
Meng Xu e676348710
Merge pull request #1955 from fzhjon/mark-ss-failed
Add fdbcli and API command to mark storage servers as permanently failed
2019-10-22 23:36:30 -07:00
A.J. Beamon 29a0014b41 Fix "bandwith" typo 2019-10-22 09:51:59 -07:00
Xin Dong fca9aab17a
Merge pull request #2046 from dongxinEric/feature/hot-read-key-detection
Added metrics for read hot key detection
2019-10-21 14:31:48 -07:00
Jon Fu d2b6626d5c Merge branch 'master' of https://github.com/apple/foundationdb into mark-ss-failed 2019-10-21 13:47:06 -07:00
Xin Dong 9a81948843
Accept review suggestions.
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-10-21 10:08:43 -07:00
Xin Dong 6a40ef25e5 Credit to Evan for pointing out the missing line which costs me weeks debugging some weird behaviors. 2019-10-18 16:46:19 -07:00
Jon Fu b1fd6b4443 addressed review comments 2019-10-18 09:43:25 -07:00
Evan Tschannen 86bcb84b45 Raised the data distribution priority of splitting shards above restoring fault tolerance to avoid hot write shards 2019-10-11 17:50:43 -07:00
Xin Dong 41aae9cbd9 Fix compiler errors 2019-10-10 13:08:59 -07:00
Xin Dong 795ce59fbb Resolved conflict with master 2019-10-09 16:45:11 -07:00
Xin Dong 62ffdd54a3 Updated some comments to reflect the correct knob value and also used a more appropiate value for read bandwidth. Set the default value for read bandwidth in some cases. 2019-10-09 16:42:42 -07:00
Xin Dong cd4757b06c Address review comments 2019-10-09 16:42:42 -07:00
Xin Dong 6b0f771cc0 Fixex a typo in knobs. Addressed some review comments. Added code for actual metric collecting. 2019-10-09 16:42:42 -07:00
Xin Dong 12293d5497 Added metrics for read hot key detection 2019-10-09 16:42:42 -07:00
A.J. Beamon 909855bcec Fix: the keys argument to changeSizes was passed as a reference, but when used after the first wait(), it may no longer be valid. 2019-10-09 14:07:48 -07:00
Jon Fu d96a7b2c69 Merge branch 'master' of https://github.com/apple/foundationdb into mark-ss-failed 2019-10-03 09:47:45 -07:00
Evan Tschannen 045175bd0e added tracking for the size of the system keyspace 2019-09-27 22:39:19 -07:00
Evan Tschannen 3bb62e008c lowered the priority of some delays in data distribution so that the process will prefer other work 2019-09-27 18:33:13 -07:00
Jon Fu 00c2025d4b fixed removeKeys impl, adjusted test workload, and introduced extra safety checks to NativeAPI and proxy 2019-08-27 14:39:44 -07:00
Jon Fu 66bba51988 Implemented direct removal of failed storage server from system keyspace 2019-08-27 14:39:43 -07:00
Meng Xu b7478f5dd3 DD:Add comments to help understand code
Add comments to explain the functionalities of some code.
2019-07-22 11:23:16 -07:00
Alex Miller 7a500cd37f A giant translation of TaskFooPriority -> TaskPriority::Foo
This is so that APIs that take priorities don't take ints, which are
common and easy to accidentally pass the wrong thing.
2019-06-25 02:47:35 -07:00
A.J. Beamon 5f55f3f613 Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used. 2019-05-10 14:01:52 -07:00
mpilman d01cbf3455 Addressed code review comments 2019-04-05 13:12:20 -07:00
mpilman 1c16f87a4e Remove trace-calls to printable (in non-workloads) 2019-04-05 13:12:19 -07:00
anoyes 981426bac9 More ide fixes 2019-03-05 18:03:57 -08:00
Jingyu Zhou c38b2a8c38 Change masterId to distributorId in tracker.
This reflects the change of moving data distribution out of master server.
2019-02-14 16:37:16 -08:00
Evan Tschannen 4e54690005 Merge branch 'release-6.0'
# Conflicts:
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/MoveKeys.actor.cpp
2018-11-12 20:26:58 -08:00
Evan Tschannen cd188a351e fix: if a destination team became unhealthy and then healthy again, it would lower the priority of a move even though the source servers we are moving from are still unhealthy
fix: badTeams were not accounted for when checking priorities
2018-11-11 12:33:31 -08:00
Evan Tschannen 4b5d0b4e2c Merge branch 'release-6.0'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/AsyncFileBlobStore.actor.cpp
#	fdbclient/AsyncFileBlobStore.actor.h
#	fdbclient/BlobStore.actor.cpp
#	fdbclient/BlobStore.h
#	fdbclient/HTTP.actor.cpp
#	fdbclient/ManagementAPI.actor.cpp
#	fdbclient/NativeAPI.actor.cpp
#	fdbrpc/LoadBalance.actor.h
#	fdbrpc/batcher.actor.h
#	fdbrpc/fdbrpc.vcxproj
#	fdbrpc/sim2.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/DataDistributionTracker.actor.cpp
#	fdbserver/SimulatedCluster.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/masterserver.actor.cpp
2018-11-10 13:04:24 -08:00
Evan Tschannen e68c07ae35 fix: trackShardBytes was called with the incorrect range, resulting in incorrect shard sizes
reduced the size of shard tracker actors by removing unnecessary state variable. Because we have a large number of these actors these extra state variables add up to a lot of memory
2018-11-02 13:03:01 -07:00
Robert Escriva 268093a96d Adjust all includes to be relative to the root.
Remove the use of relative paths.  A header at foo/bar.h could be included by
files under foo/ with "bar.h", but would be included everywhere else as
"foo/bar.h".  Adjust so that every include references such a header with the
latter form.

Signed-off-by: Robert Escriva <rescriva@dropbox.com>
2018-10-19 17:35:33 +00:00
A.J. Beamon 2a97139d5d This is the first step in eliminating the usage of database names in our code. The C API remains the same, but underneath that all usage of database names is eliminated. 2018-08-16 10:24:12 -07:00
Alex Miller fb31a6999f Rewrite all files to have #include actorcompiler.h as the last include. 2018-08-14 15:50:26 -07:00
Alex Miller 535b5701e5 Rewrite all `Void _ = wait(...)` -> `wait(...)`.
This takes advantage of the new actorcompiler functionality to avoid
having duplicate definitions of `Void _` when trying to feed the
un-actorompiled source through clang.
2018-08-14 15:50:26 -07:00
Evan Tschannen 6f02ea843a prevented a slow task when too many shards were sent to the data distribution queue after switching to a fearless deployment 2018-08-09 12:37:46 -07:00
Evan Tschannen 1c29275672 call all methods which could disable a trace event before it is initialized. In practice this means calling .error first, then .suppressFor, then all your details. 2018-08-01 14:30:57 -07:00
Evan Tschannen 392c73affb fixed a few slow tasks 2018-07-12 14:06:59 -07:00
A.J. Beamon e5488419cc Attempt to normalize trace events:
* Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check.
* Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase.
* Use seconds instead of milliseconds in details.

Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed.

This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.
2018-06-08 11:11:08 -07:00
Evan Tschannen fa7eaea7cf fix: shards affected by team failure did not properly handle separate teams for the remote and primary data centers 2018-03-08 10:50:05 -08:00
Evan Tschannen 37a6a81634 Merge commit '7f6fc3e039c911cd84b8540f7f799fc38a1c1822' into feature-remote-logs
# Conflicts:
#	fdbserver/workloads/RestartRecovery.actor.cpp
2018-02-23 12:33:28 -08:00
Alec Grieser 0bae9880f1 remove trailing whitespace from our copyright headers ; fixed formatting of python setup.py 2018-02-21 10:25:11 -08:00
Evan Tschannen ebd94bb654 removed a separately configurable storage team size for the remote data center, because it did not make sense
fix: the master did not monitor for the failure of remote logs
stop merge attempts when a data center is failed
fixed a variety of other problems with data distribution when a data center is failed
2018-02-02 11:46:04 -08:00
Evan Tschannen c3918d892a do not use bandwidth splitting on the keyServer shard, lots of sets and clears to this shard generally means you do not want to create additional data distribution work 2017-11-30 18:28:16 -08:00
Evan Tschannen aa0c2ae317 only increase the max shard size if the shard begins in the keyServer keyspace, do not increase the minimum shard size 2017-10-27 14:22:26 -07:00
Evan Tschannen 3a4078bdda the keyservers shards are always a fixed large size 2017-10-27 11:52:11 -07:00
Yichi Chiang 53e1ae9f60 shard system keyspace 2017-07-26 13:47:31 -07:00
FDB Dev Team a674cb4ef4 Initial repository commit 2017-05-25 13:48:44 -07:00