Evan Tschannen
1818aab205
Apply suggestions from code review
...
Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>
2019-02-14 16:30:13 -08:00
Jingyu Zhou
886e7ab2ba
Add a new DataDistributor role.
...
Let cluster controller to start a new data distributor role by sending a
message to a chosen worker.
Change MasterInterface usage in DataDistribution to masterId
Add DataDistributor rejoin handling.
This allows the data distributor to tell the new cluster controller of its
existence so that the controller doesn't spawn a new one. I.e., there should
be only ONE data distributor in the cluster.
If DataDistributor (DD) doesn't join in a while, then ClusterController (CC) tries
to recruit one as DD. CC also monitors DD and restarts one if it failed.
The Proxy is also monitoring the DD. If DD failed, the Proxy will ask CC for
the new DD.
Add GetRecoveryInfo RPC to master server, which is called by data distributor
to obtain the recovery Transaction version from the master server.
2019-02-14 16:30:13 -08:00
A.J. Beamon
b435d51061
Merge branch 'master' into track-server-request-latencies
2019-02-14 08:07:32 -08:00
Evan Tschannen
a678f778fa
Increase severity to SevWarnAlways for TooManyStatusRequests trace
...
Co-Authored-By: tclinken <trevorclinkenbeard@gmail.com>
2019-01-28 17:50:50 -08:00
Trevor Clinkenbeard
5b89db811a
Throttle status requests with MAX_STATUS_REQUESTS_PER_SECOND knob, whenever status batching is used.
2019-01-28 15:37:30 -08:00
A.J. Beamon
2198d24ce1
Merge commit '3b2700d25334c53d13496ca16682642aac951beb' into track-server-request-latencies
...
# Conflicts:
# fdbclient/MasterProxyInterface.h
# fdbserver/ClusterController.actor.cpp
# fdbserver/MasterProxyServer.actor.cpp
# fdbserver/ServerDBInfo.h
# fdbserver/Status.actor.cpp
# fdbserver/fdbserver.vcxproj
# fdbserver/storageserver.actor.cpp
2019-01-24 11:43:26 -08:00
A.J. Beamon
8e05e95045
Added the ability to configure the latency band settings by setting a special key in \xff keyspace.
2019-01-18 16:18:34 -08:00
Evan Tschannen
7dbf06162e
Update fdbserver/ClusterController.actor.cpp
...
Co-Authored-By: bnamasivayam <36455962+bnamasivayam@users.noreply.github.com>
2019-01-14 16:57:00 -08:00
Balachandar Namasivayam
ff661bca22
Fix a minor bug in the RoleFitness Class.
2019-01-14 14:54:54 -08:00
Balachandar Namasivayam
a8e2e75cd5
Re-enable CheckDesiredClasses after making necessary changes for multi-region setup.
...
Fixed a couple of bugs
1) A rare race condition where a worker is being roles even after it died.
2) Fix how RoleFitness is calculated for TLog and LogRouter. Only worst fitness is compared to see if a better fit is available.
2019-01-10 10:28:32 -08:00
Evan Tschannen
4b5d0b4e2c
Merge branch 'release-6.0'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbclient/AsyncFileBlobStore.actor.cpp
# fdbclient/AsyncFileBlobStore.actor.h
# fdbclient/BlobStore.actor.cpp
# fdbclient/BlobStore.h
# fdbclient/HTTP.actor.cpp
# fdbclient/ManagementAPI.actor.cpp
# fdbclient/NativeAPI.actor.cpp
# fdbrpc/LoadBalance.actor.h
# fdbrpc/batcher.actor.h
# fdbrpc/fdbrpc.vcxproj
# fdbrpc/sim2.actor.cpp
# fdbserver/DataDistribution.actor.cpp
# fdbserver/DataDistributionTracker.actor.cpp
# fdbserver/SimulatedCluster.actor.cpp
# fdbserver/TLogServer.actor.cpp
# fdbserver/masterserver.actor.cpp
2018-11-10 13:04:24 -08:00
Evan Tschannen
04fa2a7202
fix: we could recover in a region with priority < 0
2018-11-05 10:14:26 -08:00
Evan Tschannen
87295cc263
suppressed spammy trace events, and avoid reporting a long master recovery duration when the cluster is first created
2018-11-04 23:07:56 -08:00
Evan Tschannen
c1bd279a4e
addressed review comments
2018-11-04 20:26:23 -08:00
Evan Tschannen
accba4fa1d
keep track of the last time a process became available to set a better starting value for remoteStartTime
2018-11-04 14:33:03 -08:00
Evan Tschannen
30fbc29af1
Renamed TimeKeeperStarted to TimeKeeperCommit
2018-11-02 12:57:03 -07:00
Evan Tschannen
278dbd5096
call debug transaction on timekeeper
2018-11-02 12:56:29 -07:00
Robert Escriva
268093a96d
Adjust all includes to be relative to the root.
...
Remove the use of relative paths. A header at foo/bar.h could be included by
files under foo/ with "bar.h", but would be included everywhere else as
"foo/bar.h". Adjust so that every include references such a header with the
latter form.
Signed-off-by: Robert Escriva <rescriva@dropbox.com>
2018-10-19 17:35:33 +00:00
Evan Tschannen
3922e477a5
Merge branch 'release-6.0'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbclient/ManagementAPI.actor.cpp
# fdbserver/ClusterController.actor.cpp
# fdbserver/DataDistribution.actor.cpp
# fdbserver/LogSystemDiskQueueAdapter.actor.cpp
# fdbserver/SimulatedCluster.actor.cpp
# fdbserver/TLogServer.actor.cpp
2018-10-03 16:57:18 -07:00
Evan Tschannen
3fdf72c626
fix: we need to force recovery if the master is still attempting to read the txs tag
2018-09-28 13:33:33 -07:00
Evan Tschannen
22e6afbb18
fix: the cluster controller did not pass in its own locality when creating its database object, therefore it was not using locality aware load balancing
2018-09-28 12:12:06 -07:00
A.J. Beamon
92990d6aef
Merge release-6.0 into master
2018-09-21 16:14:39 -07:00
Evan Tschannen
6b6d7a087d
The cluster controller should never consider itself as failed (that will be handled by the coordinators)
...
Simplified the check that the cluster controller is excluded
2018-09-20 17:01:11 -07:00
Evan Tschannen
03728db99b
do not trigger better master exists if the cluster controller is excluded, since the master will change anyways once the cluster controller is moved
2018-09-19 18:28:24 -07:00
Evan Tschannen
4dd2dda0a3
Merge branch 'release-6.0'
...
# Conflicts:
# fdbserver/worker.actor.cpp
2018-09-05 16:11:06 -07:00
Evan Tschannen
df406a340e
Merge pull request #742 from ajbeamon/roles-in-trace-events
...
Add the roles running on a process as a field on trace events in the …
2018-09-05 16:08:12 -07:00
Evan Tschannen
90301f497f
Merge branch 'release-6.0'
...
# Conflicts:
# fdbclient/ManagementAPI.actor.cpp
# fdbrpc/FlowTransport.actor.cpp
# fdbrpc/TLSConnection.actor.cpp
# fdbserver/DataDistribution.actor.cpp
# fdbserver/Status.actor.cpp
# fdbserver/storageserver.actor.cpp
# fdbserver/workloads/StatusWorkload.actor.cpp
# versions.target
2018-09-05 16:06:33 -07:00
A.J. Beamon
2de0b5d6d7
Add the roles running on a process as a field on trace events in the form of a comma delimited string of role abbreviations.
2018-09-05 15:06:14 -07:00
Evan Tschannen
4eaff42e4f
Merge pull request #712 from ajbeamon/remove-database-name-internal
...
Eliminate use of database names (phase 1)
2018-09-05 10:35:00 -07:00
Evan Tschannen
e60c668853
The cluster controller will increase its failure monitoring delay after there have been many unfinishedRecoveries
2018-08-31 10:51:55 -07:00
Evan Tschannen
a9987202d6
fixed merge problem
2018-08-22 08:47:47 -07:00
Evan Tschannen
717c43a69f
merge 6.0 into master
2018-08-22 00:28:04 -07:00
A.J. Beamon
2a97139d5d
This is the first step in eliminating the usage of database names in our code. The C API remains the same, but underneath that all usage of database names is eliminated.
2018-08-16 10:24:12 -07:00
Evan Tschannen
e770629229
fix: json_spirit::write_string is very CPU intensive, especially for large JSON documents. The cluster controller would call this function for each status reply it needed to send, resulting in a slow task.
2018-08-15 19:39:06 -07:00
Alex Miller
86dbe1f0e9
Fix more instances of actorcompiler.h being in the wrong place.
2018-08-14 15:50:26 -07:00
Alex Miller
fb31a6999f
Rewrite all files to have #include actorcompiler.h as the last include.
2018-08-14 15:50:26 -07:00
Alex Miller
535b5701e5
Rewrite all `Void _ = wait(...)` -> `wait(...)`.
...
This takes advantage of the new actorcompiler functionality to avoid
having duplicate definitions of `Void _` when trying to feed the
un-actorompiled source through clang.
2018-08-14 15:50:26 -07:00
A.J. Beamon
574c5576a2
Merge branch 'release-6.0' of github.com:apple/foundationdb
...
# Conflicts:
# fdbrpc/TLSConnection.actor.cpp
# versions.target
2018-08-10 14:31:58 -07:00
A.J. Beamon
3535ddad80
Merge pull request #674 from alexmiller-apple/glibcxx-debug-fixes
...
Fix bugs uncovered by -D_GLIBCXX_DEBUG
2018-08-09 08:18:51 -07:00
Evan Tschannen
be1a4d74c7
tlogs serve reads to log routers at a low priority, to prevent them from using all their resources catching up a remote dc that has been down for a long time
...
increase the amount of memory ratekeeper budgets for tlogs so that there is a gap after the spill threshold to prevent temporarily overshooting the budget
2018-08-04 10:31:30 -07:00
Alex Miller
1a7cda4149
Stop performing self-moves. (e.g. a = std::move(a))
...
self-moves are frowned upon in C++, and in our code this generally happens from
calls to swap as part of trying to implement a "unordered erase" function via
swap-to-the-end-and-pop_back. For convenience, a swapAndPop() function is now
offered that performs this, while disallowing self-moves.
2018-08-01 18:09:54 -07:00
Evan Tschannen
1c29275672
call all methods which could disable a trace event before it is initialized. In practice this means calling .error first, then .suppressFor, then all your details.
2018-08-01 14:30:57 -07:00
Evan Tschannen
30b2f85020
fix: it is not safe to drop logs supporting the current primary datacenter, because configuring usable_regions down will drop the storage servers in the remote region, leaving you will no remaining logs
2018-07-14 16:26:45 -07:00
Evan Tschannen
28c0d96c90
fix: treat the local region as best when version difference is too large
...
re-check requests when the version difference becomes small
2018-07-06 14:44:11 -07:00
Evan Tschannen
21347df254
fix: getting metrics did not handle broken_promise errors
2018-07-05 12:30:11 -07:00
Evan Tschannen
507b3bacb0
fix: kill all tlogs in one region prevents the remote logs from recovering in that region, do not allow that to prevent us from configuring usable_regions=1.
...
added more recovery states.
2018-07-05 00:08:51 -07:00
Evan Tschannen
e17dfea3b6
fix: desiredTLogCount was used instead of getDesiredLogs(), which caused problems with recruitment when desiredTLogCount was -1.
...
canKillProcess logic was wrong.
We still need to configure usable_regions because if datacenterVersionDifference is too large we cannot complete data movement.
2018-07-04 16:22:32 -04:00
Evan Tschannen
f2ec80f10d
added trace events for cluster controller changing datacenters
2018-07-02 13:06:54 -04:00
Evan Tschannen
334a433238
spend less time before using satellite fallback, because the database will be unavailable during this waiting time
2018-07-02 12:50:52 -04:00
Evan Tschannen
7a12d3e130
added the (untested) ability to force a recovery to the remote datacenter, even if that results in data loss. If the DR lag is more than 1 week there could be potential data corruption if any primary storage servers are still alive.
2018-07-01 09:39:04 -04:00