foundationdb

Commit Graph

Author	SHA1	Message	Date
Evan Tschannen	1818aab205	Apply suggestions from code review Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>	2019-02-14 16:30:13 -08:00
Jingyu Zhou	886e7ab2ba	Add a new DataDistributor role. Let cluster controller to start a new data distributor role by sending a message to a chosen worker. Change MasterInterface usage in DataDistribution to masterId Add DataDistributor rejoin handling. This allows the data distributor to tell the new cluster controller of its existence so that the controller doesn't spawn a new one. I.e., there should be only ONE data distributor in the cluster. If DataDistributor (DD) doesn't join in a while, then ClusterController (CC) tries to recruit one as DD. CC also monitors DD and restarts one if it failed. The Proxy is also monitoring the DD. If DD failed, the Proxy will ask CC for the new DD. Add GetRecoveryInfo RPC to master server, which is called by data distributor to obtain the recovery Transaction version from the master server.	2019-02-14 16:30:13 -08:00
A.J. Beamon	b435d51061	Merge branch 'master' into track-server-request-latencies	2019-02-14 08:07:32 -08:00
Evan Tschannen	a678f778fa	Increase severity to SevWarnAlways for TooManyStatusRequests trace Co-Authored-By: tclinken <trevorclinkenbeard@gmail.com>	2019-01-28 17:50:50 -08:00
Trevor Clinkenbeard	5b89db811a	Throttle status requests with MAX_STATUS_REQUESTS_PER_SECOND knob, whenever status batching is used.	2019-01-28 15:37:30 -08:00
A.J. Beamon	2198d24ce1	Merge commit '3b2700d25334c53d13496ca16682642aac951beb' into track-server-request-latencies # Conflicts: # fdbclient/MasterProxyInterface.h # fdbserver/ClusterController.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # fdbserver/ServerDBInfo.h # fdbserver/Status.actor.cpp # fdbserver/fdbserver.vcxproj # fdbserver/storageserver.actor.cpp	2019-01-24 11:43:26 -08:00
A.J. Beamon	8e05e95045	Added the ability to configure the latency band settings by setting a special key in \xff keyspace.	2019-01-18 16:18:34 -08:00
Evan Tschannen	7dbf06162e	Update fdbserver/ClusterController.actor.cpp Co-Authored-By: bnamasivayam <36455962+bnamasivayam@users.noreply.github.com>	2019-01-14 16:57:00 -08:00
Balachandar Namasivayam	ff661bca22	Fix a minor bug in the RoleFitness Class.	2019-01-14 14:54:54 -08:00
Balachandar Namasivayam	a8e2e75cd5	Re-enable CheckDesiredClasses after making necessary changes for multi-region setup. Fixed a couple of bugs 1) A rare race condition where a worker is being roles even after it died. 2) Fix how RoleFitness is calculated for TLog and LogRouter. Only worst fitness is compared to see if a better fit is available.	2019-01-10 10:28:32 -08:00
Evan Tschannen	4b5d0b4e2c	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/AsyncFileBlobStore.actor.cpp # fdbclient/AsyncFileBlobStore.actor.h # fdbclient/BlobStore.actor.cpp # fdbclient/BlobStore.h # fdbclient/HTTP.actor.cpp # fdbclient/ManagementAPI.actor.cpp # fdbclient/NativeAPI.actor.cpp # fdbrpc/LoadBalance.actor.h # fdbrpc/batcher.actor.h # fdbrpc/fdbrpc.vcxproj # fdbrpc/sim2.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/DataDistributionTracker.actor.cpp # fdbserver/SimulatedCluster.actor.cpp # fdbserver/TLogServer.actor.cpp # fdbserver/masterserver.actor.cpp	2018-11-10 13:04:24 -08:00
Evan Tschannen	04fa2a7202	fix: we could recover in a region with priority < 0	2018-11-05 10:14:26 -08:00
Evan Tschannen	87295cc263	suppressed spammy trace events, and avoid reporting a long master recovery duration when the cluster is first created	2018-11-04 23:07:56 -08:00
Evan Tschannen	c1bd279a4e	addressed review comments	2018-11-04 20:26:23 -08:00
Evan Tschannen	accba4fa1d	keep track of the last time a process became available to set a better starting value for remoteStartTime	2018-11-04 14:33:03 -08:00
Evan Tschannen	30fbc29af1	Renamed TimeKeeperStarted to TimeKeeperCommit	2018-11-02 12:57:03 -07:00
Evan Tschannen	278dbd5096	call debug transaction on timekeeper	2018-11-02 12:56:29 -07:00
Robert Escriva	268093a96d	Adjust all includes to be relative to the root. Remove the use of relative paths. A header at foo/bar.h could be included by files under foo/ with "bar.h", but would be included everywhere else as "foo/bar.h". Adjust so that every include references such a header with the latter form. Signed-off-by: Robert Escriva <rescriva@dropbox.com>	2018-10-19 17:35:33 +00:00
Evan Tschannen	3922e477a5	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/ManagementAPI.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/LogSystemDiskQueueAdapter.actor.cpp # fdbserver/SimulatedCluster.actor.cpp # fdbserver/TLogServer.actor.cpp	2018-10-03 16:57:18 -07:00
Evan Tschannen	3fdf72c626	fix: we need to force recovery if the master is still attempting to read the txs tag	2018-09-28 13:33:33 -07:00
Evan Tschannen	22e6afbb18	fix: the cluster controller did not pass in its own locality when creating its database object, therefore it was not using locality aware load balancing	2018-09-28 12:12:06 -07:00
A.J. Beamon	92990d6aef	Merge release-6.0 into master	2018-09-21 16:14:39 -07:00
Evan Tschannen	6b6d7a087d	The cluster controller should never consider itself as failed (that will be handled by the coordinators) Simplified the check that the cluster controller is excluded	2018-09-20 17:01:11 -07:00
Evan Tschannen	03728db99b	do not trigger better master exists if the cluster controller is excluded, since the master will change anyways once the cluster controller is moved	2018-09-19 18:28:24 -07:00
Evan Tschannen	4dd2dda0a3	Merge branch 'release-6.0' # Conflicts: # fdbserver/worker.actor.cpp	2018-09-05 16:11:06 -07:00
Evan Tschannen	df406a340e	Merge pull request #742 from ajbeamon/roles-in-trace-events Add the roles running on a process as a field on trace events in the …	2018-09-05 16:08:12 -07:00
Evan Tschannen	90301f497f	Merge branch 'release-6.0' # Conflicts: # fdbclient/ManagementAPI.actor.cpp # fdbrpc/FlowTransport.actor.cpp # fdbrpc/TLSConnection.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/Status.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/StatusWorkload.actor.cpp # versions.target	2018-09-05 16:06:33 -07:00
A.J. Beamon	2de0b5d6d7	Add the roles running on a process as a field on trace events in the form of a comma delimited string of role abbreviations.	2018-09-05 15:06:14 -07:00
Evan Tschannen	4eaff42e4f	Merge pull request #712 from ajbeamon/remove-database-name-internal Eliminate use of database names (phase 1)	2018-09-05 10:35:00 -07:00
Evan Tschannen	e60c668853	The cluster controller will increase its failure monitoring delay after there have been many unfinishedRecoveries	2018-08-31 10:51:55 -07:00
Evan Tschannen	a9987202d6	fixed merge problem	2018-08-22 08:47:47 -07:00
Evan Tschannen	717c43a69f	merge 6.0 into master	2018-08-22 00:28:04 -07:00
A.J. Beamon	2a97139d5d	This is the first step in eliminating the usage of database names in our code. The C API remains the same, but underneath that all usage of database names is eliminated.	2018-08-16 10:24:12 -07:00
Evan Tschannen	e770629229	fix: json_spirit::write_string is very CPU intensive, especially for large JSON documents. The cluster controller would call this function for each status reply it needed to send, resulting in a slow task.	2018-08-15 19:39:06 -07:00
Alex Miller	86dbe1f0e9	Fix more instances of actorcompiler.h being in the wrong place.	2018-08-14 15:50:26 -07:00
Alex Miller	fb31a6999f	Rewrite all files to have #include actorcompiler.h as the last include.	2018-08-14 15:50:26 -07:00
Alex Miller	535b5701e5	Rewrite all `Void _ = wait(...)` -> `wait(...)`. This takes advantage of the new actorcompiler functionality to avoid having duplicate definitions of `Void _` when trying to feed the un-actorompiled source through clang.	2018-08-14 15:50:26 -07:00
A.J. Beamon	574c5576a2	Merge branch 'release-6.0' of github.com:apple/foundationdb # Conflicts: # fdbrpc/TLSConnection.actor.cpp # versions.target	2018-08-10 14:31:58 -07:00
A.J. Beamon	3535ddad80	Merge pull request #674 from alexmiller-apple/glibcxx-debug-fixes Fix bugs uncovered by -D_GLIBCXX_DEBUG	2018-08-09 08:18:51 -07:00
Evan Tschannen	be1a4d74c7	tlogs serve reads to log routers at a low priority, to prevent them from using all their resources catching up a remote dc that has been down for a long time increase the amount of memory ratekeeper budgets for tlogs so that there is a gap after the spill threshold to prevent temporarily overshooting the budget	2018-08-04 10:31:30 -07:00
Alex Miller	1a7cda4149	Stop performing self-moves. (e.g. a = std::move(a)) self-moves are frowned upon in C++, and in our code this generally happens from calls to swap as part of trying to implement a "unordered erase" function via swap-to-the-end-and-pop_back. For convenience, a swapAndPop() function is now offered that performs this, while disallowing self-moves.	2018-08-01 18:09:54 -07:00
Evan Tschannen	1c29275672	call all methods which could disable a trace event before it is initialized. In practice this means calling .error first, then .suppressFor, then all your details.	2018-08-01 14:30:57 -07:00
Evan Tschannen	30b2f85020	fix: it is not safe to drop logs supporting the current primary datacenter, because configuring usable_regions down will drop the storage servers in the remote region, leaving you will no remaining logs	2018-07-14 16:26:45 -07:00
Evan Tschannen	28c0d96c90	fix: treat the local region as best when version difference is too large re-check requests when the version difference becomes small	2018-07-06 14:44:11 -07:00
Evan Tschannen	21347df254	fix: getting metrics did not handle broken_promise errors	2018-07-05 12:30:11 -07:00
Evan Tschannen	507b3bacb0	fix: kill all tlogs in one region prevents the remote logs from recovering in that region, do not allow that to prevent us from configuring usable_regions=1. added more recovery states.	2018-07-05 00:08:51 -07:00
Evan Tschannen	e17dfea3b6	fix: desiredTLogCount was used instead of getDesiredLogs(), which caused problems with recruitment when desiredTLogCount was -1. canKillProcess logic was wrong. We still need to configure usable_regions because if datacenterVersionDifference is too large we cannot complete data movement.	2018-07-04 16:22:32 -04:00
Evan Tschannen	f2ec80f10d	added trace events for cluster controller changing datacenters	2018-07-02 13:06:54 -04:00
Evan Tschannen	334a433238	spend less time before using satellite fallback, because the database will be unavailable during this waiting time	2018-07-02 12:50:52 -04:00
Evan Tschannen	7a12d3e130	added the (untested) ability to force a recovery to the remote datacenter, even if that results in data loss. If the DR lag is more than 1 week there could be potential data corruption if any primary storage servers are still alive.	2018-07-01 09:39:04 -04:00

1 2 3

142 Commits