foundationdb

Commit Graph

Author	SHA1	Message	Date
Young Liu	525f10e30c	Merge master branch	2020-07-22 16:08:49 -07:00
Young Liu	302cf5c45f	Remove debug trace events	2020-07-22 12:20:22 -07:00
Young Liu	2703cedac5	Fixed known bugs	2020-07-17 22:24:52 -07:00
Young Liu	21c1998cca	Fix MaxTLogQueueSize Bug	2020-07-16 15:56:04 -07:00
Young Liu	5b06d69d25	Pass watches test	2020-07-15 00:37:41 -07:00
Andrew Noyes	f470ba8316	Remove using namespace std::rel_ops This causes the following to not compile anymore \#include <utility> \#include <vector> using namespace std::rel_ops; int main() { std::vector<int> xs; return xs.rbegin() != xs.rend(); } See https://godbolt.org/z/s1977n	2020-07-10 22:58:15 +00:00
Meng Xu	9668f32df5	Merge pull request #3388 from apple/release-6.3 Merge Release 6.3 into master	2020-06-18 08:50:25 -07:00
Vishesh Yadav	3068a37e1b	refactor: Remove dead failureDetectionServer code	2020-06-17 15:40:21 -07:00
sfc-gh-tclinkenbeard	99bf993815	Replace BOOST_NOEXCEPT with noexcept	2020-06-09 22:39:19 -07:00
negoyal	cf13e00a8f	Merge remote-tracking branch 'origin/release-6.3' into fdb_cache_wo_allocator	2020-06-01 17:38:31 -07:00
Markus Pilman	c2bc75516f	Merge branch 'release-6.3' of github.com:apple/foundationdb into features/trace-roles	2020-05-14 10:34:53 -07:00
Evan Tschannen	f17f00fdd5	Merge branch 'release-6.2' # Conflicts: # documentation/sphinx/source/release-notes.rst	2020-05-10 22:33:38 -07:00
Evan Tschannen	3eaa9d6397	fix: do not report datacenter version difference before both datacenters report a correct version	2020-05-10 17:49:09 -07:00
Markus Pilman	5f9b127e56	Emit traces regularly about role assignment We are currently emitting Role transition traces when a role starts and when it ends. While this is useful for debugging, it doesn't work well with tools that inject data and might potentially miss some trace lines. We do decorate each trace lines with the roles assigned to that particular process, however, this is not sufficient for tools that can make use of the UID -> Role mapping	2020-05-08 16:27:57 -07:00
negoyal	dd033736ed	Merge branch 'master' into fdb_cache_subfeature2	2020-05-04 17:29:43 -07:00
Evan Tschannen	9e5037291d	fix compiler errors	2020-05-01 14:30:50 -07:00
Evan Tschannen	a442565e13	more work towards shrinking locality	2020-04-18 21:29:38 -07:00
Evan Tschannen	b04478704e	fixed improper use of std::set erase	2020-04-17 16:45:22 -07:00
Evan Tschannen	33efb9ec97	code cleanup based on review comments	2020-04-17 15:05:01 -07:00
Evan Tschannen	b667d5442f	fix: not all removed endpoints were actually removed	2020-04-17 13:47:54 -07:00
Evan Tschannen	9b5130194d	avoid updating the same endpoint multiple times	2020-04-11 21:05:30 -07:00
Evan Tschannen	1476057996	properly cache serialization of serverDBInfo	2020-04-11 19:30:05 -07:00
Evan Tschannen	07cc0a8d74	code cleanup	2020-04-10 17:02:11 -07:00
Evan Tschannen	ce4493f679	many bug fixes	2020-04-10 13:45:16 -07:00
Evan Tschannen	a51c92854a	Merge branch 'master' into feature-tree-broadcast # Conflicts: # fdbserver/WorkerInterface.actor.h # fdbserver/worker.actor.cpp	2020-04-06 21:09:44 -07:00
Evan Tschannen	2a1bd97120	fix compilation errors	2020-04-06 20:58:43 -07:00
Evan Tschannen	477d66b46d	implemented a tree broadcast for txn state message for proxies, and serverDBInfo for workers	2020-04-05 23:09:36 -07:00
negoyal	acaf91ac47	Merge branch 'master' into fdb_cache_subfeature2	2020-03-26 13:33:08 -07:00
Jingyu Zhou	5b36dcaad5	Fix oldest backup epoch for backup workers The oldest backup epoch is piggybacked in LogSystemConfig from master to cluster controller and then to all workers. Previously, this epoch is set to the current master epoch, which is wrong.	2020-03-20 20:15:09 -07:00
Evan Tschannen	e08f0201f1	merge release 6.2 into master	2020-03-17 12:51:47 -07:00
Evan Tschannen	2038a56ff4	Merge pull request #2819 from etschannen/feature-first-proxy A "proxy" class process would not be preferred as the "first proxy" for restore and DR purposes	2020-03-16 13:53:28 -07:00
Evan Tschannen	012344e297	refactor getWorkersForRoleInDatacenter	2020-03-16 11:50:17 -07:00
Evan Tschannen	79d5511149	A "proxy" class process would not be preferred as the "first proxy" for restore and DR purposes	2020-03-13 17:49:02 -07:00
Evan Tschannen	4640edf5d6	do not recruit satellite tlogs when usable regions=1	2020-03-13 10:24:52 -07:00
Evan Tschannen	303df197cf	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # bindings/c/test/mako/mako.c # documentation/sphinx/source/release-notes.rst # fdbbackup/backup.actor.cpp # fdbclient/NativeAPI.actor.cpp # fdbclient/NativeAPI.actor.h # fdbserver/DataDistributionQueue.actor.cpp # fdbserver/Knobs.cpp # fdbserver/Knobs.h # fdbserver/LogRouter.actor.cpp # fdbserver/SkipList.cpp # fdbserver/fdbserver.actor.cpp # flow/CMakeLists.txt # flow/Knobs.cpp # flow/Knobs.h # flow/flow.vcxproj # flow/flow.vcxproj.filters # versions.target	2020-03-06 18:22:46 -08:00
Evan Tschannen	f3ac2c9180	renamed a variable	2020-03-04 18:49:21 -08:00
Evan Tschannen	b3ea9d5896	Do not allow the cluster controller to mark any process as failed within 30 seconds of startup	2020-03-04 18:45:26 -08:00
negoyal	cd949eca71	Merge branch 'master' into fdb_cache_subfeature2	2020-02-26 11:22:08 -08:00
Evan Tschannen	96258b9809	Merge branch 'release-6.2' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbcli/fdbcli.actor.cpp # fdbclient/ManagementAPI.actor.cpp # fdbrpc/FlowTransport.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/DataDistribution.actor.h # fdbserver/DataDistributionQueue.actor.cpp # fdbserver/KeyValueStoreMemory.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # fdbserver/QuietDatabase.actor.cpp # fdbserver/SkipList.cpp # fdbserver/StorageMetrics.actor.h # fdbserver/TLogServer.actor.cpp # fdbserver/fdbserver.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/KVStoreTest.actor.cpp # flow/CMakeLists.txt # flow/Knobs.cpp # flow/Knobs.h # flow/genericactors.actor.cpp # flow/serialize.h	2020-02-21 19:09:16 -08:00
Evan Tschannen	8b768e66df	Merge pull request #2694 from dongxinEric/feature/2663/specialize-policy-for-zoneid-in-cc Added a specialized algorithm for PolicyOne and PolicyAcross(,'zoneId…	2020-02-20 14:46:23 -08:00
Evan Tschannen	574e88ba8e	updateGoodRemoteRecruitmentTime was unnecessary because the only way findRemoteWorkers would return would be after a new server has joined which already resets goodRemoteRecruitmentTime	2020-02-20 13:46:22 -08:00
Xin Dong	99095c9224	Again make Clang happy.	2020-02-20 09:50:22 -08:00
Xin Dong	298d6cb3d7	Address review comments.	2020-02-20 09:34:01 -08:00
Evan Tschannen	fbd45963d8	The cluster controller waits until no new workers register for 1.0 before starting a bad recruitment	2020-02-19 16:48:30 -08:00
Xin Dong	89fcbb2055	Make clang happy	2020-02-19 09:44:15 -08:00
Xin Dong	efc0d7f9d5	Added a specialized algorithm for PolicyOne and PoilcyAcross(,'zoneId',PolicyOne()) to find a set of TLog servers which will be able to fulfill the policy later.	2020-02-19 09:25:57 -08:00
negoyal	85cc35e81e	Merge branch 'master' into HEAD	2020-02-05 14:59:55 -08:00
Evan Tschannen	844c8511c4	Merge pull request #2588 from jzhou77/backup-worker Integrate new backup worker with existing backup command	2020-02-05 14:14:43 -08:00
Jingyu Zhou	52c6737411	Rename backupLoggingEnabled as backupWorkerEnabled To highlight the changes for 7.0 backup changes. By default, backup_worker_enabled flag is set for 7.0 version.	2020-02-04 10:09:16 -08:00
Jingyu Zhou	0db03f1d3c	Use backup_logging_enabled flag The default is to enable new backup workers. Users can disable this flag to turn off the backup worker feature.	2020-02-03 20:03:22 -08:00
Evan Tschannen	4524831456	Merge pull request #2518 from vishesh/task/failmon-remove-server FailureMonitoring: Server processes no longer need to talk to ClusterController	2020-02-03 17:22:50 -08:00
Jingyu Zhou	38aa1903fd	Add a DB configuration option for backup workers Right now, the default is to keep the old backup behavior, i.e., do NOT use backup workers. Specifically, if BackupType is not set (or is set to default), the master will not recruit backup workers and will not add pseudo locality for backup workers. The StartFullBackupTaskFunc is updated to check if backup worker is enabled. Only when it is not enabled, starting a backup will wait on all backup workers to be started.	2020-01-31 19:29:09 -08:00
Jingyu Zhou	6ddf73e26a	Remove code introduced when resolving merge conflicts	2020-01-22 21:23:38 -08:00
Jingyu Zhou	c6c39ca99d	Update better master exist with backup workers During recruitment, if there is no desired log router count, use tlog size instead, because the number of backup workers has to be larger than 0.	2020-01-22 19:43:40 -08:00
Jingyu Zhou	56a2c37071	Recruit backup workers for single region Enable log router tags for single region, which are popped by backup workers. Need to add noop for backup workers if there is no active backups.	2020-01-22 19:42:13 -08:00
Jingyu Zhou	19d6a889ff	Recruit backup workers for old epochs If there are unfinished ranges in the old epochs, the new master will recruit backup workers responsible for finishing these ranges. These workers remains in the cluster until the next epoch, when it will remove itself.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	7da9f47f26	Enable pop from backup workers This is still WIP as some edge cases can trigger test failure, most likely due to not popping mutations by backup workers when epoch ends.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	ece3cadf8e	Recruit backup worker during master recovery Right now recruit the same number as TLogs. The backup worker does nothing.	2020-01-22 19:37:48 -08:00
Jingyu Zhou	de8d953865	Add backup role, class, and worker skeleton	2020-01-22 19:35:30 -08:00
Vishesh Yadav	daef5f011a	Merge remote-tracking branch 'apple/master' into task/failmon-remove-server	2020-01-21 13:20:15 -08:00
Evan Tschannen	3f9d9d8b84	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # cmake/FlowCommands.cmake # documentation/sphinx/source/release-notes.rst # fdbclient/StorageServerInterface.h # fdbserver/DataDistributionTracker.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # fdbserver/fdbserver.actor.cpp # flow/Knobs.h # flow/Platform.cpp # versions.target	2020-01-16 18:37:47 -08:00
Evan Tschannen	d55e56993d	fix: the cluster controller would not recruit more remote logs before the database became fully_recovered	2020-01-10 12:21:48 -08:00
Alvin Moore	7628d04fb9	Merge branch 'release-6.2' of github.com:apple/foundationdb into release_6.2_merge # Conflicts: # documentation/sphinx/source/release-notes.rst	2020-01-09 07:21:16 -08:00
mpilman	d3d6016c90	Merge remote-tracking branch 'negoyal/fdb_cache_subfeature2' into features/cache-initialization	2020-01-07 19:53:09 -08:00
Vishesh Yadav	6e6cfaff16	Cleanup old Failure Monitoring code	2020-01-07 15:53:32 -08:00
negoyal	29b77863f0	Cache warmup and Consistency check workload changes.	2020-01-07 13:06:58 -08:00
Evan Tschannen	3eae401886	fix: we were recruiting one too few oldLogRouters code cleanup	2020-01-02 15:05:44 -08:00
Evan Tschannen	5e5e618da0	during recovery, only send the full serverDBInfo to processes that are part of the new generation	2019-12-09 13:17:49 -08:00
Evan Tschannen	bcce5968a4	recruit oldLogRouters on TLogs, do not recruit oldLogRouters on the cluster controller if possible	2019-12-09 13:12:13 -08:00
mpilman	821edcb207	Register caches through keyspace This also removes the old mechanism that registers them through the serverDBInfo. Caches do now self-recruit at startup	2019-12-06 13:28:44 -08:00
negoyal	cf2563f1c7	Mix of various things, a lot of which will change.	2019-12-05 17:10:32 -08:00
Evan Tschannen	3c769fcf60	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # fdbserver/ClusterController.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # versions.target	2019-11-22 15:39:19 -08:00
Evan Tschannen	ebcb2f79ed	Merge branch 'master' of github.com:apple/foundationdb	2019-11-22 15:34:49 -08:00
A.J. Beamon	7c801513e2	Fix cases where latency band config could be discarded during recovery or process start.	2019-11-20 11:44:18 -08:00
Evan Tschannen	8d3ef89540	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # fdbclient/MutationList.h # fdbserver/MasterProxyServer.actor.cpp # versions.target	2019-11-14 15:49:56 -08:00
Evan Tschannen	ffc89d1182	fix: dd test recruitment should prefer the location of ratekeeper over other used processes	2019-11-13 12:58:55 -08:00
Balachandar Namasivayam	2e41497580	This commit tries to distribute RK and DD among other empty available processes.	2019-11-12 17:52:42 -08:00
Balachandar Namasivayam	f5282f2c7e	Fix bug where DD or RK could be halted and re-recruited in a loop for certain valid process class configurations. Specifically, recruitment of DD or RK takes into account that master process is preferred over proxy, resolver or cc. But check for better DD only looks for better machine class ignoring that the new recruit could share a proxy or resolver or CC. Also try to balance the distribution of the DD and RK role if there are enough processes to do so.	2019-11-12 14:22:36 -08:00
negoyal	a4a0bf18f9	Merging with Master.	2019-11-12 13:01:29 -08:00
Evan Tschannen	688940b685	merge 6.2 into master	2019-10-21 11:43:46 -07:00
Evan Tschannen	43e99ef6a4	fix: better master exists must check if fitness is better for proxies or resolvers before looking at the count of either of them	2019-10-17 13:18:31 -07:00
Evan Tschannen	298b815109	one proxy or resolver with best fitness no longer prevents more proxies or resolvers from being recruited with good fitness	2019-10-14 18:32:17 -07:00
Evan Tschannen	5064d91b75	fix: the cluster controller would not change to a new set of satellite tlogs when they become available in a better satellite location	2019-10-14 18:31:23 -07:00
Evan Tschannen	35e816e9ad	added the ability to configure satellite_logs by satellite location, this will overwrite the region configure if both are present	2019-10-14 18:30:15 -07:00
A.J. Beamon	31ce56eddf	Add cluster controller metrics	2019-10-03 15:29:11 -07:00
Evan Tschannen	b495cc697b	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # versions.target	2019-09-13 09:25:08 -07:00
Evan Tschannen	a62862c105	add yieldedFutures to prevent slow tasks	2019-09-11 16:26:48 -07:00
Evan Tschannen	945cff1e5b	the cluster controller caches the serialization of serverDBInfo, to avoid regenerating it many times	2019-09-10 14:27:22 -07:00
Meng Xu	39680fa515	StorageEngineSwitch:Clean up unnecessary trace And do not trigger storage recruitment unnecessarily.	2019-08-19 14:11:57 -07:00
Meng Xu	4ab322f52c	Merge branch 'master' into mengxu/storage-engine-switch-PR-v2	2019-08-19 13:06:32 -07:00
Meng Xu	3034a5e0c5	StorageRecruitment:Suppress outstanding req errors When too many outstanding requests cannot find a worker for storage server role, many same errors will be put into trace log. Only one error is enough to alert the problem. Too many same errors cause false positive in nightly test and thus should be suppressed.	2019-08-14 11:31:06 -07:00
Meng Xu	a588710376	StorageEngineSwitch:Graceful switch When fdbcli change storeType for storage engines, we switch the store type of storage servers one by one gracefully. This avoids recruiting multiple storage servers on the same process, which can cause OOM error.	2019-08-12 17:37:52 -07:00
Evan Tschannen	90e3b50213	Merge branch 'master' into feature-coordinator-connection # Conflicts: # fdbclient/DatabaseContext.h # fdbclient/NativeAPI.actor.cpp # fdbclient/NativeAPI.actor.h # fdbserver/workloads/KillRegion.actor.cpp	2019-07-26 15:05:02 -07:00
Evan Tschannen	be5d144b8b	added status information on connected clients	2019-07-25 17:15:31 -07:00
Jingyu Zhou	bbeaf0ebbb	Add a monitorServerInfoConfig() call back This was deleted during a code refactor in `ef868f5`. Because no tests were complaining, we didn't find this until now.	2019-07-25 15:17:26 -07:00
Evan Tschannen	4a866290b7	Clients keep a persistent connection open with coordinators to get updates to the list of proxies Status still needs to be updated with client information with information from the coordinators	2019-07-23 19:22:44 -07:00
Jingyu Zhou	50e7593c5b	Merge pull request #1796 from ajbeamon/remove-trace-event-underscores Remove trace event underscores	2019-07-05 21:45:55 -07:00
A.J. Beamon	9f4b6fd770	Remove additional underscores	2019-07-05 08:12:25 -07:00
Alex Miller	7a500cd37f	A giant translation of TaskFooPriority -> TaskPriority::Foo This is so that APIs that take priorities don't take ints, which are common and easy to accidentally pass the wrong thing.	2019-06-25 02:47:35 -07:00
Vishesh Yadav	a8e408e268	run clang-format on changes	2019-06-10 14:10:24 -07:00
Vishesh Yadav	6fa7081a21	net: Don't make FailureMonitoring requests from client This patch removes the need for clients to continuously contact cluster coordinator for failure monitoring information. Instead, it uses the FlowTransport to monitor the statuses of peers and update FailureMonitor accordingly.	2019-06-09 00:43:38 -07:00
Evan Tschannen	29b96414e2	Merge branch 'release-6.1' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/NativeAPI.actor.cpp # fdbserver/Coordination.actor.cpp # flow/Arena.h # versions.target	2019-06-03 18:49:35 -07:00
Evan Tschannen	7c333dbc16	If a process receives a message in its clusterControllerInterface before becoming the cluster controller, if the process does not become the cluster controller in the next minute it should destroy the interface to prevent a memory leak.	2019-05-29 16:57:13 -07:00
A.J. Beamon	5f55f3f613	Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used.	2019-05-10 14:01:52 -07:00
Andrew Noyes	6207d724f8	Fix all -Wunused-variable warnings	2019-04-15 18:13:00 -07:00
mpilman	1c16f87a4e	Remove trace-calls to printable (in non-workloads)	2019-04-05 13:12:19 -07:00
mpilman	c008e16c81	Defer formatting in traces to make them cheaper This is the first part of making `TraceEvent` cheaper. The main idea is to defer calls to any code that formats string. These are the main changes: - TraceEvent::detail now takes a c-string instead of std::string for literals. This prevents unnecessary allocations if the trace is not going to be printed in the first place (for example for SevDebug). Before that `detail` expected a `std::string` as key, which mean that any string literal would be copied on each call. - Templates Traceable and SpecialTraceMetricType. These templates can be specialized for any type that needs to be printed. The actual formatting will be deferred to after the `enabled` check. This provides two benefits: (1) if a TraceEvent is disabled, we don't pay for the formatting and (2) TraceEvent can trace types that it doesn't know about. - TraceEvent::enabled will be set in the constructor if the Severity is passed. This will make sure that `TraceEvent::init` is not called. - `TraceEvent::detail` will be inlined. So for disabled TraceEvent calls, a call to detail will only introduce a if-branch which is much cheaper than a function call.	2019-04-05 13:12:19 -07:00
Evan Tschannen	8ebf771392	cleanup cluster controller trace events	2019-03-30 14:17:18 -07:00
A.J. Beamon	71e2fdafb8	Changes to ratekeeper camel case	2019-03-27 08:24:25 -07:00
Evan Tschannen	5e03e178de	Merge pull request #1345 from ajbeamon/support-multiple-client-or-worker-issues Add support for a client or worker having multiple issues.	2019-03-24 17:27:50 -07:00
Evan Tschannen	d45159ebf7	Merge pull request #1307 from jzhou77/ratekeeper Monitor placement of Ratekeeper and DataDistributor	2019-03-24 17:26:07 -07:00
Evan Tschannen	d6ad027d37	ratekeeper needs to be recruited for proxies to make progress, so if one has not registered with the cluster controller by the time we are accepting commits, recruit a new one	2019-03-24 16:48:24 -07:00
Evan Tschannen	f426d732ea	fix: forgot to remove one location where id_used was incremented for distributor and ratekeeper	2019-03-24 16:04:59 -07:00
Evan Tschannen	e8948726e8	once we recruit a ratekeeper, do not allow any other ratekeepers to register	2019-03-24 11:04:39 -07:00
Jingyu Zhou	40eec20252	Restore master PID in worker registration This fix is lost during merge.	2019-03-23 21:02:11 -07:00
Jingyu Zhou	3ef26e6be3	Fix fitness assignment statements Found by MacOS build.	2019-03-23 19:16:04 -07:00
Evan Tschannen	1fc6937802	changed NetworkAddressList to at most two addresses for performance	2019-03-23 17:54:46 -07:00
Evan Tschannen	b51a24453e	the data distributor and ratekeeper are not included in id_used, but when comparing equally good options we prefer to avoid sharing with those roles excluded data distributor and ratekeeper were improperly killed when the best option was also excluded	2019-03-23 13:25:36 -07:00
Jingyu Zhou	fdc5b5ddbf	Fix: spurious ratekeeper registration A rare race condition: -r simulation -f ./foundationdb/tests/slow/WriteDuringReadAtomicRestore.txt -s 114256311 -b on - A is the ratekeeper. - CC recruit B and B starts - CC halts ratekeeper A and A is halted - A registers back with CC, which then halts B. CC sets A to be the ratekeeper. CC starts recruiting and finds A is the best machine. But skips recruiting because CC thinks A is already used. Now the cluster is left with no ratekeeper. Fix by disallowing ratekeeper registration with previous ID.	2019-03-23 11:03:51 -07:00
Jingyu Zhou	6523cd4931	Fix: recruit ratekeeper is not triggerred	2019-03-23 09:20:54 -07:00
Evan Tschannen	2da46e3172	fix: halt if datacenters are different	2019-03-22 23:53:21 -07:00
Evan Tschannen	d34c56c9a5	ensure that the processId exists in id_worker before accessing it	2019-03-22 18:54:39 -07:00
Evan Tschannen	36ab852bb1	Merge branch 'master' into ratekeeper # Conflicts: # fdbserver/ClusterController.actor.cpp	2019-03-22 18:41:00 -07:00
Evan Tschannen	ddb6058770	simplified ratekeeper monitoring loop	2019-03-22 18:22:45 -07:00
Jingyu Zhou	12917d8c7d	Add actors to store halt request futures Address best fitness in checking better DD or RK.	2019-03-22 18:06:38 -07:00
Jingyu Zhou	e8977aeb98	Remove clusterControllerDcId check This is no longer needed since it'll be set in the ctor.	2019-03-22 18:01:54 -07:00
Evan Tschannen	82bc447e29	startRatekeeper is responsible for updating serverDBInfo	2019-03-22 17:56:16 -07:00
Evan Tschannen	82c80c225d	make sure id_worker is updated before setting ratekeeper or data distribution	2019-03-22 17:08:54 -07:00
Evan Tschannen	6a9c9d79cc	Update fdbserver/ClusterController.actor.cpp	2019-03-22 17:00:58 -07:00
Evan Tschannen	70b1c88cdd	Update fdbserver/ClusterController.actor.cpp	2019-03-22 17:00:52 -07:00
Jingyu Zhou	16f54577ee	Restore master PID in cluster controller worker registration CC may think master failed and clear the master PID, which can block both data distributor and ratekeeper recruitment. Fix by restoring it during worker registration.	2019-03-22 14:53:05 -07:00
A.J. Beamon	4eb5715689	Add support for a client or worker having multiple issues.	2019-03-22 08:29:41 -07:00
Jingyu Zhou	da338c3ad6	Avoid unnecessary recuriting of DD or RK While waiting for recruting data distributor or ratekeeper, a previous one could already joined. So we can skip this unnecessary recruiting. Revert the change of worker.actor.cpp for ratekeeper. Instead, recruiting ratekeeper should avoid the process with an existing one. This fixes a bug where the ratekeeper interface became zombie, killing other healthy ratekeeper but doing no useful work. Found by: -r simulation --crash -f tests/fast/WriteDuringRead.txt -s 31858110 -b on	2019-03-21 22:40:07 -07:00
Evan Tschannen	fe4464e786	fix: processClassFitness could be wrong if the client changed their class while rebooting	2019-03-21 17:56:04 -07:00
Jingyu Zhou	299961aecb	Move ratekeeper or data distributor from excluded servers	2019-03-21 17:17:33 -07:00
Jingyu Zhou	48324ad4be	Fix a race during ratekeeper registration When a ratekeeper registers, the monitorRatekeeper wakes up and recruits a new ratekeeper. Adding a 0s delay to avoid this. If a ratekeeper is recruited on an existing machine, update the interface so that the cluster controller can clear the ratekeeperID.	2019-03-21 12:56:56 -07:00
Evan Tschannen	e692f0f70f	fix: degraded is only used for tlog recruitment, so we should not use it in the fitness calculation for other roles	2019-03-21 11:23:49 -07:00
Jingyu Zhou	8edefda193	Fix test stuck due to invalid worker in cluster controller Test case: -r simulation --crash -f ./tests/rare/CloggedCycleWithKills.txt -s 688927581 -b off	2019-03-20 22:24:01 -07:00
Jingyu Zhou	937b6dde31	Fix a race of DD, RK, Master failure If all DD, RK, Master run on the same process and failed. Recruiting of new DD or RK could try to use the old master worker interface, which is an invalid one and causes recruitment to be stuck. Fix by adding a delay and checking master is valid before recruitment.	2019-03-20 16:19:20 -07:00
Jingyu Zhou	ce5c6d18d2	Fix ratekeeper recruitment bug	2019-03-20 14:22:22 -07:00
Jingyu Zhou	86b687981b	Fix ratekeeper and data distributor recruiting bug Avoid multiple concurrent recuriting of ratekeepers with a recruiting flag. Fix endless recruiting when the chosen worker is a proxy or a resolver -- prefer master in this case.	2019-03-20 10:00:31 -07:00
Jingyu Zhou	474abd81bd	Move placement monitoring inside doCheckOutstandingRequests	2019-03-19 22:48:21 -07:00
Balachandar Namasivayam	f9560e1abd	Addressed Review Comments	2019-03-19 15:23:14 -07:00
Jingyu Zhou	bc6fdaea3e	Recruit a new ratekeeper before halting the old	2019-03-19 15:21:46 -07:00
Jingyu Zhou	0fb6a03c07	First round of review comment fixes for PR#1307	2019-03-19 11:29:19 -07:00
Jingyu Zhou	8d609eb51d	Protect ratekeeper registration race during recruitment This is similar one to DataDistributor.	2019-03-18 13:53:50 -07:00
Balachandar Namasivayam	5471725db5	Support config where the primary and remote DC's can be used as satellites.	2019-03-18 12:17:59 -07:00
Jingyu Zhou	2b41a97a6e	Fix the issue of slow dying Data Distributor Test with: -r simulation -f ./foundationdb/tests/slow/CommitBug.txt -s 67828576 -b on The test has the following event sequence: - Time 113.3s, CC noticed DD failure, cleard DD interface. - 1s later, DD rejoined and registered with CC. - Time 131.7s, DD actor cancelled. This old DD raced to register with CC and the failure monitor is not installed because monitorDataDistributor is stalled waiting for new DD. - Time 161.4s, new DD running. New DD recruting was delayed due to no servers in the period. Fix by disabling DD registration during the recruting process.	2019-03-17 22:19:23 -07:00
Jingyu Zhou	254c78053c	Fix a segfault error After wait, ServerDBInfo may have changed. Using the old copy is wrong.	2019-03-15 22:11:13 -07:00
Jingyu Zhou	12ddd56698	Fix Ratekeeper and DataDistributor placement Make sure both RateKeeper and DataDistributor are placed in the same data center as the Master. Make sure only one RateKeeper is live in the cluster as well.	2019-03-15 17:09:28 -07:00
Jingyu Zhou	bb5686eb75	Fix monitoring of DD and RK	2019-03-15 16:02:17 -07:00
Jingyu Zhou	9f6fe5f649	Merge remote-tracking branch 'apple/master' into ratekeeper	2019-03-15 11:30:04 -07:00
Jingyu Zhou	40860e0093	Attempt to fix.	2019-03-15 11:29:04 -07:00
Jingyu Zhou	99d521ef4f	Monitor Ratekeeper and DataDistributor to use stateless processes Since Ratekeeper and DataDistributor are no longer running with Master, they might be running with stateful processes before a new Master becomes alive, which is undesirable. This PR adds a monitoring of both Ratekeeper and DataDistributor at Cluster Controller -- if Master runs on a stateless class and RK/DD runs at a worse class, then RK/DD will be killed. I.e., RK/DD should be running at their own classes or on the same stateless process as Master. After restart, RK/DD should be running at a better process class.	2019-03-14 15:00:57 -07:00
Meng Xu	5a10bf5dfc	Merge branch 'master' into mengxu/tls-switch-status-PR	2019-03-14 10:35:12 -07:00
Evan Tschannen	a2108047aa	removed LocalitySetRef and IRepPolicyRef typedefs, because for clarity the Ref suffix is reserved for arena allocated objects instead of reference counted objects.	2019-03-13 13:14:39 -07:00
Evan Tschannen	e068c478b5	merge master	2019-03-12 18:31:25 -07:00
Evan Tschannen	5392742902	fixed review comments	2019-03-12 14:38:54 -07:00
Jingyu Zhou	2b0139670e	Fix review comment for PR 1176	2019-03-12 12:02:30 -07:00
Meng Xu	46f4b02807	TLS Status: Resolve review comments Use connectedCoordinatorsNumDelayed to reduce the load on cluster controller; Set connectedCoordinatorsNum to null by default for monitorLeader()	2019-03-11 17:10:08 -07:00
Evan Tschannen	1be9ae5ce3	fixed merge conflict	2019-03-08 22:51:06 -05:00
Evan Tschannen	044b6b4f8a	Merge branch 'master' into feature-degraded-tlog # Conflicts: # fdbserver/ClusterController.actor.cpp	2019-03-08 22:50:41 -05:00
Evan Tschannen	45fe6b369b	tlog recruitment will prefer non-degraded processes, however it will not choose less than desired number of tlogs to avoid degraded processes better master exists will switch the master to avoid degraded processes	2019-03-08 14:40:00 -05:00
Evan Tschannen	710a64dc4e	replaced std::pair<WorkerInterface,ProcessClass> with a struct named WorkerDetails	2019-03-08 11:25:07 -05:00
Jingyu Zhou	517966fce2	Remove lastLimited from rate keeper Refactor code to make IDE happy.	2019-03-07 13:16:20 -08:00
Jingyu Zhou	36a51a7b57	Fix a segfault bug due to uncopied ratekeeper interface	2019-03-07 13:16:20 -08:00
Jingyu Zhou	e6ac3f7fe8	Minor fix on ratekeeper work registration.	2019-03-07 13:16:20 -08:00
Jingyu Zhou	3c86643822	Separate Ratekeeper from data distribution. Add a new role for ratekeeper. Remove StorageServerChanges from data distribution. Ratekeeper monitors storage servers, which borrows the idea from DataDistribution.	2019-03-07 13:16:20 -08:00
Meng Xu	04880e3d4d	Merge branch 'master' into mengxu/tls-switch-status-PR	2019-03-06 13:41:16 -08:00
Alex Miller	c6a65389ae	Remove noexcept macro and replace with BOOST_NOEXCEPT. BOOST_NOEXCEPT does what the noexcept macro was supposed to do, but in a way that is correctly maintained over time.	2019-03-05 22:06:12 -08:00
Meng Xu	820548223a	Status: connected_coordinators misc minor changes Change the rst document file; Change the coding style to be consistent with the nearby code; Ensure we always initilize the connectedCoordinatesNum to 0 even when the variable is not used.	2019-03-05 21:45:18 -08:00
Meng Xu	b7a52e81e2	Status: Count connected coordinators per client A client will always try to connect all coordinators. This commit let Status track the number of connected coordinators for each client. This allows us to do canary in coordinators. For example, when we switch from non-TLS to TLS, we can switch 1 coordinator from non-TLS to TLS. This can help check if a client has the ability to connect through TLS. We can make the non-TLS to TLS switch for each coordinators one by one. This avoid the risk of losing connection in the switch.	2019-03-05 21:21:23 -08:00
Meng Xu	c0535c49bb	Status: TLS client status Use ClientStatusInfo structure for each network address (client), instead of passing each status info as a parameter.	2019-03-04 16:35:10 -08:00
Meng Xu	94385447bc	Status: Get if client configured TLS To understand if all clients have configured TLS, we check the tlsoption when a client tries to open database. This is similar to how we track the versions of multi-version clients.	2019-03-01 15:17:01 -08:00
Evan Tschannen	b8910ba7cd	Merge branch 'master' into feature-fix-force-recovery # Conflicts: # fdbclient/ManagementAPI.actor.h # fdbserver/DataDistribution.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/KillRegion.actor.cpp	2019-02-22 14:38:13 -08:00
Evan Tschannen	d4737fac0f	knobify force recovery recovery check delay	2019-02-19 16:05:20 -08:00
mpilman	3f0fd2a20c	Use fwd decls in WorkerInterface Also WorkerInterface.h -> WorkerInterface.actor.h	2019-02-19 15:16:59 -08:00
mpilman	27a3153719	Use ACTOR forward declarations in MoveKeys Also MoveKeys.h -> MoveKeys.actor.h	2019-02-19 15:16:59 -08:00
mpilman	3a0f9839b9	Fix minor IDE build errors	2019-02-19 15:16:59 -08:00
mpilman	0bb60e5a3b	Use proper fwd decl in NativeAPI Also NativeAPI.h -> NativeAPI.actor.h	2019-02-19 15:16:59 -08:00
Evan Tschannen	ed9e20ce17	forgot to fix merge conflicts	2019-02-18 17:09:55 -08:00
Evan Tschannen	065a45e05f	Merge branch 'master' into feature-fix-force-recovery # Conflicts: # fdbclient/ManagementAPI.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/workloads/KillRegion.actor.cpp	2019-02-18 17:09:06 -08:00
Evan Tschannen	8f2af8bed1	fix: forced recoveries now require a target dcid which will become the new primary location. During the forced recovery, the configuration will be changed to make that location primary, and usable_regions will be set to 1. If the target dcid is already the primary location, the forced recovery will do nothing. This makes forced recoveries idempotent, so it is safe to the client to re-send forced recovery commands to the cluster controller. fix: the cluster controller attempts to do a commit to determine if the cluster is alive, since its own internal recoveryState might not be up-to-date. fix: forceMasterFailure on the cluster controller did not always cause the current master to be re-recruited	2019-02-18 14:54:28 -08:00
Vishesh Yadav	e05b53d755	Merge remote-tracking branch 'apple/master' into task/tls-upgrade	2019-02-15 20:37:07 -08:00
Jingyu Zhou	5e6577cc82	Final cleanup per review comments Make distributor interface optional in ServerDBInfo and many other small changes.	2019-02-14 16:37:17 -08:00
Evan Tschannen	171a69c810	Update fdbserver/ClusterController.actor.cpp Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>	2019-02-14 16:37:16 -08:00
Evan Tschannen	a4b2c9ef88	Update fdbserver/ClusterController.actor.cpp Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>	2019-02-14 16:37:16 -08:00
Jingyu Zhou	0e47912192	Fix an out-of-memory error	2019-02-14 16:37:16 -08:00
Jingyu Zhou	62c67a50e5	Fix segfault error The usedIds is updated by master registration request, which populates the usedIds map. However, this request may contain processes that cluster controller is not aware, i.e., not in id_worker map. This is ok until I added tracing the usedIds, which silently insert an empty entry into id_worker map for the unknown process. This new entry can cause crashing failure when trying to access its LocalityData. Remove AsyncTrigger for usedIds, and change to serverInfo->onChange. Use const & to avoid unnecessary copies in WorkerInterface's LocalityData and getExtraTLogEligibleMachines().	2019-02-14 16:37:16 -08:00
Jingyu Zhou	21066b013a	Remove DataDistributorRejoinRequest This is no longer needed, since worker registration piggybacks distributor interface now.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	816f8b1ae1	Per review comments Add a knob for starting distributor delay. Move distributor failed variable to a local loop.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	578473a974	Various review comments fixes	2019-02-14 16:37:16 -08:00
Jingyu Zhou	b3d1633114	Fix bugs of missing request The quite database can fail to send out requests and report timeout. This seems to be caused by reusing a request that uses the same ReplyPromise. Another bug is Proxy can wait for unneeded time for a dabase change, while the distributor is already known to itself.	2019-02-14 16:37:16 -08:00
Evan Tschannen	5fb48083cd	Update fdbserver/ClusterController.actor.cpp Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>	2019-02-14 16:37:16 -08:00
Jingyu Zhou	8c61de318f	Fix segfault and no_more_servers errors	2019-02-14 16:37:16 -08:00
Jingyu Zhou	7897616164	Fix wait failure bug on cluster controller The setDistributor() sets an AsyncVar and then runs waitFailureClient. This ordering is wrong because the AsyncVar::set triggers the other loop to run first, which will wait on Never(). The correct code should wait on the Future returned by the waitFailureClient.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	00f2253229	Piggyback data distributor interface in worker registration This allows cluster controller to know data distributor during worker registration phase, thus avoiding recruiting a new data distributor after starting. Also change the worker to skip creating a new data distributor if there is already one running on the worker, which can trigger operation timeout in tests.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	39e4a59154	Add used worker IDs to cluster controller This "usedIds" is updated when receiving a master registration message, so that when recruiting new data distributor, existing assignment is known.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	6a655143e8	A follow-on fix for config key usage And some trace event cleanups.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	be5c962bb7	Add a new configuration version key \xff/conf/version This fixed a bug found by upgrade test, where the configuration monitor of the data distributor was monitoring excludedServersVersionKey, which doesn't change in ChangeConfig workload. As a result, data distributor was not aware of configuration changes. Adding this new key and make sure this key is updated in configuration changes so that the monitor can detect configuration changes.	2019-02-14 16:37:16 -08:00

... 2 3 4 5 6 ...

503 Commits