foundationdb

Commit Graph

Author	SHA1	Message	Date
Evan Tschannen	b8910ba7cd	Merge branch 'master' into feature-fix-force-recovery # Conflicts: # fdbclient/ManagementAPI.actor.h # fdbserver/DataDistribution.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/KillRegion.actor.cpp	2019-02-22 14:38:13 -08:00
Evan Tschannen	d4737fac0f	knobify force recovery recovery check delay	2019-02-19 16:05:20 -08:00
mpilman	3f0fd2a20c	Use fwd decls in WorkerInterface Also WorkerInterface.h -> WorkerInterface.actor.h	2019-02-19 15:16:59 -08:00
mpilman	27a3153719	Use ACTOR forward declarations in MoveKeys Also MoveKeys.h -> MoveKeys.actor.h	2019-02-19 15:16:59 -08:00
mpilman	3a0f9839b9	Fix minor IDE build errors	2019-02-19 15:16:59 -08:00
mpilman	0bb60e5a3b	Use proper fwd decl in NativeAPI Also NativeAPI.h -> NativeAPI.actor.h	2019-02-19 15:16:59 -08:00
Evan Tschannen	ed9e20ce17	forgot to fix merge conflicts	2019-02-18 17:09:55 -08:00
Evan Tschannen	065a45e05f	Merge branch 'master' into feature-fix-force-recovery # Conflicts: # fdbclient/ManagementAPI.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/workloads/KillRegion.actor.cpp	2019-02-18 17:09:06 -08:00
Evan Tschannen	8f2af8bed1	fix: forced recoveries now require a target dcid which will become the new primary location. During the forced recovery, the configuration will be changed to make that location primary, and usable_regions will be set to 1. If the target dcid is already the primary location, the forced recovery will do nothing. This makes forced recoveries idempotent, so it is safe to the client to re-send forced recovery commands to the cluster controller. fix: the cluster controller attempts to do a commit to determine if the cluster is alive, since its own internal recoveryState might not be up-to-date. fix: forceMasterFailure on the cluster controller did not always cause the current master to be re-recruited	2019-02-18 14:54:28 -08:00
Vishesh Yadav	e05b53d755	Merge remote-tracking branch 'apple/master' into task/tls-upgrade	2019-02-15 20:37:07 -08:00
Jingyu Zhou	5e6577cc82	Final cleanup per review comments Make distributor interface optional in ServerDBInfo and many other small changes.	2019-02-14 16:37:17 -08:00
Evan Tschannen	171a69c810	Update fdbserver/ClusterController.actor.cpp Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>	2019-02-14 16:37:16 -08:00
Evan Tschannen	a4b2c9ef88	Update fdbserver/ClusterController.actor.cpp Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>	2019-02-14 16:37:16 -08:00
Jingyu Zhou	0e47912192	Fix an out-of-memory error	2019-02-14 16:37:16 -08:00
Jingyu Zhou	62c67a50e5	Fix segfault error The usedIds is updated by master registration request, which populates the usedIds map. However, this request may contain processes that cluster controller is not aware, i.e., not in id_worker map. This is ok until I added tracing the usedIds, which silently insert an empty entry into id_worker map for the unknown process. This new entry can cause crashing failure when trying to access its LocalityData. Remove AsyncTrigger for usedIds, and change to serverInfo->onChange. Use const & to avoid unnecessary copies in WorkerInterface's LocalityData and getExtraTLogEligibleMachines().	2019-02-14 16:37:16 -08:00
Jingyu Zhou	21066b013a	Remove DataDistributorRejoinRequest This is no longer needed, since worker registration piggybacks distributor interface now.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	816f8b1ae1	Per review comments Add a knob for starting distributor delay. Move distributor failed variable to a local loop.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	578473a974	Various review comments fixes	2019-02-14 16:37:16 -08:00
Jingyu Zhou	b3d1633114	Fix bugs of missing request The quite database can fail to send out requests and report timeout. This seems to be caused by reusing a request that uses the same ReplyPromise. Another bug is Proxy can wait for unneeded time for a dabase change, while the distributor is already known to itself.	2019-02-14 16:37:16 -08:00
Evan Tschannen	5fb48083cd	Update fdbserver/ClusterController.actor.cpp Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>	2019-02-14 16:37:16 -08:00
Jingyu Zhou	8c61de318f	Fix segfault and no_more_servers errors	2019-02-14 16:37:16 -08:00
Jingyu Zhou	7897616164	Fix wait failure bug on cluster controller The setDistributor() sets an AsyncVar and then runs waitFailureClient. This ordering is wrong because the AsyncVar::set triggers the other loop to run first, which will wait on Never(). The correct code should wait on the Future returned by the waitFailureClient.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	00f2253229	Piggyback data distributor interface in worker registration This allows cluster controller to know data distributor during worker registration phase, thus avoiding recruiting a new data distributor after starting. Also change the worker to skip creating a new data distributor if there is already one running on the worker, which can trigger operation timeout in tests.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	39e4a59154	Add used worker IDs to cluster controller This "usedIds" is updated when receiving a master registration message, so that when recruiting new data distributor, existing assignment is known.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	6a655143e8	A follow-on fix for config key usage And some trace event cleanups.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	be5c962bb7	Add a new configuration version key \xff/conf/version This fixed a bug found by upgrade test, where the configuration monitor of the data distributor was monitoring excludedServersVersionKey, which doesn't change in ChangeConfig workload. As a result, data distributor was not aware of configuration changes. Adding this new key and make sure this key is updated in configuration changes so that the monitor can detect configuration changes.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	3135f1d84b	Cluster controller ignores distrobutor rejoin After controller starts one, it will wait for that one and ignore any rejoins received later. Add remoteRecovered() to data distribution for remote team collection.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	e0a7162cf8	Add a failure timeout knob for data distributor. Set default time to 1.0s.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	efd000dd11	Remove distributor interface from ClusterControllerData This information is now kept in ServerDBInfo.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	3f7bbc68aa	Remove getDistributorInterface from cluster controller	2019-02-14 16:37:16 -08:00
Jingyu Zhou	ef868f599c	Add DataDistributorInterface to ServerDBInfo Also change the Proxy and QuietDatabase to use the DataDistributorInterface.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	0490160714	Fix according to Evan's comments Use getRateInfo's endpoint as the ID for the DataDistributorInterface. For now, added a "rejoined" flag for ClusterControllerData and Proxy. TODO: move DataDistributorInterface into ServerDBInfo.	2019-02-14 16:30:13 -08:00
Evan Tschannen	1818aab205	Apply suggestions from code review Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>	2019-02-14 16:30:13 -08:00
Jingyu Zhou	886e7ab2ba	Add a new DataDistributor role. Let cluster controller to start a new data distributor role by sending a message to a chosen worker. Change MasterInterface usage in DataDistribution to masterId Add DataDistributor rejoin handling. This allows the data distributor to tell the new cluster controller of its existence so that the controller doesn't spawn a new one. I.e., there should be only ONE data distributor in the cluster. If DataDistributor (DD) doesn't join in a while, then ClusterController (CC) tries to recruit one as DD. CC also monitors DD and restarts one if it failed. The Proxy is also monitoring the DD. If DD failed, the Proxy will ask CC for the new DD. Add GetRecoveryInfo RPC to master server, which is called by data distributor to obtain the recovery Transaction version from the master server.	2019-02-14 16:30:13 -08:00
Vishesh Yadav	907446d0ce	Merge remote-tracking branch 'apple/master' into task/tls-upgrade	2019-02-14 11:37:38 -08:00
A.J. Beamon	b435d51061	Merge branch 'master' into track-server-request-latencies	2019-02-14 08:07:32 -08:00
Evan Tschannen	e9ddd94e27	The failure monitor is given a list of all IP addresses associated with a process The connect packet includes the correct remote address Did a lot of code cleanup Simulation test mixed TLS and non-TLS listeners on the same process	2019-01-31 18:20:14 -08:00
Evan Tschannen	a678f778fa	Increase severity to SevWarnAlways for TooManyStatusRequests trace Co-Authored-By: tclinken <trevorclinkenbeard@gmail.com>	2019-01-28 17:50:50 -08:00
Trevor Clinkenbeard	5b89db811a	Throttle status requests with MAX_STATUS_REQUESTS_PER_SECOND knob, whenever status batching is used.	2019-01-28 15:37:30 -08:00
Evan Tschannen	1d7fec3074	Merge commit '048bfc5c368063d9e009513078dab88be0cbd5b0' into task/tls-upgrade-2 # Conflicts: # .gitignore	2019-01-24 17:43:06 -08:00
A.J. Beamon	2198d24ce1	Merge commit '3b2700d25334c53d13496ca16682642aac951beb' into track-server-request-latencies # Conflicts: # fdbclient/MasterProxyInterface.h # fdbserver/ClusterController.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # fdbserver/ServerDBInfo.h # fdbserver/Status.actor.cpp # fdbserver/fdbserver.vcxproj # fdbserver/storageserver.actor.cpp	2019-01-24 11:43:26 -08:00
A.J. Beamon	8e05e95045	Added the ability to configure the latency band settings by setting a special key in \xff keyspace.	2019-01-18 16:18:34 -08:00
Evan Tschannen	7dbf06162e	Update fdbserver/ClusterController.actor.cpp Co-Authored-By: bnamasivayam <36455962+bnamasivayam@users.noreply.github.com>	2019-01-14 16:57:00 -08:00
Balachandar Namasivayam	ff661bca22	Fix a minor bug in the RoleFitness Class.	2019-01-14 14:54:54 -08:00
Balachandar Namasivayam	a8e2e75cd5	Re-enable CheckDesiredClasses after making necessary changes for multi-region setup. Fixed a couple of bugs 1) A rare race condition where a worker is being roles even after it died. 2) Fix how RoleFitness is calculated for TLog and LogRouter. Only worst fitness is compared to see if a better fit is available.	2019-01-10 10:28:32 -08:00
Vishesh Yadav	3eb9b23024	Listen to multiple addresses and start using vector<NetworkAdddress> in Endpoint - This patch will make FDB listen to multiple addresses given via command line. Although, we'll still use first address in most places, this patch starts using vector<NetworkAddress> in Endpoint at some basic places. - When sending packets to an endpoint, pick a random network address in endpoints - Renames Endpoint::address to Endpoint::addresses since it now holds a vector of addresses.	2018-12-13 13:36:52 -08:00
Vishesh Yadav	43e5a46f9b	Change Endpoint::address(NetworkAddress) to vector<NetworkAddress> Extend `Endpoint` class to take multiple NetworkAddresses instead of just one. Hence, to talk to an endpoint instead of one IP:PORT, we'll have multiple IP:PORT pairs. This patch simply adds the field and makes changes to compile the codebase. The first element of of `address` field is used everywhere. Hence the way we talk to remains same with this patch. NOTE: Directly accessing the first memeber of Endpoint::address is unsafe as Endpoint() doesn't enforces non-empty address list. However, since the correctness test pass for now and are anyway replacing all those unsafe accesses with ones considering the whole vector, this patch ignores to access them in safe way.	2018-12-13 13:36:52 -08:00
Evan Tschannen	4b5d0b4e2c	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/AsyncFileBlobStore.actor.cpp # fdbclient/AsyncFileBlobStore.actor.h # fdbclient/BlobStore.actor.cpp # fdbclient/BlobStore.h # fdbclient/HTTP.actor.cpp # fdbclient/ManagementAPI.actor.cpp # fdbclient/NativeAPI.actor.cpp # fdbrpc/LoadBalance.actor.h # fdbrpc/batcher.actor.h # fdbrpc/fdbrpc.vcxproj # fdbrpc/sim2.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/DataDistributionTracker.actor.cpp # fdbserver/SimulatedCluster.actor.cpp # fdbserver/TLogServer.actor.cpp # fdbserver/masterserver.actor.cpp	2018-11-10 13:04:24 -08:00
Evan Tschannen	04fa2a7202	fix: we could recover in a region with priority < 0	2018-11-05 10:14:26 -08:00
Evan Tschannen	87295cc263	suppressed spammy trace events, and avoid reporting a long master recovery duration when the cluster is first created	2018-11-04 23:07:56 -08:00

1 2 3 4

179 Commits