foundationdb

Commit Graph

Author	SHA1	Message	Date
sfc-gh-tclinkenbeard	5c2d7b6080	Create RangeResult type alias	2021-05-03 13:14:16 -07:00
sfc-gh-tclinkenbeard	f9ede75b42	Remove unused variable in ClusterController.actor.cpp	2021-05-03 11:10:43 -07:00
Markus Pilman	54919d4f3b	Merge remote-tracking branch 'sfc/features/actor-lineage' into features/actor-lineage	2021-04-28 09:22:14 -06:00
Evan Tschannen	1f98dec1df	cleaned up default constructed maps	2021-04-26 19:26:25 -07:00
sfc-gh-tclinkenbeard	dc577b6608	Fix some bugs in distribution of configBroadcaster interface	2021-04-26 18:46:22 -07:00
sfc-gh-tclinkenbeard	7211d838cf	Remove broadcastConfigDatabase actor	2021-04-26 15:54:08 -07:00
Evan Tschannen	451609e6be	code cleanup	2021-04-26 10:16:18 -07:00
Evan Tschannen	50bb9b51b4	simulation does recruitment twice and compares the results to ensure recruitment is deterministic	2021-04-26 10:13:59 -07:00
Evan Tschannen	49ca48f82e	fix: tlog recruitment could select more than the desired about of tlogs fix: tlog recruitment did not attempt to avoid longLivedStateless processes	2021-04-26 10:09:44 -07:00
Evan Tschannen	7503964ee9	recruitment tries to avoid degraded processes altogether, rather than just the worst one. Since this is a behavior change from the backup recruitment, we cannot compared degraded between the two recruitments	2021-04-26 10:01:54 -07:00
Evan Tschannen	ccfc77f6fb	changed preferredSharing to be ordered, so that recruitment will always share with the same other role when everything else is equal	2021-04-26 09:57:46 -07:00
sfc-gh-tclinkenbeard	9bed1f7aa5	Run SimpleConfigBroadcaster on cluster controller	2021-04-25 17:20:02 -07:00
Evan Tschannen	b61a911685	removed an ASSERT that was for debugging purposed, and increased the max commit latency, because it can be spuriously triggered by dummy transactions that take 5+ seconds each	2021-04-21 14:30:06 -07:00
Evan Tschannen	e18c9961b4	rewrote tlog recruitment logic so that it is deterministic, to prevent better master exists from triggering spuriously	2021-04-21 00:22:33 -07:00
Lukas Joswiak	c81e1e9519	Add sampling profiler frequency to global config	2021-04-19 22:46:57 -07:00
RenxuanW	4bf7218e8f	Merge pull request #4635 from RenxuanW/priority_logging Log a warning when remote dc is disabled (priority < 0)	2021-04-15 17:00:41 -07:00
Lukas Joswiak	7de23918c0	Add comments, fix erase bug, make optimizations	2021-04-14 10:56:33 -07:00
Lukas Joswiak	c38ddf5eb7	Add comments	2021-04-14 10:56:33 -07:00
Lukas Joswiak	7ba7257cd2	Store global config data on heap	2021-04-14 10:56:33 -07:00
Lukas Joswiak	1c60653c2a	Add fix to conditionally set global config history	2021-04-14 10:56:33 -07:00
Lukas Joswiak	6de28dd916	clang-format	2021-04-14 10:56:33 -07:00
Lukas Joswiak	1260385965	Use object to wrap global configuration history	2021-04-14 10:56:32 -07:00
Lukas Joswiak	fb9a929780	Fix issue with freed memory being accessed	2021-04-14 10:56:32 -07:00
Lukas Joswiak	c3f68831af	Move existing ClientDBInfo variables to global configuration	2021-04-14 10:56:32 -07:00
Lukas Joswiak	7bb0b3d899	Use commit version for global configuration updates FIXME: There is a memory issue where the underlying data for values set in the `data` field of GlobalConfig will be freed shortly after being set.	2021-04-14 10:56:32 -07:00
Lukas Joswiak	f1415412f1	Add global configuration framework implementation	2021-04-14 10:56:32 -07:00
Evan Tschannen	bd6db9ca7c	Update fdbserver/ClusterController.actor.cpp Co-authored-by: Markus Pilman <markus.pilman@snowflake.com>	2021-04-13 15:13:45 -07:00
RenxuanW	7be8dab045	Change DcPriorityNegative to CCDcPriorityNegative	2021-04-08 16:00:37 -07:00
RenxuanW	738e7402f7	Log a warning when remote dc is disabled (priority < 0)	2021-04-08 15:36:52 -07:00
RenxuanW	f3d5fa4750	Revert "Log a warning when remote dc's priority doesn't match the original primary." This reverts commit `1d701e8bcf`.	2021-04-08 15:19:43 -07:00
RenxuanW	1d701e8bcf	Log a warning when remote dc's priority doesn't match the original primary.	2021-04-08 14:38:37 -07:00
Evan Tschannen	a90c26f1d0	The master, proxies, and resolver all need to have the same machine class fitness function besides best fit to ensure recruitment is deterministic if the first GRV proxy or resolver is forced to share a process, it should prefer to share with the commit proxy so that the commit proxy has more potential options it can share with	2021-04-08 14:29:12 -07:00
Evan Tschannen	5695a1816f	fix: requiredFitness was being set to one higher than the actual requirement	2021-04-07 21:31:14 -07:00
Evan Tschannen	1b1f73ea16	added comments	2021-04-07 20:40:42 -07:00
Evan Tschannen	4d8dd0b0a0	fix: desired must be greater than or equal to required	2021-04-07 20:32:45 -07:00
Evan Tschannen	14213b0151	code cleanup	2021-04-07 20:06:30 -07:00
Evan Tschannen	15e8b43961	rewrote getWorkersForTLogs to do a much better job of avoiding degraded processes and processes in the same DC as the cluster controller	2021-04-07 19:57:24 -07:00
Evan Tschannen	c27d82cecd	tlog recruitment used a degraded LogClass process over a non-degraded TransactionClass process tlog recruitment would not use TransactionClass processes if it fulfulled the required amount with LogClass processes Better master exists did not account for how many times a process had been used when comparing recruitments Better master exists did not account for the fact that tlogs prefer to be in a different dc than the cluster controller RoleFitness comparison did not properly order count before degraded or bestFit betterCount was returning worstFit when worstIsDegraded did not match backupWorker recruitment did not attempt to avoid sharing processes with other roles If any of the commit_proxy, grv_proxy, or resolver are forced to share a process, allow the recruitment for all of them to share to an equal degree, this change allows BetterMasterExists to be refactors as a tuple comparison	2021-04-07 16:04:08 -07:00
Markus Pilman	50342b5082	fix a second low-latency bug	2021-03-29 13:31:26 -06:00
Markus Pilman	8555723b98	removing testing case	2021-03-26 15:46:54 -06:00
Markus Pilman	43bed1d9dd	Fix bug where betterMasterExist and recruitment disagree	2021-03-26 15:06:59 -06:00
Evan Tschannen	10b6b5d710	If the current configuration does not have a satellite fallback policy we do not care if the old configuration is in fallback mode	2021-03-23 13:02:31 -07:00
A.J. Beamon	99f3bb6d7d	Merge pull request #4509 from sfc-gh-etschannen/feature-bme-count Do not trigger BetterMasterExists if it lowers the number of processes	2021-03-22 13:43:24 -07:00
Zhe Wu	15f3699e22	Add targeting DC ids in the tlog recruitment event trace.	2021-03-19 14:10:38 -07:00
Meng Xu	0cedef123b	Merge pull request #4518 from halfprice/zhewu/log-tlog-recruitment-failure-reason Logging more detailed information during Tlog recruitment	2021-03-19 11:36:05 -07:00
Zhe Wu	58d9f47782	log fitness for excluded workers as well	2021-03-19 11:04:53 -07:00
Zhe Wu	4c00361f1c	Add comment for 'getWorkersForTlogs' method, and addressed TraceEvent formatting comments.	2021-03-18 21:33:43 -07:00
Zhe Wu	9419387295	Update logging field.	2021-03-18 14:53:43 -07:00
Evan Tschannen	2ff63f544e	Update fdbserver/ClusterController.actor.cpp Co-authored-by: Lukas Joswiak <lukas.joswiak@snowflake.com>	2021-03-18 13:45:51 -07:00
Zhe Wu	451b14af09	Log detailed information when a worker is considered as unavailable by the cluster controller for TLog recruitment.	2021-03-18 12:18:03 -07:00
Zhe Wu	6468c5aed6	Fix string join	2021-03-17 23:46:11 -07:00
Zhe Wu	1205650a69	Log the dcid during TLog recruitment, so that we can tell in which DC the recruitment is happening	2021-03-17 23:22:42 -07:00
Evan Tschannen	9aeb69ca1c	added a comment	2021-03-16 14:19:23 -07:00
Evan Tschannen	d0f134c20e	added a comment	2021-03-16 13:17:56 -07:00
Evan Tschannen	2a272e525f	fix compile error	2021-03-16 12:21:21 -07:00
Evan Tschannen	10fd094920	Better master exists should not trigger if it will lower the total number of processes being recruited	2021-03-16 12:14:19 -07:00
FDB Formatster	df90cc89de	apply clang-format to .c, .cpp, .h, .hpp files	2021-03-10 10:18:07 -08:00
Evan Tschannen	346a4e3ecd	Merge branch 'release-6.3' # Conflicts: # fdbcli/fdbcli.actor.cpp # fdbrpc/LoadBalance.actor.h # fdbrpc/MultiInterface.h # fdbserver/ClusterController.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # fdbserver/masterserver.actor.cpp	2021-03-01 18:52:06 -08:00
Meng Xu	33eb1de00e	Add some comment to log system and resolve review comment by deleting my questions.	2021-02-19 21:44:13 -08:00
Meng Xu	9122be4d81	Add comments to HA code and loadBalance code	2021-02-10 13:51:36 -08:00
Richard Chen	c77d9e4abe	merge conflicts	2020-12-02 21:53:19 +00:00
Markus Pilman	bdd3dbfa7d	remove duplicates	2020-11-10 14:01:07 -07:00
sfc-gh-tclinkenbeard	4669f837fa	Add uses of makeReference	2020-11-07 22:10:18 -08:00
Xin Dong	99d31391ca	Fixed a crash found by nightly correctness.	2020-11-03 09:28:04 -08:00
Richard Chen	bbf5bdf6da	fix stable interfaces test and corresponding changes in simulator	2020-10-12 18:25:12 +00:00
Richard Chen	5488ff1d81	draft diff protocol	2020-10-12 18:24:03 +00:00
Richard Chen	41843f07e6	add simulator support for different process versions and ProtocolVersion test	2020-10-12 18:19:31 +00:00
Xin Dong	175d52312a	Prevent segmentation fault.	2020-10-08 13:36:15 -07:00
Young Liu	cc5bc16bd8	Rename more places from proxy to commit proxy	2020-09-15 22:29:49 -07:00
Young Liu	35bef73a1c	Rename proxy to commit proxy	2020-09-10 17:44:15 -07:00
Young Liu	87693cae81	merge master branch and resolve conflicts	2020-09-02 13:44:33 -07:00
Evan Tschannen	12edadd059	Merge branch 'release-6.3' # Conflicts: # CMakeLists.txt # fdbclient/Knobs.cpp # fdbclient/MasterProxyInterface.h # fdbrpc/simulator.h # fdbserver/MasterProxyServer.actor.cpp # tests/fast/CycleAndLock.txt # tests/fast/TxnStateStoreCycleTest.txt # tests/fast/VersionStamp.txt # tests/slow/ParallelRestoreOldBackupApiCorrectnessAtomicRestore.txt # tests/slow/ParallelRestoreOldBackupCorrectnessCycle.txt # versions.target	2020-08-31 19:33:34 -07:00
Evan Tschannen	d42a6b6ea7	remove spammy trace event	2020-08-31 10:37:00 -07:00
Young Liu	19df032aec	Change some formatting issues	2020-08-13 15:30:21 -07:00
Young Liu	4a30492186	Remove debug trace	2020-08-13 14:42:00 -07:00
Young Liu	79ce16650d	merge master branch	2020-08-11 19:22:10 -07:00
Young Liu	ba803a5ea3	Fixed formatting issues and removed GRV related code in MasterProxy	2020-08-11 18:54:54 -07:00
Young Liu	104bac3cbd	Add trace to debug	2020-08-07 13:02:41 -07:00
Young Liu	56cc15ee71	Add trace to debug	2020-08-07 01:02:07 -07:00
Young Liu	d6a23a4d6b	Resolve comments to make GRV proxy a separate process class	2020-08-06 00:01:57 -07:00
Young Liu	30ea639666	Remove debug traces	2020-07-29 07:55:05 -07:00
Young Liu	f7b76a92af	pass joshua	2020-07-29 07:26:55 -07:00
Meng Xu	a2089b354a	RemoveServersSafely:Safety check toKill1 to avoid cluster getting stuck toKill1 and toKill2 are a random subset of all processes. If simply kill all processes in toKill1 or toKill2, we may kill too many processes to make the cluster unavailable and stuck. Similar as what toKill2 were modified if it can cause cluster unavailable, we should do the same thing for toKill1	2020-07-28 21:07:31 -07:00
Young Liu	1826ac75d5	Add some trace events to debug	2020-07-25 18:16:08 -07:00
Young Liu	0fc681cc3c	Remote some code comments	2020-07-23 22:29:51 -07:00
Young Liu	618414a416	Fix bugs related to getting proxies workers	2020-07-23 18:32:47 -07:00
Young Liu	229ab0d5f1	Fix some conflicts and remote debugging trace events	2020-07-22 23:35:46 -07:00
Young Liu	525f10e30c	Merge master branch	2020-07-22 16:08:49 -07:00
Young Liu	302cf5c45f	Remove debug trace events	2020-07-22 12:20:22 -07:00
Young Liu	2703cedac5	Fixed known bugs	2020-07-17 22:24:52 -07:00
Young Liu	21c1998cca	Fix MaxTLogQueueSize Bug	2020-07-16 15:56:04 -07:00
Young Liu	5b06d69d25	Pass watches test	2020-07-15 00:37:41 -07:00
Andrew Noyes	f470ba8316	Remove using namespace std::rel_ops This causes the following to not compile anymore \#include <utility> \#include <vector> using namespace std::rel_ops; int main() { std::vector<int> xs; return xs.rbegin() != xs.rend(); } See https://godbolt.org/z/s1977n	2020-07-10 22:58:15 +00:00
Meng Xu	9668f32df5	Merge pull request #3388 from apple/release-6.3 Merge Release 6.3 into master	2020-06-18 08:50:25 -07:00
Vishesh Yadav	3068a37e1b	refactor: Remove dead failureDetectionServer code	2020-06-17 15:40:21 -07:00
sfc-gh-tclinkenbeard	99bf993815	Replace BOOST_NOEXCEPT with noexcept	2020-06-09 22:39:19 -07:00
negoyal	cf13e00a8f	Merge remote-tracking branch 'origin/release-6.3' into fdb_cache_wo_allocator	2020-06-01 17:38:31 -07:00
Markus Pilman	c2bc75516f	Merge branch 'release-6.3' of github.com:apple/foundationdb into features/trace-roles	2020-05-14 10:34:53 -07:00
Evan Tschannen	f17f00fdd5	Merge branch 'release-6.2' # Conflicts: # documentation/sphinx/source/release-notes.rst	2020-05-10 22:33:38 -07:00
Evan Tschannen	3eaa9d6397	fix: do not report datacenter version difference before both datacenters report a correct version	2020-05-10 17:49:09 -07:00
Markus Pilman	5f9b127e56	Emit traces regularly about role assignment We are currently emitting Role transition traces when a role starts and when it ends. While this is useful for debugging, it doesn't work well with tools that inject data and might potentially miss some trace lines. We do decorate each trace lines with the roles assigned to that particular process, however, this is not sufficient for tools that can make use of the UID -> Role mapping	2020-05-08 16:27:57 -07:00
negoyal	dd033736ed	Merge branch 'master' into fdb_cache_subfeature2	2020-05-04 17:29:43 -07:00
Evan Tschannen	9e5037291d	fix compiler errors	2020-05-01 14:30:50 -07:00
Evan Tschannen	a442565e13	more work towards shrinking locality	2020-04-18 21:29:38 -07:00
Evan Tschannen	b04478704e	fixed improper use of std::set erase	2020-04-17 16:45:22 -07:00
Evan Tschannen	33efb9ec97	code cleanup based on review comments	2020-04-17 15:05:01 -07:00
Evan Tschannen	b667d5442f	fix: not all removed endpoints were actually removed	2020-04-17 13:47:54 -07:00
Evan Tschannen	9b5130194d	avoid updating the same endpoint multiple times	2020-04-11 21:05:30 -07:00
Evan Tschannen	1476057996	properly cache serialization of serverDBInfo	2020-04-11 19:30:05 -07:00
Evan Tschannen	07cc0a8d74	code cleanup	2020-04-10 17:02:11 -07:00
Evan Tschannen	ce4493f679	many bug fixes	2020-04-10 13:45:16 -07:00
Evan Tschannen	a51c92854a	Merge branch 'master' into feature-tree-broadcast # Conflicts: # fdbserver/WorkerInterface.actor.h # fdbserver/worker.actor.cpp	2020-04-06 21:09:44 -07:00
Evan Tschannen	2a1bd97120	fix compilation errors	2020-04-06 20:58:43 -07:00
Evan Tschannen	477d66b46d	implemented a tree broadcast for txn state message for proxies, and serverDBInfo for workers	2020-04-05 23:09:36 -07:00
negoyal	acaf91ac47	Merge branch 'master' into fdb_cache_subfeature2	2020-03-26 13:33:08 -07:00
Jingyu Zhou	5b36dcaad5	Fix oldest backup epoch for backup workers The oldest backup epoch is piggybacked in LogSystemConfig from master to cluster controller and then to all workers. Previously, this epoch is set to the current master epoch, which is wrong.	2020-03-20 20:15:09 -07:00
Evan Tschannen	e08f0201f1	merge release 6.2 into master	2020-03-17 12:51:47 -07:00
Evan Tschannen	2038a56ff4	Merge pull request #2819 from etschannen/feature-first-proxy A "proxy" class process would not be preferred as the "first proxy" for restore and DR purposes	2020-03-16 13:53:28 -07:00
Evan Tschannen	012344e297	refactor getWorkersForRoleInDatacenter	2020-03-16 11:50:17 -07:00
Evan Tschannen	79d5511149	A "proxy" class process would not be preferred as the "first proxy" for restore and DR purposes	2020-03-13 17:49:02 -07:00
Evan Tschannen	4640edf5d6	do not recruit satellite tlogs when usable regions=1	2020-03-13 10:24:52 -07:00
Evan Tschannen	303df197cf	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # bindings/c/test/mako/mako.c # documentation/sphinx/source/release-notes.rst # fdbbackup/backup.actor.cpp # fdbclient/NativeAPI.actor.cpp # fdbclient/NativeAPI.actor.h # fdbserver/DataDistributionQueue.actor.cpp # fdbserver/Knobs.cpp # fdbserver/Knobs.h # fdbserver/LogRouter.actor.cpp # fdbserver/SkipList.cpp # fdbserver/fdbserver.actor.cpp # flow/CMakeLists.txt # flow/Knobs.cpp # flow/Knobs.h # flow/flow.vcxproj # flow/flow.vcxproj.filters # versions.target	2020-03-06 18:22:46 -08:00
Evan Tschannen	f3ac2c9180	renamed a variable	2020-03-04 18:49:21 -08:00
Evan Tschannen	b3ea9d5896	Do not allow the cluster controller to mark any process as failed within 30 seconds of startup	2020-03-04 18:45:26 -08:00
negoyal	cd949eca71	Merge branch 'master' into fdb_cache_subfeature2	2020-02-26 11:22:08 -08:00
Evan Tschannen	96258b9809	Merge branch 'release-6.2' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbcli/fdbcli.actor.cpp # fdbclient/ManagementAPI.actor.cpp # fdbrpc/FlowTransport.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/DataDistribution.actor.h # fdbserver/DataDistributionQueue.actor.cpp # fdbserver/KeyValueStoreMemory.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # fdbserver/QuietDatabase.actor.cpp # fdbserver/SkipList.cpp # fdbserver/StorageMetrics.actor.h # fdbserver/TLogServer.actor.cpp # fdbserver/fdbserver.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/KVStoreTest.actor.cpp # flow/CMakeLists.txt # flow/Knobs.cpp # flow/Knobs.h # flow/genericactors.actor.cpp # flow/serialize.h	2020-02-21 19:09:16 -08:00
Evan Tschannen	8b768e66df	Merge pull request #2694 from dongxinEric/feature/2663/specialize-policy-for-zoneid-in-cc Added a specialized algorithm for PolicyOne and PolicyAcross(,'zoneId…	2020-02-20 14:46:23 -08:00
Evan Tschannen	574e88ba8e	updateGoodRemoteRecruitmentTime was unnecessary because the only way findRemoteWorkers would return would be after a new server has joined which already resets goodRemoteRecruitmentTime	2020-02-20 13:46:22 -08:00
Xin Dong	99095c9224	Again make Clang happy.	2020-02-20 09:50:22 -08:00
Xin Dong	298d6cb3d7	Address review comments.	2020-02-20 09:34:01 -08:00
Evan Tschannen	fbd45963d8	The cluster controller waits until no new workers register for 1.0 before starting a bad recruitment	2020-02-19 16:48:30 -08:00
Xin Dong	89fcbb2055	Make clang happy	2020-02-19 09:44:15 -08:00
Xin Dong	efc0d7f9d5	Added a specialized algorithm for PolicyOne and PoilcyAcross(,'zoneId',PolicyOne()) to find a set of TLog servers which will be able to fulfill the policy later.	2020-02-19 09:25:57 -08:00
negoyal	85cc35e81e	Merge branch 'master' into HEAD	2020-02-05 14:59:55 -08:00
Evan Tschannen	844c8511c4	Merge pull request #2588 from jzhou77/backup-worker Integrate new backup worker with existing backup command	2020-02-05 14:14:43 -08:00
Jingyu Zhou	52c6737411	Rename backupLoggingEnabled as backupWorkerEnabled To highlight the changes for 7.0 backup changes. By default, backup_worker_enabled flag is set for 7.0 version.	2020-02-04 10:09:16 -08:00
Jingyu Zhou	0db03f1d3c	Use backup_logging_enabled flag The default is to enable new backup workers. Users can disable this flag to turn off the backup worker feature.	2020-02-03 20:03:22 -08:00
Evan Tschannen	4524831456	Merge pull request #2518 from vishesh/task/failmon-remove-server FailureMonitoring: Server processes no longer need to talk to ClusterController	2020-02-03 17:22:50 -08:00
Jingyu Zhou	38aa1903fd	Add a DB configuration option for backup workers Right now, the default is to keep the old backup behavior, i.e., do NOT use backup workers. Specifically, if BackupType is not set (or is set to default), the master will not recruit backup workers and will not add pseudo locality for backup workers. The StartFullBackupTaskFunc is updated to check if backup worker is enabled. Only when it is not enabled, starting a backup will wait on all backup workers to be started.	2020-01-31 19:29:09 -08:00
Jingyu Zhou	6ddf73e26a	Remove code introduced when resolving merge conflicts	2020-01-22 21:23:38 -08:00
Jingyu Zhou	c6c39ca99d	Update better master exist with backup workers During recruitment, if there is no desired log router count, use tlog size instead, because the number of backup workers has to be larger than 0.	2020-01-22 19:43:40 -08:00
Jingyu Zhou	56a2c37071	Recruit backup workers for single region Enable log router tags for single region, which are popped by backup workers. Need to add noop for backup workers if there is no active backups.	2020-01-22 19:42:13 -08:00
Jingyu Zhou	19d6a889ff	Recruit backup workers for old epochs If there are unfinished ranges in the old epochs, the new master will recruit backup workers responsible for finishing these ranges. These workers remains in the cluster until the next epoch, when it will remove itself.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	7da9f47f26	Enable pop from backup workers This is still WIP as some edge cases can trigger test failure, most likely due to not popping mutations by backup workers when epoch ends.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	ece3cadf8e	Recruit backup worker during master recovery Right now recruit the same number as TLogs. The backup worker does nothing.	2020-01-22 19:37:48 -08:00
Jingyu Zhou	de8d953865	Add backup role, class, and worker skeleton	2020-01-22 19:35:30 -08:00
Vishesh Yadav	daef5f011a	Merge remote-tracking branch 'apple/master' into task/failmon-remove-server	2020-01-21 13:20:15 -08:00
Evan Tschannen	3f9d9d8b84	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # cmake/FlowCommands.cmake # documentation/sphinx/source/release-notes.rst # fdbclient/StorageServerInterface.h # fdbserver/DataDistributionTracker.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # fdbserver/fdbserver.actor.cpp # flow/Knobs.h # flow/Platform.cpp # versions.target	2020-01-16 18:37:47 -08:00
Evan Tschannen	d55e56993d	fix: the cluster controller would not recruit more remote logs before the database became fully_recovered	2020-01-10 12:21:48 -08:00
Alvin Moore	7628d04fb9	Merge branch 'release-6.2' of github.com:apple/foundationdb into release_6.2_merge # Conflicts: # documentation/sphinx/source/release-notes.rst	2020-01-09 07:21:16 -08:00
mpilman	d3d6016c90	Merge remote-tracking branch 'negoyal/fdb_cache_subfeature2' into features/cache-initialization	2020-01-07 19:53:09 -08:00
Vishesh Yadav	6e6cfaff16	Cleanup old Failure Monitoring code	2020-01-07 15:53:32 -08:00
negoyal	29b77863f0	Cache warmup and Consistency check workload changes.	2020-01-07 13:06:58 -08:00
Evan Tschannen	3eae401886	fix: we were recruiting one too few oldLogRouters code cleanup	2020-01-02 15:05:44 -08:00
Evan Tschannen	5e5e618da0	during recovery, only send the full serverDBInfo to processes that are part of the new generation	2019-12-09 13:17:49 -08:00
Evan Tschannen	bcce5968a4	recruit oldLogRouters on TLogs, do not recruit oldLogRouters on the cluster controller if possible	2019-12-09 13:12:13 -08:00
mpilman	821edcb207	Register caches through keyspace This also removes the old mechanism that registers them through the serverDBInfo. Caches do now self-recruit at startup	2019-12-06 13:28:44 -08:00
negoyal	cf2563f1c7	Mix of various things, a lot of which will change.	2019-12-05 17:10:32 -08:00
Evan Tschannen	3c769fcf60	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # fdbserver/ClusterController.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # versions.target	2019-11-22 15:39:19 -08:00
Evan Tschannen	ebcb2f79ed	Merge branch 'master' of github.com:apple/foundationdb	2019-11-22 15:34:49 -08:00
A.J. Beamon	7c801513e2	Fix cases where latency band config could be discarded during recovery or process start.	2019-11-20 11:44:18 -08:00
Evan Tschannen	8d3ef89540	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # fdbclient/MutationList.h # fdbserver/MasterProxyServer.actor.cpp # versions.target	2019-11-14 15:49:56 -08:00
Evan Tschannen	ffc89d1182	fix: dd test recruitment should prefer the location of ratekeeper over other used processes	2019-11-13 12:58:55 -08:00
Balachandar Namasivayam	2e41497580	This commit tries to distribute RK and DD among other empty available processes.	2019-11-12 17:52:42 -08:00
Balachandar Namasivayam	f5282f2c7e	Fix bug where DD or RK could be halted and re-recruited in a loop for certain valid process class configurations. Specifically, recruitment of DD or RK takes into account that master process is preferred over proxy, resolver or cc. But check for better DD only looks for better machine class ignoring that the new recruit could share a proxy or resolver or CC. Also try to balance the distribution of the DD and RK role if there are enough processes to do so.	2019-11-12 14:22:36 -08:00
negoyal	a4a0bf18f9	Merging with Master.	2019-11-12 13:01:29 -08:00
Evan Tschannen	688940b685	merge 6.2 into master	2019-10-21 11:43:46 -07:00
Evan Tschannen	43e99ef6a4	fix: better master exists must check if fitness is better for proxies or resolvers before looking at the count of either of them	2019-10-17 13:18:31 -07:00
Evan Tschannen	298b815109	one proxy or resolver with best fitness no longer prevents more proxies or resolvers from being recruited with good fitness	2019-10-14 18:32:17 -07:00
Evan Tschannen	5064d91b75	fix: the cluster controller would not change to a new set of satellite tlogs when they become available in a better satellite location	2019-10-14 18:31:23 -07:00
Evan Tschannen	35e816e9ad	added the ability to configure satellite_logs by satellite location, this will overwrite the region configure if both are present	2019-10-14 18:30:15 -07:00
A.J. Beamon	31ce56eddf	Add cluster controller metrics	2019-10-03 15:29:11 -07:00
Evan Tschannen	b495cc697b	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # versions.target	2019-09-13 09:25:08 -07:00
Evan Tschannen	a62862c105	add yieldedFutures to prevent slow tasks	2019-09-11 16:26:48 -07:00
Evan Tschannen	945cff1e5b	the cluster controller caches the serialization of serverDBInfo, to avoid regenerating it many times	2019-09-10 14:27:22 -07:00
Meng Xu	39680fa515	StorageEngineSwitch:Clean up unnecessary trace And do not trigger storage recruitment unnecessarily.	2019-08-19 14:11:57 -07:00
Meng Xu	4ab322f52c	Merge branch 'master' into mengxu/storage-engine-switch-PR-v2	2019-08-19 13:06:32 -07:00
Meng Xu	3034a5e0c5	StorageRecruitment:Suppress outstanding req errors When too many outstanding requests cannot find a worker for storage server role, many same errors will be put into trace log. Only one error is enough to alert the problem. Too many same errors cause false positive in nightly test and thus should be suppressed.	2019-08-14 11:31:06 -07:00
Meng Xu	a588710376	StorageEngineSwitch:Graceful switch When fdbcli change storeType for storage engines, we switch the store type of storage servers one by one gracefully. This avoids recruiting multiple storage servers on the same process, which can cause OOM error.	2019-08-12 17:37:52 -07:00
Evan Tschannen	90e3b50213	Merge branch 'master' into feature-coordinator-connection # Conflicts: # fdbclient/DatabaseContext.h # fdbclient/NativeAPI.actor.cpp # fdbclient/NativeAPI.actor.h # fdbserver/workloads/KillRegion.actor.cpp	2019-07-26 15:05:02 -07:00
Evan Tschannen	be5d144b8b	added status information on connected clients	2019-07-25 17:15:31 -07:00
Jingyu Zhou	bbeaf0ebbb	Add a monitorServerInfoConfig() call back This was deleted during a code refactor in `ef868f5`. Because no tests were complaining, we didn't find this until now.	2019-07-25 15:17:26 -07:00
Evan Tschannen	4a866290b7	Clients keep a persistent connection open with coordinators to get updates to the list of proxies Status still needs to be updated with client information with information from the coordinators	2019-07-23 19:22:44 -07:00
Jingyu Zhou	50e7593c5b	Merge pull request #1796 from ajbeamon/remove-trace-event-underscores Remove trace event underscores	2019-07-05 21:45:55 -07:00
A.J. Beamon	9f4b6fd770	Remove additional underscores	2019-07-05 08:12:25 -07:00
Alex Miller	7a500cd37f	A giant translation of TaskFooPriority -> TaskPriority::Foo This is so that APIs that take priorities don't take ints, which are common and easy to accidentally pass the wrong thing.	2019-06-25 02:47:35 -07:00
Vishesh Yadav	a8e408e268	run clang-format on changes	2019-06-10 14:10:24 -07:00
Vishesh Yadav	6fa7081a21	net: Don't make FailureMonitoring requests from client This patch removes the need for clients to continuously contact cluster coordinator for failure monitoring information. Instead, it uses the FlowTransport to monitor the statuses of peers and update FailureMonitor accordingly.	2019-06-09 00:43:38 -07:00
Evan Tschannen	29b96414e2	Merge branch 'release-6.1' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/NativeAPI.actor.cpp # fdbserver/Coordination.actor.cpp # flow/Arena.h # versions.target	2019-06-03 18:49:35 -07:00
Evan Tschannen	7c333dbc16	If a process receives a message in its clusterControllerInterface before becoming the cluster controller, if the process does not become the cluster controller in the next minute it should destroy the interface to prevent a memory leak.	2019-05-29 16:57:13 -07:00
A.J. Beamon	5f55f3f613	Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used.	2019-05-10 14:01:52 -07:00
Andrew Noyes	6207d724f8	Fix all -Wunused-variable warnings	2019-04-15 18:13:00 -07:00
mpilman	1c16f87a4e	Remove trace-calls to printable (in non-workloads)	2019-04-05 13:12:19 -07:00
mpilman	c008e16c81	Defer formatting in traces to make them cheaper This is the first part of making `TraceEvent` cheaper. The main idea is to defer calls to any code that formats string. These are the main changes: - TraceEvent::detail now takes a c-string instead of std::string for literals. This prevents unnecessary allocations if the trace is not going to be printed in the first place (for example for SevDebug). Before that `detail` expected a `std::string` as key, which mean that any string literal would be copied on each call. - Templates Traceable and SpecialTraceMetricType. These templates can be specialized for any type that needs to be printed. The actual formatting will be deferred to after the `enabled` check. This provides two benefits: (1) if a TraceEvent is disabled, we don't pay for the formatting and (2) TraceEvent can trace types that it doesn't know about. - TraceEvent::enabled will be set in the constructor if the Severity is passed. This will make sure that `TraceEvent::init` is not called. - `TraceEvent::detail` will be inlined. So for disabled TraceEvent calls, a call to detail will only introduce a if-branch which is much cheaper than a function call.	2019-04-05 13:12:19 -07:00
Evan Tschannen	8ebf771392	cleanup cluster controller trace events	2019-03-30 14:17:18 -07:00
A.J. Beamon	71e2fdafb8	Changes to ratekeeper camel case	2019-03-27 08:24:25 -07:00
Evan Tschannen	5e03e178de	Merge pull request #1345 from ajbeamon/support-multiple-client-or-worker-issues Add support for a client or worker having multiple issues.	2019-03-24 17:27:50 -07:00
Evan Tschannen	d45159ebf7	Merge pull request #1307 from jzhou77/ratekeeper Monitor placement of Ratekeeper and DataDistributor	2019-03-24 17:26:07 -07:00
Evan Tschannen	d6ad027d37	ratekeeper needs to be recruited for proxies to make progress, so if one has not registered with the cluster controller by the time we are accepting commits, recruit a new one	2019-03-24 16:48:24 -07:00
Evan Tschannen	f426d732ea	fix: forgot to remove one location where id_used was incremented for distributor and ratekeeper	2019-03-24 16:04:59 -07:00
Evan Tschannen	e8948726e8	once we recruit a ratekeeper, do not allow any other ratekeepers to register	2019-03-24 11:04:39 -07:00
Jingyu Zhou	40eec20252	Restore master PID in worker registration This fix is lost during merge.	2019-03-23 21:02:11 -07:00
Jingyu Zhou	3ef26e6be3	Fix fitness assignment statements Found by MacOS build.	2019-03-23 19:16:04 -07:00
Evan Tschannen	1fc6937802	changed NetworkAddressList to at most two addresses for performance	2019-03-23 17:54:46 -07:00
Evan Tschannen	b51a24453e	the data distributor and ratekeeper are not included in id_used, but when comparing equally good options we prefer to avoid sharing with those roles excluded data distributor and ratekeeper were improperly killed when the best option was also excluded	2019-03-23 13:25:36 -07:00
Jingyu Zhou	fdc5b5ddbf	Fix: spurious ratekeeper registration A rare race condition: -r simulation -f ./foundationdb/tests/slow/WriteDuringReadAtomicRestore.txt -s 114256311 -b on - A is the ratekeeper. - CC recruit B and B starts - CC halts ratekeeper A and A is halted - A registers back with CC, which then halts B. CC sets A to be the ratekeeper. CC starts recruiting and finds A is the best machine. But skips recruiting because CC thinks A is already used. Now the cluster is left with no ratekeeper. Fix by disallowing ratekeeper registration with previous ID.	2019-03-23 11:03:51 -07:00
Jingyu Zhou	6523cd4931	Fix: recruit ratekeeper is not triggerred	2019-03-23 09:20:54 -07:00
Evan Tschannen	2da46e3172	fix: halt if datacenters are different	2019-03-22 23:53:21 -07:00
Evan Tschannen	d34c56c9a5	ensure that the processId exists in id_worker before accessing it	2019-03-22 18:54:39 -07:00
Evan Tschannen	36ab852bb1	Merge branch 'master' into ratekeeper # Conflicts: # fdbserver/ClusterController.actor.cpp	2019-03-22 18:41:00 -07:00
Evan Tschannen	ddb6058770	simplified ratekeeper monitoring loop	2019-03-22 18:22:45 -07:00
Jingyu Zhou	12917d8c7d	Add actors to store halt request futures Address best fitness in checking better DD or RK.	2019-03-22 18:06:38 -07:00
Jingyu Zhou	e8977aeb98	Remove clusterControllerDcId check This is no longer needed since it'll be set in the ctor.	2019-03-22 18:01:54 -07:00
Evan Tschannen	82bc447e29	startRatekeeper is responsible for updating serverDBInfo	2019-03-22 17:56:16 -07:00
Evan Tschannen	82c80c225d	make sure id_worker is updated before setting ratekeeper or data distribution	2019-03-22 17:08:54 -07:00
Evan Tschannen	6a9c9d79cc	Update fdbserver/ClusterController.actor.cpp	2019-03-22 17:00:58 -07:00
Evan Tschannen	70b1c88cdd	Update fdbserver/ClusterController.actor.cpp	2019-03-22 17:00:52 -07:00
Jingyu Zhou	16f54577ee	Restore master PID in cluster controller worker registration CC may think master failed and clear the master PID, which can block both data distributor and ratekeeper recruitment. Fix by restoring it during worker registration.	2019-03-22 14:53:05 -07:00
A.J. Beamon	4eb5715689	Add support for a client or worker having multiple issues.	2019-03-22 08:29:41 -07:00
Jingyu Zhou	da338c3ad6	Avoid unnecessary recuriting of DD or RK While waiting for recruting data distributor or ratekeeper, a previous one could already joined. So we can skip this unnecessary recruiting. Revert the change of worker.actor.cpp for ratekeeper. Instead, recruiting ratekeeper should avoid the process with an existing one. This fixes a bug where the ratekeeper interface became zombie, killing other healthy ratekeeper but doing no useful work. Found by: -r simulation --crash -f tests/fast/WriteDuringRead.txt -s 31858110 -b on	2019-03-21 22:40:07 -07:00
Evan Tschannen	fe4464e786	fix: processClassFitness could be wrong if the client changed their class while rebooting	2019-03-21 17:56:04 -07:00
Jingyu Zhou	299961aecb	Move ratekeeper or data distributor from excluded servers	2019-03-21 17:17:33 -07:00
Jingyu Zhou	48324ad4be	Fix a race during ratekeeper registration When a ratekeeper registers, the monitorRatekeeper wakes up and recruits a new ratekeeper. Adding a 0s delay to avoid this. If a ratekeeper is recruited on an existing machine, update the interface so that the cluster controller can clear the ratekeeperID.	2019-03-21 12:56:56 -07:00
Evan Tschannen	e692f0f70f	fix: degraded is only used for tlog recruitment, so we should not use it in the fitness calculation for other roles	2019-03-21 11:23:49 -07:00
Jingyu Zhou	8edefda193	Fix test stuck due to invalid worker in cluster controller Test case: -r simulation --crash -f ./tests/rare/CloggedCycleWithKills.txt -s 688927581 -b off	2019-03-20 22:24:01 -07:00
Jingyu Zhou	937b6dde31	Fix a race of DD, RK, Master failure If all DD, RK, Master run on the same process and failed. Recruiting of new DD or RK could try to use the old master worker interface, which is an invalid one and causes recruitment to be stuck. Fix by adding a delay and checking master is valid before recruitment.	2019-03-20 16:19:20 -07:00
Jingyu Zhou	ce5c6d18d2	Fix ratekeeper recruitment bug	2019-03-20 14:22:22 -07:00
Jingyu Zhou	86b687981b	Fix ratekeeper and data distributor recruiting bug Avoid multiple concurrent recuriting of ratekeepers with a recruiting flag. Fix endless recruiting when the chosen worker is a proxy or a resolver -- prefer master in this case.	2019-03-20 10:00:31 -07:00
Jingyu Zhou	474abd81bd	Move placement monitoring inside doCheckOutstandingRequests	2019-03-19 22:48:21 -07:00
Balachandar Namasivayam	f9560e1abd	Addressed Review Comments	2019-03-19 15:23:14 -07:00
Jingyu Zhou	bc6fdaea3e	Recruit a new ratekeeper before halting the old	2019-03-19 15:21:46 -07:00
Jingyu Zhou	0fb6a03c07	First round of review comment fixes for PR#1307	2019-03-19 11:29:19 -07:00
Jingyu Zhou	8d609eb51d	Protect ratekeeper registration race during recruitment This is similar one to DataDistributor.	2019-03-18 13:53:50 -07:00
Balachandar Namasivayam	5471725db5	Support config where the primary and remote DC's can be used as satellites.	2019-03-18 12:17:59 -07:00
Jingyu Zhou	2b41a97a6e	Fix the issue of slow dying Data Distributor Test with: -r simulation -f ./foundationdb/tests/slow/CommitBug.txt -s 67828576 -b on The test has the following event sequence: - Time 113.3s, CC noticed DD failure, cleard DD interface. - 1s later, DD rejoined and registered with CC. - Time 131.7s, DD actor cancelled. This old DD raced to register with CC and the failure monitor is not installed because monitorDataDistributor is stalled waiting for new DD. - Time 161.4s, new DD running. New DD recruting was delayed due to no servers in the period. Fix by disabling DD registration during the recruting process.	2019-03-17 22:19:23 -07:00
Jingyu Zhou	254c78053c	Fix a segfault error After wait, ServerDBInfo may have changed. Using the old copy is wrong.	2019-03-15 22:11:13 -07:00
Jingyu Zhou	12ddd56698	Fix Ratekeeper and DataDistributor placement Make sure both RateKeeper and DataDistributor are placed in the same data center as the Master. Make sure only one RateKeeper is live in the cluster as well.	2019-03-15 17:09:28 -07:00
Jingyu Zhou	bb5686eb75	Fix monitoring of DD and RK	2019-03-15 16:02:17 -07:00
Jingyu Zhou	9f6fe5f649	Merge remote-tracking branch 'apple/master' into ratekeeper	2019-03-15 11:30:04 -07:00
Jingyu Zhou	40860e0093	Attempt to fix.	2019-03-15 11:29:04 -07:00
Jingyu Zhou	99d521ef4f	Monitor Ratekeeper and DataDistributor to use stateless processes Since Ratekeeper and DataDistributor are no longer running with Master, they might be running with stateful processes before a new Master becomes alive, which is undesirable. This PR adds a monitoring of both Ratekeeper and DataDistributor at Cluster Controller -- if Master runs on a stateless class and RK/DD runs at a worse class, then RK/DD will be killed. I.e., RK/DD should be running at their own classes or on the same stateless process as Master. After restart, RK/DD should be running at a better process class.	2019-03-14 15:00:57 -07:00
Meng Xu	5a10bf5dfc	Merge branch 'master' into mengxu/tls-switch-status-PR	2019-03-14 10:35:12 -07:00
Evan Tschannen	a2108047aa	removed LocalitySetRef and IRepPolicyRef typedefs, because for clarity the Ref suffix is reserved for arena allocated objects instead of reference counted objects.	2019-03-13 13:14:39 -07:00
Evan Tschannen	e068c478b5	merge master	2019-03-12 18:31:25 -07:00
Evan Tschannen	5392742902	fixed review comments	2019-03-12 14:38:54 -07:00
Jingyu Zhou	2b0139670e	Fix review comment for PR 1176	2019-03-12 12:02:30 -07:00
Meng Xu	46f4b02807	TLS Status: Resolve review comments Use connectedCoordinatorsNumDelayed to reduce the load on cluster controller; Set connectedCoordinatorsNum to null by default for monitorLeader()	2019-03-11 17:10:08 -07:00
Evan Tschannen	1be9ae5ce3	fixed merge conflict	2019-03-08 22:51:06 -05:00
Evan Tschannen	044b6b4f8a	Merge branch 'master' into feature-degraded-tlog # Conflicts: # fdbserver/ClusterController.actor.cpp	2019-03-08 22:50:41 -05:00
Evan Tschannen	45fe6b369b	tlog recruitment will prefer non-degraded processes, however it will not choose less than desired number of tlogs to avoid degraded processes better master exists will switch the master to avoid degraded processes	2019-03-08 14:40:00 -05:00

... 3 4 5 6 7 ...

640 Commits