foundationdb

Commit Graph

Author	SHA1	Message	Date
Jingyu Zhou	1b159a3785	Fix: backup worker ignores deleted container	2020-03-20 20:14:36 -07:00
Jingyu Zhou	00350dd3d8	Fix pulledVersion of backup worker Not sure why, the cursor's version can be smaller than before.	2020-03-20 20:14:35 -07:00
Jingyu Zhou	672ad7a8ea	Fix: backup worker savedVersion init to begin version Choosing invalidVersion is wrong, as the worker starts at beginVersion.	2020-03-20 20:14:35 -07:00
Jingyu Zhou	c300a5c1b7	Fix contract changes: backup worker generate continuous versions Before we allow holes in version ranges in partitioned mutation logs. This has been changed so that restore can easily figure out if database is restorable. A specific problem is that if the backup worker didn't find any mutations for an old epoch, the worker can just exit without generating a log file, thus leaving holes in version ranges. Another contract change is that if a backup key is set, then we must store all mutations for that key, especially for the worker for the old epoch. As a result, the worker must first check backup key, before pulling mutations and uploading logs. Otherwise, we may lose mutations. Finally, when a backup key is removed, the saving of mutations should be up to the current version so that backup worker doesn't exit too early. I.e., avoid the case saved mutation versions are less than the snapshot version taken.	2020-03-20 20:14:35 -07:00
Jingyu Zhou	86edc1c9c8	Fix backup worker does NOOP pop before getting backup key The NOOP pop cuases some mutation ranges being dropped by backup workers. As a result, the backup is incomplete. Specifically, the wait of BACKUP_NOOP_POP_DELAY blocks the monitoring of backup key actor.	2020-03-20 20:13:38 -07:00
Jingyu Zhou	1f95cba53e	Add describePartitionedBackup() for parallel restore For partitioned logs, computing continuous log end version from min logs begin version. Old backup test keeps using describeBackup() to be correctness clean. Rename partitioned log file so that the last number is block size.	2020-03-20 20:13:38 -07:00
Jingyu Zhou	2eac17b553	StagingKey can add out-of-order mutations For partitioned logs, mutations of the same version may be sent to applier out-of-order. If one loader advances to the next version, an applier may receive later version mutations for different loaders. So, dropping of early mutations is wrong.	2020-03-20 20:13:38 -07:00
Jingyu Zhou	ab0b59b0c3	Add subsequence number to restore loader & applier The subsequence number is needed so that mutations of the same commit version number, but from different partitioned logs can be correctly reassembled in order. For old backup files, the sub number is always 0. For partitioned mutation logs, the actual sub number is used. For range files, the sub number is always 0.	2020-03-20 20:13:38 -07:00
Jingyu Zhou	fda6c08640	Include a total number of tags in partition log file names This is needed for BackupContainer to check partitioned mutation logs are continuous, i.e., restorable to a version.	2020-03-20 20:13:38 -07:00
Jingyu Zhou	940bea102a	Add a knob to switch mutation logs for parallel restore Knob FASTRESTORE_USE_PARTITIONED_LOGS, default is true to enable partitioned mutation logs. Otherwise, old mutation logs are used.	2020-03-20 20:13:38 -07:00
Jingyu Zhou	6b9b93314e	Check block padding is \0xff for new mutation logs	2020-03-20 20:13:38 -07:00
Jingyu Zhou	35aafefb89	Consolidate StringRefReader classes Fix a compiler error of unused variable too.	2020-03-20 20:13:38 -07:00
Jingyu Zhou	88ad28e576	Integrate parallel restore with partitioned logs In parallel restore, use new getPartitionedRestoreSet() to get a set containing partitioned mutation logs. The loader uses a new parser to extract mutations from partitioned logs. TODO: fix unable to restore errors.	2020-03-20 20:13:38 -07:00
Jingyu Zhou	e15015ee6c	Add mutation log version names I.e., BACKUP_AGENT_MLOG_VERSION for 2001 and PARTITIONED_MLOG_VERSION for 4110.	2020-03-20 20:13:38 -07:00
Meng Xu	d3071409c5	FastRestore:Add comment for integrating with new backup format	2020-03-20 20:13:38 -07:00
Jingyu Zhou	3801e50288	Backup worker: enable 50% of time in simulation Make this randomization a separate one.	2020-03-20 20:13:38 -07:00
Meng Xu	980037f3a8	Merge pull request #2835 from bnamasivayam/revert-report-conflicting-keys Revert report conflicting keys	2020-03-20 10:33:26 -07:00
Jingyu Zhou	34415f82b3	Merge pull request #2832 from xumengpanda/mengxu/backup-code-review-PR Buggify upload delay when backup worker upload data to blob	2020-03-19 21:42:28 -07:00
Balachandar Namasivayam	804fe1b22e	Revert "Merge pull request #2257 from zjuLcg/report-conflicting-key" This reverts commit `648dc4a933`, reversing changes made to `487d131b38`.	2020-03-19 21:34:28 -07:00
Meng Xu	dfea2c2e55	BackupWorker:Remove assert in pop	2020-03-19 20:14:52 -07:00
Meng Xu	a323b80439	BackupWorker:Improve code comments	2020-03-19 15:58:22 -07:00
Jingyu Zhou	8bdda0fe04	Backup Worker: Give a chance of saving progress before displaced Move the exit loop after the saving of progress so that when doneTrigger is active, we won't exit the loop immediately.	2020-03-19 14:59:38 -07:00
Jingyu Zhou	5bf62c8f85	Reduce a call to getLogSystemConfig()	2020-03-19 10:08:19 -07:00
Meng Xu	94276076de	BackupWorker:Buggify upload delay Add questions to code as well.	2020-03-18 19:04:45 -07:00
Jingyu Zhou	9a91bb2b9e	Add target version as the limit for version batches If using partitioned logs, the mutations after the target version can be included if this limit is not considered.	2020-03-18 16:44:17 -07:00
Jingyu Zhou	61f8cd2529	Add an exitEarly flag for backup worker If a backup worker is on an old epoch, it could exit early if either of the following is true: - there is no backups - all backups starts a version >= the endVersion If this flag is set, the backup worker exit without doing any work, which signals the master to update oldest backup epoch.	2020-03-18 16:44:17 -07:00
Jingyu Zhou	19f6394dc9	Fix oldest backup epoch for backup workers The oldest backup epoch is piggybacked in LogSystemConfig from master to cluster controller and then to all workers. Previously, this epoch is set to the current master epoch, which is wrong.	2020-03-18 16:44:17 -07:00
Jingyu Zhou	3513bbefe6	StagingKey uses mutation instead of a vector of mutations for each log version Because each log version contains commit version and subsequence number, each key can only have one mutation for its log version. This simplifies StagingKey::add() a lot.	2020-03-18 16:44:17 -07:00
Jingyu Zhou	9b11bd8ee4	Batch sending all mutations of a version from RestoreLoader This optimization is to reduce the number of messages sent from loader to applier, which was unintentionally done when introducing sub sequence numbers for mutations.	2020-03-18 16:42:53 -07:00
Jingyu Zhou	b697e46b19	Fix duplicated mutation in StagingKey For some reason I am not sure why, there can be duplicated mutations added to StagingKey, which needs to be filtered out. Otherwise, atomic operations can result in corrupted data in database.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	0fb9e943f2	Small code refactor	2020-03-18 16:41:35 -07:00
Jingyu Zhou	d1ef6f1225	Fix missing mutations in splitMutation When a range mutation is larger than the last split point, this mutation can become missing in the RestoreLoader, which is fixed in this commit.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	c3dd593113	Updates lastest backup worker progress after all previous epochs are done If workers for previous epochs are still ongoing, we may end up with a container that miss mutations in previous epochs. So the update only happens after there are only current epoch's backup workers.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	d5250084bd	Fix a time gap for monitoring backup keys Backup worker starts by check if there are backup keys and then runs monitorBackupKeyOrPullData() loop, which does the check again. The second check can be delayed, which causes the loop to perform NOOP pops. The fix removes this second check and uses the result of the first check to decide what to do in the loop.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	ceb56cf49d	Add done trigger so that backup progress can be set Otherwise, when there is no mutations for the unfinished range, the empty file may not be created when the worker is displaced, thus leaving holes in version ranges.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	03fd5cf3fa	Give maximum subsequence number for snapshot mutations This is needed so that mutations in partitioned logs are applied first and snapshot mutations are applied later for the same commit version.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	472849e45c	Fix MacOS compiling error clang doesn't allow capture references, so use copy for lambda's capture list.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	dbb05faa24	Fix asset end version if request.targetVersion is -1	2020-03-18 16:41:35 -07:00
Jingyu Zhou	7d1538a9fc	Fix wrong end version for restore loader The restore cannot exceed the target version of the restore request. Otherwise, the version restored is larger than the requested version.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	6a302e6605	Add total number of tags to WorkerBackupStatus This allows the backup worker to check the number of tags.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	ce2595821a	Refactor to use std::find_if for more concise code	2020-03-18 16:41:35 -07:00
Jingyu Zhou	89d8f13038	Fix backup worker start version when logset start version is lower The start version of tlog set can be smaller than the last epoch's end version. In this case, set backup worker's start version as last epoch's end version to avoid overlapping of version ranges among backup workers.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	524b275a94	Add a flag to submitBackup for partitioned log This is to distinguish with old workloads so that they can work in simulation.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	be1d36bed3	Backup worker updates latest log versions in BackupConfig If backup worker is enabled, the current epoch's worker of tag (-2,0) will be responsible for monitoring the backup progress of all workers and update the BackupConfig with the latest saved log version, which is the minimum version of all tags. This change has been incorporated in the getLatestRestorableVersion() so that it is transparent to clients.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	15437ffb53	Add delay for master to recruit backup workers This delay is to ensure old epoch's backup workers can save their progress in the database. Otherwise, the new master could attempts to recruit backup workers for the old epoch on version ranges that have already been popped. As a result, the logs will lose data.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	b8c362cf44	Some correctness fixes	2020-03-18 16:41:35 -07:00
Jingyu Zhou	cade657682	Give a chance for backup worker to finish writing files If a backup worker is cancelled, wait until it finishes writing files so that we don't need to create these files in the next epoch.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	a015277e49	Fix compiling error of reverse iterators MacOS and Windows compiler doesn't like the use of "!=" operator of std::map::reverse_iterator.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	a0fb8ad5fc	Fix version gap in old epoch's backup When pull finished and message queue is empty, we should use end version as the popVersion for backup files. Otherwise, there might be a version gap between last message and end version.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	70487cee1b	Handle partial recovery in BackupProgress A partial recovery can result in empty epoch that copies previous epoch's version range. In this case, getOldEpochTagsVersionsInfo() will not return previous epoch's information. To correctly compute the start version for a backup worker, we need to check previous epoch's saved version. If they are larger than this epoch's begin version, use previously saved version as the start version.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	96eab2f3ec	Consider previously pulled version for pulling version Saving files only happens if we are not pulling, i.e., not in NOOP mode.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	de9362748e	Fix: backup worker ignores deleted container	2020-03-18 16:41:35 -07:00
Jingyu Zhou	ce3f0c6dfc	Fix pulledVersion of backup worker Not sure why, the cursor's version can be smaller than before.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	8f57c46bc9	Fix: backup worker savedVersion init to begin version Choosing invalidVersion is wrong, as the worker starts at beginVersion.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	07f1dcb5c9	Fix contract changes: backup worker generate continuous versions Before we allow holes in version ranges in partitioned mutation logs. This has been changed so that restore can easily figure out if database is restorable. A specific problem is that if the backup worker didn't find any mutations for an old epoch, the worker can just exit without generating a log file, thus leaving holes in version ranges. Another contract change is that if a backup key is set, then we must store all mutations for that key, especially for the worker for the old epoch. As a result, the worker must first check backup key, before pulling mutations and uploading logs. Otherwise, we may lose mutations. Finally, when a backup key is removed, the saving of mutations should be up to the current version so that backup worker doesn't exit too early. I.e., avoid the case saved mutation versions are less than the snapshot version taken.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	a20236a74d	Fix backup worker does NOOP pop before getting backup key The NOOP pop cuases some mutation ranges being dropped by backup workers. As a result, the backup is incomplete. Specifically, the wait of BACKUP_NOOP_POP_DELAY blocks the monitoring of backup key actor.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	f697ccd1b9	Add describePartitionedBackup() for parallel restore For partitioned logs, computing continuous log end version from min logs begin version. Old backup test keeps using describeBackup() to be correctness clean. Rename partitioned log file so that the last number is block size.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	af967210ee	StagingKey can add out-of-order mutations For partitioned logs, mutations of the same version may be sent to applier out-of-order. If one loader advances to the next version, an applier may receive later version mutations for different loaders. So, dropping of early mutations is wrong.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	ace409b49a	Add subsequence number to restore loader & applier The subsequence number is needed so that mutations of the same commit version number, but from different partitioned logs can be correctly reassembled in order. For old backup files, the sub number is always 0. For partitioned mutation logs, the actual sub number is used. For range files, the sub number is always 0.	2020-03-18 16:41:34 -07:00
Jingyu Zhou	d8c6bf585d	Include a total number of tags in partition log file names This is needed for BackupContainer to check partitioned mutation logs are continuous, i.e., restorable to a version.	2020-03-18 16:39:40 -07:00
Jingyu Zhou	55005952f2	Add a knob to switch mutation logs for parallel restore Knob FASTRESTORE_USE_PARTITIONED_LOGS, default is true to enable partitioned mutation logs. Otherwise, old mutation logs are used.	2020-03-18 16:39:40 -07:00
Jingyu Zhou	f6c27ca0d0	Check block padding is \0xff for new mutation logs	2020-03-18 16:37:02 -07:00
Jingyu Zhou	3664c6948b	Consolidate StringRefReader classes Fix a compiler error of unused variable too.	2020-03-18 16:37:02 -07:00
Jingyu Zhou	3c088b2352	Integrate parallel restore with partitioned logs In parallel restore, use new getPartitionedRestoreSet() to get a set containing partitioned mutation logs. The loader uses a new parser to extract mutations from partitioned logs. TODO: fix unable to restore errors.	2020-03-18 16:33:58 -07:00
Jingyu Zhou	21feb78f8a	Add mutation log version names I.e., BACKUP_AGENT_MLOG_VERSION for 2001 and PARTITIONED_MLOG_VERSION for 4110.	2020-03-18 16:33:58 -07:00
Meng Xu	b4ab78764c	FastRestore:Add comment for integrating with new backup format	2020-03-18 16:33:58 -07:00
Jingyu Zhou	7bcc0e15f2	Backup worker: enable 50% of time in simulation Make this randomization a separate one.	2020-03-18 16:33:58 -07:00
Balachandar Namasivayam	58a9bfa78b	Merge pull request #2820 from dongxinEric/fix/1977/add-back-trace-event-flush-failure-report Fix/1977/add back trace event flush failure report	2020-03-18 16:11:44 -07:00
Balachandar Namasivayam	a476127f5f	Merge pull request #2802 from xumengpanda/mengxu/debug-master-PR Fix correctness failure on master branch	2020-03-18 16:07:36 -07:00
Evan Tschannen	648dc4a933	Merge pull request #2257 from zjuLcg/report-conflicting-key Report conflicting keys	2020-03-18 13:39:42 -07:00
Balachandar Namasivayam	747434a13d	Increate QuietDatabase time to 90 seconds for real world cases.	2020-03-17 14:36:07 -07:00
Jingyu Zhou	5385d94fbd	Merge pull request #2827 from zjuLcg/temp-branch Delete unnecessary parameters in MakoWorkload	2020-03-17 14:27:14 -07:00
chaoguang	16cd68d3d9	Delete unnecessary parameters	2020-03-17 13:41:14 -07:00
Evan Tschannen	e08f0201f1	merge release 6.2 into master	2020-03-17 12:51:47 -07:00
Evan Tschannen	04052226df	reverting a change which causes data inconsistency between the primary and secondary	2020-03-17 09:41:44 -07:00
Meng Xu	7f559bc712	Cleanup code and apply clang-format Self code review	2020-03-16 15:08:32 -07:00
Evan Tschannen	ed4d02a3e4	Merge pull request #2812 from etschannen/feature-proxy-mem-limit Limit the amount of requests the proxy can queue up in memory	2020-03-16 14:56:56 -07:00
Meng Xu	1513df22f3	AutoQuorumChange:Exclude unreliable node from coordinator in simulation	2020-03-16 14:39:25 -07:00
Evan Tschannen	2038a56ff4	Merge pull request #2819 from etschannen/feature-first-proxy A "proxy" class process would not be preferred as the "first proxy" for restore and DR purposes	2020-03-16 13:53:28 -07:00
Xin Dong	89861c661e	Fix the random crash. Use a thread safe 'ThreadReturnPromise' instead of the ThreadFuture.	2020-03-16 13:36:55 -07:00
A.J. Beamon	ee3cde0b0d	Merge pull request #2815 from etschannen/feature-timeout-tlog-create Treat a tlog which takes a long time to create its disk queue as failed	2020-03-16 12:49:33 -07:00
Evan Tschannen	a068d4063f	renamed ProxyGetConsistentReadVersion	2020-03-16 12:11:32 -07:00
Evan Tschannen	7adc916e18	Merge pull request #2806 from ajbeamon/improve-team-request-performance Improve performance of get team requests.	2020-03-16 11:56:45 -07:00
A.J. Beamon	fe19f30999	Merge pull request #2813 from etschannen/feature-satellite-usable-regions do not recruit satellite tlogs when usable regions=1	2020-03-16 11:54:42 -07:00
Evan Tschannen	012344e297	refactor getWorkersForRoleInDatacenter	2020-03-16 11:50:17 -07:00
A.J. Beamon	f2defc3a3a	Merge pull request #2814 from etschannen/feature-delay-recovery Prevent coordinated state from filling up with too many old generations	2020-03-16 11:45:17 -07:00
Evan Tschannen	ea98c7a40a	added additional timeout on initPersistentState	2020-03-16 11:38:14 -07:00
A.J. Beamon	682b9faa1a	Merge pull request #2817 from etschannen/feature-fix-0-left fix: do not use priority 0 left when calculating priorities for empty teams	2020-03-16 11:15:12 -07:00
Evan Tschannen	56dee89e6e	active generations should include the current one	2020-03-16 11:09:42 -07:00
Evan Tschannen	e5d53c863b	report in status the number of active generations	2020-03-16 10:29:17 -07:00
Meng Xu	15c48b9e19	Add event for getDesired coordinators	2020-03-16 09:40:35 -07:00
Evan Tschannen	818537ed2d	Update fdbserver/masterserver.actor.cpp Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>	2020-03-14 15:04:46 -07:00
Evan Tschannen	0ca89547a5	make sure the number of logRouterTags is larger than the number of satelliteTLogs to avoid having satellites with no data.	2020-03-14 15:02:19 -07:00
Evan Tschannen	04b752b40a	Added additional logging related to memory errors (including in status)	2020-03-13 18:31:22 -07:00
Evan Tschannen	a71e61f57b	fixed compiler issue	2020-03-13 18:22:38 -07:00
Evan Tschannen	ebbf4490b3	use a Deque for each priority instead of a priority queue to improve CPU with large numbers of outstanding requests	2020-03-13 18:07:48 -07:00
Evan Tschannen	79d5511149	A "proxy" class process would not be preferred as the "first proxy" for restore and DR purposes	2020-03-13 17:49:02 -07:00
Evan Tschannen	2f2f56020f	Update fdbserver/masterserver.actor.cpp Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>	2020-03-13 15:54:13 -07:00
chaoguang	9dc441c65a	clang-format	2020-03-13 15:43:01 -07:00
chaoguang	39a37531db	Fix issues according to Andrew's comments	2020-03-13 15:42:15 -07:00
A.J. Beamon	700b13e5f8	Remember the best team from team requests, which will likely be the best again and can save us some computation.	2020-03-13 15:21:33 -07:00
Evan Tschannen	12f2b32770	added additional logging in data distribution	2020-03-13 15:19:33 -07:00
Evan Tschannen	9e99a00c8f	fix: do not use priority 0 left when calculating priorities for empty teams	2020-03-13 13:56:46 -07:00
chaoguang	c246f79d72	Update comments	2020-03-13 13:18:18 -07:00
chaoguang	a3b0dce3cd	Rename vars, update comments	2020-03-13 13:12:22 -07:00
chaoguang	8ee4fea3d3	clang-format	2020-03-13 12:54:12 -07:00
chaoguang	aef9b515de	Change the workload to a more controlled test like ConflictRange test	2020-03-13 12:42:28 -07:00
Evan Tschannen	d6d347f665	treat a tlog which takes a long time to create its disk queue as failed	2020-03-13 10:31:59 -07:00
Evan Tschannen	a39effa57d	delay recoveries after 70 outstanding generations, and stop recoveries after 100 outstanding generations to prevent a death spiral from filling up the coordinated state	2020-03-13 10:28:32 -07:00
Evan Tschannen	4640edf5d6	do not recruit satellite tlogs when usable regions=1	2020-03-13 10:24:52 -07:00
Evan Tschannen	243c268d9d	Limit the amount of requests the proxy can queue up in memory	2020-03-13 10:17:49 -07:00
Alex Miller	d86a601b84	Add cluster.processes.id.network.tls_policy.hz to status. This allows monitoring of TLS policy failures, but one has to go scrape for TLSPolicyFailure trace events to figure out why they're happening.	2020-03-13 02:46:10 -07:00
chaoguang	7118759dfa	Delete code to test resolvers' performance, simplify the workload to only test correctness	2020-03-13 01:48:55 -07:00
Xin Dong	5967ef5eab	Added back the changes that report trace log flush failures and fix the random crash	2020-03-12 14:34:19 -07:00
Meng Xu	0ef09539a9	addressMap[normalizedAddress]->address may not equal to normalizedAddress	2020-03-12 13:01:25 -07:00
chaoguang	6f90228a0b	change to krmSetRangeCoalescing	2020-03-12 11:31:36 -07:00
A.J. Beamon	555db50cd1	Avoid calling into SABTF so frequently. Use a cheaper call that only checks that shards exist.	2020-03-12 11:22:03 -07:00
Meng Xu	1759d5c8c4	Apply clang-format	2020-03-12 10:18:53 -07:00
Meng Xu	a9136f3f72	Add waitForUnreliableExtraStoreReboot to wait for extra store to reboot	2020-03-12 10:18:31 -07:00
chaoguang	c2f0c41c52	use krmSetRange	2020-03-11 23:12:38 -07:00
chaoguang	02ee4f4c46	Update comments	2020-03-11 22:22:51 -07:00
chaoguang	1a5b41157e	add test for native transaction object	2020-03-11 11:17:34 -07:00
Meng Xu	d87ed92f78	checkForExtraDataStores:Fix compilation error	2020-03-11 09:59:11 -07:00
Meng Xu	e0d2eca7a8	checkForExtraDataStores:Add coordinators into stateful process list	2020-03-10 23:38:30 -07:00
Meng Xu	bd345f85db	ConsistencyCheck:Fix failue due to address inconsistency between process and worker With TLS, a worker (or process) can have a TLS address and non-TLS address. When a process is created in simulation, the primary address is TLS by default. The non-TLS one is the TLS address port plus one. In a connection between two workers, if their primary addresses do not enable or disable TLS together, one worker will swap its primary address and secondary address so that the TLS config of the two endpoints can match. The swap can make the primary address no longer the TLS one that was created when the process is created. And the swap only happens for worker instead of process struct in simulation. This swap can cause worker->address != process->address. In checkForExtraDataStores actor, we use worker->address to check if a process is killable and use the process->address to kill the process. The inconsistency can cause simulation to kill a protected process that is not killable and leads to simulation failure.	2020-03-10 21:07:16 -07:00
chaoguang	698198a09e	Merge remote-tracking branch 'upstream/master' into report-conflicting-key	2020-03-09 10:50:33 -07:00
Evan Tschannen	303df197cf	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # bindings/c/test/mako/mako.c # documentation/sphinx/source/release-notes.rst # fdbbackup/backup.actor.cpp # fdbclient/NativeAPI.actor.cpp # fdbclient/NativeAPI.actor.h # fdbserver/DataDistributionQueue.actor.cpp # fdbserver/Knobs.cpp # fdbserver/Knobs.h # fdbserver/LogRouter.actor.cpp # fdbserver/SkipList.cpp # fdbserver/fdbserver.actor.cpp # flow/CMakeLists.txt # flow/Knobs.cpp # flow/Knobs.h # flow/flow.vcxproj # flow/flow.vcxproj.filters # versions.target	2020-03-06 18:22:46 -08:00
Evan Tschannen	dbfc0cbcc0	Merge pull request #2781 from alexmiller-apple/certificate-refresh Refresh certificates used for handshaking when they change on disk	2020-03-06 11:12:04 -08:00
Evan Tschannen	98647a61fc	Merge pull request #2784 from ajbeamon/add-resolver-metrics Add ResolverMetrics trace event	2020-03-06 09:38:30 -08:00
A.J. Beamon	faf9101ad4	Update fdbserver/Resolver.actor.cpp Co-Authored-By: Evan Tschannen <36455792+etschannen@users.noreply.github.com>	2020-03-06 09:20:38 -08:00
Evan Tschannen	1076abdee5	fixed crash when interf was not created	2020-03-05 19:09:08 -08:00
Evan Tschannen	1128666840	added additional logging on the log router	2020-03-05 18:17:06 -08:00
A.J. Beamon	7fb8c3c080	Remove unused variable.	2020-03-05 11:38:30 -08:00
A.J. Beamon	effb6d2d49	Add ResolverMetrics trace event	2020-03-05 10:49:21 -08:00
Alex Miller	595dd77ed1	Merge remote-tracking branch 'upstream/release-6.2' into certificate-refresh	2020-03-04 20:25:42 -08:00
Alex Miller	9b5ef3416e	Refactor TLSParams into TLSConfig + LoadedTLSConfig The idea being that we keep around a TLSConfig that the configuration that the user has provided, and then when we want to intialize an SSL context, we ask the TLSConfig to load all certificates and return us a LoadedTLSConfig that is a concrete set of certificate bytes in memory. initTLS now just takes the in-memory bytes and applies them to the ssl context. This is a large refactor to lead up into certificate refeshing, where we will periodically check for changes to the certificates, and then re-load them and apply them to a new SSL context.	2020-03-04 20:14:47 -08:00
Evan Tschannen	f3ac2c9180	renamed a variable	2020-03-04 18:49:21 -08:00
Evan Tschannen	b3ea9d5896	Do not allow the cluster controller to mark any process as failed within 30 seconds of startup	2020-03-04 18:45:26 -08:00
Evan Tschannen	e219c1671f	Merge branch 'release-6.2' into feature-dd-region-queue # Conflicts: # fdbserver/Knobs.h	2020-03-04 16:25:38 -08:00
Evan Tschannen	6d6f184e2f	added a knob which reverts the new queue behavior	2020-03-04 16:23:49 -08:00
Evan Tschannen	b7834b2995	Merge pull request #2774 from etschannen/feature-dd-repopulate-priority Make the DD priority of populating a region lower than machine failures	2020-03-04 16:15:18 -08:00
Xin Dong	39610d15f8	Revert this change since it somehow introduced a random crash detected on circus	2020-03-04 16:14:38 -08:00
A.J. Beamon	58e621eca1	Invalid knobs or knob values are treated as warnings rather than errors. Apply this change to backup as well.	2020-03-04 15:50:04 -08:00
Evan Tschannen	125bd13198	fix: in multi-region configurations, the data distribution queue could start too much work, expecting that the remote region would contribute to the read workload	2020-03-04 14:17:17 -08:00
Evan Tschannen	6296465e07	Make the DD priority associated with populating a remote region lower than machine failures	2020-03-04 14:07:32 -08:00
chaoguang	c63909c18c	clang-format	2020-03-04 11:44:14 -08:00
chaoguang	3a98c691b6	Update comments	2020-03-04 11:42:19 -08:00
chaoguang	7a76e9556d	Merge remote-tracking branch 'upstream/master' into report-conflicting-key	2020-03-04 11:24:39 -08:00
Andrew Noyes	c3b67c0c63	Fix OPEN_FOR_IDE build	2020-03-03 11:32:43 -08:00
Meng Xu	e6457ba0d5	FastRestore:Correct type for imcompleteStagingKeys	2020-03-02 11:33:07 -08:00

1 2 3 4 5 ...

3960 Commits