Commit Graph

3960 Commits

Author SHA1 Message Date
Jingyu Zhou 1b159a3785 Fix: backup worker ignores deleted container 2020-03-20 20:14:36 -07:00
Jingyu Zhou 00350dd3d8 Fix pulledVersion of backup worker
Not sure why, the cursor's version can be smaller than before.
2020-03-20 20:14:35 -07:00
Jingyu Zhou 672ad7a8ea Fix: backup worker savedVersion init to begin version
Choosing invalidVersion is wrong, as the worker starts at beginVersion.
2020-03-20 20:14:35 -07:00
Jingyu Zhou c300a5c1b7 Fix contract changes: backup worker generate continuous versions
Before we allow holes in version ranges in partitioned mutation logs. This
has been changed so that restore can easily figure out if database is
restorable. A specific problem is that if the backup worker didn't find any
mutations for an old epoch, the worker can just exit without generating a
log file, thus leaving holes in version ranges.

Another contract change is that if a backup key is set, then we must store
all mutations for that key, especially for the worker for the old epoch. As a
result, the worker must first check backup key, before pulling mutations and
uploading logs. Otherwise, we may lose mutations.

Finally, when a backup key is removed, the saving of mutations should be up to
the current version so that backup worker doesn't exit too early. I.e., avoid
the case saved mutation versions are less than the snapshot version taken.
2020-03-20 20:14:35 -07:00
Jingyu Zhou 86edc1c9c8 Fix backup worker does NOOP pop before getting backup key
The NOOP pop cuases some mutation ranges being dropped by backup workers. As a
result, the backup is incomplete. Specifically, the wait of BACKUP_NOOP_POP_DELAY
blocks the monitoring of backup key actor.
2020-03-20 20:13:38 -07:00
Jingyu Zhou 1f95cba53e Add describePartitionedBackup() for parallel restore
For partitioned logs, computing continuous log end version from min logs begin
version. Old backup test keeps using describeBackup() to be correctness clean.

Rename partitioned log file so that the last number is block size.
2020-03-20 20:13:38 -07:00
Jingyu Zhou 2eac17b553 StagingKey can add out-of-order mutations
For partitioned logs, mutations of the same version may be sent to applier
out-of-order. If one loader advances to the next version, an applier may
receive later version mutations for different loaders. So, dropping of early
mutations is wrong.
2020-03-20 20:13:38 -07:00
Jingyu Zhou ab0b59b0c3 Add subsequence number to restore loader & applier
The subsequence number is needed so that mutations of the same commit version
number, but from different partitioned logs can be correctly reassembled in
order.

For old backup files, the sub number is always 0. For partitioned mutation
logs, the actual sub number is used. For range files, the sub number is always
0.
2020-03-20 20:13:38 -07:00
Jingyu Zhou fda6c08640 Include a total number of tags in partition log file names
This is needed for BackupContainer to check partitioned mutation logs are
continuous, i.e., restorable to a version.
2020-03-20 20:13:38 -07:00
Jingyu Zhou 940bea102a Add a knob to switch mutation logs for parallel restore
Knob FASTRESTORE_USE_PARTITIONED_LOGS, default is true to enable partitioned
mutation logs. Otherwise, old mutation logs are used.
2020-03-20 20:13:38 -07:00
Jingyu Zhou 6b9b93314e Check block padding is \0xff for new mutation logs 2020-03-20 20:13:38 -07:00
Jingyu Zhou 35aafefb89 Consolidate StringRefReader classes
Fix a compiler error of unused variable too.
2020-03-20 20:13:38 -07:00
Jingyu Zhou 88ad28e576 Integrate parallel restore with partitioned logs
In parallel restore, use new getPartitionedRestoreSet() to get a set containing
partitioned mutation logs. The loader uses a new parser to extract mutations
from partitioned logs.

TODO: fix unable to restore errors.
2020-03-20 20:13:38 -07:00
Jingyu Zhou e15015ee6c Add mutation log version names
I.e., BACKUP_AGENT_MLOG_VERSION for 2001 and PARTITIONED_MLOG_VERSION for 4110.
2020-03-20 20:13:38 -07:00
Meng Xu d3071409c5 FastRestore:Add comment for integrating with new backup format 2020-03-20 20:13:38 -07:00
Jingyu Zhou 3801e50288 Backup worker: enable 50% of time in simulation
Make this randomization a separate one.
2020-03-20 20:13:38 -07:00
Meng Xu 980037f3a8
Merge pull request #2835 from bnamasivayam/revert-report-conflicting-keys
Revert report conflicting keys
2020-03-20 10:33:26 -07:00
Jingyu Zhou 34415f82b3
Merge pull request #2832 from xumengpanda/mengxu/backup-code-review-PR
Buggify upload delay when backup worker upload data to blob
2020-03-19 21:42:28 -07:00
Balachandar Namasivayam 804fe1b22e Revert "Merge pull request #2257 from zjuLcg/report-conflicting-key"
This reverts commit 648dc4a933, reversing
changes made to 487d131b38.
2020-03-19 21:34:28 -07:00
Meng Xu dfea2c2e55 BackupWorker:Remove assert in pop 2020-03-19 20:14:52 -07:00
Meng Xu a323b80439 BackupWorker:Improve code comments 2020-03-19 15:58:22 -07:00
Jingyu Zhou 8bdda0fe04 Backup Worker: Give a chance of saving progress before displaced
Move the exit loop after the saving of progress so that when doneTrigger is
active, we won't exit the loop immediately.
2020-03-19 14:59:38 -07:00
Jingyu Zhou 5bf62c8f85 Reduce a call to getLogSystemConfig() 2020-03-19 10:08:19 -07:00
Meng Xu 94276076de BackupWorker:Buggify upload delay
Add questions to code as well.
2020-03-18 19:04:45 -07:00
Jingyu Zhou 9a91bb2b9e Add target version as the limit for version batches
If using partitioned logs, the mutations after the target version can be
included if this limit is not considered.
2020-03-18 16:44:17 -07:00
Jingyu Zhou 61f8cd2529 Add an exitEarly flag for backup worker
If a backup worker is on an old epoch, it could exit early if either of the
following is true:
- there is no backups
- all backups starts a version >= the endVersion

If this flag is set, the backup worker exit without doing any work, which
signals the master to update oldest backup epoch.
2020-03-18 16:44:17 -07:00
Jingyu Zhou 19f6394dc9 Fix oldest backup epoch for backup workers
The oldest backup epoch is piggybacked in LogSystemConfig from master to
cluster controller and then to all workers. Previously, this epoch is set
to the current master epoch, which is wrong.
2020-03-18 16:44:17 -07:00
Jingyu Zhou 3513bbefe6 StagingKey uses mutation instead of a vector of mutations for each log version
Because each log version contains commit version and subsequence number, each
key can only have one mutation for its log version. This simplifies
StagingKey::add() a lot.
2020-03-18 16:44:17 -07:00
Jingyu Zhou 9b11bd8ee4 Batch sending all mutations of a version from RestoreLoader
This optimization is to reduce the number of messages sent from loader to
applier, which was unintentionally done when introducing sub sequence numbers
for mutations.
2020-03-18 16:42:53 -07:00
Jingyu Zhou b697e46b19 Fix duplicated mutation in StagingKey
For some reason I am not sure why, there can be duplicated mutations added to
StagingKey, which needs to be filtered out. Otherwise, atomic operations can
result in corrupted data in database.
2020-03-18 16:41:35 -07:00
Jingyu Zhou 0fb9e943f2 Small code refactor 2020-03-18 16:41:35 -07:00
Jingyu Zhou d1ef6f1225 Fix missing mutations in splitMutation
When a range mutation is larger than the last split point, this mutation can
become missing in the RestoreLoader, which is fixed in this commit.
2020-03-18 16:41:35 -07:00
Jingyu Zhou c3dd593113 Updates lastest backup worker progress after all previous epochs are done
If workers for previous epochs are still ongoing, we may end up with a
container that miss mutations in previous epochs. So the update only happens
after there are only current epoch's backup workers.
2020-03-18 16:41:35 -07:00
Jingyu Zhou d5250084bd Fix a time gap for monitoring backup keys
Backup worker starts by check if there are backup keys and then runs
monitorBackupKeyOrPullData() loop, which does the check again. The second check
can be delayed, which causes the loop to perform NOOP pops. The fix removes
this second check and uses the result of the first check to decide what to do
in the loop.
2020-03-18 16:41:35 -07:00
Jingyu Zhou ceb56cf49d Add done trigger so that backup progress can be set
Otherwise, when there is no mutations for the unfinished range, the empty file
may not be created when the worker is displaced, thus leaving holes in version
ranges.
2020-03-18 16:41:35 -07:00
Jingyu Zhou 03fd5cf3fa Give maximum subsequence number for snapshot mutations
This is needed so that mutations in partitioned logs are applied first and
snapshot mutations are applied later for the same commit version.
2020-03-18 16:41:35 -07:00
Jingyu Zhou 472849e45c Fix MacOS compiling error
clang doesn't allow capture references, so use copy for lambda's capture list.
2020-03-18 16:41:35 -07:00
Jingyu Zhou dbb05faa24 Fix asset end version if request.targetVersion is -1 2020-03-18 16:41:35 -07:00
Jingyu Zhou 7d1538a9fc Fix wrong end version for restore loader
The restore cannot exceed the target version of the restore request. Otherwise,
the version restored is larger than the requested version.
2020-03-18 16:41:35 -07:00
Jingyu Zhou 6a302e6605 Add total number of tags to WorkerBackupStatus
This allows the backup worker to check the number of tags.
2020-03-18 16:41:35 -07:00
Jingyu Zhou ce2595821a Refactor to use std::find_if for more concise code 2020-03-18 16:41:35 -07:00
Jingyu Zhou 89d8f13038 Fix backup worker start version when logset start version is lower
The start version of tlog set can be smaller than the last epoch's end version.
In this case, set backup worker's start version as last epoch's end version to
avoid overlapping of version ranges among backup workers.
2020-03-18 16:41:35 -07:00
Jingyu Zhou 524b275a94 Add a flag to submitBackup for partitioned log
This is to distinguish with old workloads so that they can work in simulation.
2020-03-18 16:41:35 -07:00
Jingyu Zhou be1d36bed3 Backup worker updates latest log versions in BackupConfig
If backup worker is enabled, the current epoch's worker of tag (-2,0) will be
responsible for monitoring the backup progress of all workers and update the
BackupConfig with the latest saved log version, which is the minimum version
of all tags.

This change has been incorporated in the getLatestRestorableVersion() so that
it is transparent to clients.
2020-03-18 16:41:35 -07:00
Jingyu Zhou 15437ffb53 Add delay for master to recruit backup workers
This delay is to ensure old epoch's backup workers can save their progress in
the database. Otherwise, the new master could attempts to recruit backup
workers for the old epoch on version ranges that have already been popped. As
a result, the logs will lose data.
2020-03-18 16:41:35 -07:00
Jingyu Zhou b8c362cf44 Some correctness fixes 2020-03-18 16:41:35 -07:00
Jingyu Zhou cade657682 Give a chance for backup worker to finish writing files
If a backup worker is cancelled, wait until it finishes writing files so that
we don't need to create these files in the next epoch.
2020-03-18 16:41:35 -07:00
Jingyu Zhou a015277e49 Fix compiling error of reverse iterators
MacOS and Windows compiler doesn't like the use of "!=" operator of
std::map::reverse_iterator.
2020-03-18 16:41:35 -07:00
Jingyu Zhou a0fb8ad5fc Fix version gap in old epoch's backup
When pull finished and message queue is empty, we should use end version as the
popVersion for backup files. Otherwise, there might be a version gap between
last message and end version.
2020-03-18 16:41:35 -07:00
Jingyu Zhou 70487cee1b Handle partial recovery in BackupProgress
A partial recovery can result in empty epoch that copies previous epoch's
version range. In this case, getOldEpochTagsVersionsInfo() will not return
previous epoch's information. To correctly compute the start version for a
backup worker, we need to check previous epoch's saved version. If they are
larger than this epoch's begin version, use previously saved version as the
start version.
2020-03-18 16:41:35 -07:00
Jingyu Zhou 96eab2f3ec Consider previously pulled version for pulling version
Saving files only happens if we are not pulling, i.e., not in NOOP mode.
2020-03-18 16:41:35 -07:00
Jingyu Zhou de9362748e Fix: backup worker ignores deleted container 2020-03-18 16:41:35 -07:00
Jingyu Zhou ce3f0c6dfc Fix pulledVersion of backup worker
Not sure why, the cursor's version can be smaller than before.
2020-03-18 16:41:35 -07:00
Jingyu Zhou 8f57c46bc9 Fix: backup worker savedVersion init to begin version
Choosing invalidVersion is wrong, as the worker starts at beginVersion.
2020-03-18 16:41:35 -07:00
Jingyu Zhou 07f1dcb5c9 Fix contract changes: backup worker generate continuous versions
Before we allow holes in version ranges in partitioned mutation logs. This
has been changed so that restore can easily figure out if database is
restorable. A specific problem is that if the backup worker didn't find any
mutations for an old epoch, the worker can just exit without generating a
log file, thus leaving holes in version ranges.

Another contract change is that if a backup key is set, then we must store
all mutations for that key, especially for the worker for the old epoch. As a
result, the worker must first check backup key, before pulling mutations and
uploading logs. Otherwise, we may lose mutations.

Finally, when a backup key is removed, the saving of mutations should be up to
the current version so that backup worker doesn't exit too early. I.e., avoid
the case saved mutation versions are less than the snapshot version taken.
2020-03-18 16:41:35 -07:00
Jingyu Zhou a20236a74d Fix backup worker does NOOP pop before getting backup key
The NOOP pop cuases some mutation ranges being dropped by backup workers. As a
result, the backup is incomplete. Specifically, the wait of BACKUP_NOOP_POP_DELAY
blocks the monitoring of backup key actor.
2020-03-18 16:41:35 -07:00
Jingyu Zhou f697ccd1b9 Add describePartitionedBackup() for parallel restore
For partitioned logs, computing continuous log end version from min logs begin
version. Old backup test keeps using describeBackup() to be correctness clean.

Rename partitioned log file so that the last number is block size.
2020-03-18 16:41:35 -07:00
Jingyu Zhou af967210ee StagingKey can add out-of-order mutations
For partitioned logs, mutations of the same version may be sent to applier
out-of-order. If one loader advances to the next version, an applier may
receive later version mutations for different loaders. So, dropping of early
mutations is wrong.
2020-03-18 16:41:35 -07:00
Jingyu Zhou ace409b49a Add subsequence number to restore loader & applier
The subsequence number is needed so that mutations of the same commit version
number, but from different partitioned logs can be correctly reassembled in
order.

For old backup files, the sub number is always 0. For partitioned mutation
logs, the actual sub number is used. For range files, the sub number is always
0.
2020-03-18 16:41:34 -07:00
Jingyu Zhou d8c6bf585d Include a total number of tags in partition log file names
This is needed for BackupContainer to check partitioned mutation logs are
continuous, i.e., restorable to a version.
2020-03-18 16:39:40 -07:00
Jingyu Zhou 55005952f2 Add a knob to switch mutation logs for parallel restore
Knob FASTRESTORE_USE_PARTITIONED_LOGS, default is true to enable partitioned
mutation logs. Otherwise, old mutation logs are used.
2020-03-18 16:39:40 -07:00
Jingyu Zhou f6c27ca0d0 Check block padding is \0xff for new mutation logs 2020-03-18 16:37:02 -07:00
Jingyu Zhou 3664c6948b Consolidate StringRefReader classes
Fix a compiler error of unused variable too.
2020-03-18 16:37:02 -07:00
Jingyu Zhou 3c088b2352 Integrate parallel restore with partitioned logs
In parallel restore, use new getPartitionedRestoreSet() to get a set containing
partitioned mutation logs. The loader uses a new parser to extract mutations
from partitioned logs.

TODO: fix unable to restore errors.
2020-03-18 16:33:58 -07:00
Jingyu Zhou 21feb78f8a Add mutation log version names
I.e., BACKUP_AGENT_MLOG_VERSION for 2001 and PARTITIONED_MLOG_VERSION for 4110.
2020-03-18 16:33:58 -07:00
Meng Xu b4ab78764c FastRestore:Add comment for integrating with new backup format 2020-03-18 16:33:58 -07:00
Jingyu Zhou 7bcc0e15f2 Backup worker: enable 50% of time in simulation
Make this randomization a separate one.
2020-03-18 16:33:58 -07:00
Balachandar Namasivayam 58a9bfa78b
Merge pull request #2820 from dongxinEric/fix/1977/add-back-trace-event-flush-failure-report
Fix/1977/add back trace event flush failure report
2020-03-18 16:11:44 -07:00
Balachandar Namasivayam a476127f5f
Merge pull request #2802 from xumengpanda/mengxu/debug-master-PR
Fix correctness failure on master branch
2020-03-18 16:07:36 -07:00
Evan Tschannen 648dc4a933
Merge pull request #2257 from zjuLcg/report-conflicting-key
Report conflicting keys
2020-03-18 13:39:42 -07:00
Balachandar Namasivayam 747434a13d Increate QuietDatabase time to 90 seconds for real world cases. 2020-03-17 14:36:07 -07:00
Jingyu Zhou 5385d94fbd
Merge pull request #2827 from zjuLcg/temp-branch
Delete unnecessary parameters in MakoWorkload
2020-03-17 14:27:14 -07:00
chaoguang 16cd68d3d9 Delete unnecessary parameters 2020-03-17 13:41:14 -07:00
Evan Tschannen e08f0201f1 merge release 6.2 into master 2020-03-17 12:51:47 -07:00
Evan Tschannen 04052226df reverting a change which causes data inconsistency between the primary and secondary 2020-03-17 09:41:44 -07:00
Meng Xu 7f559bc712 Cleanup code and apply clang-format
Self code review
2020-03-16 15:08:32 -07:00
Evan Tschannen ed4d02a3e4
Merge pull request #2812 from etschannen/feature-proxy-mem-limit
Limit the amount of requests the proxy can queue up in memory
2020-03-16 14:56:56 -07:00
Meng Xu 1513df22f3 AutoQuorumChange:Exclude unreliable node from coordinator in simulation 2020-03-16 14:39:25 -07:00
Evan Tschannen 2038a56ff4
Merge pull request #2819 from etschannen/feature-first-proxy
A "proxy" class process would not be preferred as the "first proxy" for restore and DR purposes
2020-03-16 13:53:28 -07:00
Xin Dong 89861c661e Fix the random crash. Use a thread safe 'ThreadReturnPromise' instead of the ThreadFuture. 2020-03-16 13:36:55 -07:00
A.J. Beamon ee3cde0b0d
Merge pull request #2815 from etschannen/feature-timeout-tlog-create
Treat a tlog which takes a long time to create its disk queue as failed
2020-03-16 12:49:33 -07:00
Evan Tschannen a068d4063f renamed ProxyGetConsistentReadVersion 2020-03-16 12:11:32 -07:00
Evan Tschannen 7adc916e18
Merge pull request #2806 from ajbeamon/improve-team-request-performance
Improve performance of get team requests.
2020-03-16 11:56:45 -07:00
A.J. Beamon fe19f30999
Merge pull request #2813 from etschannen/feature-satellite-usable-regions
do not recruit satellite tlogs when usable regions=1
2020-03-16 11:54:42 -07:00
Evan Tschannen 012344e297 refactor getWorkersForRoleInDatacenter 2020-03-16 11:50:17 -07:00
A.J. Beamon f2defc3a3a
Merge pull request #2814 from etschannen/feature-delay-recovery
Prevent coordinated state from filling up with too many old generations
2020-03-16 11:45:17 -07:00
Evan Tschannen ea98c7a40a added additional timeout on initPersistentState 2020-03-16 11:38:14 -07:00
A.J. Beamon 682b9faa1a
Merge pull request #2817 from etschannen/feature-fix-0-left
fix: do not use priority 0 left when calculating priorities for empty teams
2020-03-16 11:15:12 -07:00
Evan Tschannen 56dee89e6e active generations should include the current one 2020-03-16 11:09:42 -07:00
Evan Tschannen e5d53c863b report in status the number of active generations 2020-03-16 10:29:17 -07:00
Meng Xu 15c48b9e19 Add event for getDesired coordinators 2020-03-16 09:40:35 -07:00
Evan Tschannen 818537ed2d
Update fdbserver/masterserver.actor.cpp
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2020-03-14 15:04:46 -07:00
Evan Tschannen 0ca89547a5 make sure the number of logRouterTags is larger than the number of satelliteTLogs to avoid having satellites with no data. 2020-03-14 15:02:19 -07:00
Evan Tschannen 04b752b40a Added additional logging related to memory errors (including in status) 2020-03-13 18:31:22 -07:00
Evan Tschannen a71e61f57b fixed compiler issue 2020-03-13 18:22:38 -07:00
Evan Tschannen ebbf4490b3 use a Deque for each priority instead of a priority queue to improve CPU with large numbers of outstanding requests 2020-03-13 18:07:48 -07:00
Evan Tschannen 79d5511149 A "proxy" class process would not be preferred as the "first proxy" for restore and DR purposes 2020-03-13 17:49:02 -07:00
Evan Tschannen 2f2f56020f
Update fdbserver/masterserver.actor.cpp
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2020-03-13 15:54:13 -07:00
chaoguang 9dc441c65a clang-format 2020-03-13 15:43:01 -07:00
chaoguang 39a37531db Fix issues according to Andrew's comments 2020-03-13 15:42:15 -07:00
A.J. Beamon 700b13e5f8 Remember the best team from team requests, which will likely be the best again and can save us some computation. 2020-03-13 15:21:33 -07:00
Evan Tschannen 12f2b32770 added additional logging in data distribution 2020-03-13 15:19:33 -07:00
Evan Tschannen 9e99a00c8f fix: do not use priority 0 left when calculating priorities for empty teams 2020-03-13 13:56:46 -07:00
chaoguang c246f79d72 Update comments 2020-03-13 13:18:18 -07:00
chaoguang a3b0dce3cd Rename vars, update comments 2020-03-13 13:12:22 -07:00
chaoguang 8ee4fea3d3 clang-format 2020-03-13 12:54:12 -07:00
chaoguang aef9b515de Change the workload to a more controlled test like ConflictRange test 2020-03-13 12:42:28 -07:00
Evan Tschannen d6d347f665 treat a tlog which takes a long time to create its disk queue as failed 2020-03-13 10:31:59 -07:00
Evan Tschannen a39effa57d delay recoveries after 70 outstanding generations, and stop recoveries after 100 outstanding generations to prevent a death spiral from filling up the coordinated state 2020-03-13 10:28:32 -07:00
Evan Tschannen 4640edf5d6 do not recruit satellite tlogs when usable regions=1 2020-03-13 10:24:52 -07:00
Evan Tschannen 243c268d9d Limit the amount of requests the proxy can queue up in memory 2020-03-13 10:17:49 -07:00
Alex Miller d86a601b84 Add cluster.processes.id.network.tls_policy.hz to status.
This allows monitoring of TLS policy failures, but one has to go scrape
for TLSPolicyFailure trace events to figure out why they're happening.
2020-03-13 02:46:10 -07:00
chaoguang 7118759dfa Delete code to test resolvers' performance, simplify the workload to only test correctness 2020-03-13 01:48:55 -07:00
Xin Dong 5967ef5eab Added back the changes that report trace log flush failures and fix the random crash 2020-03-12 14:34:19 -07:00
Meng Xu 0ef09539a9 addressMap[normalizedAddress]->address may not equal to normalizedAddress 2020-03-12 13:01:25 -07:00
chaoguang 6f90228a0b change to krmSetRangeCoalescing 2020-03-12 11:31:36 -07:00
A.J. Beamon 555db50cd1 Avoid calling into SABTF so frequently. Use a cheaper call that only checks that shards exist. 2020-03-12 11:22:03 -07:00
Meng Xu 1759d5c8c4 Apply clang-format 2020-03-12 10:18:53 -07:00
Meng Xu a9136f3f72 Add waitForUnreliableExtraStoreReboot to wait for extra store to reboot 2020-03-12 10:18:31 -07:00
chaoguang c2f0c41c52 use krmSetRange 2020-03-11 23:12:38 -07:00
chaoguang 02ee4f4c46 Update comments 2020-03-11 22:22:51 -07:00
chaoguang 1a5b41157e add test for native transaction object 2020-03-11 11:17:34 -07:00
Meng Xu d87ed92f78 checkForExtraDataStores:Fix compilation error 2020-03-11 09:59:11 -07:00
Meng Xu e0d2eca7a8 checkForExtraDataStores:Add coordinators into stateful process list 2020-03-10 23:38:30 -07:00
Meng Xu bd345f85db ConsistencyCheck:Fix failue due to address inconsistency between process and worker
With TLS, a worker (or process) can have a TLS address and non-TLS address.
When a process is created in simulation, the primary address is TLS by default.
The non-TLS one is the TLS address port plus one.

In a connection between two workers, if their primary addresses do not enable
or disable TLS together, one worker will swap its primary address and secondary address
so that the TLS config of the two endpoints can match.

The swap can make the primary address no longer the TLS one that was created
when the process is created. And the swap only happens for worker instead of
process struct in simulation.

This swap can cause worker->address != process->address.
In checkForExtraDataStores actor, we use worker->address to check if a process
is killable and use the process->address to kill the process. The inconsistency
can cause simulation to kill a protected process that is not killable and leads
to simulation failure.
2020-03-10 21:07:16 -07:00
chaoguang 698198a09e Merge remote-tracking branch 'upstream/master' into report-conflicting-key 2020-03-09 10:50:33 -07:00
Evan Tschannen 303df197cf Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	bindings/c/test/mako/mako.c
#	documentation/sphinx/source/release-notes.rst
#	fdbbackup/backup.actor.cpp
#	fdbclient/NativeAPI.actor.cpp
#	fdbclient/NativeAPI.actor.h
#	fdbserver/DataDistributionQueue.actor.cpp
#	fdbserver/Knobs.cpp
#	fdbserver/Knobs.h
#	fdbserver/LogRouter.actor.cpp
#	fdbserver/SkipList.cpp
#	fdbserver/fdbserver.actor.cpp
#	flow/CMakeLists.txt
#	flow/Knobs.cpp
#	flow/Knobs.h
#	flow/flow.vcxproj
#	flow/flow.vcxproj.filters
#	versions.target
2020-03-06 18:22:46 -08:00
Evan Tschannen dbfc0cbcc0
Merge pull request #2781 from alexmiller-apple/certificate-refresh
Refresh certificates used for handshaking when they change on disk
2020-03-06 11:12:04 -08:00
Evan Tschannen 98647a61fc
Merge pull request #2784 from ajbeamon/add-resolver-metrics
Add ResolverMetrics trace event
2020-03-06 09:38:30 -08:00
A.J. Beamon faf9101ad4
Update fdbserver/Resolver.actor.cpp
Co-Authored-By: Evan Tschannen <36455792+etschannen@users.noreply.github.com>
2020-03-06 09:20:38 -08:00
Evan Tschannen 1076abdee5 fixed crash when interf was not created 2020-03-05 19:09:08 -08:00
Evan Tschannen 1128666840 added additional logging on the log router 2020-03-05 18:17:06 -08:00
A.J. Beamon 7fb8c3c080 Remove unused variable. 2020-03-05 11:38:30 -08:00
A.J. Beamon effb6d2d49 Add ResolverMetrics trace event 2020-03-05 10:49:21 -08:00
Alex Miller 595dd77ed1 Merge remote-tracking branch 'upstream/release-6.2' into certificate-refresh 2020-03-04 20:25:42 -08:00
Alex Miller 9b5ef3416e Refactor TLSParams into TLSConfig + LoadedTLSConfig
The idea being that we keep around a TLSConfig that the configuration
that the user has provided, and then when we want to intialize an SSL
context, we ask the TLSConfig to load all certificates and return us a
LoadedTLSConfig that is a concrete set of certificate bytes in memory.

initTLS now just takes the in-memory bytes and applies them to the ssl
context.

This is a large refactor to lead up into certificate refeshing, where we
will periodically check for changes to the certificates, and then
re-load them and apply them to a new SSL context.
2020-03-04 20:14:47 -08:00
Evan Tschannen f3ac2c9180 renamed a variable 2020-03-04 18:49:21 -08:00
Evan Tschannen b3ea9d5896 Do not allow the cluster controller to mark any process as failed within 30 seconds of startup 2020-03-04 18:45:26 -08:00
Evan Tschannen e219c1671f Merge branch 'release-6.2' into feature-dd-region-queue
# Conflicts:
#	fdbserver/Knobs.h
2020-03-04 16:25:38 -08:00
Evan Tschannen 6d6f184e2f added a knob which reverts the new queue behavior 2020-03-04 16:23:49 -08:00
Evan Tschannen b7834b2995
Merge pull request #2774 from etschannen/feature-dd-repopulate-priority
Make the DD priority of populating a region lower than machine failures
2020-03-04 16:15:18 -08:00
Xin Dong 39610d15f8 Revert this change since it somehow introduced a random crash detected on circus 2020-03-04 16:14:38 -08:00
A.J. Beamon 58e621eca1 Invalid knobs or knob values are treated as warnings rather than errors. Apply this change to backup as well. 2020-03-04 15:50:04 -08:00
Evan Tschannen 125bd13198 fix: in multi-region configurations, the data distribution queue could start too much work, expecting that the remote region would contribute to the read workload 2020-03-04 14:17:17 -08:00
Evan Tschannen 6296465e07 Make the DD priority associated with populating a remote region lower than machine failures 2020-03-04 14:07:32 -08:00
chaoguang c63909c18c clang-format 2020-03-04 11:44:14 -08:00
chaoguang 3a98c691b6 Update comments 2020-03-04 11:42:19 -08:00
chaoguang 7a76e9556d Merge remote-tracking branch 'upstream/master' into report-conflicting-key 2020-03-04 11:24:39 -08:00
Andrew Noyes c3b67c0c63 Fix OPEN_FOR_IDE build 2020-03-03 11:32:43 -08:00
Meng Xu e6457ba0d5 FastRestore:Correct type for imcompleteStagingKeys 2020-03-02 11:33:07 -08:00