Commit Graph

1637 Commits

Author SHA1 Message Date
Meng Xu 94d799552e FastRestore:Apply clang-format against master 2020-02-18 16:41:59 -08:00
Meng Xu 132f5aa9ba FastRestore:Improve trace name and cosmetic change 2020-02-18 16:41:19 -08:00
Meng Xu 31a6ec34b7 Merge branch 'master' into mengxu/fast-restore-agent-PR 2020-02-18 16:17:59 -08:00
Meng Xu a12a161fb3 Merge branch 'master' into mengxu/fast-restore-pipeline-PR 2020-02-18 14:49:52 -08:00
Meng Xu c603b20e7e FastRestore:Resolve review comments 2020-02-18 14:08:27 -08:00
A.J. Beamon 649fc6ba94
Merge pull request #2329 from davisp/trace-clock-source-network-option
Add network option for the trace clock source
2020-02-15 10:43:00 -08:00
Paul J. Davis 32e285a761 Add network option for the trace clock source
This option allows clients to select the clock source for trace events
similar to the `--traceclock` command line parameter for `fdbserver`.
Using the `realtime` clock sources makes loading event data into
OpenTracing systems like Jaeger more useful.
2020-02-15 11:30:43 -06:00
Alex Miller 94e7f790d8
Merge pull request #2667 from atn34/atn34/remove-flatbuffers-knob
Remove USE_OBJECT_SERIALIZER knob
2020-02-14 15:44:38 -08:00
Xin Dong 1849939bc3 Added a delay to avoid get stuck in a loop because the request is not versioned and thus if a storage server is behind it might not know it has been assigned a shard range that a proxy thinks it has. 2020-02-12 15:01:26 -08:00
Xin Dong 2e1d03cbe7 Addressed AJ's review comments 2020-02-12 14:57:40 -08:00
Xin Dong 03287a0214 Fix build error. 2020-02-12 14:57:40 -08:00
Xin Dong 57f0c11712 Address Evan's review comments 2020-02-12 14:57:40 -08:00
Xin Dong d20ce99774 Resolved the review comment and renamed the functions 2020-02-12 14:57:40 -08:00
Xin Dong d934aed1d7 Because when the user issue 'getStorageByteSample' on a large key range, which can be as large as the whole DB, we need to change the behavior of 'waitStorageMetricsMultipleLocation' to avoid the case where a target key range got moved/splited by DD and thus the call to 'waitMetircs' on the corresponding storage server will return 'wrong_shard_server' error and thus the whole 'waitStorageMetricsMultipleLocation' will be retried on the large key range. What we want is to do the retry only for the key range that caused the error. 2020-02-12 14:57:40 -08:00
Xin Dong 807204e676 Update fdbclient/MultiVersionTransaction.actor.cpp
Apply A.J's suggestion.

Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2020-02-12 14:57:40 -08:00
Xin Dong d5c3f821e2 Added missing pieces. 2020-02-12 14:57:40 -08:00
Xin Dong 70f89042fd Remove comment that does not apply anymore 2020-02-12 14:57:40 -08:00
Xin Dong 0c16d43c2f Added necessary plumbings to expose byte sample collected by storage servers to fdb_c library 2020-02-12 14:57:40 -08:00
Andrew Noyes 1248d2b8b4 Remove USE_OBJECT_SERIALIZER knob 2020-02-12 10:41:52 -08:00
Meng Xu cda8fc189e FastRestore:AtomicOp:Intro weighted size for atomicOp
atomicOp has an amplified performance overhead to the cluster,
for example, an ADD operation can be small, but SS has to load
the value to do the operation and the value can be large.
2020-02-11 12:48:05 -08:00
mpilman 5a9d420cb7 Merge remote-tracking branch 'upstream/release-6.2' into release-merges/20200210 2020-02-10 10:02:05 -08:00
A.J. Beamon ff44bd2b33
Merge pull request #2639 from atn34/atn34/include-port-in-address-default
Enable include_port_in_address by default for api version 700
2020-02-10 09:50:59 -08:00
Markus Pilman e71fe44ee3
Merge branch 'master' into features/icc 2020-02-08 21:33:02 -08:00
Evan Tschannen 844c8511c4
Merge pull request #2588 from jzhou77/backup-worker
Integrate new backup worker with existing backup command
2020-02-05 14:14:43 -08:00
Jingyu Zhou d5849af5c0 Address review comments 2020-02-05 10:33:51 -08:00
Meng Xu 08443ed18d FastRestore:Remove debug trace for debugging connection errors 2020-02-04 17:06:02 -08:00
Evan Tschannen 8449badb3e
Merge pull request #1868 from dongxinEric/fix/1827/error_instead_of_timeout
Send error back before put the GRV request with PRIORITY_BATCH into t…
2020-02-04 14:32:47 -08:00
mpilman 100402aadf Don't call operator explicitely 2020-02-04 11:03:43 -08:00
mpilman 52ca752dd3 Merge remote-tracking branch 'origin/features/icc' into features/icc 2020-02-04 10:29:49 -08:00
mpilman d09e07f1f5 Merge remote-tracking branch 'upstream/master' into features/icc 2020-02-04 10:26:18 -08:00
Jingyu Zhou 52c6737411 Rename backupLoggingEnabled as backupWorkerEnabled
To highlight the changes for 7.0 backup changes. By default,
backup_worker_enabled flag is set for 7.0 version.
2020-02-04 10:09:16 -08:00
Jingyu Zhou 7c10683c77 Backup workers save logs into right containers
The mutation logs of backup workers are saved into "mlogs" directory under the
container directory. The backup worker has been restructured to handle multiple
backups, where each one is stored in a separate backup container.

In the backup worker, mutations pulled from TLogs are buffered in a message
queue. When writing out to different containers, their corresponding mutation
ranges are used to check if a mutation should be written. When a new backup
is submitted by the client, "backupStartedKey" is updated. The worker monitors
this key, updates its internal map of backups, and then next pull from TLog
needs to wait for the readiness of the new backup. This is to ensure when
worker 0 sets the backup is started, all workers have already been logging
mutations for the backup.
2020-02-03 20:27:14 -08:00
Jingyu Zhou 0db03f1d3c Use backup_logging_enabled flag
The default is to enable new backup workers. Users can disable this flag to
turn off the backup worker feature.
2020-02-03 20:03:22 -08:00
Meng Xu 3b57bf1781 Merge branch 'master' into mengxu/fast-restore-agent-PR 2020-02-03 17:23:54 -08:00
Evan Tschannen 4524831456
Merge pull request #2518 from vishesh/task/failmon-remove-server
FailureMonitoring: Server processes no longer need to talk to ClusterController
2020-02-03 17:22:50 -08:00
Meng Xu ca3b6135d0 FastRestore:Add debug to see why restore role is not connected
Reason: restore is a fdbserver who does not register with CC.
The new failure monitor changes how connection works for client and server.
For client, it does not connect to CC to get connected.
For server, it has to connect to CC to get connected.
Restore worker becomes the special role that behaves like a client but is a server.
2020-02-03 17:19:52 -08:00
Andrew Noyes 2ce887012c Respect api version for include_port_in_address 2020-02-03 15:25:30 -08:00
Andrew Noyes 07a3051f0e Enable include_port_in_address by default for api version 700
Resolves #2607
2020-02-03 15:10:00 -08:00
Meng Xu 9c2046b11b FastRestore:Minic fdbd to monitor coordintors
Before we start a fdb restore process.
2020-02-03 14:48:31 -08:00
Jingyu Zhou 297f22726c Add backup_type database configuration option
Update simulation tests to randomly set backup types to be one of: old backup
(default), new backup (tagged), or both (default+tagged).
2020-01-31 19:29:09 -08:00
Jingyu Zhou 38aa1903fd Add a DB configuration option for backup workers
Right now, the default is to keep the old backup behavior, i.e., do NOT use
backup workers. Specifically, if BackupType is not set (or is set to default),
the master will not recruit backup workers and will not add pseudo locality for
backup workers.

The StartFullBackupTaskFunc is updated to check if backup worker is enabled.
Only when it is not enabled, starting a backup will wait on all backup workers
to be started.
2020-01-31 19:29:09 -08:00
Jingyu Zhou f7956cfbfc Clear backup UID from backupStartedKey when finish/abort backups
Clearing this key signals backup workers that backup is no longer needed. When
no backup is going on, the backup workers switch to the NOOP state.
2020-01-31 19:29:09 -08:00
Jingyu Zhou 19ef7f6bdb Skip watch of backup task's started key if it's already set
The backup task may be restarted multiple times so the started key for the
backup task may already be set. In this case, the wait on watch should be
skipped.
2020-01-31 19:29:09 -08:00
Jingyu Zhou f8342f0884 Add keepRunning for start backup transaction
TaskBucket::keepRunning() needs to be called in backup transactions to be sure
that the task has not been cancelled. If so, the task is cancelled. Otherwise,
the task can continue run, causing multiple runs of the same task.

Another subtle issue is that the beginVersion is persisted on backupStartedKey.
So while reading it back from that key, we should set task's beginVersion with
the value persisted earlier.
2020-01-31 19:29:09 -08:00
Jingyu Zhou 5a602f58e8 Start backup with a wait on all backup workers running
This wait is to make sure that backup workers are already saving mutations so
that no mutations are missed. The idea is that the CLI sets a "backupStartedKey"
in the database and waits for allWorkerStarted() key of the backup to be set.

Backup workers monitor the changes to the "backupStartedKey" and start logging
mutations. Additionally, backup worker for Tag(-2,0) monitors all other workers
have started (checking their saved progress version is larger than the backup's
start version), and then sets the allWorkerStarted() key for the backup.
2020-01-31 19:29:09 -08:00
Jingyu Zhou e9c7ad82cc Comment out pseudo tag pop trace event 2020-01-31 19:29:09 -08:00
Xin Dong 7016f7903b Fixed another build error. Do not use timeReplyIgnoreError since we don not want the logging inside that function and thus that's unnecessary anymore. Change to use ready() which basically ignores the error. 2020-01-31 15:48:29 -08:00
Xin Dong c1f992667b Fix build failure 2020-01-31 14:27:47 -08:00
Xin Dong 8d28c2a7f0 Added two new counters for transaction throttled error and remove the verbose trace event logging. Also changed a chain of 'if' statements into 'if-else' statements since they are mutal exclusive 2020-01-31 14:16:39 -08:00
Alex Miller ee6490c9d1
Merge pull request #2314 from mengranwo/memory-engine
New Radix-Tree based Memory Storage Engine
2020-01-30 16:20:13 -08:00