foundationdb

Commit Graph

Author	SHA1	Message	Date
A.J. Beamon	d128252e90	Merge release-6.3 into master	2020-05-22 09:25:32 -07:00
Jingyu Zhou	9e23166cf8	Fix a super rare missing mutation bug in backup workers When a backup worker stops pulling for an old epoch, we cannot clear mutations. This is because these muations are needed for saving.	2020-05-21 12:19:57 -07:00
Jingyu Zhou	cdeabc4de6	Fix memory accounting error due to growing Arena After an Arena object is counted, it can grow larger later. So we can't reduce the amount of memory of arena size later. Instead, we use the arena size when inserting mutations.	2020-05-20 13:26:57 -07:00
Jingyu Zhou	9fbbec1033	Fix duplicated counting of memory usage For each message from LogPeekCursor, check it's using different arena from the previous one. Otherwise, the arena's memory could be counted twice.	2020-05-16 18:50:09 -07:00
Jingyu Zhou	a2e5050492	Fix duplicated mutation This seems to be related to how actor compiler generates code. The message can be inserted twice with original code ordering.	2020-05-16 10:52:11 -07:00
Jingyu Zhou	caca31d05a	Filter out mutations before the true-up version When a mutation log's begin version is true-uped, we must filter out mutations less than such a version.	2020-05-15 20:06:47 -07:00
Jingyu Zhou	01eff0fc03	Fix memory bytes accounting Avoid duplicated counting of arena memory since messages from peek cursor can share arena.	2020-05-14 19:59:54 -07:00
Jingyu Zhou	17915e13b0	Limit memory usage of backup workers	2020-05-14 13:24:56 -07:00
Jingyu Zhou	1a35efe43c	Add an assertion: mutation version >= log's begin version This is to check that no version's data are split into two files.	2020-05-14 12:06:13 -07:00
Alex Miller	ccaac162e2	Resolve performance concerns of nearly-no-op debugMutation being frequently called This introduces unhygenic macro variants that inline a `ENABLED &&` before the TraceEvent. This way, they get entirely compiled out unless enabled. Then rewrite all debugMutation uses via sed.	2020-05-13 18:44:15 -07:00
Alex Miller	27da91ab9e	Merge remote-tracking branch 'upstream/master' into mutation-debugging	2020-05-13 12:51:44 -07:00
A.J. Beamon	36454bb3b8	Merge branch 'master' into transaction-tagging # Conflicts: # fdbclient/MasterProxyInterface.h # fdbclient/NativeAPI.actor.cpp	2020-05-04 10:23:25 -07:00
Evan Tschannen	bd699f435c	fixed compiler errors	2020-05-01 11:01:09 -07:00
A.J. Beamon	66228343f1	Merge branch 'master' into transaction-tagging	2020-04-30 08:12:03 -07:00
Jingyu Zhou	7d59e53349	Consolidate makePadding()	2020-04-28 15:39:23 -07:00
A.J. Beamon	41c517a5dd	Merge branch 'master' into transaction-tagging # Conflicts: # fdbclient/NativeAPI.actor.cpp	2020-04-27 13:05:24 -07:00
A.J. Beamon	239876351b	Add some initial auto-throttling. Move the definition of the priority enum to a more global place and use it for all transaction priorites (except in ClientLogEvents, because of serialization incompatibilites).	2020-04-24 11:31:16 -07:00
Jingyu Zhou	0ae0a81edf	Ensure mutation logs save complete version's data I.e., do not allow the same version's mutations saved in different files. Otherwise, we may have a file only contain a version's partial data, causing continuity analysis of mutation logs to fail. This could also cause restore failures, if the target version's mutations are stored in two files. In the above description, all mutation logs refer to the same tag's logs.	2020-04-20 20:41:30 -07:00
Jingyu Zhou	61f0f44ab3	Fix comments on startVersion in BackupWorker	2020-04-20 17:07:50 -07:00
Jingyu Zhou	5f43e18906	Backup worker pops max of savedVersion or NOOP's popVersion	2020-04-20 11:43:09 -07:00
Jingyu Zhou	70221a25d7	True-up a backup's begin version For the first mutation log of a backup, we need to true-up its begin version to the exact version of the first mutation. This is needed to ensure the strict less than relationship between two mutation logs, if one's version range is within the other. A problematic scenario is as follows: Epoch 1: a mutation log A [200, 900] is saved, but its progress is NOT saved. Epoch 2: master recruits a worker for [1, 1000], 1000 is epoch 1's end version. New worker saves a mutation log B [100, 1000] A's range is strict within B's range, but A's size is larger than B. This happens because B's start version is true-up to the backup's begin version, which is not the actual version of the first mutation. After B's begin version is true-up to 300, we won't have this issue.	2020-04-20 11:06:46 -07:00
Jingyu Zhou	8245f12091	Backup worker doesn't save progress in NOOP mode This fixes the consistency check failure, where saving progress commits new transactions. Pop is performed by the NOOP loop in monitorBackupKeyOrPullData.	2020-04-20 11:06:46 -07:00
Jingyu Zhou	cdc911a6ae	Fix inadvertent savedVersion update	2020-04-20 11:06:46 -07:00
Jingyu Zhou	76d90ac6d7	Limit the version range for old epochs When the Master recruits a backup worker for previous epochs, the Master may set the begin version to a very low number, because the backup progress for that epoch is not saved. This can cause problem for the log file, since these low versions have been popped. The fix here is to advance savedVersion to the minimum of backup's starting version if it is higher than the begin version set by the Master. This is safe because these versions are not popped. If they are popped, their progress should already be recorded and Master would use a higher version than the backup's starting version.	2020-04-20 11:06:46 -07:00
Jingyu Zhou	4e128328f7	Stop backup workers before clearing DB in parallel restore workload This is because the clearing of DB can be picked up by backup workers and be applied during restore, causing restore failures.	2020-04-11 10:26:08 -07:00
Jingyu Zhou	60407bdee3	Use LiteralStringRef for backup paused key	2020-04-07 16:02:25 -07:00
Jingyu Zhou	9fb3fb9d82	Add pause/resume for new backups To pause/resume the backup workers, the fdbbackup command will write to the backupPausedKey. Then backup workers noticed the value of the key has been changed and stops/resumes pulling from TLog.	2020-04-06 14:29:46 -07:00
Jingyu Zhou	411b4c28ac	Update mutation bytes written for new backups This will make the log bytes written available to backup status and describe backup calls.	2020-03-29 21:23:34 -07:00
Alex Miller	40d10aa990	Fix debugMutation uses that were concurrently added in new backup code	2020-03-27 04:01:18 -07:00
Jingyu Zhou	00fb4c1a35	Fix an off by one error Backup worker's saved version should start from its startVersion - 1, i.e., the startVersion is not saved yet. Otherwise, if the version range is just the startVersion itself and there is no data, then the range [startVersion, startVersion + 1) will be missing. This causes non-continuous partitioned logs.	2020-03-24 23:40:36 -07:00
Jingyu Zhou	669916467e	Add missing transaction reset call	2020-03-24 20:14:37 -07:00
Jingyu Zhou	edcbeb8992	Address review comments Move transaction object outside of the loop and rename trace events.	2020-03-24 18:22:20 -07:00
Jingyu Zhou	a3058e7d96	Fix incorrectly marking a backup job as stopped This causes missing version ranges for mutation logs.	2020-03-23 22:05:58 -07:00
Jingyu Zhou	82a1790776	Fix backup worker crash due to aborted backup job If a backup job is aborted, the "startedBackupWorkers" key can be cleared, thus triggering the assertion failure.	2020-03-23 21:11:25 -07:00
Jingyu Zhou	f1d7fbafb4	Stop actors for displaced backup workers If the worker is displaced, it should not update backup containers.	2020-03-23 18:48:06 -07:00
Jingyu Zhou	fd7643c322	Remove a variable	2020-03-23 13:45:48 -07:00
Jingyu Zhou	90b40e1d75	Merge branch 'mengxu/new-backup-format-PR-delta' of github.com:xumengpanda/foundationdb into backup-worker-bak Resolve Conflicts: fdbclient/BackupAgent.actor.h fdbserver/BackupWorker.actor.cpp fdbserver/RestoreMaster.actor.cpp fdbserver/masterserver.actor.cpp	2020-03-23 13:35:33 -07:00
Meng Xu	be67ab4d6a	Correct comment based on review	2020-03-23 12:53:40 -07:00
Meng Xu	3f31ebf659	New backup:Revise event name and explain code	2020-03-23 10:55:44 -07:00
Jingyu Zhou	a8c2acdba0	Count the unique number of tags in startedBackupWorkers	2020-03-23 10:44:26 -07:00
Jingyu Zhou	1552653f1c	Backup Worker: Cancel the actor when container is stopped	2020-03-22 21:08:11 -07:00
Jingyu Zhou	33ea027f84	Make sure only current epoch's backup workers update all workers So that backup workers from old epochs don't mess with the list of all workers.	2020-03-22 18:28:22 -07:00
Jingyu Zhou	44c1996950	Change all worker started to be set after all workers updated a key Previously, all worker started is set to be when saved log versions are higher. However, saving the versions can be wrong, as the worker is not guaranteed to write to the right container. For instance, if the watch is triggered later, then mutation logs are written to previous containers. So we need to ensure the right container is ready -- all workers have acknowledged seeing the container.	2020-03-22 16:40:12 -07:00
Jingyu Zhou	0fe2810425	Fix repeated backup progress checking in backup worker The delay is not used, which caused repeated progress checking in worker 0.	2020-03-20 20:16:43 -07:00
Jingyu Zhou	4a499a3c97	Remove backup worker's first and last pop The first pop of current epoch can pop old epoch's data before they are saved. The last pop of a stopped backup worker should be skipped so that after recovery, the data is still accessible in case the last epoch's progress saving transaction is delayed.	2020-03-20 20:16:43 -07:00
Jingyu Zhou	9d6de758a7	Backup Worker: Give a chance of saving progress before displaced Move the exit loop after the saving of progress so that when doneTrigger is active, we won't exit the loop immediately.	2020-03-20 20:16:10 -07:00
Jingyu Zhou	08173951bc	Add an exitEarly flag for backup worker If a backup worker is on an old epoch, it could exit early if either of the following is true: - there is no backups - all backups starts a version >= the endVersion If this flag is set, the backup worker exit without doing any work, which signals the master to update oldest backup epoch.	2020-03-20 20:15:09 -07:00
Jingyu Zhou	5b36dcaad5	Fix oldest backup epoch for backup workers The oldest backup epoch is piggybacked in LogSystemConfig from master to cluster controller and then to all workers. Previously, this epoch is set to the current master epoch, which is wrong.	2020-03-20 20:15:09 -07:00
Jingyu Zhou	9ea549ba7d	Updates lastest backup worker progress after all previous epochs are done If workers for previous epochs are still ongoing, we may end up with a container that miss mutations in previous epochs. So the update only happens after there are only current epoch's backup workers.	2020-03-20 20:15:09 -07:00
Jingyu Zhou	1a1f572f29	Fix a time gap for monitoring backup keys Backup worker starts by check if there are backup keys and then runs monitorBackupKeyOrPullData() loop, which does the check again. The second check can be delayed, which causes the loop to perform NOOP pops. The fix removes this second check and uses the result of the first check to decide what to do in the loop.	2020-03-20 20:15:09 -07:00
Jingyu Zhou	fa7c8d8bb3	Add done trigger so that backup progress can be set Otherwise, when there is no mutations for the unfinished range, the empty file may not be created when the worker is displaced, thus leaving holes in version ranges.	2020-03-20 20:15:09 -07:00
Jingyu Zhou	c59b0844a9	Add total number of tags to WorkerBackupStatus This allows the backup worker to check the number of tags.	2020-03-20 20:15:08 -07:00
Jingyu Zhou	e9287407d6	Backup worker updates latest log versions in BackupConfig If backup worker is enabled, the current epoch's worker of tag (-2,0) will be responsible for monitoring the backup progress of all workers and update the BackupConfig with the latest saved log version, which is the minimum version of all tags. This change has been incorporated in the getLatestRestorableVersion() so that it is transparent to clients.	2020-03-20 20:15:08 -07:00
Jingyu Zhou	80d3fa1222	Add delay for master to recruit backup workers This delay is to ensure old epoch's backup workers can save their progress in the database. Otherwise, the new master could attempts to recruit backup workers for the old epoch on version ranges that have already been popped. As a result, the logs will lose data.	2020-03-20 20:15:08 -07:00
Jingyu Zhou	fe6b4a4398	Some correctness fixes	2020-03-20 20:15:08 -07:00
Jingyu Zhou	5afc23a0e1	Give a chance for backup worker to finish writing files If a backup worker is cancelled, wait until it finishes writing files so that we don't need to create these files in the next epoch.	2020-03-20 20:15:08 -07:00
Jingyu Zhou	b792d76d62	Fix version gap in old epoch's backup When pull finished and message queue is empty, we should use end version as the popVersion for backup files. Otherwise, there might be a version gap between last message and end version.	2020-03-20 20:15:08 -07:00
Jingyu Zhou	e3eb3beaaf	Consider previously pulled version for pulling version Saving files only happens if we are not pulling, i.e., not in NOOP mode.	2020-03-20 20:15:08 -07:00
Jingyu Zhou	1b159a3785	Fix: backup worker ignores deleted container	2020-03-20 20:14:36 -07:00
Jingyu Zhou	00350dd3d8	Fix pulledVersion of backup worker Not sure why, the cursor's version can be smaller than before.	2020-03-20 20:14:35 -07:00
Jingyu Zhou	672ad7a8ea	Fix: backup worker savedVersion init to begin version Choosing invalidVersion is wrong, as the worker starts at beginVersion.	2020-03-20 20:14:35 -07:00
Jingyu Zhou	c300a5c1b7	Fix contract changes: backup worker generate continuous versions Before we allow holes in version ranges in partitioned mutation logs. This has been changed so that restore can easily figure out if database is restorable. A specific problem is that if the backup worker didn't find any mutations for an old epoch, the worker can just exit without generating a log file, thus leaving holes in version ranges. Another contract change is that if a backup key is set, then we must store all mutations for that key, especially for the worker for the old epoch. As a result, the worker must first check backup key, before pulling mutations and uploading logs. Otherwise, we may lose mutations. Finally, when a backup key is removed, the saving of mutations should be up to the current version so that backup worker doesn't exit too early. I.e., avoid the case saved mutation versions are less than the snapshot version taken.	2020-03-20 20:14:35 -07:00
Jingyu Zhou	86edc1c9c8	Fix backup worker does NOOP pop before getting backup key The NOOP pop cuases some mutation ranges being dropped by backup workers. As a result, the backup is incomplete. Specifically, the wait of BACKUP_NOOP_POP_DELAY blocks the monitoring of backup key actor.	2020-03-20 20:13:38 -07:00
Jingyu Zhou	fda6c08640	Include a total number of tags in partition log file names This is needed for BackupContainer to check partitioned mutation logs are continuous, i.e., restorable to a version.	2020-03-20 20:13:38 -07:00
Jingyu Zhou	e15015ee6c	Add mutation log version names I.e., BACKUP_AGENT_MLOG_VERSION for 2001 and PARTITIONED_MLOG_VERSION for 4110.	2020-03-20 20:13:38 -07:00
Meng Xu	dfea2c2e55	BackupWorker:Remove assert in pop	2020-03-19 20:14:52 -07:00
Meng Xu	a323b80439	BackupWorker:Improve code comments	2020-03-19 15:58:22 -07:00
Jingyu Zhou	8bdda0fe04	Backup Worker: Give a chance of saving progress before displaced Move the exit loop after the saving of progress so that when doneTrigger is active, we won't exit the loop immediately.	2020-03-19 14:59:38 -07:00
Meng Xu	94276076de	BackupWorker:Buggify upload delay Add questions to code as well.	2020-03-18 19:04:45 -07:00
Jingyu Zhou	61f8cd2529	Add an exitEarly flag for backup worker If a backup worker is on an old epoch, it could exit early if either of the following is true: - there is no backups - all backups starts a version >= the endVersion If this flag is set, the backup worker exit without doing any work, which signals the master to update oldest backup epoch.	2020-03-18 16:44:17 -07:00
Jingyu Zhou	19f6394dc9	Fix oldest backup epoch for backup workers The oldest backup epoch is piggybacked in LogSystemConfig from master to cluster controller and then to all workers. Previously, this epoch is set to the current master epoch, which is wrong.	2020-03-18 16:44:17 -07:00
Jingyu Zhou	c3dd593113	Updates lastest backup worker progress after all previous epochs are done If workers for previous epochs are still ongoing, we may end up with a container that miss mutations in previous epochs. So the update only happens after there are only current epoch's backup workers.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	d5250084bd	Fix a time gap for monitoring backup keys Backup worker starts by check if there are backup keys and then runs monitorBackupKeyOrPullData() loop, which does the check again. The second check can be delayed, which causes the loop to perform NOOP pops. The fix removes this second check and uses the result of the first check to decide what to do in the loop.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	ceb56cf49d	Add done trigger so that backup progress can be set Otherwise, when there is no mutations for the unfinished range, the empty file may not be created when the worker is displaced, thus leaving holes in version ranges.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	6a302e6605	Add total number of tags to WorkerBackupStatus This allows the backup worker to check the number of tags.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	be1d36bed3	Backup worker updates latest log versions in BackupConfig If backup worker is enabled, the current epoch's worker of tag (-2,0) will be responsible for monitoring the backup progress of all workers and update the BackupConfig with the latest saved log version, which is the minimum version of all tags. This change has been incorporated in the getLatestRestorableVersion() so that it is transparent to clients.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	15437ffb53	Add delay for master to recruit backup workers This delay is to ensure old epoch's backup workers can save their progress in the database. Otherwise, the new master could attempts to recruit backup workers for the old epoch on version ranges that have already been popped. As a result, the logs will lose data.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	b8c362cf44	Some correctness fixes	2020-03-18 16:41:35 -07:00
Jingyu Zhou	cade657682	Give a chance for backup worker to finish writing files If a backup worker is cancelled, wait until it finishes writing files so that we don't need to create these files in the next epoch.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	a0fb8ad5fc	Fix version gap in old epoch's backup When pull finished and message queue is empty, we should use end version as the popVersion for backup files. Otherwise, there might be a version gap between last message and end version.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	96eab2f3ec	Consider previously pulled version for pulling version Saving files only happens if we are not pulling, i.e., not in NOOP mode.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	de9362748e	Fix: backup worker ignores deleted container	2020-03-18 16:41:35 -07:00
Jingyu Zhou	ce3f0c6dfc	Fix pulledVersion of backup worker Not sure why, the cursor's version can be smaller than before.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	8f57c46bc9	Fix: backup worker savedVersion init to begin version Choosing invalidVersion is wrong, as the worker starts at beginVersion.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	07f1dcb5c9	Fix contract changes: backup worker generate continuous versions Before we allow holes in version ranges in partitioned mutation logs. This has been changed so that restore can easily figure out if database is restorable. A specific problem is that if the backup worker didn't find any mutations for an old epoch, the worker can just exit without generating a log file, thus leaving holes in version ranges. Another contract change is that if a backup key is set, then we must store all mutations for that key, especially for the worker for the old epoch. As a result, the worker must first check backup key, before pulling mutations and uploading logs. Otherwise, we may lose mutations. Finally, when a backup key is removed, the saving of mutations should be up to the current version so that backup worker doesn't exit too early. I.e., avoid the case saved mutation versions are less than the snapshot version taken.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	a20236a74d	Fix backup worker does NOOP pop before getting backup key The NOOP pop cuases some mutation ranges being dropped by backup workers. As a result, the backup is incomplete. Specifically, the wait of BACKUP_NOOP_POP_DELAY blocks the monitoring of backup key actor.	2020-03-18 16:41:35 -07:00
Jingyu Zhou	d8c6bf585d	Include a total number of tags in partition log file names This is needed for BackupContainer to check partitioned mutation logs are continuous, i.e., restorable to a version.	2020-03-18 16:39:40 -07:00
Jingyu Zhou	21feb78f8a	Add mutation log version names I.e., BACKUP_AGENT_MLOG_VERSION for 2001 and PARTITIONED_MLOG_VERSION for 4110.	2020-03-18 16:33:58 -07:00
Jingyu Zhou	9e4668d656	Fix key_not_found error due to deleted BackupConfig Since backup worker doesn't catch this error, a deleted BackupConfig can cause the backup worker to get key_not_found error. Fix by adding a check if the container can be found.	2020-02-14 19:37:58 -08:00
Jingyu Zhou	471d903862	Fix valgrind error: change from BinaryReader to ArenaReader	2020-02-12 16:57:56 -08:00
Jingyu Zhou	b4aa36b651	Really fix valgrind error: erase messages after saved to files	2020-02-12 11:43:14 -08:00
Jingyu Zhou	237f0c35cd	Add mutations state variable to hold on to memory	2020-02-12 10:02:27 -08:00
Jingyu Zhou	a13d4e9bb6	Attempt to fix: remove dead code and add a unit test	2020-02-10 15:40:19 -08:00
Jingyu Zhou	c43ac4c38f	Backup worker: Construct range map on-demand This is to reduce the number of map lookups in the original code.	2020-02-05 11:47:05 -08:00
Jingyu Zhou	d5849af5c0	Address review comments	2020-02-05 10:33:51 -08:00
Jingyu Zhou	e32750931b	Backup worker: Remove stopped backups and fix block ends	2020-02-04 16:01:27 -08:00
Jingyu Zhou	c95d52cd18	Make mutation log file continuous w.r.t. versions Each file is encoded with [startVersion, endVersion) range so that we can easily detect missing ranges (i.e., files) in a backup container.	2020-02-04 14:30:32 -08:00
Jingyu Zhou	28349e2b03	Backup worker checks backup ranges Mutations are only logged when they are within the backup ranges, which means a range clear mutation has to calculate the intersection ranges and divide a clear into potentially multiple clear mutations. This pare of code is modeled after how proxy handles backup mutations.	2020-02-03 20:27:31 -08:00
Jingyu Zhou	7c10683c77	Backup workers save logs into right containers The mutation logs of backup workers are saved into "mlogs" directory under the container directory. The backup worker has been restructured to handle multiple backups, where each one is stored in a separate backup container. In the backup worker, mutations pulled from TLogs are buffered in a message queue. When writing out to different containers, their corresponding mutation ranges are used to check if a mutation should be written. When a new backup is submitted by the client, "backupStartedKey" is updated. The worker monitors this key, updates its internal map of backups, and then next pull from TLog needs to wait for the readiness of the new backup. This is to ensure when worker 0 sets the backup is started, all workers have already been logging mutations for the backup.	2020-02-03 20:27:14 -08:00
Jingyu Zhou	7cf2881fe8	Fix backup worker ID and remove some fields The backup worker ID was changed to be the interface ID, not the request ID. The lastSeenVersion is replaced with minKnownCommittedVersion, which is incremented even when there is no mutations. This also avoid the problem of setting popVersion to be higher than the actual committed version (no harm here though, as there are no mutations). The "backupStartedKey" handling is also fixed to correctly handle cases when we wait for the stopping of backups.	2020-01-31 19:29:09 -08:00

1 2 3 4

163 Commits