Commit Graph

8262 Commits

Author SHA1 Message Date
Jingyu Zhou 297da14aba Fix backup worker not popping up to end version
Previously, the pop version is the min of minKnownCommittedVersion and
endVersion. In the case of backup worker for previous epoch, the endVersion
should be used.
2020-01-22 19:38:46 -08:00
Jingyu Zhou 40436a4e78 Filter out non-backup related mutations 2020-01-22 19:38:45 -08:00
Jingyu Zhou ff512b0c93 Fix memory corruption due to invalid Arena
For an ILogPeekCursor, the arena becomes invalid if hasMessage() is false.
So the backup worker needs to keep a reference to the arena so that the message
refers to memory area that is still valid.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 12e91240cc Fix a typo 2020-01-22 19:38:45 -08:00
Jingyu Zhou 9abdd16cc5 Add logic to skip non-backup related mutations
If a mutation has txsTag, then it is the change to in-memory key value store,
i.e., the transaction state store, and should be ignored by the backup worker.
The only exception is for the "metadataVersionKey", which needs to be stored in
the backup.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 485d3d0feb Use Version instead of int64_t 2020-01-22 19:38:45 -08:00
Jingyu Zhou 31a1106286 Save mutations to backup files in simulation
This is the first step in the new backup's data pipeline. Verification of file
content is needed in future commits. A clear documentation of file format is a
work in progress.
2020-01-22 19:38:45 -08:00
Jingyu Zhou dafcaee844 Fix compiler errors. 2020-01-22 19:38:45 -08:00
Jingyu Zhou c7f51782b8 Use override for virtual functions. 2020-01-22 19:38:45 -08:00
Jingyu Zhou b745373163 Backup workers only save committed mutations. 2020-01-22 19:38:45 -08:00
Jingyu Zhou 23985da6a0 Use backup worker failed error code during recovery
And use override instead of virtual in TagPartitionedLogSystem.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 840e74d696 Allow storage server queue in consistency check
The backup worker needs to update its progress even during consistency check by
commit transactions to the database. Thus we can't really achieve zero storage
server queue. So add a limit of 10,000 to pass the consistency check.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 8585d78bfd Refactor to remove a trigger from backup worker 2020-01-22 19:38:45 -08:00
Jingyu Zhou 9d7a1a77d0 Small fixes. 2020-01-22 19:38:45 -08:00
Jingyu Zhou 9567bf730d Fix a crash due to null log system
When a master starts, backup worker from old epochs may send BackupWorkerDoneRequest
to it. The master can be safely ignore it, since the checkRemoved logic of the
backup worker can self exit then.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 0c08161d8e Remove old backup workers when done
For backup workers working on old epochs, once their work is done, they will
notify the master. Then the master removes them from the log system and
acknowledge back to the backup workers so that they can gracefully shut down.

The popping of a backup worker is stalled if there are workers from older
epochs still working. Otherwise, workers from old epochs will lost data.

However, allowing newer epoch to start backup can cause holes in version ranges.
The restore process must verify the backup progress to make sure there are no
holes, otherwise it has to wait.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 85c4a4e422 Address review comments for PR #1625 2020-01-22 19:38:45 -08:00
Jingyu Zhou 116608a0a7 Set backup workers w.r.t. the correct epoch
For backup workers created for previous epoch, we need to associate them with
the correct epoch so that later peekLogRouter can get the correct peek cursor.
Otherwise, the workers can never peek the missing range of mutations.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 22f4bef589 Fix a race that backup workers may not be registered
After the backup worker recruitment is done, we need to force trigger the
registration with cluster controller. Otherwise, the log system may not have
the backup workers, which can stall backup workers from obtaining a cursor and
resulting in mutations being kept in TLogs.
2020-01-22 19:38:45 -08:00
Jingyu Zhou d3f14699c4 Backup worker should aggressively advance versions
Separate popping logic into an actor with shorter interval than the upload
interval. More critically, even if there is no mutations (e.g., in quiet
database period), the popped version should still be advanced.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 6c6a553dcc Fix hang due to distributor death in QuietDatabase
It's possible that after obtaining data distributor, the distributor then dies
and a new one is recruited. Because the tester is still contacting the old one,
it becomes stuck.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 73824faf65 Track pseudo tags popping for individual IDs
For each log router ID, we track the popped version of each pseudo tag so that
the popping only applied to the minimum of these versions.

Also add more tracing for popping and epochs.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 3509209d3f Fix not setting epoch for old log system 2020-01-22 19:38:45 -08:00
Jingyu Zhou a1095c8250 Remove epoch from DBCoreState
Use existing recoveryCount if needed.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 580151e1d4 Refactor code using C++ 17 iterator 2020-01-22 19:38:45 -08:00
Jingyu Zhou d5a92e1805 Fix pseudo locality usage bug
Somehow pseudo localities are not saved to LogSystemConfig and getPseudoPopTag()
should translate LogRouter tag to pseudo tags.
2020-01-22 19:38:45 -08:00
Jingyu Zhou c2b8ee3b53 Small improvement 2020-01-22 19:38:45 -08:00
Jingyu Zhou 19d6a889ff Recruit backup workers for old epochs
If there are unfinished ranges in the old epochs, the new master will recruit
backup workers responsible for finishing these ranges. These workers remains in
the cluster until the next epoch, when it will remove itself.
2020-01-22 19:38:45 -08:00
Jingyu Zhou ac851619bb Fix merge errors with master 2020-01-22 19:38:45 -08:00
Jingyu Zhou 11964733b7 WIP: should be divided into smaller commits. 2020-01-22 19:38:45 -08:00
Jingyu Zhou 17002740bb Add epoch and backup workers to DBCoreState
This enables backup workers to know the end version of the epoch. Additionally,
the master recovery only needs to deal with crashed backup workers by
recruiting new workers to backup the unfinished version range.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 41f0cf2bb5 Add decode function for backup progress 2020-01-22 19:38:45 -08:00
Jingyu Zhou f245084bf3 Refactor LogRouter with hasLogRouter() 2020-01-22 19:38:45 -08:00
Jingyu Zhou 03a17a30ef Refactor: check displacement in LogSystemConfig 2020-01-22 19:38:45 -08:00
Jingyu Zhou 7da9f47f26 Enable pop from backup workers
This is still WIP as some edge cases can trigger test failure, most likely due
to not popping mutations by backup workers when epoch ends.
2020-01-22 19:38:45 -08:00
Jingyu Zhou a797958af6 Update peekLogRouter for backup workers to peek 2020-01-22 19:37:48 -08:00
Jingyu Zhou c3e5a9550f Fix too many files error caused by many recoveries 2020-01-22 19:37:48 -08:00
Jingyu Zhou 443c4995a2 Add file identifier in interfaces for flatbuffer 2020-01-22 19:37:48 -08:00
Jingyu Zhou a4d6ebe79e Recruit backup worker in newEpoch 2020-01-22 19:37:48 -08:00
Jingyu Zhou ece3cadf8e Recruit backup worker during master recovery
Right now recruit the same number as TLogs. The backup worker does nothing.
2020-01-22 19:37:48 -08:00
Jingyu Zhou eac49bca04 Add backup worker recruitment in master. 2020-01-22 19:35:30 -08:00
Jingyu Zhou acc4ad276d Add std:: namespace for vector 2020-01-22 19:35:30 -08:00
Jingyu Zhou 442738b6db Small code refactoring 2020-01-22 19:35:30 -08:00
Jingyu Zhou de8d953865 Add backup role, class, and worker skeleton 2020-01-22 19:35:30 -08:00
Jingyu Zhou 8221d33eb1 Use emplace_back instead of push_back for TLogServer 2020-01-22 19:35:30 -08:00
Evan Tschannen 38569e46c1
Merge pull request #2584 from etschannen/master
updated old release notes
2020-01-22 15:59:15 -08:00
Evan Tschannen 6a26ca09b1 updated old release notes 2020-01-22 15:58:29 -08:00
mpilman bb88458830 Merge branch 'features/documentation-server' of github.com:mpilman/foundationdb into features/documentation-server 2020-01-22 14:25:00 -08:00
mpilman 6b11a3ee21 handle cases where no username is set 2020-01-22 14:24:01 -08:00
Markus Pilman ff11c29258
Typo
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2020-01-22 14:14:08 -08:00