This introduces unhygenic macro variants that inline a `ENABLED &&`
before the TraceEvent. This way, they get entirely compiled out unless
enabled.
Then rewrite all debugMutation uses via sed.
We are currently emitting Role transition traces when a role starts and
when it ends. While this is useful for debugging, it doesn't work well
with tools that inject data and might potentially miss some trace lines.
We do decorate each trace lines with the roles assigned to that
particular process, however, this is not sufficient for tools that can
make use of the UID -> Role mapping
Like:
* Leaving the proxy
* Entering the TLog
* Leaving the TLog
* Being read on a cursor
All of this brought to you by TagsAndMessage!
This also slides in a minor optimization as to how mutations are serialized per target log.
Added logging of SharedTLog ID for each TLog.
Switched ID logged for TLogRejoining event to the TLog instead of the SharedTLog.
Made some parameters to startRole passed by reference.
For each log router ID, we track the popped version of each pseudo tag so that
the popping only applied to the minimum of these versions.
Also add more tracing for popping and epochs.
Parallel peek more code would prefer the begin version it was sent by
the previous parallel peek over the request's begin version. This means
that a merge cursor trying to advance past message versions would still
get old data that it would have to filter out.
A simple application of std::max fixes this.
TLogServer and LogRouter had some leftover code from me trying to be
more "correct" about parallel peek semantics, but those changes weren't
reflected in the OldTLog* files. I've reverted the changes, as
realistically, they are more likely to waste CPU than improve TLog behavior.
This fixes an issue introduced in the previous patch, where pop would
immediately set `poppedLocationNeedsUpdate`, but setting the popped
version was now delayed. This means that we could:
1. Run the spill loop and persist all popped versions
2. Receive a pop, and set the poppedLocationNeedsUpdate flag
3. Run the dq-pop loop, and clear the poppedLocationNeedsUpdate flag
and now when we update the persistentPopped version again, we won't have
the flag set for dq-pop to know that it needs to scan the spilled data
again for the minLocation.
We could more carefully update the flag, but instead, I've just
converted it into a version that's kept in sync purely in the dq-pop
loop, to remove shared state between pop and the dq-pop loop.
This commit is to fix a bug where popping a tag between
updatePersistentData and popDiskQueue can cause the TLog to recover to
an incorrect understanding of what data it has available.
The following series of events need to happen to trigger this bug:
Tag 1:1 is popped to version 10
updatePersistentData is run...
updatePersistentPopped runs and we persistentData stores 1:1 as popped to 10
A mutation is spilled for 1:1 at version 11 at location 1000
A mutation is spilled for 1:1 at version 21 at location 5000
updatePersistentData finishes and commits the btree changes
Tag 1:1 is popped to version 20
popDiskQueue runs
The btree is read for spilled mutations with version >=20
The minimum location required for the disk queue is found to be location 5000
The disk queue is popped to location 5000
The TLog crashes
The worker restarts, and reloads the TLog files from disk
restorePersistentPopped restores tag 1:1 as having been popped to version 10
Parallel peeks are received for tag 1:1 starting at version 0
The first peek is less than the popped version, so we respond with no data, and an end version of 10
The second peek starts at version 10, which is greater than the popped version
The btree is read for spilled mutations, and we find that there is a mutation at version 11 at location 1000
Location 1000 is read in the DiskQueue
The resulting page read at Location 1000 was popped pre-crash, and thus
might either (a) be corrupt or (b) have an incorrect sequence number.
The fix to this is to force popDiskQueue/updatePoppedLocation to use the
popped version that was persisted to disk, and not the most recently
popped version for the given tag.
This bug doesn't manifest in simulation, because we don't have any code
that peeks at a lower version than what has been popped.
This is to guard against the case where
1. Peeks with sequence numbers 0-39 are submitted
2. A 15min pause happens, in which timeout removes the peek tracker data
3. Peeks with sequence numbers 40-59 are submitted, with the same peekId
The second round of peeks wouldn't have the data left that it's allowed
to start running peek 40 immediately, and thus would hang for 10min
until it gets cleaned up.
Also, guard against overflowing the sequence number.
When switching between spill_type or log_version, a new instance of a
SharedTLog is created in the transaction log processes. If this is done
in a saturated database, then doubling the amount of memory to hold
mutations in memory can cause TLogs to be uncomfortably close to the 8GB
OOM limit.
Instead, we now thread which UID of a SharedTLog is active, and the
other TLog spill out the majority of their mutations.
This is a backport of #2213 (fef89aa1) to release-6.2
When switching between spill_type or log_version, a new instance of a
SharedTLog is created in the transaction log processes. If this is done
in a saturated database, then doubling the amount of memory to hold
mutations in memory can cause TLogs to be uncomfortably close to the 8GB
OOM limit.
Instead, we now thread which UID of a SharedTLog is active, and the
other TLog spill out the majority of their mutations.
...and test it in simulation, but not combined yet.
It turns out that because of txsTag, we basically had to support
spill-by-value anyway. Thus, if we treat all tags like txsTag when
spilling and peeking, then we have an easy way to bring the two spilling
types back into one implementation.
The spilling type is now pulled out of the request, and then stored on
LogData for later access, and persisted in the tlog metadata per tlog
generation.
It turns out that serializing types as Unversioned is a bit wonky.
Which is an intermingling of what should be two commits:
1. Rely on TLogVersion instead of txsTags==0
2. Copy and index sharded txsTags between KCV and RV as txsTag when
configuring log_version 4->3.
The buggify was actually incorrect and broke an invariant, which I then
fixed on the other side, but this work was actually unneeded in total.
The real issue being fixed was returnIfBlock not sending an error, as
well as the other error cases.
There were error cases that would cause a peek to terminate early or be
cancelled without sending anything to the next peek in line. We would
thus end up with the first peek in a sequence waiting on its future, and
nothing that exists that would send to that future.
`peekTracker` was held on the Shared TLog (TLogData), whereas peeks are
received and replied to as part of a TLog instance (LogData). When a
peek was received on a TLog, it was registered into peekTracker along
with the ReplyPromise. If the TLog was then removed as part of a
no-longer-needed generation of TLogs, there is nothing left to reply to
the request, but by holding onto the ReplyPromise in peekTracker, we
leave the remote end with an expectation that we will reply. Then,
10min later, peekTrackerCleanup runs and finally times out the peek
cursor, thus preventing FDB from being completely stuck.
Now, each TLog generation has its own `peekTracker`, and when a TLog is
destroyed, it times out all of the pending peek curors that are still
expecting a response. This will then trigger the client to re-issue
them to the next generation of TLogs, thus removing the 10min gap to do
so.
Formerly, they would prefer to peek from the primary's logs. Testing of
a failed region rejoining the cluster revealed that this becomes quite a
strain on the primary logs when extremely large volumes of peek requests
are coming from the Log Routers. It happens that we have satellites
that contain the same mutations with Log Router tags, that have no other
peeking load, so we can prefer to use the satellite to peek rather than
the primary to distribute load across TLogs better.
Unfortunately, this revealed a latent bug in how tagged mutations in the
KnownCommittedVersion->RecoveryVersion gap were copied across
generations when the number of log router tags were decreased.
Satellite TLogs would be assigned log router tags using the
team-building based logic in getPushLocations(), whereas TLogs would
internally re-index tags according to tag.id%logRouterTags. This
mismatch would mean that we could have:
Log0 -2:0 ----- -2:0 Log 0
Log1 -2:1 \
>--- -2:1,-2:0 (-2:2 mod 2 becomes -2:0) Log 1
Log2 -2:2 /
And now we have data that's tagged as -2:0 on a TLog that's not the
preferred location for -2:0, and therefore a BestLocationOnly cursor
would miss the mutations.
This was never noticed before, as we never
used a satellite as a preferred location to peek from. Merge cursors
always peek from all locations, and thus a peek for -2:0 that needed
data from the satellites would have gone to both TLogs and merged the
results.
We now take this mod-based re-indexing into account when assigning which
TLogs need to recover which tags from the previous generation, to make
sure that tag.id%logRouterTags always results in the assigned TLog being
the preferred location.
Unfortunately, previously existing will potentially have existing
satellites with log router tags indexed incorrectly, so this transition
needs to be gated on a `log_version` transition. Old LogSets will have
an old LogVersion, and we won't prefer the sattelite for peeking. Log
Sets post-6.2 (opt-in) or post-6.3 (default) will be indexed correctly,
and therefore we can safely offload peeking onto the satellites.
When onlySpilled transitions from true (don't peek memory) to false (do
peek memory) as part of a parallel peek, we'll end up wasting the rest
of the replies because we'll honor their onlySpilled=true setting and
thus not have any additional data to return.
Instead, we thread the onlySpilled back through in the same way that the
ending version of the last peek is used overrides the requested starting
version of the next peek. This simulated the same behavior that the
client has, where the value of onlySpilled that we reply with comes back
in the next request.
I haven't actually seen it be a problem, but this should help make sure
the onlySpilled transition when catching up doesn't ever cause any ill
effects if a process starts riding the line between onlySpilled settings.
This fixes#1214
The basic idea is that ProtocolVersion is now its own type. This
alone is an improvement as it makes many things more typesafe. For
each version, we can now add breaking features (for example Fearless).
After that, there's no need to test against actual (confusing) version
numbers. Instead a developer can simply test
`protocolVersion->hasFearless()` and this will return true iff the
protocolVersion is newer than the newest version that didn't support
fearless.
- ignorePopDeadline to have highier limit in simulator
to accommdate for the buggify delays and make snapshot succeed.
- introduce a new knob for auto resetting the disabling of tlog pop
tLogCommit exects no blocking between duplicate check and
setting of the new version, that constraint was broken
when synchronous execProcessingHelper was introduced.
As a fix, execProcessingHelper was made asynchronous.
- wait for snapTLogFailKeys in a loop, otherwise in some race
condition it can cause a false assert
- in single region, there does not seem to be a guarantee of
tagLocalityListKey for a given DC ID, avoiding that assert for now
- to find the workers that are coordinators, looking up by primary
address is not sufficient in some cases, hence looking by both
primary and secondary address
- test make files to reflect the location of the new test cases
In Tlogs, disable pop is done whlie taking snapshots. Earlier, tlogs
were ignoring the pops if it got pop requests when pops were
disabled. In this change, instead of ignoring the pop - it remembers
the list of pops in-memory and plays them once the popping is
enabled.
- exec operation to go to all the TLogs
- minor bug fix in tlog
- restore implementation for the simulator
- restore snap UID to be stored in restartInfo.ini
- test cases added
- indentation and trace file fixes
fix: we could incorrectly make data durable if eraseMessagesFromMemory was in progress while running updatePersistentData
the quiet database check now ensure that tlogs have no more than 30 seconds of versions unpopped from the disk queue
If a peek is entirely fulfilled from spilled data, then it's likely that
the next peek will be also. It is thus wasteful for each of these peeks
to call peekMessagesFromMemory, which memcpy's excessively, and then
throw all that data away without using it.
Now, TLogs will give a hint back to peek cursors about if the provided
reply was served entirely from the spilled data, which peek curors then
feed back as the hint into their next request.
At some point, a cursor will send a request for only spilled data, get
an incomplete response, and then be told to send its next request as one
that peeks from memory as well, and then it will fully catch up.
This deprioritizes before calling peekMessagesFromMemory, which should
improve the memory usage of the TLog, and makes sure to keep txsTag
peeks at a high priority to help recoveries stay fast.
Theoretically, we could spill 20MB of 22B mutations for one key, which
would generate a very long value being stored in SQLite, and very
inefficiently read back. This stops that from being a problem, at the
cost of some extra write calls.
This changes the logic of pop operations from log routers (LG):
- LG pops tagLocalityLogRouterMapped from TLogs;
- TLog converts tagLocalityLogRouterMapped back to tagLocalityLogRouter before
popping.
Later when we add more psuedo localities, the same pattern can be used.
This time, track what location in the DiskQueue has been spilled in
persistent state, and then feed it back into the disk queue before
recovery.
This also introduces an ASSERT that recovery only reads exactly the
bytes that it needs to have in memory.
Popping the disk queue now requires potentially recovering the location
to which we can pop from the spilled data itself, and for each tag we
must maintain the first location with relevant data.
The previous queue we had to represent the ordering, queueOrder, was
used by spilling, and popped when a TLog had been spilled. This means
that as soon as a TLog has been fully spilled, we have no idea how it
relates in order to other fully spilled TLogs.
Instead, use queueOrder to keep track of all the TLog UIDs until they're
removed, and use spillOrder to keep track of the order only for
spilling.
If a server has its data spilled, then it's behind the 5s window.
Feeding it data is less important than committing, so we can hide the
extra CPU usage from checksumming the read amplified disk queue pages.
Though this format is being deprecated in favor of an eventual plumbing
through of TLogVersion, we should probably bump it anyway.
And also remove the fallback to OldTLogServer code. It should never be
executed, as OldTLogServer_6_0 is entirely relied upon to execute
OldTLogServer_4_6.
There's not a tremendous reason as to why this matters now, but I feel
like I might regret sometime later not keeping the same schema under the
same key.
* log_version in the database (`/conf/log_version`) is now a hint that gets
rounded to the nearest supported version.
* fdbcli and FDB enforce that only a valid log_version can be configured to
* TLogVersion is persisted in CoreTLogSet (and LogSet and TLogSet)
* Some comments here and there
* Add an assert on filename length to make sure KV-pairs in filename
don't exceed a maximum length.