Evan Tschannen
ff171e293e
fix: always make sure to add txsTags to localTags for remote logs
2019-07-31 16:04:35 -07:00
Evan Tschannen
9f11f2ec53
Merge branch 'master' of github.com:apple/foundationdb
2019-07-30 16:55:56 -07:00
Evan Tschannen
aaeeb605b2
Changes to degraded can cause master recoveries, which are not supposed to happen when speedUpSimulation is true
2019-07-30 16:33:40 -07:00
Evan Tschannen
6977e7d2e8
do not return recovered version as popped for txsTags because it could cause recovery to start over
...
optimized how buffered peek cursor discards popped data
2019-07-30 12:21:48 -07:00
Evan Tschannen
13203da199
fix: do not set the popped version of txsTag because it could be copied over at the recoveredAt version
2019-07-27 22:36:06 -07:00
Evan Tschannen
28df2c35bb
Merge pull request #1855 from alexmiller-apple/sharded-txs-safe-upgrade
...
Make sharded txsTag upgradeable and downgradeable
2019-07-26 13:29:39 -07:00
Meng Xu
1706aaf199
Merge branch 'master' into mengxu/performant-restore-PR
...
Fix conflict in TlogServer.actor.cpp by accepting master changes
2019-07-26 11:46:27 -07:00
sramamoorthy
9afd162e2f
remove snap v1 related code
2019-07-25 17:29:31 -07:00
Meng Xu
45083edf74
Merge branch 'master' into mengxu/performant-restore-PR
...
Fix conflicts as well.
2019-07-25 10:46:11 -07:00
sramamoorthy
a65c9f92ed
get rid of all timeouts and other changes
2019-07-24 15:36:28 -07:00
sramamoorthy
a2f2ad96ff
code review comments and merge to master changes
2019-07-24 15:36:28 -07:00
sramamoorthy
31c010b393
few minor fixes
2019-07-24 15:36:28 -07:00
sramamoorthy
c73bdfad9f
do not pop txsTag
2019-07-24 15:36:28 -07:00
sramamoorthy
a335ed2011
includeCancelled for tLogSnapCreate
2019-07-24 15:36:28 -07:00
sramamoorthy
61cd690add
enable/disable pop req with UID mis-match to fail
2019-07-24 15:36:28 -07:00
sramamoorthy
f4e257e464
snap v2: TLog related changes
2019-07-24 15:36:28 -07:00
Evan Tschannen
6d694cc2ce
Merge pull request #1818 from alexmiller-apple/peek-cursor-timeout-bug
...
Fix parallel peek stalling for 10min when a TLog generation is destroyed
2019-07-19 16:39:31 -07:00
Alex Miller
9863ace96c
Replace usages with intialization lists.
...
But C++ needs a bit of help to inference though the templates.
2019-07-18 22:27:36 -07:00
Alex Miller
55258709a0
Remove an ASSERT from testing and now inaccurate comment.
2019-07-17 01:30:01 -07:00
Alex Miller
e9684a1f63
Fix issues configuring from sharded txs tag to not
...
Which is an intermingling of what should be two commits:
1. Rely on TLogVersion instead of txsTags==0
2. Copy and index sharded txsTags between KCV and RV as txsTag when
configuring log_version 4->3.
2019-07-17 01:25:09 -07:00
Alex Miller
812ce37bcd
Remove buggify and unneeded safeguards.
...
The buggify was actually incorrect and broke an invariant, which I then
fixed on the other side, but this work was actually unneeded in total.
The real issue being fixed was returnIfBlock not sending an error, as
well as the other error cases.
2019-07-16 15:58:02 -07:00
Alex Miller
4cc60dc9b8
Merge remote-tracking branch 'upstream/master' into peek-cursor-timeout-bug
2019-07-15 17:05:39 -07:00
Alex Miller
2cbc05fc72
Address more issues that cause peek cursors to time out.
...
There were error cases that would cause a peek to terminate early or be
cancelled without sending anything to the next peek in line. We would
thus end up with the first peek in a sequence waiting on its future, and
nothing that exists that would send to that future.
2019-07-15 16:03:37 -07:00
Alex Miller
c8e94e601a
Merge pull request #1729 from etschannen/feature-fast-txs-recovery
...
Improve the recovery speed of the txnStateStore
2019-07-15 13:27:41 -07:00
Vishesh Yadav
2606794df6
Merge pull request #1812 from alexmiller-apple/improve-only-spilled
...
Improve the behavior of parallelPeekMore+onlySpilled.
2019-07-10 17:15:19 -07:00
Evan Tschannen
d8948c8be1
Merge branch 'master' into feature-fast-txs-recovery
...
# Conflicts:
# fdbserver/TagPartitionedLogSystem.actor.cpp
2019-07-10 13:59:52 -07:00
Evan Tschannen
49121172ea
Merge pull request #1795 from alexmiller-apple/peek-from-satellites
...
Log Routers will prefer to peek from satellite logs.
2019-07-09 17:38:57 -07:00
Alex Miller
fd769ad878
Fix parallel peek stalling for 10min when a TLog generation is destroyed.
...
`peekTracker` was held on the Shared TLog (TLogData), whereas peeks are
received and replied to as part of a TLog instance (LogData). When a
peek was received on a TLog, it was registered into peekTracker along
with the ReplyPromise. If the TLog was then removed as part of a
no-longer-needed generation of TLogs, there is nothing left to reply to
the request, but by holding onto the ReplyPromise in peekTracker, we
leave the remote end with an expectation that we will reply. Then,
10min later, peekTrackerCleanup runs and finally times out the peek
cursor, thus preventing FDB from being completely stuck.
Now, each TLog generation has its own `peekTracker`, and when a TLog is
destroyed, it times out all of the pending peek curors that are still
expecting a response. This will then trigger the client to re-issue
them to the next generation of TLogs, thus removing the 10min gap to do
so.
2019-07-09 17:27:36 -07:00
Alex Miller
44f11702a8
Log Routers will prefer to peek from satellite logs.
...
Formerly, they would prefer to peek from the primary's logs. Testing of
a failed region rejoining the cluster revealed that this becomes quite a
strain on the primary logs when extremely large volumes of peek requests
are coming from the Log Routers. It happens that we have satellites
that contain the same mutations with Log Router tags, that have no other
peeking load, so we can prefer to use the satellite to peek rather than
the primary to distribute load across TLogs better.
Unfortunately, this revealed a latent bug in how tagged mutations in the
KnownCommittedVersion->RecoveryVersion gap were copied across
generations when the number of log router tags were decreased.
Satellite TLogs would be assigned log router tags using the
team-building based logic in getPushLocations(), whereas TLogs would
internally re-index tags according to tag.id%logRouterTags. This
mismatch would mean that we could have:
Log0 -2:0 ----- -2:0 Log 0
Log1 -2:1 \
>--- -2:1,-2:0 (-2:2 mod 2 becomes -2:0) Log 1
Log2 -2:2 /
And now we have data that's tagged as -2:0 on a TLog that's not the
preferred location for -2:0, and therefore a BestLocationOnly cursor
would miss the mutations.
This was never noticed before, as we never
used a satellite as a preferred location to peek from. Merge cursors
always peek from all locations, and thus a peek for -2:0 that needed
data from the satellites would have gone to both TLogs and merged the
results.
We now take this mod-based re-indexing into account when assigning which
TLogs need to recover which tags from the previous generation, to make
sure that tag.id%logRouterTags always results in the assigned TLog being
the preferred location.
Unfortunately, previously existing will potentially have existing
satellites with log router tags indexed incorrectly, so this transition
needs to be gated on a `log_version` transition. Old LogSets will have
an old LogVersion, and we won't prefer the sattelite for peeking. Log
Sets post-6.2 (opt-in) or post-6.3 (default) will be indexed correctly,
and therefore we can safely offload peeking onto the satellites.
2019-07-08 22:25:01 -07:00
Alex Miller
6c8f50ca66
Improve the behavior of parallelPeekMore+onlySpilled.
...
When onlySpilled transitions from true (don't peek memory) to false (do
peek memory) as part of a parallel peek, we'll end up wasting the rest
of the replies because we'll honor their onlySpilled=true setting and
thus not have any additional data to return.
Instead, we thread the onlySpilled back through in the same way that the
ending version of the last peek is used overrides the requested starting
version of the next peek. This simulated the same behavior that the
client has, where the value of onlySpilled that we reply with comes back
in the next request.
I haven't actually seen it be a problem, but this should help make sure
the onlySpilled transition when catching up doesn't ever cause any ill
effects if a process starts riding the line between onlySpilled settings.
2019-07-08 22:13:09 -07:00
Evan Tschannen
15e894c724
Merge in master
2019-07-05 15:49:24 -07:00
Evan Tschannen
235697f688
fix: txsTags are not popped at the recovery version
2019-06-27 23:18:26 -07:00
Alex Miller
bf883d7055
Merge remote-tracking branch 'upstream/master' into flowlock-api
2019-06-25 14:26:50 -07:00
Alex Miller
7a500cd37f
A giant translation of TaskFooPriority -> TaskPriority::Foo
...
This is so that APIs that take priorities don't take ints, which are
common and easy to accidentally pass the wrong thing.
2019-06-25 02:47:35 -07:00
Evan Tschannen
1c005d5878
Merge pull request #1584 from alexmiller-apple/spilled-only-peek
...
Save TLog resources by letting peek request only spilled data.
2019-06-20 18:22:31 -07:00
mpilman
844dd60202
FDB compiling with intel compiler
2019-06-20 09:29:01 -07:00
Evan Tschannen
e0be631414
shard the txs tag so that more transaction logs are involved in its recovery
2019-06-19 18:15:09 -07:00
mpilman
68ce9a5e75
ProtocolVersion type - second try
2019-06-18 17:55:27 -07:00
Alex Miller
51fd42a4d2
Merge remote-tracking branch 'upstream/master' into spilled-only-peek
2019-06-18 17:33:52 -07:00
mpilman
8576665a90
Revert "Revert "Make protocol version a type""
...
This reverts commit 455bf3b3ec
.
2019-06-18 14:49:04 -07:00
Alex Miller
455bf3b3ec
Revert "Make protocol version a type"
2019-06-18 10:59:17 -07:00
mpilman
da53a92bec
Make protocol version a type
...
This fixes #1214
The basic idea is that ProtocolVersion is now its own type. This
alone is an improvement as it makes many things more typesafe. For
each version, we can now add breaking features (for example Fearless).
After that, there's no need to test against actual (confusing) version
numbers. Instead a developer can simply test
`protocolVersion->hasFearless()` and this will return true iff the
protocolVersion is newer than the newest version that didn't support
fearless.
2019-06-16 09:59:15 -07:00
sramamoorthy
1190f2f33d
rebased related changes
2019-05-28 22:07:46 -07:00
sramamoorthy
b43c100e57
TLog bug fixes
2019-05-28 22:07:46 -07:00
sramamoorthy
3877f87481
comment change in tLogCommit
2019-05-28 22:07:46 -07:00
sramamoorthy
31b6c86650
ignorePopDeadline to have high limit in simulator
...
- ignorePopDeadline to have highier limit in simulator
to accommdate for the buggify delays and make snapshot succeed.
- introduce a new knob for auto resetting the disabling of tlog pop
2019-05-28 22:07:46 -07:00
sramamoorthy
b1b96946af
logData->stop check right after execOpHold wait
2019-05-28 22:07:46 -07:00
sramamoorthy
5749e220bd
use FlowLock for implementing critical section
...
Instead of using Promises and future to implement
critcal section use FlowLock
2019-05-28 22:07:46 -07:00
sramamoorthy
e6c0b87a4d
remove unused variable
2019-05-28 22:07:46 -07:00
sramamoorthy
f27a40f118
execProcessingHelper made synchronous
...
tLogCommit exects no blocking between duplicate check and
setting of the new version, that constraint was broken
when synchronous execProcessingHelper was introduced.
As a fix, execProcessingHelper was made asynchronous.
2019-05-28 22:07:46 -07:00
sramamoorthy
d3a179b6f9
Multiple bug fixes
...
- wait for snapTLogFailKeys in a loop, otherwise in some race
condition it can cause a false assert
- in single region, there does not seem to be a guarantee of
tagLocalityListKey for a given DC ID, avoiding that assert for now
- to find the workers that are coordinators, looking up by primary
address is not sufficient in some cases, hence looking by both
primary and secondary address
- test make files to reflect the location of the new test cases
2019-05-28 22:07:46 -07:00
sramamoorthy
dcd2d96751
make spawnProcess predictable in the simulator
2019-05-28 22:07:46 -07:00
sramamoorthy
4083af0b01
Avoid using trackLatest for TLog pop test cases
2019-05-28 22:07:46 -07:00
sramamoorthy
ec7834e2f7
code re-orgnaization and address comments
2019-05-28 22:07:46 -07:00
sramamoorthy
b6e037ffbc
Replace fork with boost::process::child
2019-05-28 22:07:46 -07:00
sramamoorthy
e91c76834e
tlog: move snap create part to indepdendent funcs
2019-05-28 22:07:46 -07:00
sramamoorthy
61e93a9304
Address review comments and minor fixes
2019-05-28 22:07:46 -07:00
sramamoorthy
9e3104c2d4
Fix: races in async exec leading to bad backup
2019-05-28 22:07:46 -07:00
sramamoorthy
cfdad0c5e6
tlog to snapshot exactly at exec version
2019-05-28 22:07:46 -07:00
sramamoorthy
539e65efad
Skip parsing mutations if it is tagged for TxsTag
...
In Tlog, if a mutation is targetted for TxsTag then skip from
parsing them.
2019-05-28 22:07:46 -07:00
sramamoorthy
17ecba8313
trace cleanup and other indentation changes
2019-05-28 22:07:46 -07:00
sramamoorthy
aa79480d69
changes to make fdbfork asynchronous
2019-05-28 22:07:46 -07:00
sramamoorthy
4016f16c76
Fix few compilation and bugs in rebase
2019-05-28 22:07:46 -07:00
sramamoorthy
3d5998e9dd
tlog: when pops are disabled, store them & replay
...
In Tlogs, disable pop is done whlie taking snapshots. Earlier, tlogs
were ignoring the pops if it got pop requests when pops were
disabled. In this change, instead of ignoring the pop - it remembers
the list of pops in-memory and plays them once the popping is
enabled.
2019-05-28 22:07:46 -07:00
sramamoorthy
4bc4c615da
exec op to all tlog, restore change in test &other
...
- exec operation to go to all the TLogs
- minor bug fix in tlog
- restore implementation for the simulator
- restore snap UID to be stored in restartInfo.ini
- test cases added
- indentation and trace file fixes
2019-05-28 22:07:46 -07:00
sramamoorthy
72dd067173
Trace message changes and fix few FIXMEs
2019-05-28 22:07:46 -07:00
sramamoorthy
69edefe68b
Snapshot based backup and resotre implementation
2019-05-28 22:07:46 -07:00
A.J. Beamon
f417e60264
Merge branch 'merge-release-6.1-into-master' into thread-safe-random-number-generation
...
# Conflicts:
# fdbserver/QuietDatabase.actor.cpp
2019-05-23 09:52:00 -07:00
A.J. Beamon
d29c7e4c9b
Merge branch 'release-6.1' into merge-release-6.1-into-master
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbserver/QuietDatabase.actor.cpp
# versions.target
2019-05-23 09:28:45 -07:00
Evan Tschannen
003cc6be18
fix: nothingPersistent could be incorrect when popped is equal to persistentDataVersion
2019-05-22 20:23:35 -10:00
Evan Tschannen
ee04c583fa
fix: do not pop the disk queue past the persistentDataVersion
2019-05-21 10:40:30 -07:00
Evan Tschannen
4059d68348
fix: the tlog would not pop data from the disk queue after a storage server was removed, because the tag still exists in memory on the logs
...
fix: we could incorrectly make data durable if eraseMessagesFromMemory was in progress while running updatePersistentData
the quiet database check now ensure that tlogs have no more than 30 seconds of versions unpopped from the disk queue
2019-05-20 23:58:45 -07:00
Meng Xu
9ea83e0f3c
FastRestore:Remove dbprintf
2019-05-17 17:34:42 -07:00
Alex Miller
4eb4c03ce5
Save TLog resources by letting peek request only spilled data.
...
If a peek is entirely fulfilled from spilled data, then it's likely that
the next peek will be also. It is thus wasteful for each of these peeks
to call peekMessagesFromMemory, which memcpy's excessively, and then
throw all that data away without using it.
Now, TLogs will give a hint back to peek cursors about if the provided
reply was served entirely from the spilled data, which peek curors then
feed back as the hint into their next request.
At some point, a cursor will send a request for only spilled data, get
an incomplete response, and then be told to send its next request as one
that peeks from memory as well, and then it will fully catch up.
2019-05-14 15:38:48 -10:00
A.J. Beamon
5f55f3f613
Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used.
2019-05-10 14:01:52 -07:00
Evan Tschannen
22499666d0
Merge branch 'release-6.1'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbserver/LogRouter.actor.cpp
# flow/Trace.cpp
# versions.target
2019-05-08 18:19:35 -07:00
Evan Tschannen
93eb2a9395
Merge pull request #1527 from alexmiller-apple/tstlog-6.1
...
Spill-by-reference knob + TLog6.0 Spilled Peek deprioritization
2019-05-03 17:19:45 -07:00
Alex Miller
c918b21137
Deprioritize spilled peeks in spill-by-value, and improve its logic.
...
This deprioritizes before calling peekMessagesFromMemory, which should
improve the memory usage of the TLog, and makes sure to keep txsTag
peeks at a high priority to help recoveries stay fast.
2019-05-03 15:27:11 -07:00
Alex Miller
4052f3826a
Add a knob to limit the number of commits indexed per key.
...
Theoretically, we could spill 20MB of 22B mutations for one key, which
would generate a very long value being stored in SQLite, and very
inefficiently read back. This stops that from being a problem, at the
cost of some extra write calls.
2019-05-03 15:27:10 -07:00
Evan Tschannen
12088119d2
Merge pull request #1517 from alexmiller-apple/tstlog-6.1
...
Add a knob to limit amount of data read from sqlite for one PeekRequest.
2019-05-03 11:01:11 -07:00
Alex Miller
f4e48c3851
Add a knob to limit amount of data read from sqlite for one PeekRequest.
...
This prevents peeking from degrading over time if there are a very large
number of SpilledData entries for one particular tag.
2019-05-02 17:26:45 -07:00
Evan Tschannen
8590b710bf
added additional logging on the logs and log routers
2019-05-02 17:24:39 -07:00
Jingyu Zhou
8b5449e608
Fix review comments for PR #1473
2019-04-29 16:45:42 -07:00
Jingyu Zhou
5462f560e7
Add pseudo locality for log routers and tlogs
...
This changes the logic of pop operations from log routers (LG):
- LG pops tagLocalityLogRouterMapped from TLogs;
- TLog converts tagLocalityLogRouterMapped back to tagLocalityLogRouter before
popping.
Later when we add more psuedo localities, the same pattern can be used.
2019-04-23 21:35:56 -07:00
Jingyu Zhou
0b1984978a
Small code refactoring.
2019-04-21 10:41:07 -07:00
Jingyu Zhou
ec1bc5cfca
Add LogSystemType enum
2019-04-21 10:41:07 -07:00
Meng Xu
529ce66b6c
Merge branch 'apple/master' into mengxu/performant-restore-PR
2019-04-18 18:02:45 -07:00
Meng Xu
4c3ccebe8a
FastRestore: Cleanup code
...
Remove unused code and comments.
2019-04-12 13:49:55 -07:00
Evan Tschannen
6220a5ce0f
Merge pull request #1370 from jzhou77/fix-unreferenced
...
Remove unused functions
2019-04-09 11:49:45 -07:00
mpilman
1c16f87a4e
Remove trace-calls to printable (in non-workloads)
2019-04-05 13:12:19 -07:00
Meng Xu
c4a8a80d6f
Merge branch 'apple/master' into mengxu/performant-restore-PR
2019-04-04 22:51:00 -07:00
Jingyu Zhou
47b4b82628
Merge branch 'master' into fix-unreferenced
2019-04-01 14:07:19 -07:00
Meng Xu
70d7c289f4
Merge branch 'master' into mengxu/restore/parallel-v7
2019-03-30 22:13:10 -07:00
Alex Miller
e7ad39246c
Fix typo
2019-03-29 20:16:26 -07:00
Evan Tschannen
a44ffd851e
fix: the shared tlog could fail to update a stopped tlog’s queueCommitVersion to version if a second tlog registered before it could issue the first commit for the tlog
2019-03-29 20:11:30 -07:00
Evan Tschannen
b6008558d3
renamed BinaryWriter.toStringRef() to .toValue(), because the function now returns a Standalone<StringRef>()
...
eliminated an unnecessary copy from the proxy commit path
eliminated an unnecessary copy from buffered peek cursor
2019-03-28 11:52:50 -07:00
Jingyu Zhou
a55f06e082
Remove unused functions
...
Found with -Wunused-function flag.
2019-03-27 15:45:28 -07:00
Evan Tschannen
c705a1af74
fix: make sure recoveryLocation is always a valid page
2019-03-20 19:33:09 -07:00
Evan Tschannen
1c6ad6d307
fix: change the location where stopped is checked, because a yield could cause cause stopped to be set after the existing check
2019-03-20 19:33:09 -07:00
Alex Miller
b11ecb3210
Remove random bits of code that were either unneeded or leftover from debugging.
2019-03-18 15:47:20 -07:00
Alex Miller
37ea71b117
Implement limiting how many bytes recovery will read.
...
This time, track what location in the DiskQueue has been spilled in
persistent state, and then feed it back into the disk queue before
recovery.
This also introduces an ASSERT that recovery only reads exactly the
bytes that it needs to have in memory.
2019-03-18 15:09:43 -07:00
Alex Miller
29ab7370cd
Clear versionLocation when spilling, and pop DQ separately.
...
Popping the disk queue now requires potentially recovering the location
to which we can pop from the spilled data itself, and for each tag we
must maintain the first location with relevant data.
The previous queue we had to represent the ordering, queueOrder, was
used by spilling, and popped when a TLog had been spilled. This means
that as soon as a TLog has been fully spilled, we have no idea how it
relates in order to other fully spilled TLogs.
Instead, use queueOrder to keep track of all the TLog UIDs until they're
removed, and use spillOrder to keep track of the order only for
spilling.
2019-03-18 15:09:22 -07:00
Alex Miller
7f5bc2981f
Checksum DiskQueue pages on read, but at a lower priority.
...
If a server has its data spilled, then it's behind the 5s window.
Feeding it data is less important than committing, so we can hide the
extra CPU usage from checksumming the read amplified disk queue pages.
2019-03-15 21:01:19 -07:00
Alex Miller
ee4721a63f
Make checking or ignoring checksums part of the IDiskQueue::read API.
2019-03-15 21:01:18 -07:00
Alex Miller
81c59e88a8
Persist the protocol version of a TLog instance when it is created.
...
This allows us to do easy upgrades of SpilledData in the future, if the
need arises, because we then have a protocol version to compare against.
2019-03-15 21:01:17 -07:00
Alex Miller
686b097397
Remove verification code from DiskQueue and TLogServer.
2019-03-15 21:01:15 -07:00
Alex Miller
77f596743f
Bump persistFormat in TLogServer to differ from OldTLogServer*
...
Though this format is being deprecated in favor of an eventual plumbing
through of TLogVersion, we should probably bump it anyway.
And also remove the fallback to OldTLogServer code. It should never be
executed, as OldTLogServer_6_0 is entirely relied upon to execute
OldTLogServer_4_6.
2019-03-15 21:01:13 -07:00
Alex Miller
4f98634f59
Add LogId to all TLog TraceEvents that have it.
2019-03-15 21:01:12 -07:00
Evan Tschannen
5873705228
tlog commits very rarely take an additional 6 seconds
2019-03-11 12:11:17 -07:00
Evan Tschannen
80c3f2f8e2
added status fields detailing which processes are degraded, and also the total number of degraded processes
2019-03-10 22:58:15 -07:00
Evan Tschannen
044b6b4f8a
Merge branch 'master' into feature-degraded-tlog
...
# Conflicts:
# fdbserver/ClusterController.actor.cpp
2019-03-08 22:50:41 -05:00
Evan Tschannen
53f16b5347
when a tlog queue commit takes longer than 5 seconds, its process is marked as degraded
2019-03-08 11:46:34 -05:00
Alex Miller
c6a65389ae
Remove noexcept macro and replace with BOOST_NOEXCEPT.
...
BOOST_NOEXCEPT does what the noexcept macro was supposed to do, but in a
way that is correctly maintained over time.
2019-03-05 22:06:12 -08:00
Alex Miller
244903a9de
Spill txsTag by value under TagMsg/ and not TagMsgRef/
...
There's not a tremendous reason as to why this matters now, but I feel
like I might regret sometime later not keeping the same schema under the
same key.
2019-03-04 01:42:39 -08:00
Alex Miller
72c2cf11ab
Replace ResourceLimiter with FlowLock.
2019-03-04 01:42:38 -08:00
Alex Miller
aff9ebe21a
Spill (start,length) instead of (begin,end) to save a few bytes.
2019-03-04 01:42:38 -08:00
Alex Miller
9ef283d4e7
Implement hard limiting of memory used to serve peek requests.
2019-03-04 01:42:38 -08:00
Alex Miller
e3506ad9af
Add a yield to parseMessagesForTag
2019-03-04 01:42:38 -08:00
Alex Miller
742f6e1847
Solve overreading via pre-calculating tag bytes per commit
2019-03-04 01:42:38 -08:00
Alex Miller
e7d8520c63
Batch more when spilling data.
2019-03-04 01:42:38 -08:00
Alex Miller
04e1170c88
Spill txsTag by value
2019-03-04 01:42:38 -08:00
Alex Miller
ba31f8f1f9
Remove all code related to writing and cleaning up old-style spilling.
2019-02-26 18:13:49 -08:00
Alex Miller
0539c1df00
Peek from new spilled data and not old spilled data.
2019-02-26 18:13:49 -08:00
Alex Miller
d687a4bb85
Implement proper cleanup of disk queue when spilling refs.
2019-02-26 18:00:55 -08:00
Alex Miller
84dc41c206
Add a comment.
2019-02-26 18:00:55 -08:00
Alex Miller
cade914645
Temporarily add verification of spilled tag data.
2019-02-26 18:00:55 -08:00
Alex Miller
f659825575
Persist start,end of message when spilling tags.
2019-02-26 18:00:55 -08:00
Alex Miller
8d76cbed02
Track both start and end versions in versionLocation.
2019-02-26 18:00:55 -08:00
Evan Tschannen
8afb7fbb9d
Merge pull request #1160 from alexmiller-apple/tstlog-fork
...
Spill-By-Reference TLog Part 2: New and Old TLogServers co-exist harmoniously
2019-02-26 18:00:04 -08:00
Alex Miller
2dc57568cb
Change many things about log_version.
...
* log_version in the database (`/conf/log_version`) is now a hint that gets
rounded to the nearest supported version.
* fdbcli and FDB enforce that only a valid log_version can be configured to
* TLogVersion is persisted in CoreTLogSet (and LogSet and TLogSet)
* Some comments here and there
* Add an assert on filename length to make sure KV-pairs in filename
don't exceed a maximum length.
2019-02-26 16:47:04 -08:00
Evan Tschannen
b8910ba7cd
Merge branch 'master' into feature-fix-force-recovery
...
# Conflicts:
# fdbclient/ManagementAPI.actor.h
# fdbserver/DataDistribution.actor.cpp
# fdbserver/storageserver.actor.cpp
# fdbserver/workloads/KillRegion.actor.cpp
2019-02-22 14:38:13 -08:00
Alex Miller
91e05575a2
Rename OldTLogServer -> OldTLogServer_4_6
2019-02-19 22:18:10 -08:00
mpilman
3f0fd2a20c
Use fwd decls in WorkerInterface
...
Also WorkerInterface.h -> WorkerInterface.actor.h
2019-02-19 15:16:59 -08:00
mpilman
0bb60e5a3b
Use proper fwd decl in NativeAPI
...
Also NativeAPI.h -> NativeAPI.actor.h
2019-02-19 15:16:59 -08:00
Evan Tschannen
065a45e05f
Merge branch 'master' into feature-fix-force-recovery
...
# Conflicts:
# fdbclient/ManagementAPI.actor.cpp
# fdbserver/ClusterController.actor.cpp
# fdbserver/workloads/KillRegion.actor.cpp
2019-02-18 17:09:06 -08:00
Evan Tschannen
3247d59498
partially restored an optimization on remote storage servers where a behind storage server will keep less data in memory. This optimization was fully maintained on the primary storage servers, but remote storage servers can only use a version which is known to be durable on all remote transaction logs
2019-02-18 16:47:38 -08:00
Vishesh Yadav
e05b53d755
Merge remote-tracking branch 'apple/master' into task/tls-upgrade
2019-02-15 20:37:07 -08:00
Jingyu Zhou
7897616164
Fix wait failure bug on cluster controller
...
The setDistributor() sets an AsyncVar and then runs waitFailureClient. This
ordering is wrong because the AsyncVar::set triggers the other loop to run
first, which will wait on Never(). The correct code should wait on the Future
returned by the waitFailureClient.
2019-02-14 16:37:16 -08:00
Meng Xu
550f2e2682
Merge with master to use the latest backup container
...
In fdb 6.0.15, backup container is changed on how to organize the backup data.
The backup made by fdb >6.0.15 has to be restored with fdb > 6.0.15.
Merge with master so that the fast restore uses fdb > 6.0.15
2019-01-30 12:05:15 -08:00
Evan Tschannen
1d7fec3074
Merge commit '048bfc5c368063d9e009513078dab88be0cbd5b0' into task/tls-upgrade-2
...
# Conflicts:
# .gitignore
2019-01-24 17:43:06 -08:00
Meng Xu
5f406d864c
WiP:Has bug: assign roles and distribute files
...
The global variables for a process in real mode are separate across processes in real mode, BUT
they are NOT separate in simulation mode.
The workers in different processes can still access the global variables,
which makes one worker overwrite another work's global variable value.
To solve this problem, which happens in other FDB code as well,
we allocate the global variable in a heap and pass the pointers across functions,
as what DataDistribution.actor does.
The other approach is to create two code path: one for real mode and the other for simulator,
as the g_simulator does to create the process context for each simulated process
2019-01-04 22:23:02 -08:00
anoyes
6a4d87802b
Replace & operator with variadic function
2018-12-28 11:33:42 -08:00
Vishesh Yadav
3eb9b23024
Listen to multiple addresses and start using vector<NetworkAdddress> in Endpoint
...
- This patch will make FDB listen to multiple addresses given via
command line. Although, we'll still use first address in most places,
this patch starts using vector<NetworkAddress> in Endpoint at some basic
places.
- When sending packets to an endpoint, pick a random network address in
endpoints
- Renames Endpoint::address to Endpoint::addresses since it
now holds a vector of addresses.
2018-12-13 13:36:52 -08:00
Vishesh Yadav
43e5a46f9b
Change Endpoint::address(NetworkAddress) to vector<NetworkAddress>
...
Extend `Endpoint` class to take multiple NetworkAddresses instead of
just one. Hence, to talk to an endpoint instead of one IP:PORT, we'll
have multiple IP:PORT pairs.
This patch simply adds the field and makes changes to compile the
codebase. The first element of of `address` field is used everywhere.
Hence the way we talk to remains same with this patch.
NOTE:
Directly accessing the first memeber of Endpoint::address is unsafe
as Endpoint() doesn't enforces non-empty address list. However, since
the correctness test pass for now and are anyway replacing all those
unsafe accesses with ones considering the whole vector, this patch
ignores to access them in safe way.
2018-12-13 13:36:52 -08:00
Evan Tschannen
4b5d0b4e2c
Merge branch 'release-6.0'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbclient/AsyncFileBlobStore.actor.cpp
# fdbclient/AsyncFileBlobStore.actor.h
# fdbclient/BlobStore.actor.cpp
# fdbclient/BlobStore.h
# fdbclient/HTTP.actor.cpp
# fdbclient/ManagementAPI.actor.cpp
# fdbclient/NativeAPI.actor.cpp
# fdbrpc/LoadBalance.actor.h
# fdbrpc/batcher.actor.h
# fdbrpc/fdbrpc.vcxproj
# fdbrpc/sim2.actor.cpp
# fdbserver/DataDistribution.actor.cpp
# fdbserver/DataDistributionTracker.actor.cpp
# fdbserver/SimulatedCluster.actor.cpp
# fdbserver/TLogServer.actor.cpp
# fdbserver/masterserver.actor.cpp
2018-11-10 13:04:24 -08:00
Evan Tschannen
f045c041eb
fix: if a storage server already exists in a remote region after converting to fearless, it did not receive mutations between the known committed version and the recovery version
2018-11-02 14:11:39 -07:00
Evan Tschannen
1b5d28386a
fix: the Tlog would not update the durable version properly when version_sizes was empty
2018-11-02 13:05:54 -07:00
Robert Escriva
268093a96d
Adjust all includes to be relative to the root.
...
Remove the use of relative paths. A header at foo/bar.h could be included by
files under foo/ with "bar.h", but would be included everywhere else as
"foo/bar.h". Adjust so that every include references such a header with the
latter form.
Signed-off-by: Robert Escriva <rescriva@dropbox.com>
2018-10-19 17:35:33 +00:00
Evan Tschannen
db71b60d72
Merge pull request #819 from satherton/feature-redwood
...
Redwood storage engine, initial/experimental version
2018-10-18 18:38:11 -07:00
Evan Tschannen
0217aed74c
Merge branch 'release-6.0'
...
# Conflicts:
# bindings/go/README.md
# documentation/sphinx/source/release-notes.rst
# fdbserver/MasterProxyServer.actor.cpp
# versions.target
2018-10-15 18:38:51 -07:00
Evan Tschannen
ecddeab2ae
fixed review comments; demote killRegionCycle test for now
2018-10-08 10:39:39 -07:00
Stephen Atherton
22f8a4efa9
Normalized all unit test names to begin with "/" if they should be included in random unit testing.
2018-10-05 22:09:58 -07:00
Stephen Atherton
7c1dc305cb
Merge commit 'a72c8f5cb2e79a673abc0ed3d27ef1c51028fb13' into feature-redwood
2018-10-05 10:15:10 -07:00
Evan Tschannen
3922e477a5
Merge branch 'release-6.0'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbclient/ManagementAPI.actor.cpp
# fdbserver/ClusterController.actor.cpp
# fdbserver/DataDistribution.actor.cpp
# fdbserver/LogSystemDiskQueueAdapter.actor.cpp
# fdbserver/SimulatedCluster.actor.cpp
# fdbserver/TLogServer.actor.cpp
2018-10-03 16:57:18 -07:00
Evan Tschannen
15ce215c1b
fix: parallel peek requests leaked memory
2018-10-02 17:28:39 -07:00
Evan Tschannen
a24eadd73a
fix: for remote logs, their known committed version cannot be set to 1, because they can be used when their durable version is 0, leading to a known committed version being greater than a queue committed version
2018-09-28 12:17:21 -07:00
A.J. Beamon
92990d6aef
Merge release-6.0 into master
2018-09-21 16:14:39 -07:00
Evan Tschannen
77e2fb787e
Merge branch 'release-6.0' into feature-fix-forced-recovery
2018-09-21 14:55:37 -07:00
Evan Tschannen
31d0b0315f
fix: tlog spill policy would spill everything when it wanted to spill nothing
...
use a flow lock to protect updatePersistData and initPersistentState from committing simultaneously
2018-09-20 15:33:38 -07:00
Stephen Atherton
2fc86c5ff3
Merge branch 'master' of github.com:apple/foundationdb into feature-redwood
...
# Conflicts:
# fdbrpc/AsyncFileCached.actor.h
# fdbserver/IKeyValueStore.h
# fdbserver/KeyValueStoreMemory.actor.cpp
# fdbserver/workloads/StatusWorkload.actor.cpp
# tests/fast/SidebandWithStatus.txt
# tests/rare/LargeApiCorrectnessStatus.txt
# tests/slow/DDBalanceAndRemoveStatus.txt
2018-09-20 03:39:55 -07:00
Evan Tschannen
270b1b24a6
fix: we have to use durableKnownCommittedVersion, because the is the true lower bound on the recovery version of the remote logs
...
fixed a compiler error
2018-09-18 16:29:03 -07:00
Evan Tschannen
200e65fe61
added a workload which tests killing an entire region, and recovering from the failure with data loss.
...
fix: we cannot pop the txs tag from remote logs until they have a full copy of the txnStateStore
fix: we have to modify all of history, we cannot stop after finding a local remote
2018-09-17 18:32:39 -07:00
Evan Tschannen
4dd2dda0a3
Merge branch 'release-6.0'
...
# Conflicts:
# fdbserver/worker.actor.cpp
2018-09-05 16:11:06 -07:00
Evan Tschannen
df406a340e
Merge pull request #742 from ajbeamon/roles-in-trace-events
...
Add the roles running on a process as a field on trace events in the …
2018-09-05 16:08:12 -07:00
Evan Tschannen
90301f497f
Merge branch 'release-6.0'
...
# Conflicts:
# fdbclient/ManagementAPI.actor.cpp
# fdbrpc/FlowTransport.actor.cpp
# fdbrpc/TLSConnection.actor.cpp
# fdbserver/DataDistribution.actor.cpp
# fdbserver/Status.actor.cpp
# fdbserver/storageserver.actor.cpp
# fdbserver/workloads/StatusWorkload.actor.cpp
# versions.target
2018-09-05 16:06:33 -07:00
A.J. Beamon
2de0b5d6d7
Add the roles running on a process as a field on trace events in the form of a comma delimited string of role abbreviations.
2018-09-05 15:06:14 -07:00
Evan Tschannen
1022e0a5c6
added yields to the log router and tlogs after processing a version
2018-09-04 17:16:44 -07:00
Evan Tschannen
717c43a69f
merge 6.0 into master
2018-08-22 00:28:04 -07:00
Evan Tschannen
84e1f7b2b5
added overhead bytes durable to complement overhead bytes input
2018-08-21 22:35:04 -07:00
Evan Tschannen
74f7412975
added separate logging for overhead bytes
2018-08-21 22:18:38 -07:00
Evan Tschannen
d7c01f0419
added a separate knob for tlog’s recoverMemoryLimit
2018-08-21 21:11:23 -07:00
Alex Miller
74a9d2f836
Remove a couple more `Void _ = wait`s that crept in from rebase.
2018-08-14 15:50:26 -07:00
Alex Miller
fb31a6999f
Rewrite all files to have #include actorcompiler.h as the last include.
2018-08-14 15:50:26 -07:00
Alex Miller
69aa33eed5
Fix a case of a bool being used as an integer.
2018-08-14 15:50:26 -07:00
Alex Miller
535b5701e5
Rewrite all `Void _ = wait(...)` -> `wait(...)`.
...
This takes advantage of the new actorcompiler functionality to avoid
having duplicate definitions of `Void _` when trying to feed the
un-actorompiled source through clang.
2018-08-14 15:50:26 -07:00
Evan Tschannen
7c5d414f7b
fix: during destruction logData could attempt to dereference tLogData after it has been deleted
2018-08-09 12:38:35 -07:00
Evan Tschannen
c757c68bfa
fix: nextVersion needs to be set to logData->version if version_sizes is empty
2018-08-04 23:53:37 -07:00
Evan Tschannen
fec285146c
significant cpu optimization in update storage
2018-08-04 12:36:48 -07:00
Evan Tschannen
be1a4d74c7
tlogs serve reads to log routers at a low priority, to prevent them from using all their resources catching up a remote dc that has been down for a long time
...
increase the amount of memory ratekeeper budgets for tlogs so that there is a gap after the spill threshold to prevent temporarily overshooting the budget
2018-08-04 10:31:30 -07:00
Evan Tschannen
71f89f372f
changed a trace event name to avoid scope type mismatch on the tag field
2018-08-03 15:53:38 -07:00
Evan Tschannen
2619234477
Merge branch 'release-5.2' into release-6.0
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
2018-08-03 11:40:24 -07:00
Evan Tschannen
501033c5af
fix: tlog spilling on a stopped log was only making one version durable at a time
2018-08-03 11:38:12 -07:00
Evan Tschannen
1c29275672
call all methods which could disable a trace event before it is initialized. In practice this means calling .error first, then .suppressFor, then all your details.
2018-08-01 14:30:57 -07:00
Stephen Atherton
40762d9f9b
Merge branch 'master' of github.com:apple/foundationdb into feature-redwood
2018-07-25 17:58:52 -07:00
Evan Tschannen
0f59dc4086
fix: do not write to the persistent queue when we are terminated, which could happen if shutdown was caused by setting a promise in the asyncPullData loop
2018-07-13 17:01:31 -07:00
Evan Tschannen
cd63c7a7cc
added a buffered cursor, which efficiently merges lots of peek cursors
2018-07-12 12:09:48 -07:00
Stephen Atherton
96389c74cd
Merge branch 'master' of github.com:apple/foundationdb into feature-redwood
...
# Conflicts:
# tests/fast/SidebandWithStatus.txt
# tests/rare/LargeApiCorrectnessStatus.txt
# tests/slow/DDBalanceAndRemoveStatus.txt
2018-07-10 16:42:34 -07:00
Stephen Atherton
1bc95862b7
Merge branch 'release-6.0' of github.com:apple/foundationdb into feature-redwood
...
# Conflicts:
# tests/fast/SidebandWithStatus.txt
# tests/rare/LargeApiCorrectnessStatus.txt
# tests/slow/DDBalanceAndRemoveStatus.txt
2018-07-10 04:16:02 -07:00
Evan Tschannen
6b40f2764d
fix: off by one error on popping missing tags
2018-07-09 15:43:22 -07:00
Evan Tschannen
2718176927
fix: remote logs did not pop all of the data for removed logs on recovery because data for the missing tag was not recorded yet at the time of recovery
2018-07-06 16:10:41 -07:00
Evan Tschannen
507b3bacb0
fix: kill all tlogs in one region prevents the remote logs from recovering in that region, do not allow that to prevent us from configuring usable_regions=1.
...
added more recovery states.
2018-07-05 00:08:51 -07:00
Stephen Atherton
b95a2bd6c1
Merge commit 'b17c8359ec22892ed4daeaa569f2f5e105477251' into feature-redwood
...
# Conflicts:
# flow/Trace.cpp
2018-06-30 23:18:29 -07:00
Evan Tschannen
2987f85177
fix: known committed version must be updated before creating the tlogQueueEntryRef
2018-06-26 23:21:30 -07:00
Evan Tschannen
6e19622872
fix: durableKnownCommittedVersion was incorrect because we could update knownCommittedVersion before pushing a TLogQueueEntry with that known committed version
2018-06-26 18:02:55 -07:00
Evan Tschannen
2ec8744ab3
fix: parallel get more needs to verify the begin version matches the end of the previous request, because when a peek cursor expires we lose all history, so the same sequence number could start at different versions
2018-06-25 11:15:49 -07:00
Evan Tschannen
68ac3bdc4c
log routers now calculate a precise version to pop for their log router tag
2018-06-21 15:29:46 -07:00
Evan Tschannen
f755961c42
use parallelGetMore during log recovery
2018-06-20 17:04:06 -07:00
Stephen Atherton
e5c48d453a
Merge branch 'master' of github.com:apple/foundationdb into feature-redwood
2018-06-18 22:45:27 -07:00
Evan Tschannen
df4c445e25
fix: we need to return from commit when stopped
2018-06-18 22:12:46 -07:00
Evan Tschannen
403fb5a2e9
removed unnecessary suppressFor
2018-06-18 17:59:29 -07:00
Evan Tschannen
b79feaddd3
added a hard memory limit to the TLog to prevent it from running out of memory. Because remote logs are not ratekeeper controlled this is their only protection
2018-06-18 17:22:40 -07:00
Stephen Atherton
90c8288c68
Merge branch 'master' of github.com:apple/foundationdb into feature-redwood
2018-06-17 14:55:05 -07:00
Evan Tschannen
7aef5ec6f1
got rid of persistUnrecoveredBefore
...
added persistLocality
2018-06-17 14:44:33 -07:00
Evan Tschannen
f637c680f1
fix: populateSatelliteTagLocations was broken
...
fix: satellites do not index the upgraded locality
2018-06-17 13:29:17 -07:00
Evan Tschannen
0d87186821
use a specific locality for satellites
2018-06-15 11:06:38 -07:00
Evan Tschannen
feb8578c06
fix: only flush and exit in simulation
2018-06-14 13:48:30 -07:00
Stephen Atherton
1eae9d621b
Merge branch 'master' of github.com:apple/foundationdb into feature-redwood
2018-06-13 15:58:21 -07:00
Stephen Atherton
2878f30f29
Merge branch 'master' of github.com:apple/foundationdb into feature-redwood
...
# Conflicts:
# fdbserver/IKeyValueStore.h
# fdbserver/KeyValueStoreMemory.actor.cpp
# fdbserver/storageserver.actor.cpp
2018-06-13 15:56:06 -07:00
Alex Miller
fcfa00928b
Make RecoveryState an enum class.
...
This means that all the == 7 or != 0 checks go away, and explicit names must be used.
2018-06-12 16:50:25 -07:00
Evan Tschannen
372ed67497
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# fdbserver/DataDistribution.actor.cpp
# fdbserver/MasterProxyServer.actor.cpp
# fdbserver/TLogServer.actor.cpp
# fdbserver/TagPartitionedLogSystem.actor.cpp
2018-06-11 11:34:10 -07:00
Evan Tschannen
588eaf4b36
fix: previous delay 0 could still cause us to recruit a tlog before processing disk errors
2018-06-11 11:26:30 -07:00
Evan Tschannen
a5c2a8ee8a
fix: allow disk errors to cancel the actor before recruiting logs
2018-06-10 20:27:19 -07:00
A.J. Beamon
e5488419cc
Attempt to normalize trace events:
...
* Detail names now all start with an uppercase character and contain no underscores. Ideally these should be head-first camel case, though that was harder to check.
* Type names have the same rules, except they allow one underscore (to support a usage pattern Context_Type). The first character after the underscore is also uppercase.
* Use seconds instead of milliseconds in details.
Added a check when events are logged in simulation that logs a message to stderr if the first two rules above aren't followed.
This probably doesn't address every instance of the above problems, but all of the events I was able to hit in simulation pass the check.
2018-06-08 11:11:08 -07:00
Evan Tschannen
bf65e745a9
tlogs do not index tags for other localities
2018-06-01 22:51:08 -07:00
Evan Tschannen
c519339adb
avoid peeking from logs that do not match the tag’s locality
2018-06-01 18:42:48 -07:00
A.J. Beamon
1f0b519a73
Rename several variables in TLogServer.actor.cpp to follow our normal camel case conventions. I didn't rename every variable here, because some appear to be data structures (like a map) following the pattern keydesc_valuedesc, and I wasn't sure that the straightforward keydescValuedesc rename made sense. I did rename a couple of instances of these where it seemed reasonable, though.
2018-06-01 10:18:07 -07:00
A.J. Beamon
d9c702a9e3
Merge release-5.1 into release-5.2
2018-05-30 09:09:55 -07:00
Evan Tschannen
529bd34cf9
fix: when a tlog is stopped by another recruitment it no longer has the opportunity for commtingQueue to be set
2018-05-06 20:37:44 -07:00
Evan Tschannen
81c7bddaf8
fix: must check for log router errors while waiting on satellite replies because the recruitmentID will not be updated if it threw an error
2018-05-06 18:15:12 -07:00
Evan Tschannen
b4bd03e67e
fix: we cannot set queueCommitEnd until we have popped the log system to prevent the popped version from going backwards
2018-05-01 22:20:25 -07:00
Evan Tschannen
12ef63b698
knobify replace contents bytes
2018-05-01 19:43:35 -07:00
Evan Tschannen
10d25927cd
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# fdbserver/DataDistribution.actor.cpp
2018-04-30 22:15:39 -07:00
Evan Tschannen
5143871fed
passed debug ids into all versions of peek() to assist debugging
2018-04-30 13:36:35 -07:00
Evan Tschannen
883f2318a0
test fearless configurations
2018-04-30 13:17:29 -07:00
Evan Tschannen
92b134eb98
fix: errors from removed were not handled properly
2018-04-29 23:05:08 -07:00
Evan Tschannen
6f318dbff2
fix: do not reply to recruitment until we are sure the log commits to the queue
2018-04-29 22:08:24 -07:00
Evan Tschannen
9cdabfed0e
added useful trace events
2018-04-29 18:54:47 -07:00
Evan Tschannen
f77c1ec14e
fix: fixed rare bug where a log stopped by a different recruitment would still response successfully to the recruitment message
2018-04-28 13:34:06 -07:00
Evan Tschannen
af63dac5dd
fix: remote logs need to wait until the durable known committed version is greater than the recovery version before completing recovery to ensure we will not pick a start version that we do not have
2018-04-27 12:18:42 -07:00
Stephen Atherton
af61d3596d
Merge branch 'public-master' into feature-redwood
...
# Conflicts:
# fdbserver/DatabaseConfiguration.cpp
# fdbserver/OldTLogServer.actor.cpp
# fdbserver/fdbserver.vcxproj
# fdbserver/fdbserver.vcxproj.filters
2018-04-24 17:22:21 -07:00
Evan Tschannen
ae1de575f1
fix: remote logs are not considered fully recovered until they are at recoveredAt
2018-04-23 17:49:46 -07:00
Evan Tschannen
3ec09ce9f6
fix: only peekSingle needs to throw worker_removed, because tlogs have other ways to get notified they are no longer needed
...
fix: we need to wait until tags are popped past recoveredAt instead of unrecovered before
2018-04-23 16:43:08 -07:00
Evan Tschannen
126fc53d10
fix: the start version for peek cursors that merge with multiple log sets is the maximum of the individual start versions
2018-04-23 12:42:51 -07:00
Dennis Schafroth
290122637b
Using ASSERT_ABORT in destructors
2018-04-23 14:05:10 +02:00
Evan Tschannen
73597f190e
fix: new tlogs are initialized with exactly the tags which existed at the recovery version
2018-04-22 20:28:01 -07:00
Evan Tschannen
a520d03397
fix: if we cannot find a tag, it must have been popped at the recovery version.
2018-04-22 15:08:38 -07:00
Evan Tschannen
1d1e2cd367
fix: initialize the known committed version on the tlog
2018-04-21 00:41:15 -07:00
Evan Tschannen
8d350ceb5f
fix: persist the known committed version on the tlogs
2018-04-20 17:55:46 -07:00
Evan Tschannen
a6d9e889f0
a cleaner solution to preventing tlogs from peeking log routers
2018-04-20 13:25:22 -07:00
Evan Tschannen
f5c3417905
fix: prevent tlogs from peeking the wrong log routers
2018-04-20 00:30:37 -07:00
Evan Tschannen
5da452db8e
fix: pop the log routers again after the log system updates
2018-04-19 14:33:31 -07:00
Evan Tschannen
22526ef996
fix: do not tell storage servers about large sections of empty versions, because it can lead them to make mutations durable which have not been committed
2018-04-18 16:06:44 -07:00
Evan Tschannen
447c7bd15b
fix: log routers use durable known committed version at the time of the pop to determine what is safe to pop from their logs
...
fix: storage server does not advance its version across large version increase until it has data associated with the version
2018-04-18 12:07:29 -07:00
Evan Tschannen
760bc8bc99
fix: log router version needs to be fetched before it is available
...
fix: tlog did not fetch known committed version if start version was exactly equal to it
2018-04-17 11:16:48 -07:00
Evan Tschannen
3e40505f4a
Revert "fix: remote logs should reply until they have recovered through recoverAt"
...
This reverts commit 3c0c03c004
.
2018-04-16 23:17:16 -07:00
Evan Tschannen
3c0c03c004
fix: remote logs should reply until they have recovered through recoverAt
2018-04-16 17:25:49 -07:00
Evan Tschannen
3018a7b1b3
fix: the known committed version of a newly initialized log is 1, since by definition the first commit must have succeeded
2018-04-16 10:42:48 -07:00
Evan Tschannen
5533016f1e
fix: tlogs are now initialized immediately, instead of when starting the core, this must be done to pop the log routers during recovery
...
fix: log router start version must be the same as remote log start version
2018-04-15 14:33:07 -07:00
Evan Tschannen
65e69620a7
fix: unrecoveredBefore on a new log is at minimum 1
2018-04-13 10:41:30 -07:00
Evan Tschannen
1af5ac0d9d
fix: a number of different problems prevented tlogs from using log routers during recovery
2018-04-12 15:20:54 -07:00
Alex Miller
20082e3228
Clang fixes.
2018-04-12 11:10:53 -07:00
Evan Tschannen
a738c4bec1
fix: if the known committed version is equal to the recovery version we do not need to copy any data
2018-04-09 20:48:55 -07:00
Evan Tschannen
419951f601
fix: need to initialize tlog versions to less than the startVersion
2018-04-09 17:17:11 -07:00
Evan Tschannen
4c89f721cd
fix: do not include logRouter tags in lock results
2018-04-09 10:48:57 -07:00
Evan Tschannen
7af892f50b
first working version of non-copying recovery working with fearless configurations
2018-04-08 21:24:05 -07:00
Stephen Atherton
2752a28611
Merge branch 'release-5.2' of github.com:apple/foundationdb into feature-redwood
2018-04-06 16:29:37 -07:00
Evan Tschannen
331e707684
fix: pop all tags that did not have data at the recovery version because fully popped tags may come back when pullAsyncData re-indexes the mutations
2018-03-31 16:47:56 -07:00
Evan Tschannen
96fffe2cea
fix: do not update version if the log has been stopped
2018-03-30 22:11:42 -07:00
Evan Tschannen
1a4ded1c99
support upgrades by merging tags associated with the different peek requests
2018-03-29 17:54:08 -07:00
Evan Tschannen
b36e08f08f
first version of non-copying recovery. Upgrades are broken, and it has not been tested using fearless configurations yet
2018-03-29 15:12:38 -07:00
Evan Tschannen
0746fe4d56
optimized tag lookups on the tlog by removing one level of vectors
2018-03-20 10:41:42 -07:00
Evan Tschannen
d8e064d8bb
fix: when a new log is recruited on a shared log, all outstanding commits need to be notified that they are stopped, because there is no longer a guarantee that their queueCommittedVersion will advance
2018-03-19 17:48:28 -07:00
Evan Tschannen
54be14000d
do not deserialize tags
2018-03-17 11:24:18 -07:00
Evan Tschannen
9c8cb445d6
optimized the tlog to use a vector for tags instead of a map
2018-03-17 10:36:19 -07:00
Evan Tschannen
fecfea0f7d
fix: messages vector was not cleared
2018-03-17 10:24:44 -07:00
Evan Tschannen
ccd70fd005
The tlog uses the tags embedded in the message instead of a separate vector of locations
...
optimized remote tlog committing to avoid re-serializing the message
2018-03-16 16:47:05 -07:00
Evan Tschannen
820382ea68
optimized the log router commit path to avoid re-serializing the data
2018-03-16 11:40:21 -07:00
Evan Tschannen
f6a22c1035
fix: the recovery actor was holding a copy of the tlogInterface after the tlog was removed
2018-03-12 16:56:34 -07:00
A.J. Beamon
f2c804e14f
Reverting changes from merge of master into release-5.2 ( b25810711c
). Note that we never intend to release master into release-5.2, but if we did we would need to revert this commit.
2018-03-06 10:15:04 -08:00
A.J. Beamon
b25810711c
Merge branch 'master' into release-5.2
2018-03-05 10:32:57 -08:00
Balachandar Namasivayam
8ae640c062
Addressed review comments.
2018-03-02 17:56:49 -08:00
Balachandar Namasivayam
11df1aeabf
Add new api to get shared tlogs id and address
2018-03-02 16:50:30 -08:00
Evan Tschannen
e3c6b66240
fix: do not commit more data after being stopped
...
fix: prioritize dc locality above exclusion to prevent being stuck after excluding all machines in a data center
2018-02-26 13:13:37 -08:00
Evan Tschannen
37a6a81634
Merge commit '7f6fc3e039c911cd84b8540f7f799fc38a1c1822' into feature-remote-logs
...
# Conflicts:
# fdbserver/workloads/RestartRecovery.actor.cpp
2018-02-23 12:33:28 -08:00
Evan Tschannen
ddb484143c
fix: do not peek from remote logs if they are not fully recovered
2018-02-21 14:06:44 -08:00
Alec Grieser
0bae9880f1
remove trailing whitespace from our copyright headers ; fixed formatting of python setup.py
2018-02-21 10:25:11 -08:00
Evan Tschannen
1dc6a8d4bd
fix: the tlog can peek from log systems that have been recovered even if it does not match its recoverFrom set
2018-02-20 14:50:13 -08:00
Evan Tschannen
31b89a638f
added satellite_none and remote_none options to unconfigure from a fearless setup
...
fix: log_router configuration was broken
2018-02-17 13:51:17 -08:00
Evan Tschannen
1fedcba890
fix: do not use log router tags when configured without remote logs
...
fix: data distribution tracks undesired storage servers
re-enabled consistency check
2018-02-13 17:01:34 -08:00
Stephen Atherton
0a35f167e4
Merge branch 'master' into feature-redwood
...
# Conflicts:
# fdbserver/DiskQueue.actor.cpp
# fdbserver/IDiskQueue.h
# fdbserver/Knobs.cpp
# fdbserver/Knobs.h
# fdbserver/fdbserver.vcxproj
# fdbserver/fdbserver.vcxproj.filters
# fdbserver/worker.actor.cpp
2018-02-12 01:30:02 -08:00
Evan Tschannen
6b54d56ca7
gracefully exit if attempting to upgrade from 4.X versions
2018-01-30 17:10:50 -08:00
Evan Tschannen
29c5d4ad3d
upgrades from 5.X mostly supported, still some remaining correctness problems
2018-01-28 11:52:54 -08:00
Evan Tschannen
66b2218989
added tlog support for upgrading from 5.X clusters. Does not support upgrading from 4.X or earlier. Untested, storage servers still need the ability to change their tag.
2018-01-21 12:21:46 -08:00
Evan Tschannen
21482a45e1
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# fdbserver/DBCoreState.h
# fdbserver/LogSystem.h
# fdbserver/LogSystemPeekCursor.actor.cpp
# fdbserver/TLogServer.actor.cpp
2018-01-14 13:40:24 -08:00
Evan Tschannen
be643d6937
fix: the tlog did not cancel recovery properly when stopped
2018-01-12 17:18:14 -08:00
Evan Tschannen
de119f192d
fixed a priority inversion where the tlog would prefer to copy data from the previous generation rather than make data durable (leading to being ratekeeper controlled)
2018-01-11 16:09:49 -08:00
Evan Tschannen
30710f7493
syncLogId was not necessary
2018-01-06 14:52:39 -08:00
Evan Tschannen
10c3fc165e
fix: after recovering from disk, only allow peeking data the was fully recovered
2018-01-06 13:49:13 -08:00
Evan Tschannen
63751fb0e2
fix: remote logs are not in the log system until the recovery is complete so they cannot be used to determine if this is the correct log system to recover from
2018-01-05 14:15:25 -08:00
Evan Tschannen
5ac4f73978
Merge branch 'release-5.1' into feature-remote-logs
...
# Conflicts:
# fdbclient/NativeAPI.actor.cpp
# fdbrpc/Locality.h
# fdbrpc/simulator.h
# fdbserver/ApplyMetadataMutation.h
# fdbserver/ClusterController.actor.cpp
# fdbserver/LogSystemPeekCursor.actor.cpp
# fdbserver/MasterProxyServer.actor.cpp
# fdbserver/SimulatedCluster.actor.cpp
# fdbserver/TLogServer.actor.cpp
# fdbserver/TagPartitionedLogSystem.actor.cpp
# fdbserver/WorkerInterface.h
# fdbserver/masterserver.actor.cpp
# flow/Net2.actor.cpp
# tests/fast/SidebandWithStatus.txt
# tests/rare/LargeApiCorrectnessStatus.txt
# tests/slow/DDBalanceAndRemoveStatus.txt
2018-01-05 11:33:42 -08:00
Evan Tschannen
f2c4beed9f
fix: tlogFitness did not consider it better to have one tlog of a better fitness
...
fix: checkStable was not used in all places in better master exists
fix: we need to call checkOutstanding on worker registration in all cases
fix: in case persistentData is keyValueStoreMemory, we need to make sure it is fully recovered before writing to it
2018-01-04 11:33:02 -08:00
Alex Miller
c7dbd31a1e
Refactoring: Create a common prefixRange and do UID->Key once in backup.
2017-12-19 17:17:50 -08:00
Evan Tschannen
8c51bc4ac4
fixed low latency tests in a way that gives us better test coverage
2017-11-28 18:20:29 -08:00
Evan Tschannen
dc624a54dc
fix: avoid flushing large queues in simulation when checking latency
2017-11-27 17:23:20 -08:00
Evan Tschannen
df74e2a373
re-added support for non-copying tlog recovery
2017-10-24 15:09:31 -07:00
Evan Tschannen
15962cf079
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# fdbrpc/Locality.cpp
# fdbrpc/Locality.h
# fdbserver/ClusterController.actor.cpp
# fdbserver/ClusterRecruitmentInterface.h
# fdbserver/TLogServer.actor.cpp
# fdbserver/TagPartitionedLogSystem.actor.cpp
# fdbserver/WorkerInterface.h
# fdbserver/fdbserver.vcxproj.filters
# fdbserver/masterserver.actor.cpp
# fdbserver/worker.actor.cpp
# flow/error_definitions.h
2017-10-05 17:09:44 -07:00
Balachandar Namasivayam
0e153cdd35
Throttle Spammy logs. Three knobs are added.
...
Trace Events are sampled and cached with an expiration set. Every TraceEvent above SevDebug is checked against this cache to see if it exceeded a set threshold. If yes, then throttle the TraceEvent.
If a TraceEvent is throttled, a warning msg is logged.
2017-10-02 18:43:11 -07:00
Stephen Atherton
248dab79b6
Created “redwood” storage engine option and many changes to support that including IKeyValueStore::init() and custom DiskQueue file extensions.
2017-09-21 23:51:55 -07:00
Evan Tschannen
f75dfc3153
do not register with the master until recovery of the queue is complete, to avoid having the master wait a long time for a peek response
2017-09-18 17:39:12 -07:00
Evan Tschannen
36c98f18e9
do not register a worker with the cluster controller until it has finished recovering all files from disk
2017-09-15 10:57:58 -07:00