Jingyu Zhou
fda6c08640
Include a total number of tags in partition log file names
...
This is needed for BackupContainer to check partitioned mutation logs are
continuous, i.e., restorable to a version.
2020-03-20 20:13:38 -07:00
Balachandar Namasivayam
58a9bfa78b
Merge pull request #2820 from dongxinEric/fix/1977/add-back-trace-event-flush-failure-report
...
Fix/1977/add back trace event flush failure report
2020-03-18 16:11:44 -07:00
Xin Dong
5967ef5eab
Added back the changes that report trace log flush failures and fix the random crash
2020-03-12 14:34:19 -07:00
Meng Xu
e0d2eca7a8
checkForExtraDataStores:Add coordinators into stateful process list
2020-03-10 23:38:30 -07:00
Xin Dong
39610d15f8
Revert this change since it somehow introduced a random crash detected on circus
2020-03-04 16:14:38 -08:00
Xin Dong
f20619c9fb
Resolve review comments. Changed how issues got cleared
2020-02-25 15:39:51 -08:00
Xin Dong
090c89e90a
Addressed review comments. Fix the bug where issues on a worker may be wrongly cleared by subsequent GetDBinfo request.
2020-02-25 15:39:38 -08:00
Xin Dong
6325c40336
Apply suggestions from code review
...
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2020-02-25 15:39:09 -08:00
Xin Dong
f4f860bfa8
Changed issue reporting to be thread safe. Also changed the liveness ping to be thread safe.
2020-02-25 15:38:14 -08:00
Xin Dong
0b0414fb94
Addressded review comments. Change the issue reporting from 'ITraceLogWriter' to be a more generic way.
2020-02-25 15:37:53 -08:00
Xin Dong
034dfe5e42
Now the inability to flush trace logs will be reported to both 'stderr' and also the status json object.
...
- Since the first flush failure, if the accumulated consecutive failure count exceeds the value defined in knobs, it will trigger the current worker process to report this issue via the 'GetServerDBInfo' interface of the cluster controler
- A successful flush will reset the accumulated counter.
Notice that the current solution does not take the time into consideration. The assumption is that flush failures tend to only happen in a clustered manner. The intermittent, but short, periods of flush failures are not considered as a problem since the memory pressure built by them should be negligible.
2020-02-25 15:37:32 -08:00
Evan Tschannen
96258b9809
Merge branch 'release-6.2'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbcli/fdbcli.actor.cpp
# fdbclient/ManagementAPI.actor.cpp
# fdbrpc/FlowTransport.actor.cpp
# fdbserver/ClusterController.actor.cpp
# fdbserver/DataDistribution.actor.cpp
# fdbserver/DataDistribution.actor.h
# fdbserver/DataDistributionQueue.actor.cpp
# fdbserver/KeyValueStoreMemory.actor.cpp
# fdbserver/MasterProxyServer.actor.cpp
# fdbserver/QuietDatabase.actor.cpp
# fdbserver/SkipList.cpp
# fdbserver/StorageMetrics.actor.h
# fdbserver/TLogServer.actor.cpp
# fdbserver/fdbserver.actor.cpp
# fdbserver/storageserver.actor.cpp
# fdbserver/workloads/KVStoreTest.actor.cpp
# flow/CMakeLists.txt
# flow/Knobs.cpp
# flow/Knobs.h
# flow/genericactors.actor.cpp
# flow/serialize.h
2020-02-21 19:09:16 -08:00
A.J. Beamon
1d9140d874
Removed TLogVersion logging.
...
Added logging of SharedTLog ID for each TLog.
Switched ID logged for TLogRejoining event to the TLog instead of the SharedTLog.
Made some parameters to startRole passed by reference.
2020-02-14 12:33:43 -08:00
A.J. Beamon
56053c565b
Improve TLog "Role" event by adding the worker ID, the TLog version, and under what circumstances the TLog is being started (Restored, Recruited, or Recovered).
...
The SharedTLog role was being started and stopped twice, so remove one instance of it.
2020-02-12 15:11:38 -08:00
Jingyu Zhou
1eaea91cb3
Address review comments
2020-01-22 19:42:13 -08:00
Jingyu Zhou
116608a0a7
Set backup workers w.r.t. the correct epoch
...
For backup workers created for previous epoch, we need to associate them with
the correct epoch so that later peekLogRouter can get the correct peek cursor.
Otherwise, the workers can never peek the missing range of mutations.
2020-01-22 19:38:45 -08:00
Jingyu Zhou
19d6a889ff
Recruit backup workers for old epochs
...
If there are unfinished ranges in the old epochs, the new master will recruit
backup workers responsible for finishing these ranges. These workers remains in
the cluster until the next epoch, when it will remove itself.
2020-01-22 19:38:45 -08:00
Jingyu Zhou
17002740bb
Add epoch and backup workers to DBCoreState
...
This enables backup workers to know the end version of the epoch. Additionally,
the master recovery only needs to deal with crashed backup workers by
recruiting new workers to backup the unfinished version range.
2020-01-22 19:38:45 -08:00
Jingyu Zhou
7da9f47f26
Enable pop from backup workers
...
This is still WIP as some edge cases can trigger test failure, most likely due
to not popping mutations by backup workers when epoch ends.
2020-01-22 19:38:45 -08:00
Jingyu Zhou
443c4995a2
Add file identifier in interfaces for flatbuffer
2020-01-22 19:37:48 -08:00
Jingyu Zhou
ece3cadf8e
Recruit backup worker during master recovery
...
Right now recruit the same number as TLogs. The backup worker does nothing.
2020-01-22 19:37:48 -08:00
Jingyu Zhou
de8d953865
Add backup role, class, and worker skeleton
2020-01-22 19:35:30 -08:00
Evan Tschannen
ebcb2f79ed
Merge branch 'master' of github.com:apple/foundationdb
2019-11-22 15:34:49 -08:00
Evan Tschannen
8d3ef89540
Merge branch 'release-6.2'
...
# Conflicts:
# CMakeLists.txt
# documentation/sphinx/source/release-notes.rst
# fdbclient/MutationList.h
# fdbserver/MasterProxyServer.actor.cpp
# versions.target
2019-11-14 15:49:56 -08:00
negoyal
a4a0bf18f9
Merging with Master.
2019-11-12 13:01:29 -08:00
Evan Tschannen
1e5677b55a
increase the priority of reboot and recruitment requests
2019-11-11 15:17:11 -08:00
Alex Miller
1eb3a70b96
Spill SharedTLog when there's more than one.
...
When switching between spill_type or log_version, a new instance of a
SharedTLog is created in the transaction log processes. If this is done
in a saturated database, then doubling the amount of memory to hold
mutations in memory can cause TLogs to be uncomfortably close to the 8GB
OOM limit.
Instead, we now thread which UID of a SharedTLog is active, and the
other TLog spill out the majority of their mutations.
This is a backport of #2213 (fef89aa1
) to release-6.2
2019-10-17 01:24:50 -07:00
Alex Miller
b3fd4f62a7
Fix whitespace.
2019-10-07 18:08:27 -07:00
Alex Miller
1d8a7e5af7
Spill SharedTLog when there's more than one.
...
When switching between spill_type or log_version, a new instance of a
SharedTLog is created in the transaction log processes. If this is done
in a saturated database, then doubling the amount of memory to hold
mutations in memory can cause TLogs to be uncomfortably close to the 8GB
OOM limit.
Instead, we now thread which UID of a SharedTLog is active, and the
other TLog spill out the majority of their mutations.
2019-10-07 18:08:27 -07:00
Alex Miller
60fb04ca68
Fork TLogServer into TLogServer_6_2
...
This prepares us for incoming modifications to the TLog that can't
easily coexist with our current on-disk state.
2019-10-03 01:41:25 -07:00
Evan Tschannen
b495cc697b
Merge branch 'release-6.2'
...
# Conflicts:
# CMakeLists.txt
# documentation/sphinx/source/release-notes.rst
# versions.target
2019-09-13 09:25:08 -07:00
Evan Tschannen
945cff1e5b
the cluster controller caches the serialization of serverDBInfo, to avoid regenerating it many times
2019-09-10 14:27:22 -07:00
Meng Xu
d160810662
FastRestore:Resolve review comments
2019-09-04 16:48:43 -07:00
sramamoorthy
a65c9f92ed
get rid of all timeouts and other changes
2019-07-24 15:36:28 -07:00
sramamoorthy
8f1f0c0435
snap v2: worker and other helper related changes
2019-07-24 15:36:28 -07:00
Evan Tschannen
15e894c724
Merge in master
2019-07-05 15:49:24 -07:00
Alex Miller
ea6898144d
Merge remote-tracking branch 'upstream/master' into flowlock-api
2019-07-03 20:44:15 -07:00
A.J. Beamon
8c10d832a1
Add coordinator role in trace events
2019-07-03 11:09:36 -07:00
mengranwo
6b61b0e030
fix syntax error, pass compile
2019-07-01 16:09:51 -07:00
mengranwo
0b9cd18fb4
checking cluster is healthy or not during recovery process(for storage engine), if healthy, delete data files and join as new
2019-07-01 16:09:51 -07:00
Alex Miller
7a500cd37f
A giant translation of TaskFooPriority -> TaskPriority::Foo
...
This is so that APIs that take priorities don't take ints, which are
common and easy to accidentally pass the wrong thing.
2019-06-25 02:47:35 -07:00
Evan Tschannen
e0be631414
shard the txs tag so that more transaction logs are involved in its recovery
2019-06-19 18:15:09 -07:00
mpilman
68ce9a5e75
ProtocolVersion type - second try
2019-06-18 17:55:27 -07:00
sramamoorthy
2a68b28590
rebase related changes
2019-05-28 22:07:46 -07:00
sramamoorthy
ec7834e2f7
code re-orgnaization and address comments
2019-05-28 22:07:46 -07:00
sramamoorthy
b6e037ffbc
Replace fork with boost::process::child
2019-05-28 22:07:46 -07:00
sramamoorthy
61e93a9304
Address review comments and minor fixes
2019-05-28 22:07:46 -07:00
sramamoorthy
9e3104c2d4
Fix: races in async exec leading to bad backup
2019-05-28 22:07:46 -07:00
sramamoorthy
cfdad0c5e6
tlog to snapshot exactly at exec version
2019-05-28 22:07:46 -07:00
sramamoorthy
898bed66c1
Allow only whitelisted binary path for exec op
2019-05-28 22:07:46 -07:00