foundationdb

Commit Graph

Author	SHA1	Message	Date
Jingyu Zhou	fda6c08640	Include a total number of tags in partition log file names This is needed for BackupContainer to check partitioned mutation logs are continuous, i.e., restorable to a version.	2020-03-20 20:13:38 -07:00
Balachandar Namasivayam	58a9bfa78b	Merge pull request #2820 from dongxinEric/fix/1977/add-back-trace-event-flush-failure-report Fix/1977/add back trace event flush failure report	2020-03-18 16:11:44 -07:00
Xin Dong	5967ef5eab	Added back the changes that report trace log flush failures and fix the random crash	2020-03-12 14:34:19 -07:00
Meng Xu	e0d2eca7a8	checkForExtraDataStores:Add coordinators into stateful process list	2020-03-10 23:38:30 -07:00
Xin Dong	39610d15f8	Revert this change since it somehow introduced a random crash detected on circus	2020-03-04 16:14:38 -08:00
Xin Dong	f20619c9fb	Resolve review comments. Changed how issues got cleared	2020-02-25 15:39:51 -08:00
Xin Dong	090c89e90a	Addressed review comments. Fix the bug where issues on a worker may be wrongly cleared by subsequent GetDBinfo request.	2020-02-25 15:39:38 -08:00
Xin Dong	6325c40336	Apply suggestions from code review Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>	2020-02-25 15:39:09 -08:00
Xin Dong	f4f860bfa8	Changed issue reporting to be thread safe. Also changed the liveness ping to be thread safe.	2020-02-25 15:38:14 -08:00
Xin Dong	0b0414fb94	Addressded review comments. Change the issue reporting from 'ITraceLogWriter' to be a more generic way.	2020-02-25 15:37:53 -08:00
Xin Dong	034dfe5e42	Now the inability to flush trace logs will be reported to both 'stderr' and also the status json object. - Since the first flush failure, if the accumulated consecutive failure count exceeds the value defined in knobs, it will trigger the current worker process to report this issue via the 'GetServerDBInfo' interface of the cluster controler - A successful flush will reset the accumulated counter. Notice that the current solution does not take the time into consideration. The assumption is that flush failures tend to only happen in a clustered manner. The intermittent, but short, periods of flush failures are not considered as a problem since the memory pressure built by them should be negligible.	2020-02-25 15:37:32 -08:00
Evan Tschannen	96258b9809	Merge branch 'release-6.2' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbcli/fdbcli.actor.cpp # fdbclient/ManagementAPI.actor.cpp # fdbrpc/FlowTransport.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/DataDistribution.actor.cpp # fdbserver/DataDistribution.actor.h # fdbserver/DataDistributionQueue.actor.cpp # fdbserver/KeyValueStoreMemory.actor.cpp # fdbserver/MasterProxyServer.actor.cpp # fdbserver/QuietDatabase.actor.cpp # fdbserver/SkipList.cpp # fdbserver/StorageMetrics.actor.h # fdbserver/TLogServer.actor.cpp # fdbserver/fdbserver.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/KVStoreTest.actor.cpp # flow/CMakeLists.txt # flow/Knobs.cpp # flow/Knobs.h # flow/genericactors.actor.cpp # flow/serialize.h	2020-02-21 19:09:16 -08:00
A.J. Beamon	1d9140d874	Removed TLogVersion logging. Added logging of SharedTLog ID for each TLog. Switched ID logged for TLogRejoining event to the TLog instead of the SharedTLog. Made some parameters to startRole passed by reference.	2020-02-14 12:33:43 -08:00
A.J. Beamon	56053c565b	Improve TLog "Role" event by adding the worker ID, the TLog version, and under what circumstances the TLog is being started (Restored, Recruited, or Recovered). The SharedTLog role was being started and stopped twice, so remove one instance of it.	2020-02-12 15:11:38 -08:00
Jingyu Zhou	1eaea91cb3	Address review comments	2020-01-22 19:42:13 -08:00
Jingyu Zhou	116608a0a7	Set backup workers w.r.t. the correct epoch For backup workers created for previous epoch, we need to associate them with the correct epoch so that later peekLogRouter can get the correct peek cursor. Otherwise, the workers can never peek the missing range of mutations.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	19d6a889ff	Recruit backup workers for old epochs If there are unfinished ranges in the old epochs, the new master will recruit backup workers responsible for finishing these ranges. These workers remains in the cluster until the next epoch, when it will remove itself.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	17002740bb	Add epoch and backup workers to DBCoreState This enables backup workers to know the end version of the epoch. Additionally, the master recovery only needs to deal with crashed backup workers by recruiting new workers to backup the unfinished version range.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	7da9f47f26	Enable pop from backup workers This is still WIP as some edge cases can trigger test failure, most likely due to not popping mutations by backup workers when epoch ends.	2020-01-22 19:38:45 -08:00
Jingyu Zhou	443c4995a2	Add file identifier in interfaces for flatbuffer	2020-01-22 19:37:48 -08:00
Jingyu Zhou	ece3cadf8e	Recruit backup worker during master recovery Right now recruit the same number as TLogs. The backup worker does nothing.	2020-01-22 19:37:48 -08:00
Jingyu Zhou	de8d953865	Add backup role, class, and worker skeleton	2020-01-22 19:35:30 -08:00
Evan Tschannen	ebcb2f79ed	Merge branch 'master' of github.com:apple/foundationdb	2019-11-22 15:34:49 -08:00
Evan Tschannen	8d3ef89540	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # fdbclient/MutationList.h # fdbserver/MasterProxyServer.actor.cpp # versions.target	2019-11-14 15:49:56 -08:00
negoyal	a4a0bf18f9	Merging with Master.	2019-11-12 13:01:29 -08:00
Evan Tschannen	1e5677b55a	increase the priority of reboot and recruitment requests	2019-11-11 15:17:11 -08:00
Alex Miller	1eb3a70b96	Spill SharedTLog when there's more than one. When switching between spill_type or log_version, a new instance of a SharedTLog is created in the transaction log processes. If this is done in a saturated database, then doubling the amount of memory to hold mutations in memory can cause TLogs to be uncomfortably close to the 8GB OOM limit. Instead, we now thread which UID of a SharedTLog is active, and the other TLog spill out the majority of their mutations. This is a backport of #2213 (`fef89aa1`) to release-6.2	2019-10-17 01:24:50 -07:00
Alex Miller	b3fd4f62a7	Fix whitespace.	2019-10-07 18:08:27 -07:00
Alex Miller	1d8a7e5af7	Spill SharedTLog when there's more than one. When switching between spill_type or log_version, a new instance of a SharedTLog is created in the transaction log processes. If this is done in a saturated database, then doubling the amount of memory to hold mutations in memory can cause TLogs to be uncomfortably close to the 8GB OOM limit. Instead, we now thread which UID of a SharedTLog is active, and the other TLog spill out the majority of their mutations.	2019-10-07 18:08:27 -07:00
Alex Miller	60fb04ca68	Fork TLogServer into TLogServer_6_2 This prepares us for incoming modifications to the TLog that can't easily coexist with our current on-disk state.	2019-10-03 01:41:25 -07:00
Evan Tschannen	b495cc697b	Merge branch 'release-6.2' # Conflicts: # CMakeLists.txt # documentation/sphinx/source/release-notes.rst # versions.target	2019-09-13 09:25:08 -07:00
Evan Tschannen	945cff1e5b	the cluster controller caches the serialization of serverDBInfo, to avoid regenerating it many times	2019-09-10 14:27:22 -07:00
Meng Xu	d160810662	FastRestore:Resolve review comments	2019-09-04 16:48:43 -07:00
sramamoorthy	a65c9f92ed	get rid of all timeouts and other changes	2019-07-24 15:36:28 -07:00
sramamoorthy	8f1f0c0435	snap v2: worker and other helper related changes	2019-07-24 15:36:28 -07:00
Evan Tschannen	15e894c724	Merge in master	2019-07-05 15:49:24 -07:00
Alex Miller	ea6898144d	Merge remote-tracking branch 'upstream/master' into flowlock-api	2019-07-03 20:44:15 -07:00
A.J. Beamon	8c10d832a1	Add coordinator role in trace events	2019-07-03 11:09:36 -07:00
mengranwo	6b61b0e030	fix syntax error, pass compile	2019-07-01 16:09:51 -07:00
mengranwo	0b9cd18fb4	checking cluster is healthy or not during recovery process(for storage engine), if healthy, delete data files and join as new	2019-07-01 16:09:51 -07:00
Alex Miller	7a500cd37f	A giant translation of TaskFooPriority -> TaskPriority::Foo This is so that APIs that take priorities don't take ints, which are common and easy to accidentally pass the wrong thing.	2019-06-25 02:47:35 -07:00
Evan Tschannen	e0be631414	shard the txs tag so that more transaction logs are involved in its recovery	2019-06-19 18:15:09 -07:00
mpilman	68ce9a5e75	ProtocolVersion type - second try	2019-06-18 17:55:27 -07:00
sramamoorthy	2a68b28590	rebase related changes	2019-05-28 22:07:46 -07:00
sramamoorthy	ec7834e2f7	code re-orgnaization and address comments	2019-05-28 22:07:46 -07:00
sramamoorthy	b6e037ffbc	Replace fork with boost::process::child	2019-05-28 22:07:46 -07:00
sramamoorthy	61e93a9304	Address review comments and minor fixes	2019-05-28 22:07:46 -07:00
sramamoorthy	9e3104c2d4	Fix: races in async exec leading to bad backup	2019-05-28 22:07:46 -07:00
sramamoorthy	cfdad0c5e6	tlog to snapshot exactly at exec version	2019-05-28 22:07:46 -07:00
sramamoorthy	898bed66c1	Allow only whitelisted binary path for exec op	2019-05-28 22:07:46 -07:00

1 2

76 Commits