foundationdb

Commit Graph

Author	SHA1	Message	Date
Aaron Molitor	bb17e194d9	Revert "Refactor: ClusterController driving cluster-recovery state machine" This reverts commit `1520390bc5`.	2021-12-24 11:25:51 -08:00
Ata E Husain Bohra	1520390bc5	Refactor: ClusterController driving cluster-recovery state machine diff-1: Address Jingyu's review comments diff-2: Introduce ClusterRecovery actor to seperate out cluster recovery code At present, cluster recovery process consists of following steps: 1. ClusterController clusterWatchDatabase actor recruits master/sequencer process. 2. Sequencer process implements the cluster recovery state machine, responsible to recruit all other processes as well restore the cluster state. Patch proposes a scheme where the cluster recovery state machine is implemented and driven by the ClusterController process instead of the Sequencer process. Advantages of the scheme could be: 1. Simplified design where ClusterController recruits "sequencer" process like other worker processes compared to current scheme where "sequencer" process gets special treatment. In newer scheme sequencer is responsible for maintaining/providing "committed version" (as expected). 2. ClusterController is responsible for worker processes recruitment, the sequencer though orchestrating the recovery state machine, it need to reachout to the ClusterController for recruiting worker processes etc. NOTE: Patch has moved the recovery state machine code from 'sequencer' -> 'cluster-controller' process, however, necessary updates were done for both functionality as well as performance improvement reasons. Next Steps: Cluster recovery documentation will be updated in near future.	2021-12-22 14:06:27 -08:00
Ata E Husain Bohra	abd2959702	Refactor: ClusterController driving cluster-recovery state machine diff-1: Address Jingyu's review comments At present, cluster recovery process consists of following steps: 1. ClusterController clusterWatchDatabase actor recruits master/sequencer process. 2. Sequencer process implements the cluster recovery state machine, responsible to recruit all other processes as well restore the cluster state. Patch proposes a scheme where the cluster recovery state machine is implemented and driven by the ClusterController process instead of the Sequencer process. Advantages of the scheme could be: 1. Simplified design where ClusterController recruits "sequencer" process like other worker processes compared to current scheme where "sequencer" process gets special treatment. In newer scheme sequencer is responsible for maintaining/providing "committed version" (as expected). 2. ClusterController is responsible for worker processes recruitment, the sequencer though orchestrating the recovery state machine, it need to reachout to the ClusterController for recruiting worker processes etc. NOTE: Patch has moved the recovery state machine code from 'sequencer' -> 'cluster-controller' process, however, necessary updates were done for both functionality as well as performance improvement reasons. Next Steps: Cluster recovery documentation will be updated in near future.	2021-12-22 14:06:27 -08:00
Ata E Husain Bohra	dfe9d184ff	Refactor: ClusterController driving cluster-recovery state machine At present, cluster recovery process consists of following steps: 1. ClusterController clusterWatchDatabase actor recruits master/sequencer process. 2. Sequencer process implements the cluster recovery state machine, responsible to recruit all other processes as well restore the cluster state. Patch proposes a scheme where the cluster recovery state machine is implemented and driven by the ClusterController process instead of the Sequencer process. Advantages of the scheme could be: 1. Simplified design where ClusterController recruits "sequencer" process like other worker processes compared to current scheme where "sequencer" process gets special treatment. In newer scheme sequencer is responsible for maintaining/providing "committed version" (as expected). 2. ClusterController is responsible for worker processes recruitment, the sequencer though orchestrating the recovery state machine, it need to reachout to the ClusterController for recruiting worker processes etc. NOTE: Patch has moved the recovery state machine code from 'sequencer' -> 'cluster-controller' process, however, necessary updates were done for both functionality as well as performance improvement reasons. Next Steps: Cluster recovery documentation will be updated in near future.	2021-12-22 14:06:27 -08:00
Neethu Haneesha Bingi	1f30368e71	KeyValueStoreRocksDB histograms to track latencies	2021-12-21 23:09:46 -08:00
Andrew Noyes	fd33d31ff5	Enable -Wdelete-non-virtual-dtor for clang build We had been disabling -Wdelete-non-virtual-dtor, because this seems to be done intentionally in the generated code of the actor compiler. I spent some time trying to rewrite it in a way that doesn't literally delete/destroy through a pointer to a base class without a virtual destructor, but I was unable to come up with something that passes correctness. My best guess is that we do this so that we can destroy actor state classes, call callbacks registered on the actor SAV, and then destroy the SAV. Anyway now we'll detect new usages of deleting through a pointer to a base class without a virtual destructor.	2021-12-20 16:19:31 -08:00
A.J. Beamon	d8e161f89e	Refactor NativeAPI transactions to create and pass around a reference counted state object. Watches no longer use the tranasction info object but instead use their own state.	2021-12-17 11:57:39 -08:00
A.J. Beamon	ff1cb58174	Convert hyphens to underscores for all prefix-based arguments (e.g. --knob-, --locality-)	2021-12-14 12:01:44 -08:00
A.J. Beamon	f24adc7b6a	Fix a bunch of places where we used old-style arguments. Allow hyphens for profiler args.	2021-12-14 09:59:14 -08:00
A.J. Beamon	f29f487823	Unify flags (#25 ) * Unify flags implementation and change help text in backup.actor.cpp Description Testing * Keep LOG_GROUP unchanged Description Testing * Transfer the hyphens to underscores for internal options and user's input, EXCEPT leading hyphens Description Testing * Use a deep copy of the user's input flag to do the match Description Testing * Convert the _ to - in Option arrays of backup.actor.cpp Description Testing * Transter _ to - for files: TLSConfig.actor.h, fdbcli.actor.cpp, fdbserver.actor.cpp, FileConverter.h, FileConverter.cpp Description Testing * Change another way to unify flag: using SO_O_ICASE_HYPHEN_AND_UNDERSCORE to determine whether we do the conversion in function IsEqual Description Testing * Change the config command's name from SO_O_ICASE_HYPHEN_AND_UNDERSCORE to SO_O_HYPHEN_TO_UNDERSCORE Description Testing * Update the comment for the SO_O_HYPHEN_TO_UNDERSCORE Description Testing * Fix left underscore in SOption arrays Description Testing * Convert _ to - in several files for commands Description Testing * Make the FDBService and fdbmonitor backward compatible Description Testing * Fix bugs about pointers Description Testing * Check underscore and hyphen at the same time for --knob_, --localily_ and --test_ And fix bugs in fdbmonitor and FDBService Description Testing * Simplify the function in fdbmonitor and FDBService about retrieving arguments. And fix some documents in masterserver.actor.cpp Description Testing * Convert _ to - for knob in the setKnob functions Description Testing * Convert - to _ in the setKnob functions Description Since key in the knob related maps only contain _ Testing * Rename varialbe name in the fdbmonitor and FDBService for clarification Description Testing Co-authored-by: Chang Liu <chang.liu@snowflake.com>	2021-12-14 08:44:39 -08:00
Renxuan Wang	5b079acd66	Coordinator should reply clientInfo when it changes. This bug is introduced in #5231.	2021-12-10 16:37:48 -08:00
Renxuan Wang	6978486e85	Coordinator should only reply client data if it's valid. Because when a coordinator restarts or newly joins a cluster, a client trying to connect to it may already have client data, while the coordinator doesn't. In this case, the coordinator should not reply empty client data.	2021-12-10 16:37:48 -08:00
Josh Slocum	f6ea67120e	Adding explicit empty version back to change feeds for now	2021-12-10 10:04:05 -06:00
Neethu Haneesha Bingi	d23b8645f8	Enabling rocksdb metrics logger in simulation.	2021-12-07 15:18:29 -08:00
Evan Tschannen	20ee921986	Merge pull request #5923 from sfc-gh-sgwydir/minicycletest Add MiniCycle Test	2021-12-07 11:30:16 -08:00
Sam Gwydir	31c0eef69c	Add Minicycle Workload	2021-12-06 15:46:40 -08:00
Evan Tschannen	fd2b27d7c4	Merge pull request #6103 from sfc-gh-ajbeamon/fix-dd-merge-too-soon Fix: Merge too soon bug	2021-12-06 15:01:47 -08:00
Andrew Noyes	def41697bf	Merge pull request #6083 from sfc-gh-tclinkenbeard/remove-temporaries Avoid creating unnecessary temporary objects	2021-12-06 13:24:56 -08:00
Zhe Wu	dc88d3fa37	use CC_ENABLE_WORKER_HEALTH_MONITOR knob to guard remoteDCIsHealthy logic	2021-12-06 09:33:45 -08:00
A.J. Beamon	ce2d1e1648	Fix: the shard tracker state could become inaccurate if there is an ABA type update to the bandwith state of a shard.	2021-12-03 18:42:55 -08:00
Evan Tschannen	6e9b1f18fe	Merge pull request #6095 from sfc-gh-ngoyal/fix-exclude-unreliable-from-protected Don't include an unreliable process in the protected list.	2021-12-03 14:49:59 -08:00
Tao Lin	9b0a9c4503	Return error when getRangeAndFlatMap has more & Improve simulation tests (#6029 )	2021-12-03 12:50:07 -08:00
negoyal	2725183b26	Don't include an unreliable process in the protected list.	2021-12-03 09:44:52 -08:00
Evan Tschannen	b11ae4dae8	Merge pull request #5910 from sfc-gh-jslocum/bg_bindings Blob Granule C bindings	2021-12-02 11:40:26 -08:00
sfc-gh-tclinkenbeard	464d9488ef	Merge remote-tracking branch 'origin/master' into fix-unused-warnings	2021-12-01 23:52:09 -08:00
sfc-gh-tclinkenbeard	6b45ef98ca	Merge remote-tracking branch 'origin/master' into remove-temporaries	2021-12-01 23:50:29 -08:00
sfc-gh-tclinkenbeard	d01a363e29	Avoid creating unnecessary temporary objects	2021-12-01 23:48:34 -08:00
sfc-gh-tclinkenbeard	90ced244eb	Fix -Wunused-but-set-variable warnings	2021-12-01 18:15:53 -08:00
sfc-gh-tclinkenbeard	3b5d23ef88	Remove unnecessary temporary objects when appending to vector of pairs	2021-12-01 18:15:16 -08:00
Josh Slocum	a82845af43	Merge branch 'master' into bg_bindings	2021-12-01 16:55:28 -06:00
Evan Tschannen	404bef2f03	Merge pull request #6076 from sfc-gh-ajbeamon/fix-cstate-registration-race Fix a race condition with updating the coordinated state and updating the master registration	2021-12-01 12:56:12 -08:00
sfc-gh-tclinkenbeard	ec64890ac1	Remove some usages of PRId64 by using fmt library	2021-11-30 23:35:36 -08:00
A.J. Beamon	72c5fb183d	Fix: avoid updating the master registration while the cstate is written but we are not accepting commits.	2021-11-30 15:44:04 -08:00
Steve Atherton	43b3e05fd5	Merge branch 'master' of github.com:apple/foundationdb into delay-prioritized-eviction # Conflicts: # fdbserver/VersionedBTree.actor.cpp	2021-11-29 16:14:43 -08:00
sfc-gh-ngoyal	ef248f3dbd	Merge pull request #6060 from sfc-gh-satherton/remove-try-evict Remove unused / unwanted code in Redwood.	2021-11-29 13:26:05 -08:00
Andrew Noyes	b6fd402a3c	Add option to use boost or libcoro By default, use boost everywhere except windows and linux x86 (for performance reasons)	2021-11-29 13:14:15 -08:00
Josh Slocum	1870e07ff4	Fixed pause racing with waitUntilHealthy	2021-11-29 14:19:15 -06:00
Steve Atherton	8216ad1e4e	When a cached page is hit, if it is in the prioritized eviction order don't move it to the normal order.	2021-11-28 21:04:35 -08:00
Steve Atherton	bed25f9571	Delay prioritized eviction of updated pages until after commit completes.	2021-11-28 21:03:44 -08:00
Steve Atherton	0f5535fce1	Remove explicit tryEvict(x) as it is unused and it is functionally replaced by prioritizeEviction(x).	2021-11-27 03:26:34 -08:00
Evan Tschannen	e3819dad7c	fix: If a removed tlog never attempted a queue commit, the update storage loop could get stuck waiting for queueCommittingVersion to advance	2021-11-25 09:55:01 -08:00
Evan Tschannen	80014d3247	Merge branch 'master' of https://github.com/apple/foundationdb into fix-best-team # Conflicts: # fdbclient/ServerKnobs.cpp	2021-11-24 11:56:23 -08:00
Evan Tschannen	1afef44f19	fix: self->shardsAffectedByTeamFailure->moveShard must be called without any waits after getting the destination team or we could miss failure notifications for the storage servers in the destination team	2021-11-23 15:26:38 -08:00
Evan Tschannen	37c9a1320c	added --print_sim_time to print simulated time to stdout	2021-11-23 15:01:44 -08:00
Evan Tschannen	b0aad44831	change feed requests with an explicit end need to get empty versions at the end boundary	2021-11-23 15:01:44 -08:00
Evan Tschannen	3360000a91	fix: an empty mutation vector did not mean that all of the data was returned	2021-11-23 15:01:44 -08:00
Evan Tschannen	e97d337ffe	fix: when a feed transitions from atLatest to not atLatest stall all updates at the blocked version	2021-11-23 15:01:44 -08:00
Evan Tschannen	10a925b7e9	do not return mutations larger than the dequeVersion	2021-11-23 15:01:44 -08:00
Evan Tschannen	e19277160e	improved the change feed workload to test whenAtLeast	2021-11-23 15:01:44 -08:00
Evan Tschannen	f3bb1d8f51	properly handle an active change feed stream when removed	2021-11-23 15:01:44 -08:00
sfc-gh-ngoyal	12e68da635	Merge pull request #6036 from sfc-gh-satherton/redwood-debug-sim-fixes Test-only bug fixes in Redwood along with debug logging detail improvements.	2021-11-22 11:13:28 -08:00
sfc-gh-ngoyal	e43ddb09ac	Merge pull request #6034 from sfc-gh-satherton/remap-window-fix Bug fix: incorrect remap cleanup window size in KeyValueStoreRedwood	2021-11-22 09:12:59 -08:00
Steve Atherton	d2149631d5	Bug fix, the remap cleanup window was being initialized incorrectly in the Redwood KVS wrapper.	2021-11-19 21:40:09 -08:00
Steve Atherton	3901d60548	Test-only bug fixes in Redwood along with debug logging detail improvements. Added clearRemapQueue() to Pager to more cleanly and reliably expire all old data and process the remap queue, fixing a bug where with certain configuration parameters and a lot of data the DestructiveSanityCheck would fail because it would not run cleanup long enough. Added more parameters to performance/set unit test.	2021-11-19 01:00:14 -08:00
sfc-gh-tclinkenbeard	c745002607	Merge remote-tracking branch 'origin/master' into add-format-warning	2021-11-18 11:13:03 -08:00
Evan Tschannen	5879d611f5	Merge pull request #5899 from sfc-gh-jfu/jfu-fix-ss-segfault Fix rare segfault that can occur when SS terminates while an actor that uses the StorageServer is still on the stack	2021-11-18 09:57:11 -08:00
sfc-gh-tclinkenbeard	2613ec7561	Expand use of fmt to get rid of %ld usage	2021-11-17 17:03:32 -08:00
sfc-gh-tclinkenbeard	07349869d9	Use fmt to address -Wformat warnings	2021-11-17 14:45:48 -08:00
sfc-gh-tclinkenbeard	766a05d33c	Merge remote-tracking branch 'origin/master' into add-format-warning	2021-11-17 12:14:01 -08:00
Steve Atherton	3caca74ac2	Merge commit 'fd707c6d7ee80de6d9fda5796da2d0add10abd79' into bit-flipping-workload	2021-11-16 21:54:27 -08:00
negoyal	867ad8be46	Merge branch 'bit-flipping-workload' of github.com:sfc-gh-ngoyal/foundationdb into bit-flipping-workload	2021-11-16 18:12:18 -08:00
Steve Atherton	6e43dde613	Fixed bad merge resolution.	2021-11-16 18:04:37 -08:00
Renxuan Wang	725d31e264	Fix parameters passes to simulatedMachine().	2021-11-16 17:50:12 -08:00
negoyal	ce112e1f23	Enhance trace event.	2021-11-16 17:39:59 -08:00
Jon Fu	8f6934c4d0	Merge branch 'master' of github.com:apple/foundationdb into jfu-fix-ss-segfault	2021-11-16 18:03:32 -05:00
Evan Tschannen	557186ed17	Merge pull request #5909 from sfc-gh-jfu/jfu-cc-request-dbinfo Change dbinfo broadcast to be explicitly requested by the worker registration message	2021-11-16 15:01:42 -08:00
Steve Atherton	035e0d6e52	Merge branch 'master' into bit-flipping-workload	2021-11-16 14:42:22 -08:00
Jingyu Zhou	8d6cfcb630	Merge pull request #6003 from sfc-gh-etschannen/fix-forwarding fix: coordinators would process forwarding requests before making them durable	2021-11-16 14:24:21 -08:00
Evan Tschannen	35bce4cd36	added a comment	2021-11-16 13:07:35 -08:00
Evan Tschannen	fd635432c4	fix: coordinators would process forwarding requests before making them durable	2021-11-16 12:21:26 -08:00
Steve Atherton	21c3c585ca	Make file name parameter more user friendly in unit tests.	2021-11-16 04:03:11 -08:00
Steve Atherton	867999a41a	Rename wrong_format_version to unsupported_format_version.	2021-11-16 03:25:54 -08:00
Steve Atherton	7b29804a5e	Fix typo.	2021-11-16 02:30:57 -08:00
Steve Atherton	c53f5aa110	Renamed redwood to redwood-1-experimental and file extension to .redwood-v1.	2021-11-16 02:15:22 -08:00
Evan Tschannen	964d0209ca	Merge pull request #5637 from sfc-gh-ljoswiak/features/data-loss-prevention Data loss protection when joining new cluster	2021-11-15 15:26:32 -08:00
Jingyu Zhou	3d26d2372b	Merge pull request #5932 from sfc-gh-ahusain/ahusain-improveTesterLogging Improve tester actor logging to track workload run & check status	2021-11-15 13:28:55 -08:00
Jingyu Zhou	02d0c43bc2	Merge pull request #5982 from sfc-gh-tclinkenbeard/improve-error-descriptions Make snapshot errors more descriptive	2021-11-15 13:18:19 -08:00
Evan Tschannen	6e81b83924	fix: cleanup change feeds which have been completely removed from a storage server	2021-11-15 11:47:42 -08:00
Evan Tschannen	94a51e57a5	Merge branch 'master' into feature-changefeed-empty-versions # Conflicts: # fdbclient/StorageServerInterface.h	2021-11-14 19:13:05 -08:00
Evan Tschannen	6909754b21	changefeeds now have a whenAtLeast function for efficiently learning when the version has updated but no mutations have been committed	2021-11-14 19:08:46 -08:00
sfc-gh-tclinkenbeard	0ba77ea79b	Fix proxySnapCreate trace typo	2021-11-14 16:12:28 -08:00
Tao Lin	9422b8e5f2	Restricted getRangeAndFlatMap to snapshot	2021-11-12 15:12:37 -08:00
Steve Atherton	508429f30d	Redwood chunked file growth and low priority IO starvation prevention (#5936 ) * Redwood files now growth in large page chunks controlled by a knob to reduce truncate() calls for expansion. PriorityMultiLock has limit on consecutive same-priority lock release. Increased Redwood max priority level to 3 for more separation at higher BTree levels. * Simulation fix, don't mark certain IO timeout errors as injected unless the simulated process has been set to have an unreliable disk. * Pager writes now truncate gradually upward, one chunk at a time, in response to writes, which wait on only the necessary truncate operations. Increased buggified chunk size because truncate can be very slow in simulation. * In simulation, ioTimeoutError() and ioDegradedOrTimeoutError() will wait until at least the target timeout interval past the point when simulation is sped up. * PriorityMultiLock::toString() prints more info and is now public. * Added queued time to PriorityMultiLock. * Bug fix to handle when speedUpSimulation changes later than the configured time. * Refactored mutation application in leaf nodes to do fewer comparisons and do in place value updates if the new value is the same size as the old value. * Renamed updatingInPlace to updatingDeltaTree for clarity. Inlined switchToLinearMerge() since it is only used in one place. * Updated extendToCover to be more clear by passing in the old extension future as a parameter. Fixed initialization warning.	2021-11-12 13:47:07 -08:00
sfc-gh-tclinkenbeard	62efeb6812	Merge remote-tracking branch 'origin/master' into add-format-warning	2021-11-12 11:50:36 -08:00
Tao Lin	08fd69e787	Merge pull request #5967 from nblintao/flatmap-exception	2021-11-12 10:19:44 -08:00
Daniel Smith	019fd50f46	Merge pull request #5971 from Daniel-B-Smith/rocksdb-error-logging Clean up RocksDB error logging	2021-11-12 13:17:10 -05:00
Ata E Husain Bohra	82c3e8bf79	Trigger buildTeam operation if server transition from unhealthy -> healthy (#5930 ) * Trigger buildTeam operation if server transition from unhealthy -> healthy DataDistribution actor helps in building teams as server count changes (add/removal), however, it is possible that total_healthy_server count is insufficient to allow team formation. If happens, even healthy server count recover, the buildTeam operation will not be triggered. Patch proposal is to trigger `checkBuildTeam` operation if server transitions from unhealthy -> healthy state. Incase system already has created enough teams (desiredTeamCount/maxTeamCount), the operation incurs a very minimal cost.	2021-11-12 09:41:01 -08:00
Tao Lin	2d5f924278	GetKeyValuesAndFlatMap should return error if not retriable	2021-11-12 09:35:28 -08:00
Daniel Smith	9dccb0131e	Clean up RocksDB error logging	2021-11-12 12:14:12 -05:00
He Liu	d73d2144fd	Adjust distributorSplitRange order.	2021-11-11 20:28:55 -08:00
He Liu	984bc0fbea	Added Endpoints.	2021-11-11 16:56:04 -08:00
Josh Slocum	329091e14f	Merge branch 'master' into bg_bindings	2021-11-11 10:13:37 -06:00
Josh Slocum	b8ac4213a1	Switched BG APIs to transaction instead of database	2021-11-11 08:59:06 -06:00
Markus Pilman	0dfb72176e	Merge pull request #5857 from sfc-gh-vgasiunas/notify-client-lib-changes A mechanism to notify MVC about relevant client library status changes on the cluster	2021-11-11 07:43:20 -07:00
Lukas Joswiak	e4c3f886da	Fix recovery issue	2021-11-10 16:15:13 -08:00
Lukas Joswiak	28b72550f3	Remove additional unused tracing	2021-11-10 13:33:49 -08:00
Lukas Joswiak	c93052121f	Fix issue where transaction spans would not be recorded	2021-11-10 13:33:49 -08:00
Daniel Smith	481cf9bb55	Merge pull request #5788 from Daniel-B-Smith/rocks-throttle Throttle the number of concurrent reads to RocksDB	2021-11-10 15:20:30 -05:00
Daniel Smith	499dbcdb18	Don't fail fetchKeys when server overloaded is returned	2021-11-10 14:15:42 -05:00
Andrew Noyes	db3c08c7cd	Merge pull request #5928 from sfc-gh-anoyes/anoyes/fix-heap-use-after-free Fix a heap use after free	2021-11-10 10:21:05 -08:00

1 2 3 4 5 ...

8060 Commits