foundationdb

Commit Graph

Author	SHA1	Message	Date
Meng Xu	599fcb2e6d	Add serverTeamRemover to remove redundant server teams	2019-07-02 17:40:37 -07:00
Evan Tschannen	b9a6271375	local ratekeeper no longer globally limits	2019-06-28 16:54:22 -07:00
Evan Tschannen	18d5fbf1e0	Avoid jumping from rejecting 0% of requests directly to 20% of requests	2019-06-28 16:54:22 -07:00
Evan Tschannen	db413c37f7	restored the STORAGE_DURABILITY_LAG_SOFT_MAX knob and made the rk target slightly smaller than the soft limit, to avoid inaccuracies in ratekeeper control causing behavior changes on the storage servers	2019-06-28 16:54:22 -07:00
Evan Tschannen	92b32855ca	ratekeeper’s control algorithm would oscillate when limited by local ratekeeper	2019-06-28 16:54:22 -07:00
A.J. Beamon	35b6277a50	Fix knob copy paste error	2019-06-27 12:55:39 -07:00
Alex Miller	61901effed	Increase how long FDB will wait before starting DD to repair data loss. 10s is a bit short for starting data distribution, which is rather expensive. 60s is a bit more reasonable.	2019-06-19 13:40:21 -07:00
Evan Tschannen	20e3edeb0a	Merge branch 'release-6.1' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbserver/storageserver.actor.cpp # versions.target	2019-06-14 12:42:59 -07:00
Evan Tschannen	924f92e5aa	Prevent the byte sample recovery from interfering with storage server recovery	2019-06-13 15:55:25 -07:00
Evan Tschannen	054d775343	increase the delay between idle commits to reduce the rate idle clusters fsync	2019-06-13 14:55:37 -07:00
Trevor Clinkenbeard	8144882d7b	Merge branch 'apple-master' into features/local-rk	2019-06-10 19:40:25 -07:00
Evan Tschannen	29b96414e2	Merge branch 'release-6.1' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbclient/NativeAPI.actor.cpp # fdbserver/Coordination.actor.cpp # flow/Arena.h # versions.target	2019-06-03 18:49:35 -07:00
Evan Tschannen	7c333dbc16	If a process receives a message in its clusterControllerInterface before becoming the cluster controller, if the process does not become the cluster controller in the next minute it should destroy the interface to prevent a memory leak.	2019-05-29 16:57:13 -07:00
sramamoorthy	31b6c86650	ignorePopDeadline to have high limit in simulator - ignorePopDeadline to have highier limit in simulator to accommdate for the buggify delays and make snapshot succeed. - introduce a new knob for auto resetting the disabling of tlog pop	2019-05-28 22:07:46 -07:00
A.J. Beamon	603721e125	Merge branch 'master' into thread-safe-random-number-generation # Conflicts: # fdbclient/ManagementAPI.actor.cpp # fdbrpc/AsyncFileCached.actor.h # fdbrpc/genericactors.actor.cpp # fdbrpc/sim2.actor.cpp # fdbserver/DiskQueue.actor.cpp # fdbserver/workloads/BulkSetup.actor.h # flow/ActorCollection.actor.cpp # flow/Net2.actor.cpp # flow/Trace.cpp # flow/flow.cpp	2019-05-23 08:35:47 -07:00
Evan Tschannen	8c3516951a	Merge branch 'release-6.1' # Conflicts: # documentation/sphinx/source/release-notes.rst # versions.target	2019-05-12 20:13:49 -07:00
Alex Miller	ea12a54946	Rename DISK_QUEUE_MAX_TRUNCATE_EXTENTS -> ..._BYTES So as to not make filesystem assumptions. This knob did technically appear in (only the) 6.1.5 release, but this feature was broken 6.1.5, so thus impossible to use anyway.	2019-05-10 18:26:22 -10:00
A.J. Beamon	5f55f3f613	Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used.	2019-05-10 14:01:52 -07:00
Evan Tschannen	22499666d0	Merge branch 'release-6.1' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbserver/LogRouter.actor.cpp # flow/Trace.cpp # versions.target	2019-05-08 18:19:35 -07:00
Alex Miller	0685e6c1c7	Avoid large truncates in the DiskQueue. And instead create a new file while incrementally truncating the old one down. This avoids queueing up a massive number of filesystem metadata operations in one call, thus flooding the disk with requests and stalling out all other filesystem operations. This sets the knobs so that a truncate of >10GB causes us to create a new file rather than trying to truncate the old one.	2019-05-08 12:33:31 -10:00
Alex Miller	4052f3826a	Add a knob to limit the number of commits indexed per key. Theoretically, we could spill 20MB of 22B mutations for one key, which would generate a very long value being stored in SQLite, and very inefficiently read back. This stops that from being a problem, at the cost of some extra write calls.	2019-05-03 15:27:10 -07:00
Alex Miller	f4e48c3851	Add a knob to limit amount of data read from sqlite for one PeekRequest. This prevents peeking from degrading over time if there are a very large number of SpilledData entries for one particular tag.	2019-05-02 17:26:45 -07:00
Evan Tschannen	2d5043c665	Merge branch 'release-6.1' # Conflicts: # documentation/sphinx/source/release-notes.rst # versions.target	2019-04-30 18:27:04 -07:00
Evan Tschannen	1a4c1759a4	Merge pull request #1429 from jzhou77/pprof Dump heap profiler when memory usage is high	2019-04-29 16:31:44 -07:00
Evan Tschannen	cacd82758e	Reduced data distribution speeds	2019-04-26 13:54:49 -07:00
Evan Tschannen	9ff8aca1da	Increased the SQLITE_CHUNK_SIZE to 100MB (left at 4MB for simulation)	2019-04-26 13:53:56 -07:00
A.J. Beamon	253d2400ef	Merge branch 'release-6.1' into speed-up-and-parameterize-spring-cleaning # Conflicts: # documentation/sphinx/source/release-notes.rst	2019-04-23 14:38:52 -07:00
A.J. Beamon	4ad0496b39	Increase the frequency that lazy deletes are run. Add more parameters for better control over the spring cleaning process.	2019-04-23 14:01:51 -07:00
Stephen Atherton	83db547306	Implemented the chunk size and db size hint fileControl options in our SQLite VFS implementation. KeyValueStoreSQLite now sets file chunk size based on a new knob, SQLITE_CHUNK_SIZE_PAGES.	2019-04-23 04:50:58 -07:00
Jingyu Zhou	6870e132b2	Merge branch 'master' into pprof	2019-04-19 14:06:44 -07:00
Andrew Noyes	d1e86779a6	Address review comments	2019-04-18 08:48:27 -07:00
mpilman	32393ec4c9	Prototype of local ratekeeper	2019-04-08 11:04:44 -07:00
Evan Tschannen	05869a8383	do not log a degraded reset message if the previous reset was more than a week ago	2019-04-07 23:00:58 -07:00
Jingyu Zhou	4b08042a88	Change memory profiling threshold to a flag	2019-04-05 16:33:51 -07:00
Jingyu Zhou	09b2c35d11	Dump heap profiler when memory usage is high Set the threshold of dump to 2GB.	2019-04-05 16:12:23 -07:00
Evan Tschannen	390ab9cfed	A process will mark itself as degraded if it continually disconnects from a different process which the failure monitor thinks is healthy	2019-04-04 14:11:12 -07:00
A.J. Beamon	71e2fdafb8	Changes to ratekeeper camel case	2019-03-27 08:24:25 -07:00
Evan Tschannen	6254a1a8e4	fix: restarting the provisional proxy causes all tlog peeks to restart, so if tlog peeks take longer than 1 second this could end in an infinite loop	2019-03-22 18:37:39 -07:00
A.J. Beamon	2d7b48dadc	Merge pull request #1311 from etschannen/feature-increase-grv-batch Increased the GRV client batch size	2019-03-19 08:23:05 -07:00
Evan Tschannen	2554fed965	reduce max transaction to start	2019-03-18 16:16:03 -07:00
Evan Tschannen	87e2a1a029	The proxy budget is implemented to let one request over its limit through, and then pay back what was over the limit in the next update	2019-03-18 16:09:57 -07:00
Alex Miller	29ab7370cd	Clear versionLocation when spilling, and pop DQ separately. Popping the disk queue now requires potentially recovering the location to which we can pop from the spilled data itself, and for each tag we must maintain the first location with relevant data. The previous queue we had to represent the ordering, queueOrder, was used by spilling, and popped when a TLog had been spilled. This means that as soon as a TLog has been fully spilled, we have no idea how it relates in order to other fully spilled TLogs. Instead, use queueOrder to keep track of all the TLog UIDs until they're removed, and use spillOrder to keep track of the order only for spilling.	2019-03-18 15:09:22 -07:00
Evan Tschannen	ec6c843124	increased the GRV client batch size, similarly increased the proxy limits related to the number of transactions started in a batch	2019-03-16 16:18:58 -07:00
Evan Tschannen	e068c478b5	merge master	2019-03-12 18:31:25 -07:00
Evan Tschannen	c6e94293bf	reset a process to not be degraded after 2 days	2019-03-10 22:39:21 -07:00
Evan Tschannen	53f16b5347	when a tlog queue commit takes longer than 5 seconds, its process is marked as degraded	2019-03-08 11:46:34 -05:00
Jingyu Zhou	3c86643822	Separate Ratekeeper from data distribution. Add a new role for ratekeeper. Remove StorageServerChanges from data distribution. Ratekeeper monitors storage servers, which borrows the idea from DataDistribution.	2019-03-07 13:16:20 -08:00
Alex Miller	94bf75cb00	Allow the disk queue to shrink if it has unneeded slack space.	2019-03-04 01:42:38 -08:00
Alex Miller	9ef283d4e7	Implement hard limiting of memory used to serve peek requests.	2019-03-04 01:42:38 -08:00
Alex Miller	e7d8520c63	Batch more when spilling data.	2019-03-04 01:42:38 -08:00
Trevor Clinkenbeard	39f612d132	Merge branch 'master' of https://github.com/apple/foundationdb into add-health-metrics	2019-03-02 17:07:00 -08:00
A.J. Beamon	a051055caf	Initial implementation of adding separate limits for batch priority in ratekeeper	2019-02-27 10:31:56 -08:00
Trevor Clinkenbeard	abfe057805	Merge branch 'master' of https://github.com/apple/foundationdb into add-health-metrics	2019-02-25 13:47:16 -08:00
Evan Tschannen	b8910ba7cd	Merge branch 'master' into feature-fix-force-recovery # Conflicts: # fdbclient/ManagementAPI.actor.h # fdbserver/DataDistribution.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/KillRegion.actor.cpp	2019-02-22 14:38:13 -08:00
Meng Xu	9445ac0b0c	Status: Use new data distributor worker to publish status After we add a new data distributor role, we publish the data related to data distributor and rate keeper through the new role (and new worker). So the status needs to contact the data distributor, instead of master, to get the status information.	2019-02-21 18:05:50 -08:00
Meng Xu	7cca439e00	TeamRemover: Add status to show redundant team removing Distinguish the removal of unhealthy team and redundant team. Change status report to include redundant team removal report.	2019-02-21 14:16:46 -08:00
Trevor Clinkenbeard	fa96b8dd33	Merge branch 'master' of https://github.com/apple/foundationdb into add-health-metrics	2019-02-20 16:56:16 -08:00
Meng Xu	d86ba0e811	TeamRemover: Change it to run periodically This simplifies the problem of when we should invoke the teamRemover	2019-02-20 16:08:34 -08:00
Evan Tschannen	27e3617548	fix: remove bad teams needed to use dd_stall_check delay, because in simulation the buggified delay time could make us remove bad teams before they submit their ranges to the queue	2019-02-20 14:18:36 -08:00
Evan Tschannen	d4737fac0f	knobify force recovery recovery check delay	2019-02-19 16:05:20 -08:00
Evan Tschannen	065a45e05f	Merge branch 'master' into feature-fix-force-recovery # Conflicts: # fdbclient/ManagementAPI.actor.cpp # fdbserver/ClusterController.actor.cpp # fdbserver/workloads/KillRegion.actor.cpp	2019-02-18 17:09:06 -08:00
Evan Tschannen	d492395f84	fix: simulation could buggify a delay such that data distribution incorrectly thinks the queue is not processing unhealthy relocations	2019-02-18 14:57:07 -08:00
Meng Xu	6d09ac483c	Merge with master	2019-02-15 17:03:40 -08:00
Jingyu Zhou	fc3a784963	Fix another build team bug The buildTeam() can create teams with undesired storage servers, which are considered unhealthy. As a result, the data movement can become stuck. Fix this by adding an ACTOR monitorHealthyTeams that builds team every one second whenever there is no healthy teams. Clean up storageServerTracker() interface.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	816f8b1ae1	Per review comments Add a knob for starting distributor delay. Move distributor failed variable to a local loop.	2019-02-14 16:37:16 -08:00
Jingyu Zhou	e0a7162cf8	Add a failure timeout knob for data distributor. Set default time to 1.0s.	2019-02-14 16:37:16 -08:00
Meng Xu	5481851e82	TeamCollection: Add knobs for team remover Added three knobs to control team remover bool TR_FLAG_DISABLE_TEAM_REMOVER: Disable the teamRemover actor double TR_REMOVE_MACHINE_TEAM_DELAY: Wait for the specified time before try to remove next machine team double TR_WAIT_FOR_ALL_MACHINES_HEALTHY_DELAY: Wait before checking if all machines are healthy	2019-02-13 15:11:56 -08:00
Meng Xu	214a72fba3	TeamCollection: Resolve review comments 1) Reduce the frequency of checking if we need to call teamRemover 2) Improve code efficiency in finding the machine team to remove 3) Remove unused code 4) Add sanity check	2019-02-12 10:59:57 -08:00
Meng Xu	3b8ae0fe95	TeamCollection: Add into 6.1 release note	2019-02-08 13:50:27 -08:00
Meng Xu	7cfe6de27e	TeamCollection: Server team number must match machine team number DESIRED_TEAMS_PER_MACHINE must equal to DESIRED_TEAMS_PER_SERVER. Otherwise, we may have to few machine teams to create enough server teams. Note that BUGGIFY macro value is based on a random number generator. When you have two BUGGIFY, one may be true and the other is false. Also fix a bug in get the number of healthy machine teams.	2019-02-07 13:53:55 -08:00
Meng Xu	455024b3fe	SimulationTest: Test the number of teams Magnify the possibility that the number of created machine teams is larger than the number of desired machine teams if we do NOT try to remove the surplus machine teams. This help test the upgrade to machine team in FDB 6.1	2019-02-06 11:04:41 -08:00
Trevor Clinkenbeard	5822bd65bf	Track health metrics in Ratekeeper and send these metrics to proxies in GetRateInfoReply messages	2019-01-31 12:56:58 -08:00
Trevor Clinkenbeard	d7930af2cb	Storage server periodically calculates cpuUsage and diskUsage metrics. These metrics (as well as all other metrics necessary for health metrics calculation) are sent in the StorageQueuingMetricsReply message.	2019-01-31 12:23:04 -08:00
Trevor Clinkenbeard	5b89db811a	Throttle status requests with MAX_STATUS_REQUESTS_PER_SECOND knob, whenever status batching is used.	2019-01-28 15:37:30 -08:00
Evan Tschannen	684a22a52b	Merge branch 'release-6.0' # Conflicts: # documentation/sphinx/source/release-notes.rst # fdbbackup/backup.actor.cpp # fdbclient/BackupContainer.actor.cpp # fdbclient/HTTP.actor.cpp # fdbserver/storageserver.actor.cpp # fdbserver/workloads/BackupCorrectness.actor.cpp # versions.target	2019-01-09 16:14:46 -08:00
Evan Tschannen	57293a2db0	byte sample recovery did not use limits for its range reads, leading to slow tasks	2019-01-04 10:32:31 -08:00
Markus Pilman	dbe9baff1f	Several small compilation fixes for new versions of gcc There are several missing includes for cmath in the code, I added those. Next, Coro returns a reference to a stack variable and this causes a warning. As this is probably ok for Coro, I disabled the warning in that file for GCC. I want to have this warning in the build system as it is generally a very useful warning to have. Another change is that major and minor are deprecated for a while now. I replaced those with gnu_dev_major and gnu_dev_minor. ErrorOr currently implements operators ==, !=, and <. These do not compile because Error does not implement ==. This compiles on older versions of gcc and clang because ErrorOr<T>::operator== is not used anywhere. It is still wrong though and newer gcc versions complain. I simply removed these methods. The most interesting fix is that TraceEvent::~TraceEvent is currently throwing exceptions. This is illegal behavior in C++11 and a idea in older versions of C++. For now I simply removed the throw, but this might need some more thought.	2019-01-03 12:44:19 -08:00
Evan Tschannen	4e54690005	Merge branch 'release-6.0' # Conflicts: # fdbserver/DataDistribution.actor.cpp # fdbserver/MoveKeys.actor.cpp	2018-11-12 20:26:58 -08:00
Evan Tschannen	3f3a562f75	updated resolution balancing knobs to be a little more aggressive	2018-11-12 19:11:28 -08:00
Evan Tschannen	3f39024640	buggify resolution balancing so that it still happens in simulation	2018-11-12 00:03:07 -08:00
Evan Tschannen	536ee826da	tuned resolver balancing to keep the resolvers within 5MB per second of each other	2018-11-11 23:42:45 -08:00
Robert Escriva	268093a96d	Adjust all includes to be relative to the root. Remove the use of relative paths. A header at foo/bar.h could be included by files under foo/ with "bar.h", but would be included everywhere else as "foo/bar.h". Adjust so that every include references such a header with the latter form. Signed-off-by: Robert Escriva <rescriva@dropbox.com>	2018-10-19 17:35:33 +00:00
Evan Tschannen	db71b60d72	Merge pull request #819 from satherton/feature-redwood Redwood storage engine, initial/experimental version	2018-10-18 18:38:11 -07:00
Evan Tschannen	0613a34845	The storage server would block the main thread when processing a single version with a large amount of data	2018-10-18 13:37:31 -07:00
Stephen Atherton	7c1dc305cb	Merge commit 'a72c8f5cb2e79a673abc0ed3d27ef1c51028fb13' into feature-redwood	2018-10-05 10:15:10 -07:00
Evan Tschannen	636420abee	fix: if the disk queue adapter peek hangs for a while, switch to a peek from a different locality	2018-10-03 13:58:55 -07:00
Evan Tschannen	28545e0f8d	multi cursors start a get more for the first 10 cursors to hide latency	2018-10-03 13:57:45 -07:00
Evan Tschannen	a92fc911ac	do not spin on a failed storage server recruitment	2018-10-02 17:31:07 -07:00
Stephen Atherton	2fc86c5ff3	Merge branch 'master' of github.com:apple/foundationdb into feature-redwood # Conflicts: # fdbrpc/AsyncFileCached.actor.h # fdbserver/IKeyValueStore.h # fdbserver/KeyValueStoreMemory.actor.cpp # fdbserver/workloads/StatusWorkload.actor.cpp # tests/fast/SidebandWithStatus.txt # tests/rare/LargeApiCorrectnessStatus.txt # tests/slow/DDBalanceAndRemoveStatus.txt	2018-09-20 03:39:55 -07:00
Evan Tschannen	200e65fe61	added a workload which tests killing an entire region, and recovering from the failure with data loss. fix: we cannot pop the txs tag from remote logs until they have a full copy of the txnStateStore fix: we have to modify all of history, we cannot stop after finding a local remote	2018-09-17 18:32:39 -07:00
Evan Tschannen	6496a6d9c8	fix: start move keys will only move destination servers to become source servers if less than destination servers are healthy and the total number of sources is less than 2x the number of destinations	2018-08-31 12:43:14 -07:00
Evan Tschannen	d7c01f0419	added a separate knob for tlog’s recoverMemoryLimit	2018-08-21 21:11:23 -07:00
Evan Tschannen	7f7755165c	slowly send notifications to clients to clear the list of dead clients	2018-08-08 17:29:32 -07:00
Evan Tschannen	be1a4d74c7	tlogs serve reads to log routers at a low priority, to prevent them from using all their resources catching up a remote dc that has been down for a long time increase the amount of memory ratekeeper budgets for tlogs so that there is a gap after the spill threshold to prevent temporarily overshooting the budget	2018-08-04 10:31:30 -07:00
Stephen Atherton	40762d9f9b	Merge branch 'master' of github.com:apple/foundationdb into feature-redwood	2018-07-25 17:58:52 -07:00
Evan Tschannen	4fedd05506	added more yields to avoid slow tasks	2018-07-12 17:47:35 -07:00
Evan Tschannen	392c73affb	fixed a few slow tasks	2018-07-12 14:06:59 -07:00
Evan Tschannen	cd63c7a7cc	added a buffered cursor, which efficiently merges lots of peek cursors	2018-07-12 12:09:48 -07:00
Stephen Atherton	96389c74cd	Merge branch 'master' of github.com:apple/foundationdb into feature-redwood # Conflicts: # tests/fast/SidebandWithStatus.txt # tests/rare/LargeApiCorrectnessStatus.txt # tests/slow/DDBalanceAndRemoveStatus.txt	2018-07-10 16:42:34 -07:00
Stephen Atherton	1bc95862b7	Merge branch 'release-6.0' of github.com:apple/foundationdb into feature-redwood # Conflicts: # tests/fast/SidebandWithStatus.txt # tests/rare/LargeApiCorrectnessStatus.txt # tests/slow/DDBalanceAndRemoveStatus.txt	2018-07-10 04:16:02 -07:00

1 2 3 4 5

205 Commits