Commit Graph

145 Commits

Author SHA1 Message Date
Trevor Clinkenbeard 8144882d7b Merge branch 'apple-master' into features/local-rk 2019-06-10 19:40:25 -07:00
Evan Tschannen 29b96414e2 Merge branch 'release-6.1'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/NativeAPI.actor.cpp
#	fdbserver/Coordination.actor.cpp
#	flow/Arena.h
#	versions.target
2019-06-03 18:49:35 -07:00
Evan Tschannen 7c333dbc16 If a process receives a message in its clusterControllerInterface before becoming the cluster controller, if the process does not become the cluster controller in the next minute it should destroy the interface to prevent a memory leak. 2019-05-29 16:57:13 -07:00
sramamoorthy 31b6c86650 ignorePopDeadline to have high limit in simulator
- ignorePopDeadline to have highier limit in simulator
to accommdate for the buggify delays and make snapshot succeed.

- introduce a new knob for auto resetting the disabling of tlog pop
2019-05-28 22:07:46 -07:00
A.J. Beamon 603721e125 Merge branch 'master' into thread-safe-random-number-generation
# Conflicts:
#	fdbclient/ManagementAPI.actor.cpp
#	fdbrpc/AsyncFileCached.actor.h
#	fdbrpc/genericactors.actor.cpp
#	fdbrpc/sim2.actor.cpp
#	fdbserver/DiskQueue.actor.cpp
#	fdbserver/workloads/BulkSetup.actor.h
#	flow/ActorCollection.actor.cpp
#	flow/Net2.actor.cpp
#	flow/Trace.cpp
#	flow/flow.cpp
2019-05-23 08:35:47 -07:00
Evan Tschannen 8c3516951a Merge branch 'release-6.1'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	versions.target
2019-05-12 20:13:49 -07:00
Alex Miller ea12a54946 Rename DISK_QUEUE_MAX_TRUNCATE_EXTENTS -> ..._BYTES
So as to not make filesystem assumptions.  This knob did technically
appear in (only the) 6.1.5 release, but this feature was broken 6.1.5,
so thus impossible to use anyway.
2019-05-10 18:26:22 -10:00
A.J. Beamon 5f55f3f613 Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used. 2019-05-10 14:01:52 -07:00
Evan Tschannen 22499666d0 Merge branch 'release-6.1'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/LogRouter.actor.cpp
#	flow/Trace.cpp
#	versions.target
2019-05-08 18:19:35 -07:00
Alex Miller 0685e6c1c7 Avoid large truncates in the DiskQueue.
And instead create a new file while incrementally truncating the old one
down.  This avoids queueing up a massive number of filesystem metadata
operations in one call, thus flooding the disk with requests and
stalling out all other filesystem operations.

This sets the knobs so that a truncate of >10GB causes us to create a
new file rather than trying to truncate the old one.
2019-05-08 12:33:31 -10:00
Alex Miller 4052f3826a Add a knob to limit the number of commits indexed per key.
Theoretically, we could spill 20MB of 22B mutations for one key, which
would generate a very long value being stored in SQLite, and very
inefficiently read back.  This stops that from being a problem, at the
cost of some extra write calls.
2019-05-03 15:27:10 -07:00
Alex Miller f4e48c3851 Add a knob to limit amount of data read from sqlite for one PeekRequest.
This prevents peeking from degrading over time if there are a very large
number of SpilledData entries for one particular tag.
2019-05-02 17:26:45 -07:00
Evan Tschannen 2d5043c665 Merge branch 'release-6.1'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	versions.target
2019-04-30 18:27:04 -07:00
Evan Tschannen 1a4c1759a4
Merge pull request #1429 from jzhou77/pprof
Dump heap profiler when memory usage is high
2019-04-29 16:31:44 -07:00
Evan Tschannen cacd82758e Reduced data distribution speeds 2019-04-26 13:54:49 -07:00
Evan Tschannen 9ff8aca1da Increased the SQLITE_CHUNK_SIZE to 100MB (left at 4MB for simulation) 2019-04-26 13:53:56 -07:00
A.J. Beamon 253d2400ef Merge branch 'release-6.1' into speed-up-and-parameterize-spring-cleaning
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
2019-04-23 14:38:52 -07:00
A.J. Beamon 4ad0496b39 Increase the frequency that lazy deletes are run. Add more parameters for better control over the spring cleaning process. 2019-04-23 14:01:51 -07:00
Stephen Atherton 83db547306 Implemented the chunk size and db size hint fileControl options in our SQLite VFS implementation. KeyValueStoreSQLite now sets file chunk size based on a new knob, SQLITE_CHUNK_SIZE_PAGES. 2019-04-23 04:50:58 -07:00
Jingyu Zhou 6870e132b2
Merge branch 'master' into pprof 2019-04-19 14:06:44 -07:00
Andrew Noyes d1e86779a6 Address review comments 2019-04-18 08:48:27 -07:00
mpilman 32393ec4c9 Prototype of local ratekeeper 2019-04-08 11:04:44 -07:00
Evan Tschannen 05869a8383 do not log a degraded reset message if the previous reset was more than a week ago 2019-04-07 23:00:58 -07:00
Jingyu Zhou 4b08042a88 Change memory profiling threshold to a flag 2019-04-05 16:33:51 -07:00
Jingyu Zhou 09b2c35d11 Dump heap profiler when memory usage is high
Set the threshold of dump to 2GB.
2019-04-05 16:12:23 -07:00
Evan Tschannen 390ab9cfed A process will mark itself as degraded if it continually disconnects from a different process which the failure monitor thinks is healthy 2019-04-04 14:11:12 -07:00
A.J. Beamon 71e2fdafb8 Changes to ratekeeper camel case 2019-03-27 08:24:25 -07:00
Evan Tschannen 6254a1a8e4 fix: restarting the provisional proxy causes all tlog peeks to restart, so if tlog peeks take longer than 1 second this could end in an infinite loop 2019-03-22 18:37:39 -07:00
A.J. Beamon 2d7b48dadc
Merge pull request #1311 from etschannen/feature-increase-grv-batch
Increased the GRV client batch size
2019-03-19 08:23:05 -07:00
Evan Tschannen 2554fed965 reduce max transaction to start 2019-03-18 16:16:03 -07:00
Evan Tschannen 87e2a1a029 The proxy budget is implemented to let one request over its limit through, and then pay back what was over the limit in the next update 2019-03-18 16:09:57 -07:00
Alex Miller 29ab7370cd Clear versionLocation when spilling, and pop DQ separately.
Popping the disk queue now requires potentially recovering the location
to which we can pop from the spilled data itself, and for each tag we
must maintain the first location with relevant data.

The previous queue we had to represent the ordering, queueOrder, was
used by spilling, and popped when a TLog had been spilled.  This means
that as soon as a TLog has been fully spilled, we have no idea how it
relates in order to other fully spilled TLogs.

Instead, use queueOrder to keep track of all the TLog UIDs until they're
removed, and use spillOrder to keep track of the order only for
spilling.
2019-03-18 15:09:22 -07:00
Evan Tschannen ec6c843124 increased the GRV client batch size, similarly increased the proxy limits related to the number of transactions started in a batch 2019-03-16 16:18:58 -07:00
Evan Tschannen e068c478b5 merge master 2019-03-12 18:31:25 -07:00
Evan Tschannen c6e94293bf reset a process to not be degraded after 2 days 2019-03-10 22:39:21 -07:00
Evan Tschannen 53f16b5347 when a tlog queue commit takes longer than 5 seconds, its process is marked as degraded 2019-03-08 11:46:34 -05:00
Jingyu Zhou 3c86643822 Separate Ratekeeper from data distribution.
Add a new role for ratekeeper.

Remove StorageServerChanges from data distribution.
Ratekeeper monitors storage servers, which borrows the idea from
DataDistribution.
2019-03-07 13:16:20 -08:00
Alex Miller 94bf75cb00 Allow the disk queue to shrink if it has unneeded slack space. 2019-03-04 01:42:38 -08:00
Alex Miller 9ef283d4e7 Implement hard limiting of memory used to serve peek requests. 2019-03-04 01:42:38 -08:00
Alex Miller e7d8520c63 Batch more when spilling data. 2019-03-04 01:42:38 -08:00
Trevor Clinkenbeard 39f612d132 Merge branch 'master' of https://github.com/apple/foundationdb into add-health-metrics 2019-03-02 17:07:00 -08:00
A.J. Beamon a051055caf Initial implementation of adding separate limits for batch priority in ratekeeper 2019-02-27 10:31:56 -08:00
Trevor Clinkenbeard abfe057805 Merge branch 'master' of https://github.com/apple/foundationdb into add-health-metrics 2019-02-25 13:47:16 -08:00
Evan Tschannen b8910ba7cd Merge branch 'master' into feature-fix-force-recovery
# Conflicts:
#	fdbclient/ManagementAPI.actor.h
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/storageserver.actor.cpp
#	fdbserver/workloads/KillRegion.actor.cpp
2019-02-22 14:38:13 -08:00
Meng Xu 9445ac0b0c Status: Use new data distributor worker to publish status
After we add a new data distributor role, we publish the data
related to data distributor and rate keeper through the new
role (and new worker).

So the status needs to contact the data distributor, instead of master,
to get the status information.
2019-02-21 18:05:50 -08:00
Meng Xu 7cca439e00 TeamRemover: Add status to show redundant team removing
Distinguish the removal of unhealthy team and redundant team.
Change status report to include redundant team removal report.
2019-02-21 14:16:46 -08:00
Trevor Clinkenbeard fa96b8dd33 Merge branch 'master' of https://github.com/apple/foundationdb into add-health-metrics 2019-02-20 16:56:16 -08:00
Meng Xu d86ba0e811 TeamRemover: Change it to run periodically
This simplifies the problem of when we should invoke the teamRemover
2019-02-20 16:08:34 -08:00
Evan Tschannen 27e3617548 fix: remove bad teams needed to use dd_stall_check delay, because in simulation the buggified delay time could make us remove bad teams before they submit their ranges to the queue 2019-02-20 14:18:36 -08:00
Evan Tschannen d4737fac0f knobify force recovery recovery check delay 2019-02-19 16:05:20 -08:00