Commit Graph

291 Commits

Author SHA1 Message Date
A.J. Beamon 603721e125 Merge branch 'master' into thread-safe-random-number-generation
# Conflicts:
#	fdbclient/ManagementAPI.actor.cpp
#	fdbrpc/AsyncFileCached.actor.h
#	fdbrpc/genericactors.actor.cpp
#	fdbrpc/sim2.actor.cpp
#	fdbserver/DiskQueue.actor.cpp
#	fdbserver/workloads/BulkSetup.actor.h
#	flow/ActorCollection.actor.cpp
#	flow/Net2.actor.cpp
#	flow/Trace.cpp
#	flow/flow.cpp
2019-05-23 08:35:47 -07:00
Evan Tschannen 8c3516951a Merge branch 'release-6.1'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	versions.target
2019-05-12 20:13:49 -07:00
Alex Miller ea12a54946 Rename DISK_QUEUE_MAX_TRUNCATE_EXTENTS -> ..._BYTES
So as to not make filesystem assumptions.  This knob did technically
appear in (only the) 6.1.5 release, but this feature was broken 6.1.5,
so thus impossible to use anyway.
2019-05-10 18:26:22 -10:00
A.J. Beamon 5f55f3f613 Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used. 2019-05-10 14:01:52 -07:00
Evan Tschannen 22499666d0 Merge branch 'release-6.1'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/LogRouter.actor.cpp
#	flow/Trace.cpp
#	versions.target
2019-05-08 18:19:35 -07:00
Alex Miller 0685e6c1c7 Avoid large truncates in the DiskQueue.
And instead create a new file while incrementally truncating the old one
down.  This avoids queueing up a massive number of filesystem metadata
operations in one call, thus flooding the disk with requests and
stalling out all other filesystem operations.

This sets the knobs so that a truncate of >10GB causes us to create a
new file rather than trying to truncate the old one.
2019-05-08 12:33:31 -10:00
Alex Miller 4052f3826a Add a knob to limit the number of commits indexed per key.
Theoretically, we could spill 20MB of 22B mutations for one key, which
would generate a very long value being stored in SQLite, and very
inefficiently read back.  This stops that from being a problem, at the
cost of some extra write calls.
2019-05-03 15:27:10 -07:00
Alex Miller f4e48c3851 Add a knob to limit amount of data read from sqlite for one PeekRequest.
This prevents peeking from degrading over time if there are a very large
number of SpilledData entries for one particular tag.
2019-05-02 17:26:45 -07:00
Evan Tschannen 2d5043c665 Merge branch 'release-6.1'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	versions.target
2019-04-30 18:27:04 -07:00
Evan Tschannen 1a4c1759a4
Merge pull request #1429 from jzhou77/pprof
Dump heap profiler when memory usage is high
2019-04-29 16:31:44 -07:00
Evan Tschannen cacd82758e Reduced data distribution speeds 2019-04-26 13:54:49 -07:00
Evan Tschannen 9ff8aca1da Increased the SQLITE_CHUNK_SIZE to 100MB (left at 4MB for simulation) 2019-04-26 13:53:56 -07:00
A.J. Beamon 253d2400ef Merge branch 'release-6.1' into speed-up-and-parameterize-spring-cleaning
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
2019-04-23 14:38:52 -07:00
A.J. Beamon 4ad0496b39 Increase the frequency that lazy deletes are run. Add more parameters for better control over the spring cleaning process. 2019-04-23 14:01:51 -07:00
Stephen Atherton 83db547306 Implemented the chunk size and db size hint fileControl options in our SQLite VFS implementation. KeyValueStoreSQLite now sets file chunk size based on a new knob, SQLITE_CHUNK_SIZE_PAGES. 2019-04-23 04:50:58 -07:00
Jingyu Zhou 6870e132b2
Merge branch 'master' into pprof 2019-04-19 14:06:44 -07:00
Andrew Noyes d1e86779a6 Address review comments 2019-04-18 08:48:27 -07:00
mpilman 32393ec4c9 Prototype of local ratekeeper 2019-04-08 11:04:44 -07:00
Evan Tschannen 05869a8383 do not log a degraded reset message if the previous reset was more than a week ago 2019-04-07 23:00:58 -07:00
Jingyu Zhou 4b08042a88 Change memory profiling threshold to a flag 2019-04-05 16:33:51 -07:00
Jingyu Zhou 09b2c35d11 Dump heap profiler when memory usage is high
Set the threshold of dump to 2GB.
2019-04-05 16:12:23 -07:00
Evan Tschannen 390ab9cfed A process will mark itself as degraded if it continually disconnects from a different process which the failure monitor thinks is healthy 2019-04-04 14:11:12 -07:00
A.J. Beamon 71e2fdafb8 Changes to ratekeeper camel case 2019-03-27 08:24:25 -07:00
Evan Tschannen 6254a1a8e4 fix: restarting the provisional proxy causes all tlog peeks to restart, so if tlog peeks take longer than 1 second this could end in an infinite loop 2019-03-22 18:37:39 -07:00
A.J. Beamon 2d7b48dadc
Merge pull request #1311 from etschannen/feature-increase-grv-batch
Increased the GRV client batch size
2019-03-19 08:23:05 -07:00
Evan Tschannen 2554fed965 reduce max transaction to start 2019-03-18 16:16:03 -07:00
Evan Tschannen 87e2a1a029 The proxy budget is implemented to let one request over its limit through, and then pay back what was over the limit in the next update 2019-03-18 16:09:57 -07:00
Alex Miller 29ab7370cd Clear versionLocation when spilling, and pop DQ separately.
Popping the disk queue now requires potentially recovering the location
to which we can pop from the spilled data itself, and for each tag we
must maintain the first location with relevant data.

The previous queue we had to represent the ordering, queueOrder, was
used by spilling, and popped when a TLog had been spilled.  This means
that as soon as a TLog has been fully spilled, we have no idea how it
relates in order to other fully spilled TLogs.

Instead, use queueOrder to keep track of all the TLog UIDs until they're
removed, and use spillOrder to keep track of the order only for
spilling.
2019-03-18 15:09:22 -07:00
Evan Tschannen ec6c843124 increased the GRV client batch size, similarly increased the proxy limits related to the number of transactions started in a batch 2019-03-16 16:18:58 -07:00
Evan Tschannen e068c478b5 merge master 2019-03-12 18:31:25 -07:00
Evan Tschannen c6e94293bf reset a process to not be degraded after 2 days 2019-03-10 22:39:21 -07:00
Evan Tschannen 53f16b5347 when a tlog queue commit takes longer than 5 seconds, its process is marked as degraded 2019-03-08 11:46:34 -05:00
Jingyu Zhou 3c86643822 Separate Ratekeeper from data distribution.
Add a new role for ratekeeper.

Remove StorageServerChanges from data distribution.
Ratekeeper monitors storage servers, which borrows the idea from
DataDistribution.
2019-03-07 13:16:20 -08:00
Alex Miller 94bf75cb00 Allow the disk queue to shrink if it has unneeded slack space. 2019-03-04 01:42:38 -08:00
Alex Miller 9ef283d4e7 Implement hard limiting of memory used to serve peek requests. 2019-03-04 01:42:38 -08:00
Alex Miller e7d8520c63 Batch more when spilling data. 2019-03-04 01:42:38 -08:00
Trevor Clinkenbeard 39f612d132 Merge branch 'master' of https://github.com/apple/foundationdb into add-health-metrics 2019-03-02 17:07:00 -08:00
A.J. Beamon a051055caf Initial implementation of adding separate limits for batch priority in ratekeeper 2019-02-27 10:31:56 -08:00
Trevor Clinkenbeard abfe057805 Merge branch 'master' of https://github.com/apple/foundationdb into add-health-metrics 2019-02-25 13:47:16 -08:00
Evan Tschannen b8910ba7cd Merge branch 'master' into feature-fix-force-recovery
# Conflicts:
#	fdbclient/ManagementAPI.actor.h
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/storageserver.actor.cpp
#	fdbserver/workloads/KillRegion.actor.cpp
2019-02-22 14:38:13 -08:00
Meng Xu 9445ac0b0c Status: Use new data distributor worker to publish status
After we add a new data distributor role, we publish the data
related to data distributor and rate keeper through the new
role (and new worker).

So the status needs to contact the data distributor, instead of master,
to get the status information.
2019-02-21 18:05:50 -08:00
Meng Xu 7cca439e00 TeamRemover: Add status to show redundant team removing
Distinguish the removal of unhealthy team and redundant team.
Change status report to include redundant team removal report.
2019-02-21 14:16:46 -08:00
Trevor Clinkenbeard fa96b8dd33 Merge branch 'master' of https://github.com/apple/foundationdb into add-health-metrics 2019-02-20 16:56:16 -08:00
Meng Xu d86ba0e811 TeamRemover: Change it to run periodically
This simplifies the problem of when we should invoke the teamRemover
2019-02-20 16:08:34 -08:00
Evan Tschannen 27e3617548 fix: remove bad teams needed to use dd_stall_check delay, because in simulation the buggified delay time could make us remove bad teams before they submit their ranges to the queue 2019-02-20 14:18:36 -08:00
Evan Tschannen d4737fac0f knobify force recovery recovery check delay 2019-02-19 16:05:20 -08:00
Evan Tschannen 065a45e05f Merge branch 'master' into feature-fix-force-recovery
# Conflicts:
#	fdbclient/ManagementAPI.actor.cpp
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/workloads/KillRegion.actor.cpp
2019-02-18 17:09:06 -08:00
Evan Tschannen d492395f84 fix: simulation could buggify a delay such that data distribution incorrectly thinks the queue is not processing unhealthy relocations 2019-02-18 14:57:07 -08:00
Meng Xu 6d09ac483c Merge with master 2019-02-15 17:03:40 -08:00
Jingyu Zhou fc3a784963 Fix another build team bug
The buildTeam() can create teams with undesired storage servers, which are
considered unhealthy. As a result, the data movement can become stuck.

Fix this by adding an ACTOR monitorHealthyTeams that builds team every one
second whenever there is no healthy teams.

Clean up storageServerTracker() interface.
2019-02-14 16:37:16 -08:00
Jingyu Zhou 816f8b1ae1 Per review comments
Add a knob for starting distributor delay.
Move distributor failed variable to a local loop.
2019-02-14 16:37:16 -08:00
Jingyu Zhou e0a7162cf8 Add a failure timeout knob for data distributor.
Set default time to 1.0s.
2019-02-14 16:37:16 -08:00
Meng Xu 5481851e82 TeamCollection: Add knobs for team remover
Added three knobs to control team remover

bool TR_FLAG_DISABLE_TEAM_REMOVER:
	Disable the teamRemover actor
double TR_REMOVE_MACHINE_TEAM_DELAY:
	Wait for the specified time before try to remove next machine team
double TR_WAIT_FOR_ALL_MACHINES_HEALTHY_DELAY:
	Wait before checking if all machines are healthy
2019-02-13 15:11:56 -08:00
Meng Xu 214a72fba3 TeamCollection: Resolve review comments
1) Reduce the frequency of checking if we need to call teamRemover
2) Improve code efficiency in finding the machine team to remove
3) Remove unused code
4) Add sanity check
2019-02-12 10:59:57 -08:00
Meng Xu 3b8ae0fe95 TeamCollection: Add into 6.1 release note 2019-02-08 13:50:27 -08:00
Meng Xu 7cfe6de27e TeamCollection: Server team number must match machine team number
DESIRED_TEAMS_PER_MACHINE must equal to DESIRED_TEAMS_PER_SERVER.
Otherwise, we may have to few machine teams to create enough server teams.

Note that BUGGIFY macro value is based on a random number generator.
When you have two BUGGIFY, one may be true and the other is false.

Also fix a bug in get the number of healthy machine teams.
2019-02-07 13:53:55 -08:00
Meng Xu 455024b3fe SimulationTest: Test the number of teams
Magnify the possibility that the number of created machine teams is
larger than the number of desired machine teams if we do NOT try to remove the surplus machine teams.
This help test the upgrade to machine team in FDB 6.1
2019-02-06 11:04:41 -08:00
Trevor Clinkenbeard 5822bd65bf Track health metrics in Ratekeeper and send these metrics to proxies in GetRateInfoReply messages 2019-01-31 12:56:58 -08:00
Trevor Clinkenbeard d7930af2cb Storage server periodically calculates cpuUsage and diskUsage metrics. These metrics (as well as all other metrics necessary for health metrics calculation) are sent in the StorageQueuingMetricsReply message. 2019-01-31 12:23:04 -08:00
Trevor Clinkenbeard 5b89db811a Throttle status requests with MAX_STATUS_REQUESTS_PER_SECOND knob, whenever status batching is used. 2019-01-28 15:37:30 -08:00
Evan Tschannen 684a22a52b Merge branch 'release-6.0'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbbackup/backup.actor.cpp
#	fdbclient/BackupContainer.actor.cpp
#	fdbclient/HTTP.actor.cpp
#	fdbserver/storageserver.actor.cpp
#	fdbserver/workloads/BackupCorrectness.actor.cpp
#	versions.target
2019-01-09 16:14:46 -08:00
Evan Tschannen 57293a2db0 byte sample recovery did not use limits for its range reads, leading to slow tasks 2019-01-04 10:32:31 -08:00
Markus Pilman dbe9baff1f Several small compilation fixes for new versions of gcc
There are several missing includes for cmath in the code, I added those.

Next, Coro returns a reference to a stack variable and this causes a
warning. As this is probably ok for Coro, I disabled the warning in
that file for GCC. I want to have this warning in the build system as
it is generally a very useful warning to have.

Another change is that major and minor are deprecated for a while now.
I replaced those with gnu_dev_major and gnu_dev_minor.

ErrorOr currently implements operators ==, !=, and <. These do not
compile because Error does not implement ==. This compiles on older
versions of gcc and clang because ErrorOr<T>::operator== is not used
anywhere. It is still wrong though and newer gcc versions complain.
I simply removed these methods.

The most interesting fix is that TraceEvent::~TraceEvent is currently
throwing exceptions. This is illegal behavior in C++11 and a idea in
older versions of C++. For now I simply removed the throw, but this
might need some more thought.
2019-01-03 12:44:19 -08:00
Evan Tschannen 4e54690005 Merge branch 'release-6.0'
# Conflicts:
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/MoveKeys.actor.cpp
2018-11-12 20:26:58 -08:00
Evan Tschannen 3f3a562f75 updated resolution balancing knobs to be a little more aggressive 2018-11-12 19:11:28 -08:00
Evan Tschannen 3f39024640 buggify resolution balancing so that it still happens in simulation 2018-11-12 00:03:07 -08:00
Evan Tschannen 536ee826da tuned resolver balancing to keep the resolvers within 5MB per second of each other 2018-11-11 23:42:45 -08:00
Robert Escriva 268093a96d Adjust all includes to be relative to the root.
Remove the use of relative paths.  A header at foo/bar.h could be included by
files under foo/ with "bar.h", but would be included everywhere else as
"foo/bar.h".  Adjust so that every include references such a header with the
latter form.

Signed-off-by: Robert Escriva <rescriva@dropbox.com>
2018-10-19 17:35:33 +00:00
Evan Tschannen db71b60d72
Merge pull request #819 from satherton/feature-redwood
Redwood storage engine, initial/experimental version
2018-10-18 18:38:11 -07:00
Evan Tschannen 0613a34845 The storage server would block the main thread when processing a single version with a large amount of data 2018-10-18 13:37:31 -07:00
Stephen Atherton 7c1dc305cb Merge commit 'a72c8f5cb2e79a673abc0ed3d27ef1c51028fb13' into feature-redwood 2018-10-05 10:15:10 -07:00
Evan Tschannen 636420abee fix: if the disk queue adapter peek hangs for a while, switch to a peek from a different locality 2018-10-03 13:58:55 -07:00
Evan Tschannen 28545e0f8d multi cursors start a get more for the first 10 cursors to hide latency 2018-10-03 13:57:45 -07:00
Evan Tschannen a92fc911ac do not spin on a failed storage server recruitment 2018-10-02 17:31:07 -07:00
Stephen Atherton 2fc86c5ff3 Merge branch 'master' of github.com:apple/foundationdb into feature-redwood
# Conflicts:
#	fdbrpc/AsyncFileCached.actor.h
#	fdbserver/IKeyValueStore.h
#	fdbserver/KeyValueStoreMemory.actor.cpp
#	fdbserver/workloads/StatusWorkload.actor.cpp
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-09-20 03:39:55 -07:00
Evan Tschannen 200e65fe61 added a workload which tests killing an entire region, and recovering from the failure with data loss.
fix: we cannot pop the txs tag from remote logs until they have a full copy of the txnStateStore
fix: we have to modify all of history, we cannot stop after finding a local remote
2018-09-17 18:32:39 -07:00
Evan Tschannen 6496a6d9c8 fix: start move keys will only move destination servers to become source servers if less than destination servers are healthy and the total number of sources is less than 2x the number of destinations 2018-08-31 12:43:14 -07:00
Evan Tschannen d7c01f0419 added a separate knob for tlog’s recoverMemoryLimit 2018-08-21 21:11:23 -07:00
Evan Tschannen 7f7755165c slowly send notifications to clients to clear the list of dead clients 2018-08-08 17:29:32 -07:00
Evan Tschannen be1a4d74c7 tlogs serve reads to log routers at a low priority, to prevent them from using all their resources catching up a remote dc that has been down for a long time
increase the amount of memory ratekeeper budgets for tlogs so that there is a gap after the spill threshold to prevent temporarily overshooting the budget
2018-08-04 10:31:30 -07:00
Stephen Atherton 40762d9f9b Merge branch 'master' of github.com:apple/foundationdb into feature-redwood 2018-07-25 17:58:52 -07:00
Evan Tschannen 4fedd05506 added more yields to avoid slow tasks 2018-07-12 17:47:35 -07:00
Evan Tschannen 392c73affb fixed a few slow tasks 2018-07-12 14:06:59 -07:00
Evan Tschannen cd63c7a7cc added a buffered cursor, which efficiently merges lots of peek cursors 2018-07-12 12:09:48 -07:00
Stephen Atherton 96389c74cd Merge branch 'master' of github.com:apple/foundationdb into feature-redwood
# Conflicts:
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-07-10 16:42:34 -07:00
Stephen Atherton 1bc95862b7 Merge branch 'release-6.0' of github.com:apple/foundationdb into feature-redwood
# Conflicts:
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-07-10 04:16:02 -07:00
Evan Tschannen ed7c744a8e fix: we cannot lower max_read_transaction_life_versions too low when configured with remote logs, because it causes the log routers to fall too far behind 2018-07-09 16:55:33 -07:00
Evan Tschannen 5a2cb3037b merge 5.2 into 6.0 2018-07-08 20:14:06 -07:00
Evan Tschannen 4dd18afb84 fix: we cannot make MAX_RECOVERY_VERSIONS lower than MAX_VERSIONS_IN_FLIGHT because we can mark a recovery as stalled before finishing the recovery, leading to an infinite loop of recoveries 2018-07-07 17:41:20 -07:00
Evan Tschannen d6c6e7d306 fix: do not attempt data movement to an unhealthy destination team
allow building more teams than desired if all teams are unhealthy
bestTeamStuck is an error in simulation again
2018-07-07 16:51:16 -07:00
Evan Tschannen cdafd542ee fix: fixed a memory leak where leaderInfo notifications are not cleared out 2018-07-06 17:40:29 -07:00
Stephen Atherton 2925b9b984 Merge branch 'master' of github.com:apple/foundationdb into feature-redwood 2018-07-03 23:03:56 -07:00
Evan Tschannen 7a12d3e130 added the (untested) ability to force a recovery to the remote datacenter, even if that results in data loss. If the DR lag is more than 1 week there could be potential data corruption if any primary storage servers are still alive. 2018-07-01 09:39:04 -04:00
Stephen Atherton b95a2bd6c1 Merge commit 'b17c8359ec22892ed4daeaa569f2f5e105477251' into feature-redwood
# Conflicts:
#	flow/Trace.cpp
2018-06-30 23:18:29 -07:00
Evan Tschannen 1f02bdee0a do not buggify future version delay, because remote storage servers will be delayed getting data so they need additional time 2018-06-29 11:29:22 -07:00
Evan Tschannen 7e68bee692 update better machine classes first to give them a higher chance of becoming the next cluster controller 2018-06-29 01:11:59 -07:00
Evan Tschannen 58c2f67ff6 checking outstanding requests can be CPU intensive, so rate limit checking requests 2018-06-27 23:02:08 -07:00
Evan Tschannen 5fc8199abc Swapped OkayFit and UnsetFit, because generally if machine classes are set on one machine they are set everywhere and it helps with wait_for_good_recruitment logic
wait_for_good_recruitment now requires that you have the desired count of each roll
remote recruitment is given a much longer wait_for_good_recruitment time interval, which does not start until enough remote machines have registered
2018-06-22 10:15:24 -07:00
Evan Tschannen 678b4494f4 added logging for the datacenter version difference 2018-06-21 16:31:52 -07:00
Stephen Atherton e5c48d453a Merge branch 'master' of github.com:apple/foundationdb into feature-redwood 2018-06-18 22:45:27 -07:00
Evan Tschannen b79feaddd3 added a hard memory limit to the TLog to prevent it from running out of memory. Because remote logs are not ratekeeper controlled this is their only protection 2018-06-18 17:22:40 -07:00
Stephen Atherton 90c8288c68 Merge branch 'master' of github.com:apple/foundationdb into feature-redwood 2018-06-17 14:55:05 -07:00
Evan Tschannen 889889323e The master will tell the cluster controller if it is going to take a long time to recruit new logs in its DC; the cluster controller can determine if the other DC would be better and recruit there.
The cluster controller will not switch to the other data center if remote logs are too far behind.
We will not recruit in DCs with negative priority.
2018-06-13 18:14:14 -07:00
Stephen Atherton 2878f30f29 Merge branch 'master' of github.com:apple/foundationdb into feature-redwood
# Conflicts:
#	fdbserver/IKeyValueStore.h
#	fdbserver/KeyValueStoreMemory.actor.cpp
#	fdbserver/storageserver.actor.cpp
2018-06-13 15:56:06 -07:00
Evan Tschannen 372ed67497 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/TagPartitionedLogSystem.actor.cpp
2018-06-11 11:34:10 -07:00
Evan Tschannen 4903df5ce9 fix: give time to detect failed servers before building teams 2018-06-10 20:21:39 -07:00
Stephen Atherton 69b713918b VersionedBTree now uses PrefixTree based pages (with bugs). This required significant changes to both classes because the interface and semantics for building, seeking in, and iterating through pages is very different from the previous trivial approach which was based on serialized vectors. PrefixTree node format rewritten to support optional values without increasing overhead for common node scenarios. PrefixTree::Cursor rewritten to reuse path prefix memory instead of allocating new memory on each movement which is then 'leaked' until destruction. PrefixTree::Cursor movement modified to work better with VersionedBTree::InternalCursor, which was also heavily modified. Added knobs related to key arrangement in PrefixTree nodes. Added StringRef::toHexString() as an alternative to printable() to make reading raw PrefixTree data easier. PrefixTree performance is temporarily worse with this update and VersionedBtree fails its unit test. 2018-06-08 03:32:34 -07:00
Balachandar Namasivayam 529d0497f1 Proxy going OOM when applying high volumes of writes to a proxy, particular in a sudden fashion before ratekeeper can control the workload.
Address this issue by proactively monitoring the memory used by commit batches and dropping requests if a certain memory limit is exceeded.
2018-06-01 15:21:40 -07:00
Evan Tschannen e8f6ad88f0 fix: tripled the smallStorageTarget to prevent simulations which do a lot of work from timing out 2018-05-07 17:26:44 -07:00
Evan Tschannen 8cb8198250 fix: the e-brake should be buggified with ratekeeper storage limits to prevent simulation from running full blast into the e-brake resulting in simulation taking forever to complete (joshua timeouts) 2018-05-06 12:33:25 -07:00
Evan Tschannen 12ef63b698 knobify replace contents bytes 2018-05-01 19:43:35 -07:00
Stephen Atherton af61d3596d Merge branch 'public-master' into feature-redwood
# Conflicts:
#	fdbserver/DatabaseConfiguration.cpp
#	fdbserver/OldTLogServer.actor.cpp
#	fdbserver/fdbserver.vcxproj
#	fdbserver/fdbserver.vcxproj.filters
2018-04-24 17:22:21 -07:00
Stephen Atherton 2752a28611 Merge branch 'release-5.2' of github.com:apple/foundationdb into feature-redwood 2018-04-06 16:29:37 -07:00
Evan Tschannen 9c8cb445d6 optimized the tlog to use a vector for tags instead of a map 2018-03-17 10:36:19 -07:00
Evan Tschannen ccd70fd005 The tlog uses the tags embedded in the message instead of a separate vector of locations
optimized remote tlog committing to avoid re-serializing the message
2018-03-16 16:47:05 -07:00
A.J. Beamon f2c804e14f Reverting changes from merge of master into release-5.2 (b25810711c). Note that we never intend to release master into release-5.2, but if we did we would need to revert this commit. 2018-03-06 10:15:04 -08:00
Evan Tschannen 37a6a81634 Merge commit '7f6fc3e039c911cd84b8540f7f799fc38a1c1822' into feature-remote-logs
# Conflicts:
#	fdbserver/workloads/RestartRecovery.actor.cpp
2018-02-23 12:33:28 -08:00
Alec Grieser 0bae9880f1 remove trailing whitespace from our copyright headers ; fixed formatting of python setup.py 2018-02-21 10:25:11 -08:00
Stephen Atherton 0a35f167e4 Merge branch 'master' into feature-redwood
# Conflicts:
#	fdbserver/DiskQueue.actor.cpp
#	fdbserver/IDiskQueue.h
#	fdbserver/Knobs.cpp
#	fdbserver/Knobs.h
#	fdbserver/fdbserver.vcxproj
#	fdbserver/fdbserver.vcxproj.filters
#	fdbserver/worker.actor.cpp
2018-02-12 01:30:02 -08:00
Evan Tschannen 42405c78a5 Merge commit '4038bd2fd968d88861f2cebd442ce511724816cb' into feature-remote-logs
# Conflicts:
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/Knobs.cpp
2018-02-10 12:08:52 -08:00
Evan Tschannen c7b3be5b19 re-enabled better master exists
the cluster controller can choose a better data center for itself and let the workers know where the next cluster controller should be recruited
2018-02-09 16:48:55 -08:00
Evan Tschannen d0caffd339 fix: knob was set to incorrect value 2018-02-06 18:11:45 -08:00
Evan Tschannen b7dde88029 fix: the cluster controller did not consider the master sharing the same process as the cluster controller as bad in all needed locations
waited too long for good recruitment locations, which would add too much time to recoveries of clusters that do not use machine classes
2018-02-06 11:30:05 -08:00
Evan Tschannen 21482a45e1 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbserver/DBCoreState.h
#	fdbserver/LogSystem.h
#	fdbserver/LogSystemPeekCursor.actor.cpp
#	fdbserver/TLogServer.actor.cpp
2018-01-14 13:40:24 -08:00
Evan Tschannen 022df3b91b backup and restore sometimes took too long in simulation 2018-01-09 17:26:42 -08:00
Evan Tschannen 5ac4f73978 Merge branch 'release-5.1' into feature-remote-logs
# Conflicts:
#	fdbclient/NativeAPI.actor.cpp
#	fdbrpc/Locality.h
#	fdbrpc/simulator.h
#	fdbserver/ApplyMetadataMutation.h
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/LogSystemPeekCursor.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/SimulatedCluster.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/TagPartitionedLogSystem.actor.cpp
#	fdbserver/WorkerInterface.h
#	fdbserver/masterserver.actor.cpp
#	flow/Net2.actor.cpp
#	tests/fast/SidebandWithStatus.txt
#	tests/rare/LargeApiCorrectnessStatus.txt
#	tests/slow/DDBalanceAndRemoveStatus.txt
2018-01-05 11:33:42 -08:00
A.J. Beamon bb1297c686 Remove RkServerQueueInfo and RkTLogQueueInfo trace events, since this information is more or less already logged on the storage servers and tlogs. Update the quiet database check and magnesium to use the information from the logs and storage servers. 2017-11-14 12:59:42 -08:00
Evan Tschannen 3a4078bdda the keyservers shards are always a fixed large size 2017-10-27 11:52:11 -07:00
Alex Miller f997cb9038 Add a string knob to hold the Log directory, and write profiles to it.
This is the combination of two small changes.

1. Add support for a string knob type.
2. Change profiles to be written to the log directory instead of the working
   directory.

We have three options of where to write files: the working directory, the data
directory, and the log directory.

The working directory may be set to a non-writable location, and likely
contains the fdb binaries.  Allowing these files to be overwritten would likely
not be a wise idea.

The data directory hosts our sqlite b-trees.  It would also be very unfortunate
if these were ever overwritten by an unfortunate profile name.

The log directory contains logs.  Out of the three, these matter the least if
they disappear or become corrupted.

Thus, we write to the log directory.
2017-10-16 16:05:02 -07:00
Evan Tschannen 15962cf079 Merge branch 'master' into feature-remote-logs
# Conflicts:
#	fdbrpc/Locality.cpp
#	fdbrpc/Locality.h
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/ClusterRecruitmentInterface.h
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/TagPartitionedLogSystem.actor.cpp
#	fdbserver/WorkerInterface.h
#	fdbserver/fdbserver.vcxproj.filters
#	fdbserver/masterserver.actor.cpp
#	fdbserver/worker.actor.cpp
#	flow/error_definitions.h
2017-10-05 17:09:44 -07:00
Bhaskar Muppana 0f8ff26029 Merge pull request #158 from bmuppana/master
<rdar://problem/34557380> Need a way to map real time to version
2017-09-27 17:56:42 -07:00
Bhaskar Muppana 6a0b1d6808 Fixing PR comments
<rdar://problem/34557380> Need a way to map real time to version
2017-09-27 17:56:01 -07:00
Bhaskar Muppana 0bf5bdb23a <rdar://problem/34557380> Need a way to map real time to version 2017-09-25 12:51:37 -07:00
Evan Tschannen 489332533c all timeouts longer than two minutes have been can be lowered to 60.0 with buggification
added a workload that tries for a 50 second maximum latency in the presence of one failure with both buggification and connection failures
2017-09-18 11:04:51 -07:00
Stephen Atherton 5d49f2c710 Merge branch 'master' into feature-redwood
# Conflicts:
#	fdbserver/fdbserver.vcxproj
2017-08-28 17:45:50 -07:00
Evan Tschannen c22708b6d6 added tag localities
fix: remote logs need to stop the master when they are stopped
2017-08-03 16:16:36 -07:00
Evan Tschannen 0906250e78 merged everything from feature-remote-logs besides the tlog and tagpartitionedlogsystem
re-included tags in messages to the tlog
previously never committed the LogRouter
2017-06-29 15:50:19 -07:00
Evan Tschannen 69c862ed6e updated release notes for 5.0.1 2017-06-28 16:52:45 -07:00
Evan Tschannen 4bdcd8fc12 Merge branch 'release-4.6' into release-5.0
# Conflicts:
#	bindings/bindingtester/run_binding_tester.sh
#	fdbrpc/AsyncFileKAIO.actor.h
2017-06-14 16:43:53 -07:00
Stephen Atherton b65ad3563c Merge branch 'master' into feature-redwood
# Conflicts:
#	fdbserver/fdbserver.vcxproj
#	fdbserver/fdbserver.vcxproj.filters
2017-06-09 14:56:41 -07:00
FDB Dev Team a674cb4ef4 Initial repository commit 2017-05-25 13:48:44 -07:00