Evan Tschannen
636420abee
fix: if the disk queue adapter peek hangs for a while, switch to a peek from a different locality
2018-10-03 13:58:55 -07:00
Evan Tschannen
28545e0f8d
multi cursors start a get more for the first 10 cursors to hide latency
2018-10-03 13:57:45 -07:00
Evan Tschannen
200e65fe61
added a workload which tests killing an entire region, and recovering from the failure with data loss.
...
fix: we cannot pop the txs tag from remote logs until they have a full copy of the txnStateStore
fix: we have to modify all of history, we cannot stop after finding a local remote
2018-09-17 18:32:39 -07:00
Evan Tschannen
6496a6d9c8
fix: start move keys will only move destination servers to become source servers if less than destination servers are healthy and the total number of sources is less than 2x the number of destinations
2018-08-31 12:43:14 -07:00
Evan Tschannen
d7c01f0419
added a separate knob for tlog’s recoverMemoryLimit
2018-08-21 21:11:23 -07:00
Evan Tschannen
7f7755165c
slowly send notifications to clients to clear the list of dead clients
2018-08-08 17:29:32 -07:00
Evan Tschannen
be1a4d74c7
tlogs serve reads to log routers at a low priority, to prevent them from using all their resources catching up a remote dc that has been down for a long time
...
increase the amount of memory ratekeeper budgets for tlogs so that there is a gap after the spill threshold to prevent temporarily overshooting the budget
2018-08-04 10:31:30 -07:00
Evan Tschannen
cd63c7a7cc
added a buffered cursor, which efficiently merges lots of peek cursors
2018-07-12 12:09:48 -07:00
Evan Tschannen
5a2cb3037b
merge 5.2 into 6.0
2018-07-08 20:14:06 -07:00
Evan Tschannen
d6c6e7d306
fix: do not attempt data movement to an unhealthy destination team
...
allow building more teams than desired if all teams are unhealthy
bestTeamStuck is an error in simulation again
2018-07-07 16:51:16 -07:00
Evan Tschannen
cdafd542ee
fix: fixed a memory leak where leaderInfo notifications are not cleared out
2018-07-06 17:40:29 -07:00
Evan Tschannen
7a12d3e130
added the (untested) ability to force a recovery to the remote datacenter, even if that results in data loss. If the DR lag is more than 1 week there could be potential data corruption if any primary storage servers are still alive.
2018-07-01 09:39:04 -04:00
Evan Tschannen
7e68bee692
update better machine classes first to give them a higher chance of becoming the next cluster controller
2018-06-29 01:11:59 -07:00
Evan Tschannen
58c2f67ff6
checking outstanding requests can be CPU intensive, so rate limit checking requests
2018-06-27 23:02:08 -07:00
Evan Tschannen
5fc8199abc
Swapped OkayFit and UnsetFit, because generally if machine classes are set on one machine they are set everywhere and it helps with wait_for_good_recruitment logic
...
wait_for_good_recruitment now requires that you have the desired count of each roll
remote recruitment is given a much longer wait_for_good_recruitment time interval, which does not start until enough remote machines have registered
2018-06-22 10:15:24 -07:00
Evan Tschannen
678b4494f4
added logging for the datacenter version difference
2018-06-21 16:31:52 -07:00
Evan Tschannen
b79feaddd3
added a hard memory limit to the TLog to prevent it from running out of memory. Because remote logs are not ratekeeper controlled this is their only protection
2018-06-18 17:22:40 -07:00
Evan Tschannen
889889323e
The master will tell the cluster controller if it is going to take a long time to recruit new logs in its DC; the cluster controller can determine if the other DC would be better and recruit there.
...
The cluster controller will not switch to the other data center if remote logs are too far behind.
We will not recruit in DCs with negative priority.
2018-06-13 18:14:14 -07:00
Evan Tschannen
372ed67497
Merge branch 'master' into feature-remote-logs
...
# Conflicts:
# fdbserver/DataDistribution.actor.cpp
# fdbserver/MasterProxyServer.actor.cpp
# fdbserver/TLogServer.actor.cpp
# fdbserver/TagPartitionedLogSystem.actor.cpp
2018-06-11 11:34:10 -07:00
Evan Tschannen
4903df5ce9
fix: give time to detect failed servers before building teams
2018-06-10 20:21:39 -07:00
Balachandar Namasivayam
529d0497f1
Proxy going OOM when applying high volumes of writes to a proxy, particular in a sudden fashion before ratekeeper can control the workload.
...
Address this issue by proactively monitoring the memory used by commit batches and dropping requests if a certain memory limit is exceeded.
2018-06-01 15:21:40 -07:00
Evan Tschannen
12ef63b698
knobify replace contents bytes
2018-05-01 19:43:35 -07:00
Evan Tschannen
9c8cb445d6
optimized the tlog to use a vector for tags instead of a map
2018-03-17 10:36:19 -07:00
Evan Tschannen
ccd70fd005
The tlog uses the tags embedded in the message instead of a separate vector of locations
...
optimized remote tlog committing to avoid re-serializing the message
2018-03-16 16:47:05 -07:00
A.J. Beamon
f2c804e14f
Reverting changes from merge of master into release-5.2 ( b25810711c
). Note that we never intend to release master into release-5.2, but if we did we would need to revert this commit.
2018-03-06 10:15:04 -08:00
Evan Tschannen
37a6a81634
Merge commit '7f6fc3e039c911cd84b8540f7f799fc38a1c1822' into feature-remote-logs
...
# Conflicts:
# fdbserver/workloads/RestartRecovery.actor.cpp
2018-02-23 12:33:28 -08:00
Alec Grieser
0bae9880f1
remove trailing whitespace from our copyright headers ; fixed formatting of python setup.py
2018-02-21 10:25:11 -08:00
Evan Tschannen
c7b3be5b19
re-enabled better master exists
...
the cluster controller can choose a better data center for itself and let the workers know where the next cluster controller should be recruited
2018-02-09 16:48:55 -08:00
A.J. Beamon
bb1297c686
Remove RkServerQueueInfo and RkTLogQueueInfo trace events, since this information is more or less already logged on the storage servers and tlogs. Update the quiet database check and magnesium to use the information from the logs and storage servers.
2017-11-14 12:59:42 -08:00
Evan Tschannen
3a4078bdda
the keyservers shards are always a fixed large size
2017-10-27 11:52:11 -07:00
Alex Miller
f997cb9038
Add a string knob to hold the Log directory, and write profiles to it.
...
This is the combination of two small changes.
1. Add support for a string knob type.
2. Change profiles to be written to the log directory instead of the working
directory.
We have three options of where to write files: the working directory, the data
directory, and the log directory.
The working directory may be set to a non-writable location, and likely
contains the fdb binaries. Allowing these files to be overwritten would likely
not be a wise idea.
The data directory hosts our sqlite b-trees. It would also be very unfortunate
if these were ever overwritten by an unfortunate profile name.
The log directory contains logs. Out of the three, these matter the least if
they disappear or become corrupted.
Thus, we write to the log directory.
2017-10-16 16:05:02 -07:00
Bhaskar Muppana
0bf5bdb23a
<rdar://problem/34557380> Need a way to map real time to version
2017-09-25 12:51:37 -07:00
Evan Tschannen
4bdcd8fc12
Merge branch 'release-4.6' into release-5.0
...
# Conflicts:
# bindings/bindingtester/run_binding_tester.sh
# fdbrpc/AsyncFileKAIO.actor.h
2017-06-14 16:43:53 -07:00
FDB Dev Team
a674cb4ef4
Initial repository commit
2017-05-25 13:48:44 -07:00