foundationdb

Commit Graph

Author	SHA1	Message	Date
Meng Xu	d2fd1f4931	DD:MisconfiguredLocality:Fix review comments	2019-09-17 13:04:21 -07:00
Meng Xu	37d2318eed	DD:Handle worker with incorrect locality When a worker has incorrect locality, the worker will be excluded from storage recruitment. When the worker has its locality corrected by system operators, the worker will be reincluded for storage recruitment.	2019-09-14 12:12:56 -07:00
Meng Xu	78b8e48cef	DD:ValidLocality:Resolve review comment	2019-09-13 15:35:16 -07:00
Meng Xu	3ad7e3adb3	DD:DD_VALIDATE_LOCALITY:Guard the checking of locality validity	2019-09-13 13:19:35 -07:00
Evan Tschannen	00424a5108	changed the rate at which the coordinators register with the cluster controller and the clients register with the coordinator so the the connected client number in status will be much more accurate	2019-08-21 15:02:09 -07:00
Evan Tschannen	41b908752e	increased move keys parallelism to be less of a decrease just in case lowering this could effect normal data distribution raised target durability lag versions to give more time for batch limiting to come into play before this limit is hit changed max_bad_options to better reflect the name	2019-08-21 14:55:21 -07:00
Evan Tschannen	37e2fc86de	Increase the target durability lag versions to be larger than the soft max, so that storage servers will respond with a penalty to clients before ratekeeper controls on the lag	2019-08-19 14:03:42 -07:00
Evan Tschannen	9318b494ad	reduce the DD move keys parallelism to avoid a hot read shard when transitioning from triple replication to double replication	2019-08-19 14:02:18 -07:00
Evan Tschannen	9382a58390	fix: after a forced recovery it is possible to not have logs from all generations, so only wait at most a second for getting a popped txs version	2019-08-06 16:32:28 -07:00
Evan Tschannen	7d7aa27c2d	Merge pull request #1814 from dongxinEric/feature/1508/finer-grained-dd-controls Added finer grained controls to DataDistribution in fdbcli.	2019-07-31 17:36:20 -07:00
Evan Tschannen	a0b29ff82f	updated knobs to allow more batch priority traffic	2019-07-31 17:19:41 -07:00
Evan Tschannen	4308ff86f7	increased the MAX_TEAMS_PER_SERVER	2019-07-31 16:08:18 -07:00
Xin Dong	b653ddb30d	Final clean ups after rebasing master	2019-07-30 22:35:34 -07:00
Xin Dong	cda70700cc	Address review comments. 50K correctness with no failures.	2019-07-30 22:24:30 -07:00
Evan Tschannen	6dbaddd0a7	Added a knob to always use CAUSAL_READ_RISKY for GRV	2019-07-30 18:21:46 -07:00
Evan Tschannen	5dd9043fd3	addressed review comments	2019-07-30 17:04:41 -07:00
A.J. Beamon	41605735f5	Merge pull request #1916 from ajbeamon/merge-onto-new-servers Add knob to control whether merges request new servers or not.	2019-07-30 15:04:37 -07:00
A.J. Beamon	bc536757df	Add knob to control whether merges request new servers or not. Set the default to request new servers in \xff but not in main key space.	2019-07-29 15:47:34 -07:00
Evan Tschannen	d8b14fe372	we cannot buggify replace content bytes because it takes too long to recovery when the txnStateStore is too large	2019-07-28 19:34:17 -07:00
Evan Tschannen	5c98dcce6d	revert the proxy forwarding path, because it is no longer necessary as clients keep a persistent connection open with coordinators	2019-07-27 16:46:22 -07:00
Evan Tschannen	b509a441e7	Merge branch 'master' into feature-skip-confirm # Conflicts: # bindings/flow/tester/Tester.actor.cpp # bindings/go/src/_stacktester/stacktester.go # bindings/java/src/test/com/apple/foundationdb/test/AsyncStackTester.java # bindings/java/src/test/com/apple/foundationdb/test/StackTester.java # bindings/python/tests/tester.py # bindings/ruby/tests/tester.rb # documentation/sphinx/source/api-c.rst # documentation/sphinx/source/api-python.rst # documentation/sphinx/source/api-ruby.rst # documentation/sphinx/source/data-modeling.rst # documentation/sphinx/source/developer-guide.rst # fdbclient/vexillographer/fdb.options # fdbserver/MasterProxyServer.actor.cpp	2019-07-27 15:08:13 -07:00
Evan Tschannen	ee94e8a062	removed a trace event which was causing valgrind errors	2019-07-27 13:51:59 -07:00
Evan Tschannen	90e3b50213	Merge branch 'master' into feature-coordinator-connection # Conflicts: # fdbclient/DatabaseContext.h # fdbclient/NativeAPI.actor.cpp # fdbclient/NativeAPI.actor.h # fdbserver/workloads/KillRegion.actor.cpp	2019-07-26 15:05:02 -07:00
Evan Tschannen	ee92f0574f	fix: lastRequestTime was not updated fix: COORDINATOR_REGISTER_INTERVAL was not set fixed review comments	2019-07-26 13:23:56 -07:00
sramamoorthy	a65c9f92ed	get rid of all timeouts and other changes	2019-07-24 15:36:28 -07:00
sramamoorthy	7e04e3c8be	snap v2: knobs for max snap create timeout	2019-07-24 15:36:28 -07:00
Evan Tschannen	c70e762f0e	Merge pull request #1785 from xumengpanda/mengxu/server-team-remover-PR Remove redundant server teams	2019-07-19 17:44:16 -07:00
Meng Xu	b001a9ebe8	ServerTeamRemover runs after machineTeamRemover finishes If serverTeamRemover removes a team before machineTeamRemover brings the machine team number down to the desired number, DD may create a new team (due to teams removed by serverTeamRemover), which may be removed later by machineTeamRemover. This causes unnnecessary extra data movement.	2019-07-19 16:48:52 -07:00
Evan Tschannen	846038b0e6	Merge pull request #1858 from bnamasivayam/rk-ssfetch-throttle Ratekeeper throttling aggressively when unable to fetch storage server list	2019-07-19 16:41:58 -07:00
Alex Miller	c3a8ae4752	Merge pull request #1791 from fzhjon/fetch-keys-requests-priority Introduce priority to fetchKeys requests from data distribution	2019-07-19 14:54:51 -07:00
Balachandar Namasivayam	ecb3de3b49	Fixed space issue.	2019-07-17 18:10:05 -07:00
Balachandar Namasivayam	406bcebdc4	Ratekeeper to throttle tpsLimit to 1 if it is not able to fetch storage server list for some configurable amount of time.	2019-07-17 18:08:17 -07:00
Meng Xu	20f067e794	Merge with master:Resolve conflict with PR#1797	2019-07-16 10:52:28 -07:00
Meng Xu	415622f465	MachineTeamRemover:Change to remove MT with most teams Change to remove machine team with most machine teams, using the same logic as the serverTeamRemover. The featue is guarded by TR_FLAG_REMOVE_MT_WITH_MOST_TEAMS knob.	2019-07-15 14:29:49 -07:00
Evan Tschannen	db5b4a6331	avoid going to unlimited immediately after going below the durabilityLagTargetVersion	2019-07-12 18:50:56 -07:00
Evan Tschannen	1a18c859c7	knobified the durability lag rate controls	2019-07-12 18:50:56 -07:00
Evan Tschannen	02de53160d	only skip confirm epoch live if CAUSAL_READ_RISKY is enabled time checked on the proxy should be less than the time waited by the master to account for clock speed differences setting REQUIRED_MIN_RECOVERY_DURATION and ENFORCED_MIN_RECOVERY_DURATION to 0 will go back to the old behavior	2019-07-12 17:58:16 -07:00
Evan Tschannen	a63969afb3	enforce a minimum recovery duration, which allows proxies to avoid checking if the epoch is alive as long as its last commit has been less than MINIMUM_RECOVERY_DURATION ago	2019-07-12 13:10:21 -07:00
Jon Fu	f12a3909f3	renamed workloads and made code style adjustments	2019-07-11 09:56:58 -07:00
Jon Fu	1e9d31597c	removed extra parameter from getRange, added knob to guard new changes, and adjusted style/formatting in several places	2019-07-11 09:56:58 -07:00
Evan Tschannen	7e919e361c	Merge pull request #1817 from etschannen/feature-proxy-forward Proxies will forward clients to the next generation	2019-07-10 13:53:12 -07:00
Evan Tschannen	49121172ea	Merge pull request #1795 from alexmiller-apple/peek-from-satellites Log Routers will prefer to peek from satellite logs.	2019-07-09 17:38:57 -07:00
Evan Tschannen	001abec29d	fixed a compiler error, buggified a new knob	2019-07-09 16:50:59 -07:00
Evan Tschannen	64aee73c4f	we only need to hold the ReplyPromise for messages that we are going to forward to new proxies	2019-07-09 16:47:56 -07:00
Alex Miller	44f11702a8	Log Routers will prefer to peek from satellite logs. Formerly, they would prefer to peek from the primary's logs. Testing of a failed region rejoining the cluster revealed that this becomes quite a strain on the primary logs when extremely large volumes of peek requests are coming from the Log Routers. It happens that we have satellites that contain the same mutations with Log Router tags, that have no other peeking load, so we can prefer to use the satellite to peek rather than the primary to distribute load across TLogs better. Unfortunately, this revealed a latent bug in how tagged mutations in the KnownCommittedVersion->RecoveryVersion gap were copied across generations when the number of log router tags were decreased. Satellite TLogs would be assigned log router tags using the team-building based logic in getPushLocations(), whereas TLogs would internally re-index tags according to tag.id%logRouterTags. This mismatch would mean that we could have: Log0 -2:0 ----- -2:0 Log 0 Log1 -2:1 \ >--- -2:1,-2:0 (-2:2 mod 2 becomes -2:0) Log 1 Log2 -2:2 / And now we have data that's tagged as -2:0 on a TLog that's not the preferred location for -2:0, and therefore a BestLocationOnly cursor would miss the mutations. This was never noticed before, as we never used a satellite as a preferred location to peek from. Merge cursors always peek from all locations, and thus a peek for -2:0 that needed data from the satellites would have gone to both TLogs and merged the results. We now take this mod-based re-indexing into account when assigning which TLogs need to recover which tags from the previous generation, to make sure that tag.id%logRouterTags always results in the assigned TLog being the preferred location. Unfortunately, previously existing will potentially have existing satellites with log router tags indexed incorrectly, so this transition needs to be gated on a `log_version` transition. Old LogSets will have an old LogVersion, and we won't prefer the sattelite for peeking. Log Sets post-6.2 (opt-in) or post-6.3 (default) will be indexed correctly, and therefore we can safely offload peeking onto the satellites.	2019-07-08 22:25:01 -07:00
Meng Xu	3b9618fe11	ServerTeamRemover:Speedup removing teams in simulation Otherwise, simulation may time out when team remover needs to remove hundreds of teams.	2019-07-08 18:17:21 -07:00
Meng Xu	08a721b320	Merge branch 'master' into mengxu/server-team-remover-PR	2019-07-08 16:30:32 -07:00
Evan Tschannen	c348b3da51	After a proxy dies, it will remain alive for an additional 10 seconds to forward clients to the new proxies	2019-07-08 12:53:40 -07:00
Evan Tschannen	310a5fe9a3	fix: we cannot reject 100% of requests, because a storage server which is stuck needs to get a future version error to trigger an all alternatives failed message from load balance so that clients will re-grab storage server interfaces from the proxy	2019-07-05 17:28:22 -07:00
Evan Tschannen	e7c0ecf729	fix: we cannot reject 100% of requests, because a storage server which is stuck needs to get a future version error to trigger an all alternatives failed message from load balance so that clients will re-grab storage server interfaces from the proxy	2019-07-05 15:46:16 -07:00

1 2 3 4 5

205 Commits