Commit Graph

189 Commits

Author SHA1 Message Date
A.J. Beamon 41605735f5
Merge pull request #1916 from ajbeamon/merge-onto-new-servers
Add knob to control whether merges request new servers or not.
2019-07-30 15:04:37 -07:00
A.J. Beamon bc536757df Add knob to control whether merges request new servers or not. Set the default to request new servers in \xff but not in main key space. 2019-07-29 15:47:34 -07:00
Evan Tschannen d8b14fe372 we cannot buggify replace content bytes because it takes too long to recovery when the txnStateStore is too large 2019-07-28 19:34:17 -07:00
Evan Tschannen 5c98dcce6d revert the proxy forwarding path, because it is no longer necessary as clients keep a persistent connection open with coordinators 2019-07-27 16:46:22 -07:00
Evan Tschannen b509a441e7 Merge branch 'master' into feature-skip-confirm
# Conflicts:
#	bindings/flow/tester/Tester.actor.cpp
#	bindings/go/src/_stacktester/stacktester.go
#	bindings/java/src/test/com/apple/foundationdb/test/AsyncStackTester.java
#	bindings/java/src/test/com/apple/foundationdb/test/StackTester.java
#	bindings/python/tests/tester.py
#	bindings/ruby/tests/tester.rb
#	documentation/sphinx/source/api-c.rst
#	documentation/sphinx/source/api-python.rst
#	documentation/sphinx/source/api-ruby.rst
#	documentation/sphinx/source/data-modeling.rst
#	documentation/sphinx/source/developer-guide.rst
#	fdbclient/vexillographer/fdb.options
#	fdbserver/MasterProxyServer.actor.cpp
2019-07-27 15:08:13 -07:00
Evan Tschannen ee94e8a062 removed a trace event which was causing valgrind errors 2019-07-27 13:51:59 -07:00
Evan Tschannen 90e3b50213 Merge branch 'master' into feature-coordinator-connection
# Conflicts:
#	fdbclient/DatabaseContext.h
#	fdbclient/NativeAPI.actor.cpp
#	fdbclient/NativeAPI.actor.h
#	fdbserver/workloads/KillRegion.actor.cpp
2019-07-26 15:05:02 -07:00
Evan Tschannen ee92f0574f fix: lastRequestTime was not updated
fix: COORDINATOR_REGISTER_INTERVAL was not set
fixed review comments
2019-07-26 13:23:56 -07:00
sramamoorthy a65c9f92ed get rid of all timeouts and other changes 2019-07-24 15:36:28 -07:00
sramamoorthy 7e04e3c8be snap v2: knobs for max snap create timeout 2019-07-24 15:36:28 -07:00
Evan Tschannen c70e762f0e
Merge pull request #1785 from xumengpanda/mengxu/server-team-remover-PR
Remove redundant server teams
2019-07-19 17:44:16 -07:00
Meng Xu b001a9ebe8 ServerTeamRemover runs after machineTeamRemover finishes
If serverTeamRemover removes a team before machineTeamRemover brings
the machine team number down to the desired number, DD may create a new
team (due to teams removed by serverTeamRemover), which may be removed
later by machineTeamRemover. This causes unnnecessary extra data movement.
2019-07-19 16:48:52 -07:00
Evan Tschannen 846038b0e6
Merge pull request #1858 from bnamasivayam/rk-ssfetch-throttle
Ratekeeper throttling aggressively when unable to fetch storage server list
2019-07-19 16:41:58 -07:00
Alex Miller c3a8ae4752
Merge pull request #1791 from fzhjon/fetch-keys-requests-priority
Introduce priority to fetchKeys requests from data distribution
2019-07-19 14:54:51 -07:00
Balachandar Namasivayam ecb3de3b49 Fixed space issue. 2019-07-17 18:10:05 -07:00
Balachandar Namasivayam 406bcebdc4 Ratekeeper to throttle tpsLimit to 1 if it is not able to fetch storage server list for some configurable amount of time. 2019-07-17 18:08:17 -07:00
Meng Xu 20f067e794 Merge with master:Resolve conflict with PR#1797 2019-07-16 10:52:28 -07:00
Meng Xu 415622f465 MachineTeamRemover:Change to remove MT with most teams
Change to remove machine team with most machine teams, using the same
logic as the serverTeamRemover.

The featue is guarded by TR_FLAG_REMOVE_MT_WITH_MOST_TEAMS knob.
2019-07-15 14:29:49 -07:00
Evan Tschannen db5b4a6331 avoid going to unlimited immediately after going below the durabilityLagTargetVersion 2019-07-12 18:50:56 -07:00
Evan Tschannen 1a18c859c7 knobified the durability lag rate controls 2019-07-12 18:50:56 -07:00
Evan Tschannen 02de53160d only skip confirm epoch live if CAUSAL_READ_RISKY is enabled
time checked on the proxy should be less than the time waited by the master to account for clock speed differences
setting REQUIRED_MIN_RECOVERY_DURATION and ENFORCED_MIN_RECOVERY_DURATION to 0 will go back to the old behavior
2019-07-12 17:58:16 -07:00
Evan Tschannen a63969afb3 enforce a minimum recovery duration, which allows proxies to avoid checking if the epoch is alive as long as its last commit has been less than MINIMUM_RECOVERY_DURATION ago 2019-07-12 13:10:21 -07:00
Jon Fu f12a3909f3 renamed workloads and made code style adjustments 2019-07-11 09:56:58 -07:00
Jon Fu 1e9d31597c removed extra parameter from getRange, added knob to guard new changes, and adjusted style/formatting in several places 2019-07-11 09:56:58 -07:00
Evan Tschannen 7e919e361c
Merge pull request #1817 from etschannen/feature-proxy-forward
Proxies will forward clients to the next generation
2019-07-10 13:53:12 -07:00
Evan Tschannen 49121172ea
Merge pull request #1795 from alexmiller-apple/peek-from-satellites
Log Routers will prefer to peek from satellite logs.
2019-07-09 17:38:57 -07:00
Evan Tschannen 001abec29d fixed a compiler error, buggified a new knob 2019-07-09 16:50:59 -07:00
Evan Tschannen 64aee73c4f we only need to hold the ReplyPromise for messages that we are going to forward to new proxies 2019-07-09 16:47:56 -07:00
Alex Miller 44f11702a8 Log Routers will prefer to peek from satellite logs.
Formerly, they would prefer to peek from the primary's logs.  Testing of
a failed region rejoining the cluster revealed that this becomes quite a
strain on the primary logs when extremely large volumes of peek requests
are coming from the Log Routers.  It happens that we have satellites
that contain the same mutations with Log Router tags, that have no other
peeking load, so we can prefer to use the satellite to peek rather than
the primary to distribute load across TLogs better.

Unfortunately, this revealed a latent bug in how tagged mutations in the
KnownCommittedVersion->RecoveryVersion gap were copied across
generations when the number of log router tags were decreased.
Satellite TLogs would be assigned log router tags using the
team-building based logic in getPushLocations(), whereas TLogs would
internally re-index tags according to tag.id%logRouterTags.  This
mismatch would mean that we could have:

    Log0 -2:0 ----- -2:0  Log 0

    Log1 -2:1 \
               >--- -2:1,-2:0 (-2:2 mod 2 becomes -2:0)  Log 1
    Log2 -2:2 /

And now we have data that's tagged as -2:0 on a TLog that's not the
preferred location for -2:0, and therefore a BestLocationOnly cursor
would miss the mutations.

This was never noticed before, as we never
used a satellite as a preferred location to peek from.  Merge cursors
always peek from all locations, and thus a peek for -2:0 that needed
data from the satellites would have gone to both TLogs and merged the
results.

We now take this mod-based re-indexing into account when assigning which
TLogs need to recover which tags from the previous generation, to make
sure that tag.id%logRouterTags always results in the assigned TLog being
the preferred location.

Unfortunately, previously existing will potentially have existing
satellites with log router tags indexed incorrectly, so this transition
needs to be gated on a `log_version` transition.  Old LogSets will have
an old LogVersion, and we won't prefer the sattelite for peeking.  Log
Sets post-6.2 (opt-in) or post-6.3 (default) will be indexed correctly,
and therefore we can safely offload peeking onto the satellites.
2019-07-08 22:25:01 -07:00
Meng Xu 3b9618fe11 ServerTeamRemover:Speedup removing teams in simulation
Otherwise, simulation may time out when team remover needs to
remove hundreds of teams.
2019-07-08 18:17:21 -07:00
Meng Xu 08a721b320 Merge branch 'master' into mengxu/server-team-remover-PR 2019-07-08 16:30:32 -07:00
Evan Tschannen c348b3da51 After a proxy dies, it will remain alive for an additional 10 seconds to forward clients to the new proxies 2019-07-08 12:53:40 -07:00
Evan Tschannen 310a5fe9a3 fix: we cannot reject 100% of requests, because a storage server which is stuck needs to get a future version error to trigger an all alternatives failed message from load balance so that clients will re-grab storage server interfaces from the proxy 2019-07-05 17:28:22 -07:00
Evan Tschannen e7c0ecf729 fix: we cannot reject 100% of requests, because a storage server which is stuck needs to get a future version error to trigger an all alternatives failed message from load balance so that clients will re-grab storage server interfaces from the proxy 2019-07-05 15:46:16 -07:00
Meng Xu 599fcb2e6d Add serverTeamRemover to remove redundant server teams 2019-07-02 17:40:37 -07:00
Evan Tschannen b9a6271375 local ratekeeper no longer globally limits 2019-06-28 16:54:22 -07:00
Evan Tschannen 18d5fbf1e0 Avoid jumping from rejecting 0% of requests directly to 20% of requests 2019-06-28 16:54:22 -07:00
Evan Tschannen db413c37f7 restored the STORAGE_DURABILITY_LAG_SOFT_MAX knob and made the rk target slightly smaller than the soft limit, to avoid inaccuracies in ratekeeper control causing behavior changes on the storage servers 2019-06-28 16:54:22 -07:00
Evan Tschannen 92b32855ca ratekeeper’s control algorithm would oscillate when limited by local ratekeeper 2019-06-28 16:54:22 -07:00
A.J. Beamon 35b6277a50 Fix knob copy paste error 2019-06-27 12:55:39 -07:00
Alex Miller 61901effed Increase how long FDB will wait before starting DD to repair data loss.
10s is a bit short for starting data distribution, which is rather
expensive.  60s is a bit more reasonable.
2019-06-19 13:40:21 -07:00
Evan Tschannen 20e3edeb0a Merge branch 'release-6.1'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/storageserver.actor.cpp
#	versions.target
2019-06-14 12:42:59 -07:00
Evan Tschannen 924f92e5aa Prevent the byte sample recovery from interfering with storage server recovery 2019-06-13 15:55:25 -07:00
Evan Tschannen 054d775343 increase the delay between idle commits to reduce the rate idle clusters fsync 2019-06-13 14:55:37 -07:00
Trevor Clinkenbeard 8144882d7b Merge branch 'apple-master' into features/local-rk 2019-06-10 19:40:25 -07:00
Evan Tschannen 29b96414e2 Merge branch 'release-6.1'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/NativeAPI.actor.cpp
#	fdbserver/Coordination.actor.cpp
#	flow/Arena.h
#	versions.target
2019-06-03 18:49:35 -07:00
Evan Tschannen 7c333dbc16 If a process receives a message in its clusterControllerInterface before becoming the cluster controller, if the process does not become the cluster controller in the next minute it should destroy the interface to prevent a memory leak. 2019-05-29 16:57:13 -07:00
sramamoorthy 31b6c86650 ignorePopDeadline to have high limit in simulator
- ignorePopDeadline to have highier limit in simulator
to accommdate for the buggify delays and make snapshot succeed.

- introduce a new knob for auto resetting the disabling of tlog pop
2019-05-28 22:07:46 -07:00
A.J. Beamon 603721e125 Merge branch 'master' into thread-safe-random-number-generation
# Conflicts:
#	fdbclient/ManagementAPI.actor.cpp
#	fdbrpc/AsyncFileCached.actor.h
#	fdbrpc/genericactors.actor.cpp
#	fdbrpc/sim2.actor.cpp
#	fdbserver/DiskQueue.actor.cpp
#	fdbserver/workloads/BulkSetup.actor.h
#	flow/ActorCollection.actor.cpp
#	flow/Net2.actor.cpp
#	flow/Trace.cpp
#	flow/flow.cpp
2019-05-23 08:35:47 -07:00
Evan Tschannen 8c3516951a Merge branch 'release-6.1'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	versions.target
2019-05-12 20:13:49 -07:00