Commit Graph

2123 Commits

Author SHA1 Message Date
Evan Tschannen be5d144b8b added status information on connected clients 2019-07-25 17:15:31 -07:00
Evan Tschannen 8b73a1c998 removed verbose trace messages 2019-07-24 15:07:41 -07:00
Evan Tschannen 2434d06726 fix: The coordinators did not properly track hasConnectedClients 2019-07-24 14:41:12 -07:00
Evan Tschannen b303ab4e6c fix: DR agents need to be clients because their failure monitoring information needs to come from two different cluster controllers 2019-07-23 19:24:07 -07:00
Evan Tschannen 4a866290b7 Clients keep a persistent connection open with coordinators to get updates to the list of proxies
Status still needs to be updated with client information with information from the coordinators
2019-07-23 19:22:44 -07:00
A.J. Beamon f31884c749 Merge branch 'master' into add-priority-starts-to-status
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
2019-07-11 15:26:52 -07:00
Steve Atherton 1700d492cf
Merge pull request #1823 from ajbeamon/cache-hit-rate-in-status
Tweak cache hit calculations and add cache hit rate to status
2019-07-11 14:06:06 -07:00
A.J. Beamon 97609ad991 Add information about transaction starts at different priorities to status. 2019-07-11 13:54:44 -07:00
A.J. Beamon d10d9c6557
Merge pull request #1826 from etschannen/master
fix: do not access optionInfo unless the option already exists in the map
2019-07-10 19:27:22 -07:00
Evan Tschannen bbef631872 fix: do not access optionInfo unless the option already exists in the map 2019-07-10 18:48:54 -07:00
Vishesh Yadav 2606794df6
Merge pull request #1812 from alexmiller-apple/improve-only-spilled
Improve the behavior of parallelPeekMore+onlySpilled.
2019-07-10 17:15:19 -07:00
A.J. Beamon b4dbc6d7fa Change the way cache hits and misses are tracked to avoid counting blind page writes as misses and count the results of partial page writes. Report cache hit rate in status. 2019-07-10 14:43:20 -07:00
Evan Tschannen 7e919e361c
Merge pull request #1817 from etschannen/feature-proxy-forward
Proxies will forward clients to the next generation
2019-07-10 13:53:12 -07:00
A.J. Beamon 69d7c4f79c Merge branch 'master' into track-run-loop-busyness
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	flow/Net2.actor.cpp
#	flow/network.h
2019-07-09 18:39:23 -07:00
Evan Tschannen 49121172ea
Merge pull request #1795 from alexmiller-apple/peek-from-satellites
Log Routers will prefer to peek from satellite logs.
2019-07-09 17:38:57 -07:00
Evan Tschannen c8d86516f0
Merge pull request #1800 from ajbeamon/rename-datacenter-version-difference
Rename datacenter_version_difference to datacenter_lag and include bo…
2019-07-09 17:29:27 -07:00
Evan Tschannen 7ad0d1a12b Merge branch 'master' into feature-proxy-forward
# Conflicts:
#	fdbclient/NativeAPI.actor.cpp
2019-07-09 17:26:15 -07:00
Evan Tschannen 001abec29d fixed a compiler error, buggified a new knob 2019-07-09 16:50:59 -07:00
Evan Tschannen 64aee73c4f we only need to hold the ReplyPromise for messages that we are going to forward to new proxies 2019-07-09 16:47:56 -07:00
Meng Xu cce00bb413
Merge pull request #1808 from ajbeamon/improved-transaction-metrics
Improve TransactionMetrics
2019-07-09 16:46:17 -07:00
Evan Tschannen b27a909f3a fix: onDisconnectOrFailure can spuriously trigger 2019-07-09 16:38:59 -07:00
Evan Tschannen d032d7fcf9 fix: if we get a broken_promise from the actor, wait to get the real error from the store 2019-07-09 16:37:54 -07:00
Vishesh Yadav 2f29b2c3d1 simulator: Just do a wait() in setupAndRun to avoid destruction
It get us out of the ACTOR, never clearing the systemActors, and let
simulator call exit().
2019-07-09 14:55:20 -07:00
Vishesh Yadav 78a1b2defc simulator: Destroy each process individually in its context
When simulation ends, all the actors are cancelled, and the
destructions which rely on `globals` may not have access to right
globals (instead of the default simulator process globals). This
patch, calls destroy on each process individually after we context
switch to that process so that the globals acceses in destructor are
its own.

This issue arised when trying to get `Peer::peerReferences` in
NetNotifiedQueue, resulting in decrementing the reference count of
peers in FlowTransport object of '0.0.0.0'.
2019-07-09 14:24:16 -07:00
Vishesh Yadav eabc610daa
Merge pull request #1813 from alexmiller-apple/log-version-4
Add a TLogVersion::V4
2019-07-09 08:42:20 -07:00
Alex Miller 44f11702a8 Log Routers will prefer to peek from satellite logs.
Formerly, they would prefer to peek from the primary's logs.  Testing of
a failed region rejoining the cluster revealed that this becomes quite a
strain on the primary logs when extremely large volumes of peek requests
are coming from the Log Routers.  It happens that we have satellites
that contain the same mutations with Log Router tags, that have no other
peeking load, so we can prefer to use the satellite to peek rather than
the primary to distribute load across TLogs better.

Unfortunately, this revealed a latent bug in how tagged mutations in the
KnownCommittedVersion->RecoveryVersion gap were copied across
generations when the number of log router tags were decreased.
Satellite TLogs would be assigned log router tags using the
team-building based logic in getPushLocations(), whereas TLogs would
internally re-index tags according to tag.id%logRouterTags.  This
mismatch would mean that we could have:

    Log0 -2:0 ----- -2:0  Log 0

    Log1 -2:1 \
               >--- -2:1,-2:0 (-2:2 mod 2 becomes -2:0)  Log 1
    Log2 -2:2 /

And now we have data that's tagged as -2:0 on a TLog that's not the
preferred location for -2:0, and therefore a BestLocationOnly cursor
would miss the mutations.

This was never noticed before, as we never
used a satellite as a preferred location to peek from.  Merge cursors
always peek from all locations, and thus a peek for -2:0 that needed
data from the satellites would have gone to both TLogs and merged the
results.

We now take this mod-based re-indexing into account when assigning which
TLogs need to recover which tags from the previous generation, to make
sure that tag.id%logRouterTags always results in the assigned TLog being
the preferred location.

Unfortunately, previously existing will potentially have existing
satellites with log router tags indexed incorrectly, so this transition
needs to be gated on a `log_version` transition.  Old LogSets will have
an old LogVersion, and we won't prefer the sattelite for peeking.  Log
Sets post-6.2 (opt-in) or post-6.3 (default) will be indexed correctly,
and therefore we can safely offload peeking onto the satellites.
2019-07-08 22:25:01 -07:00
Alex Miller d2ef84a8f9 Add a TLogVersion::V4
And refactor some code to make adding more TLogVersions easier.
2019-07-08 22:22:45 -07:00
Alex Miller 6c8f50ca66 Improve the behavior of parallelPeekMore+onlySpilled.
When onlySpilled transitions from true (don't peek memory) to false (do
peek memory) as part of a parallel peek, we'll end up wasting the rest
of the replies because we'll honor their onlySpilled=true setting and
thus not have any additional data to return.

Instead, we thread the onlySpilled back through in the same way that the
ending version of the last peek is used overrides the requested starting
version of the next peek.  This simulated the same behavior that the
client has, where the value of onlySpilled that we reply with comes back
in the next request.

I haven't actually seen it be a problem, but this should help make sure
the onlySpilled transition when catching up doesn't ever cause any ill
effects if a process starts riding the line between onlySpilled settings.
2019-07-08 22:13:09 -07:00
A.J. Beamon a5a6f8431c Add a random UID to TransactionMetrics in case a client opens multiple connections and also a field to indicate whether the connection is internal. Convert some of the metrics to our Counter object instead of running totals. 2019-07-08 14:01:04 -07:00
Evan Tschannen c348b3da51 After a proxy dies, it will remain alive for an additional 10 seconds to forward clients to the new proxies 2019-07-08 12:53:40 -07:00
Evan Tschannen ec11ef024b
Merge pull request #1798 from ajbeamon/merge-release-6.1-into-master
Merge release 6.1 into master
2019-07-08 09:02:56 -07:00
A.J. Beamon dd85edb08c
Merge pull request #1802 from xumengpanda/mengxu/DD-ensure-redundant-team-priority-as700-PR
TeamTracker:Set redundant team priority as PRIORITY_TEAM_REDUNDANT
2019-07-08 08:47:28 -07:00
Vishesh Yadav 8d3a826c63
Merge pull request #1804 from alexmiller-apple/cycle-verify-only
Add a checkOnly parameter to Cycle workload.
2019-07-05 21:59:52 -07:00
Jingyu Zhou 50e7593c5b
Merge pull request #1796 from ajbeamon/remove-trace-event-underscores
Remove trace event underscores
2019-07-05 21:45:55 -07:00
Alex Miller 14e5dd74fe Add a checkOnly parameter to Cycle workload.
So that it can be used in the real world for consistency checking of
backup and DR.
2019-07-05 19:09:09 -07:00
Evan Tschannen 310a5fe9a3 fix: we cannot reject 100% of requests, because a storage server which is stuck needs to get a future version error to trigger an all alternatives failed message from load balance so that clients will re-grab storage server interfaces from the proxy 2019-07-05 17:28:22 -07:00
Meng Xu e8fb7564f5 Merge branch 'master' into mengxu/DD-ensure-redundant-team-priority-as700-PR 2019-07-05 17:28:12 -07:00
Meng Xu 46d28a3b79 TeamTracker:Set redundant team priority as redundant
The redundant team removed by teamRemover will not exist
in the global teams data structure. So we will not find
the redundant team from shard-to-team mapping in the system key.

Before this change, teamTracker marks such team as PRIORITY_TEAM_UNHEALTHY.
With this change, it marks it as PRIORITY_TEAM_REDUNDANT
2019-07-05 15:24:00 -07:00
A.J. Beamon 4be08d9b2d Rename datacenter_version_difference to datacenter_lag and include both seconds and versions. 2019-07-05 14:36:18 -07:00
A.J. Beamon 2a56e011ea Merge branch 'release-6.1' into merge-release-6.1-into-master
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/DataDistribution.actor.cpp
2019-07-05 13:52:29 -07:00
A.J. Beamon 9f4b6fd770 Remove additional underscores 2019-07-05 08:12:25 -07:00
A.J. Beamon a3ac9c7eea Remove underscores from some trace event names 2019-07-05 08:08:29 -07:00
Alex Miller ea6898144d Merge remote-tracking branch 'upstream/master' into flowlock-api 2019-07-03 20:44:15 -07:00
Evan Tschannen 23ecc17075
Merge pull request #1755 from senthil-ram/recoveryFix
sev40 if knownCommittedVersion > recoveryVersion
2019-07-03 16:39:16 -07:00
Evan Tschannen e153571a50
Merge pull request #1775 from alexmiller-apple/crc32c-memory-storage
Memory storage engine to use crc32c DiskQueue by default (in 6.2).
2019-07-03 16:37:42 -07:00
A.J. Beamon 8c10d832a1 Add coordinator role in trace events 2019-07-03 11:09:36 -07:00
Evan Tschannen 8afab93e29
Merge pull request #1782 from etschannen/master
revert storage server priority changes
2019-07-02 17:25:31 -07:00
Evan Tschannen 3fb0999e10 revert storage server priority changes 2019-07-02 16:54:47 -07:00
Evan Tschannen 86b0224347 Merge branch 'release-6.1' of github.com:apple/foundationdb into release-6.1 2019-07-02 16:27:31 -07:00
Evan Tschannen 64e33bb4f9 added logging for maintenance mode 2019-07-02 16:25:29 -07:00