A.J. Beamon
7c801513e2
Fix cases where latency band config could be discarded during recovery or process start.
2019-11-20 11:44:18 -08:00
Evan Tschannen
ffc89d1182
fix: dd test recruitment should prefer the location of ratekeeper over other used processes
2019-11-13 12:58:55 -08:00
Balachandar Namasivayam
2e41497580
This commit tries to distribute RK and DD among other empty available processes.
2019-11-12 17:52:42 -08:00
Balachandar Namasivayam
f5282f2c7e
Fix bug where DD or RK could be halted and re-recruited in a loop for certain valid process class configurations. Specifically, recruitment of DD or RK takes into account that master process is preferred over proxy, resolver or cc.
...
But check for better DD only looks for better machine class ignoring that the new recruit could share a proxy or resolver or CC. Also try to balance the distribution of the DD and RK role if there are enough processes to do so.
2019-11-12 14:22:36 -08:00
Evan Tschannen
43e99ef6a4
fix: better master exists must check if fitness is better for proxies or resolvers before looking at the count of either of them
2019-10-17 13:18:31 -07:00
Evan Tschannen
298b815109
one proxy or resolver with best fitness no longer prevents more proxies or resolvers from being recruited with good fitness
2019-10-14 18:32:17 -07:00
Evan Tschannen
5064d91b75
fix: the cluster controller would not change to a new set of satellite tlogs when they become available in a better satellite location
2019-10-14 18:31:23 -07:00
Evan Tschannen
35e816e9ad
added the ability to configure satellite_logs by satellite location, this will overwrite the region configure if both are present
2019-10-14 18:30:15 -07:00
A.J. Beamon
31ce56eddf
Add cluster controller metrics
2019-10-03 15:29:11 -07:00
Evan Tschannen
a62862c105
add yieldedFutures to prevent slow tasks
2019-09-11 16:26:48 -07:00
Evan Tschannen
945cff1e5b
the cluster controller caches the serialization of serverDBInfo, to avoid regenerating it many times
2019-09-10 14:27:22 -07:00
Evan Tschannen
90e3b50213
Merge branch 'master' into feature-coordinator-connection
...
# Conflicts:
# fdbclient/DatabaseContext.h
# fdbclient/NativeAPI.actor.cpp
# fdbclient/NativeAPI.actor.h
# fdbserver/workloads/KillRegion.actor.cpp
2019-07-26 15:05:02 -07:00
Evan Tschannen
be5d144b8b
added status information on connected clients
2019-07-25 17:15:31 -07:00
Jingyu Zhou
bbeaf0ebbb
Add a monitorServerInfoConfig() call back
...
This was deleted during a code refactor in ef868f5
. Because no tests were
complaining, we didn't find this until now.
2019-07-25 15:17:26 -07:00
Evan Tschannen
4a866290b7
Clients keep a persistent connection open with coordinators to get updates to the list of proxies
...
Status still needs to be updated with client information with information from the coordinators
2019-07-23 19:22:44 -07:00
Jingyu Zhou
50e7593c5b
Merge pull request #1796 from ajbeamon/remove-trace-event-underscores
...
Remove trace event underscores
2019-07-05 21:45:55 -07:00
A.J. Beamon
9f4b6fd770
Remove additional underscores
2019-07-05 08:12:25 -07:00
Alex Miller
7a500cd37f
A giant translation of TaskFooPriority -> TaskPriority::Foo
...
This is so that APIs that take priorities don't take ints, which are
common and easy to accidentally pass the wrong thing.
2019-06-25 02:47:35 -07:00
Vishesh Yadav
a8e408e268
run clang-format on changes
2019-06-10 14:10:24 -07:00
Vishesh Yadav
6fa7081a21
net: Don't make FailureMonitoring requests from client
...
This patch removes the need for clients to continuously contact
cluster coordinator for failure monitoring information. Instead, it
uses the FlowTransport to monitor the statuses of peers and update
FailureMonitor accordingly.
2019-06-09 00:43:38 -07:00
Evan Tschannen
29b96414e2
Merge branch 'release-6.1'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbclient/NativeAPI.actor.cpp
# fdbserver/Coordination.actor.cpp
# flow/Arena.h
# versions.target
2019-06-03 18:49:35 -07:00
Evan Tschannen
7c333dbc16
If a process receives a message in its clusterControllerInterface before becoming the cluster controller, if the process does not become the cluster controller in the next minute it should destroy the interface to prevent a memory leak.
2019-05-29 16:57:13 -07:00
A.J. Beamon
5f55f3f613
Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used.
2019-05-10 14:01:52 -07:00
Andrew Noyes
6207d724f8
Fix all -Wunused-variable warnings
2019-04-15 18:13:00 -07:00
mpilman
1c16f87a4e
Remove trace-calls to printable (in non-workloads)
2019-04-05 13:12:19 -07:00
mpilman
c008e16c81
Defer formatting in traces to make them cheaper
...
This is the first part of making `TraceEvent` cheaper. The main idea is
to defer calls to any code that formats string. These are the main
changes:
- TraceEvent::detail now takes a c-string instead of std::string for
literals. This prevents unnecessary allocations if the trace is not
going to be printed in the first place (for example for SevDebug).
Before that `detail` expected a `std::string` as key, which mean that
any string literal would be copied on each call.
- Templates Traceable and SpecialTraceMetricType. These templates can be
specialized for any type that needs to be printed. The actual
formatting will be deferred to after the `enabled` check. This
provides two benefits: (1) if a TraceEvent is disabled, we don't pay
for the formatting and (2) TraceEvent can trace types that it doesn't
know about.
- TraceEvent::enabled will be set in the constructor if the Severity is
passed. This will make sure that `TraceEvent::init` is not called.
- `TraceEvent::detail` will be inlined. So for disabled TraceEvent
calls, a call to detail will only introduce a if-branch which is much
cheaper than a function call.
2019-04-05 13:12:19 -07:00
Evan Tschannen
8ebf771392
cleanup cluster controller trace events
2019-03-30 14:17:18 -07:00
A.J. Beamon
71e2fdafb8
Changes to ratekeeper camel case
2019-03-27 08:24:25 -07:00
Evan Tschannen
5e03e178de
Merge pull request #1345 from ajbeamon/support-multiple-client-or-worker-issues
...
Add support for a client or worker having multiple issues.
2019-03-24 17:27:50 -07:00
Evan Tschannen
d45159ebf7
Merge pull request #1307 from jzhou77/ratekeeper
...
Monitor placement of Ratekeeper and DataDistributor
2019-03-24 17:26:07 -07:00
Evan Tschannen
d6ad027d37
ratekeeper needs to be recruited for proxies to make progress, so if one has not registered with the cluster controller by the time we are accepting commits, recruit a new one
2019-03-24 16:48:24 -07:00
Evan Tschannen
f426d732ea
fix: forgot to remove one location where id_used was incremented for distributor and ratekeeper
2019-03-24 16:04:59 -07:00
Evan Tschannen
e8948726e8
once we recruit a ratekeeper, do not allow any other ratekeepers to register
2019-03-24 11:04:39 -07:00
Jingyu Zhou
40eec20252
Restore master PID in worker registration
...
This fix is lost during merge.
2019-03-23 21:02:11 -07:00
Jingyu Zhou
3ef26e6be3
Fix fitness assignment statements
...
Found by MacOS build.
2019-03-23 19:16:04 -07:00
Evan Tschannen
1fc6937802
changed NetworkAddressList to at most two addresses for performance
2019-03-23 17:54:46 -07:00
Evan Tschannen
b51a24453e
the data distributor and ratekeeper are not included in id_used, but when comparing equally good options we prefer to avoid sharing with those roles
...
excluded data distributor and ratekeeper were improperly killed when the best option was also excluded
2019-03-23 13:25:36 -07:00
Jingyu Zhou
fdc5b5ddbf
Fix: spurious ratekeeper registration
...
A rare race condition:
-r simulation -f ./foundationdb/tests/slow/WriteDuringReadAtomicRestore.txt -s 114256311 -b on
- A is the ratekeeper.
- CC recruit B and B starts
- CC halts ratekeeper A and A is halted
- A registers back with CC, which then halts B. CC sets A to be the ratekeeper.
CC starts recruiting and finds A is the best machine. But skips recruiting
because CC thinks A is already used. Now the cluster is left with no ratekeeper.
Fix by disallowing ratekeeper registration with previous ID.
2019-03-23 11:03:51 -07:00
Jingyu Zhou
6523cd4931
Fix: recruit ratekeeper is not triggerred
2019-03-23 09:20:54 -07:00
Evan Tschannen
2da46e3172
fix: halt if datacenters are different
2019-03-22 23:53:21 -07:00
Evan Tschannen
d34c56c9a5
ensure that the processId exists in id_worker before accessing it
2019-03-22 18:54:39 -07:00
Evan Tschannen
36ab852bb1
Merge branch 'master' into ratekeeper
...
# Conflicts:
# fdbserver/ClusterController.actor.cpp
2019-03-22 18:41:00 -07:00
Evan Tschannen
ddb6058770
simplified ratekeeper monitoring loop
2019-03-22 18:22:45 -07:00
Jingyu Zhou
12917d8c7d
Add actors to store halt request futures
...
Address best fitness in checking better DD or RK.
2019-03-22 18:06:38 -07:00
Jingyu Zhou
e8977aeb98
Remove clusterControllerDcId check
...
This is no longer needed since it'll be set in the ctor.
2019-03-22 18:01:54 -07:00
Evan Tschannen
82bc447e29
startRatekeeper is responsible for updating serverDBInfo
2019-03-22 17:56:16 -07:00
Evan Tschannen
82c80c225d
make sure id_worker is updated before setting ratekeeper or data distribution
2019-03-22 17:08:54 -07:00
Evan Tschannen
6a9c9d79cc
Update fdbserver/ClusterController.actor.cpp
2019-03-22 17:00:58 -07:00
Evan Tschannen
70b1c88cdd
Update fdbserver/ClusterController.actor.cpp
2019-03-22 17:00:52 -07:00
Jingyu Zhou
16f54577ee
Restore master PID in cluster controller worker registration
...
CC may think master failed and clear the master PID, which can block both data
distributor and ratekeeper recruitment. Fix by restoring it during worker
registration.
2019-03-22 14:53:05 -07:00