Commit Graph

272 Commits

Author SHA1 Message Date
A.J. Beamon 7c801513e2 Fix cases where latency band config could be discarded during recovery or process start. 2019-11-20 11:44:18 -08:00
Evan Tschannen ffc89d1182 fix: dd test recruitment should prefer the location of ratekeeper over other used processes 2019-11-13 12:58:55 -08:00
Balachandar Namasivayam 2e41497580 This commit tries to distribute RK and DD among other empty available processes. 2019-11-12 17:52:42 -08:00
Balachandar Namasivayam f5282f2c7e Fix bug where DD or RK could be halted and re-recruited in a loop for certain valid process class configurations. Specifically, recruitment of DD or RK takes into account that master process is preferred over proxy, resolver or cc.
But check for better DD only looks for better machine class ignoring that the new recruit could share a proxy or resolver or CC. Also try to balance the distribution of the DD and RK role if there are enough processes to do so.
2019-11-12 14:22:36 -08:00
Evan Tschannen 43e99ef6a4 fix: better master exists must check if fitness is better for proxies or resolvers before looking at the count of either of them 2019-10-17 13:18:31 -07:00
Evan Tschannen 298b815109 one proxy or resolver with best fitness no longer prevents more proxies or resolvers from being recruited with good fitness 2019-10-14 18:32:17 -07:00
Evan Tschannen 5064d91b75 fix: the cluster controller would not change to a new set of satellite tlogs when they become available in a better satellite location 2019-10-14 18:31:23 -07:00
Evan Tschannen 35e816e9ad added the ability to configure satellite_logs by satellite location, this will overwrite the region configure if both are present 2019-10-14 18:30:15 -07:00
A.J. Beamon 31ce56eddf Add cluster controller metrics 2019-10-03 15:29:11 -07:00
Evan Tschannen a62862c105 add yieldedFutures to prevent slow tasks 2019-09-11 16:26:48 -07:00
Evan Tschannen 945cff1e5b the cluster controller caches the serialization of serverDBInfo, to avoid regenerating it many times 2019-09-10 14:27:22 -07:00
Evan Tschannen 90e3b50213 Merge branch 'master' into feature-coordinator-connection
# Conflicts:
#	fdbclient/DatabaseContext.h
#	fdbclient/NativeAPI.actor.cpp
#	fdbclient/NativeAPI.actor.h
#	fdbserver/workloads/KillRegion.actor.cpp
2019-07-26 15:05:02 -07:00
Evan Tschannen be5d144b8b added status information on connected clients 2019-07-25 17:15:31 -07:00
Jingyu Zhou bbeaf0ebbb Add a monitorServerInfoConfig() call back
This was deleted during a code refactor in ef868f5. Because no tests were
complaining, we didn't find this until now.
2019-07-25 15:17:26 -07:00
Evan Tschannen 4a866290b7 Clients keep a persistent connection open with coordinators to get updates to the list of proxies
Status still needs to be updated with client information with information from the coordinators
2019-07-23 19:22:44 -07:00
Jingyu Zhou 50e7593c5b
Merge pull request #1796 from ajbeamon/remove-trace-event-underscores
Remove trace event underscores
2019-07-05 21:45:55 -07:00
A.J. Beamon 9f4b6fd770 Remove additional underscores 2019-07-05 08:12:25 -07:00
Alex Miller 7a500cd37f A giant translation of TaskFooPriority -> TaskPriority::Foo
This is so that APIs that take priorities don't take ints, which are
common and easy to accidentally pass the wrong thing.
2019-06-25 02:47:35 -07:00
Vishesh Yadav a8e408e268 run clang-format on changes 2019-06-10 14:10:24 -07:00
Vishesh Yadav 6fa7081a21 net: Don't make FailureMonitoring requests from client
This patch removes the need for clients to continuously contact
cluster coordinator for failure monitoring information. Instead, it
uses the FlowTransport to monitor the statuses of peers and update
FailureMonitor accordingly.
2019-06-09 00:43:38 -07:00
Evan Tschannen 29b96414e2 Merge branch 'release-6.1'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/NativeAPI.actor.cpp
#	fdbserver/Coordination.actor.cpp
#	flow/Arena.h
#	versions.target
2019-06-03 18:49:35 -07:00
Evan Tschannen 7c333dbc16 If a process receives a message in its clusterControllerInterface before becoming the cluster controller, if the process does not become the cluster controller in the next minute it should destroy the interface to prevent a memory leak. 2019-05-29 16:57:13 -07:00
A.J. Beamon 5f55f3f613 Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used. 2019-05-10 14:01:52 -07:00
Andrew Noyes 6207d724f8 Fix all -Wunused-variable warnings 2019-04-15 18:13:00 -07:00
mpilman 1c16f87a4e Remove trace-calls to printable (in non-workloads) 2019-04-05 13:12:19 -07:00
mpilman c008e16c81 Defer formatting in traces to make them cheaper
This is the first part of making `TraceEvent` cheaper. The main idea is
to defer calls to any code that formats string. These are the main
changes:

- TraceEvent::detail now takes a c-string instead of std::string for
  literals. This prevents unnecessary allocations if the trace is not
  going to be printed in the first place (for example for SevDebug).
  Before that `detail` expected a `std::string` as key, which mean that
  any string literal would be copied on each call.
- Templates Traceable and SpecialTraceMetricType. These templates can be
  specialized for any type that needs to be printed. The actual
  formatting will be deferred to after the `enabled` check. This
  provides two benefits: (1) if a TraceEvent is disabled, we don't pay
  for the formatting and (2) TraceEvent can trace types that it doesn't
  know about.
- TraceEvent::enabled will be set in the constructor if the Severity is
  passed. This will make sure that `TraceEvent::init` is not called.
- `TraceEvent::detail` will be inlined. So for disabled TraceEvent
  calls, a call to detail will only introduce a if-branch which is much
  cheaper than a function call.
2019-04-05 13:12:19 -07:00
Evan Tschannen 8ebf771392 cleanup cluster controller trace events 2019-03-30 14:17:18 -07:00
A.J. Beamon 71e2fdafb8 Changes to ratekeeper camel case 2019-03-27 08:24:25 -07:00
Evan Tschannen 5e03e178de
Merge pull request #1345 from ajbeamon/support-multiple-client-or-worker-issues
Add support for a client or worker having multiple issues.
2019-03-24 17:27:50 -07:00
Evan Tschannen d45159ebf7
Merge pull request #1307 from jzhou77/ratekeeper
Monitor placement of Ratekeeper and DataDistributor
2019-03-24 17:26:07 -07:00
Evan Tschannen d6ad027d37 ratekeeper needs to be recruited for proxies to make progress, so if one has not registered with the cluster controller by the time we are accepting commits, recruit a new one 2019-03-24 16:48:24 -07:00
Evan Tschannen f426d732ea fix: forgot to remove one location where id_used was incremented for distributor and ratekeeper 2019-03-24 16:04:59 -07:00
Evan Tschannen e8948726e8 once we recruit a ratekeeper, do not allow any other ratekeepers to register 2019-03-24 11:04:39 -07:00
Jingyu Zhou 40eec20252 Restore master PID in worker registration
This fix is lost during merge.
2019-03-23 21:02:11 -07:00
Jingyu Zhou 3ef26e6be3 Fix fitness assignment statements
Found by MacOS build.
2019-03-23 19:16:04 -07:00
Evan Tschannen 1fc6937802 changed NetworkAddressList to at most two addresses for performance 2019-03-23 17:54:46 -07:00
Evan Tschannen b51a24453e the data distributor and ratekeeper are not included in id_used, but when comparing equally good options we prefer to avoid sharing with those roles
excluded data distributor and ratekeeper were improperly killed when the best option was also excluded
2019-03-23 13:25:36 -07:00
Jingyu Zhou fdc5b5ddbf Fix: spurious ratekeeper registration
A rare race condition:
-r simulation -f ./foundationdb/tests/slow/WriteDuringReadAtomicRestore.txt -s 114256311 -b on

- A is the ratekeeper.
- CC recruit B and B starts
- CC halts ratekeeper A and A is halted
- A registers back with CC, which then halts B. CC sets A to be the ratekeeper.

CC starts recruiting and finds A is the best machine. But skips recruiting
because CC thinks A is already used. Now the cluster is left with no ratekeeper.

Fix by disallowing ratekeeper registration with previous ID.
2019-03-23 11:03:51 -07:00
Jingyu Zhou 6523cd4931 Fix: recruit ratekeeper is not triggerred 2019-03-23 09:20:54 -07:00
Evan Tschannen 2da46e3172 fix: halt if datacenters are different 2019-03-22 23:53:21 -07:00
Evan Tschannen d34c56c9a5 ensure that the processId exists in id_worker before accessing it 2019-03-22 18:54:39 -07:00
Evan Tschannen 36ab852bb1 Merge branch 'master' into ratekeeper
# Conflicts:
#	fdbserver/ClusterController.actor.cpp
2019-03-22 18:41:00 -07:00
Evan Tschannen ddb6058770 simplified ratekeeper monitoring loop 2019-03-22 18:22:45 -07:00
Jingyu Zhou 12917d8c7d Add actors to store halt request futures
Address best fitness in checking better DD or RK.
2019-03-22 18:06:38 -07:00
Jingyu Zhou e8977aeb98 Remove clusterControllerDcId check
This is no longer needed since it'll be set in the ctor.
2019-03-22 18:01:54 -07:00
Evan Tschannen 82bc447e29 startRatekeeper is responsible for updating serverDBInfo 2019-03-22 17:56:16 -07:00
Evan Tschannen 82c80c225d make sure id_worker is updated before setting ratekeeper or data distribution 2019-03-22 17:08:54 -07:00
Evan Tschannen 6a9c9d79cc
Update fdbserver/ClusterController.actor.cpp 2019-03-22 17:00:58 -07:00
Evan Tschannen 70b1c88cdd
Update fdbserver/ClusterController.actor.cpp 2019-03-22 17:00:52 -07:00
Jingyu Zhou 16f54577ee Restore master PID in cluster controller worker registration
CC may think master failed and clear the master PID, which can block both data
distributor and ratekeeper recruitment. Fix by restoring it during worker
registration.
2019-03-22 14:53:05 -07:00