Commit Graph

640 Commits

Author SHA1 Message Date
sfc-gh-tclinkenbeard 5c2d7b6080 Create RangeResult type alias 2021-05-03 13:14:16 -07:00
sfc-gh-tclinkenbeard f9ede75b42 Remove unused variable in ClusterController.actor.cpp 2021-05-03 11:10:43 -07:00
Markus Pilman 54919d4f3b Merge remote-tracking branch 'sfc/features/actor-lineage' into features/actor-lineage 2021-04-28 09:22:14 -06:00
Evan Tschannen 1f98dec1df cleaned up default constructed maps 2021-04-26 19:26:25 -07:00
sfc-gh-tclinkenbeard dc577b6608 Fix some bugs in distribution of configBroadcaster interface 2021-04-26 18:46:22 -07:00
sfc-gh-tclinkenbeard 7211d838cf Remove broadcastConfigDatabase actor 2021-04-26 15:54:08 -07:00
Evan Tschannen 451609e6be code cleanup 2021-04-26 10:16:18 -07:00
Evan Tschannen 50bb9b51b4 simulation does recruitment twice and compares the results to ensure recruitment is deterministic 2021-04-26 10:13:59 -07:00
Evan Tschannen 49ca48f82e fix: tlog recruitment could select more than the desired about of tlogs
fix: tlog recruitment did not attempt to avoid longLivedStateless processes
2021-04-26 10:09:44 -07:00
Evan Tschannen 7503964ee9 recruitment tries to avoid degraded processes altogether, rather than just the worst one. Since this is a behavior change from the backup recruitment, we cannot compared degraded between the two recruitments 2021-04-26 10:01:54 -07:00
Evan Tschannen ccfc77f6fb changed preferredSharing to be ordered, so that recruitment will always share with the same other role when everything else is equal 2021-04-26 09:57:46 -07:00
sfc-gh-tclinkenbeard 9bed1f7aa5 Run SimpleConfigBroadcaster on cluster controller 2021-04-25 17:20:02 -07:00
Evan Tschannen b61a911685 removed an ASSERT that was for debugging purposed, and increased the max commit latency, because it can be spuriously triggered by dummy transactions that take 5+ seconds each 2021-04-21 14:30:06 -07:00
Evan Tschannen e18c9961b4 rewrote tlog recruitment logic so that it is deterministic, to prevent better master exists from triggering spuriously 2021-04-21 00:22:33 -07:00
Lukas Joswiak c81e1e9519 Add sampling profiler frequency to global config 2021-04-19 22:46:57 -07:00
RenxuanW 4bf7218e8f
Merge pull request #4635 from RenxuanW/priority_logging
Log a warning when remote dc is disabled (priority < 0)
2021-04-15 17:00:41 -07:00
Lukas Joswiak 7de23918c0 Add comments, fix erase bug, make optimizations 2021-04-14 10:56:33 -07:00
Lukas Joswiak c38ddf5eb7 Add comments 2021-04-14 10:56:33 -07:00
Lukas Joswiak 7ba7257cd2 Store global config data on heap 2021-04-14 10:56:33 -07:00
Lukas Joswiak 1c60653c2a Add fix to conditionally set global config history 2021-04-14 10:56:33 -07:00
Lukas Joswiak 6de28dd916 clang-format 2021-04-14 10:56:33 -07:00
Lukas Joswiak 1260385965 Use object to wrap global configuration history 2021-04-14 10:56:32 -07:00
Lukas Joswiak fb9a929780 Fix issue with freed memory being accessed 2021-04-14 10:56:32 -07:00
Lukas Joswiak c3f68831af Move existing ClientDBInfo variables to global configuration 2021-04-14 10:56:32 -07:00
Lukas Joswiak 7bb0b3d899 Use commit version for global configuration updates
FIXME: There is a memory issue where the underlying data for values set
in the `data` field of GlobalConfig will be freed shortly after being
set.
2021-04-14 10:56:32 -07:00
Lukas Joswiak f1415412f1 Add global configuration framework implementation 2021-04-14 10:56:32 -07:00
Evan Tschannen bd6db9ca7c
Update fdbserver/ClusterController.actor.cpp
Co-authored-by: Markus Pilman <markus.pilman@snowflake.com>
2021-04-13 15:13:45 -07:00
RenxuanW 7be8dab045 Change DcPriorityNegative to CCDcPriorityNegative 2021-04-08 16:00:37 -07:00
RenxuanW 738e7402f7 Log a warning when remote dc is disabled (priority < 0) 2021-04-08 15:36:52 -07:00
RenxuanW f3d5fa4750 Revert "Log a warning when remote dc's priority doesn't match the original primary."
This reverts commit 1d701e8bcf.
2021-04-08 15:19:43 -07:00
RenxuanW 1d701e8bcf Log a warning when remote dc's priority doesn't match the original primary. 2021-04-08 14:38:37 -07:00
Evan Tschannen a90c26f1d0 The master, proxies, and resolver all need to have the same machine class fitness function besides best fit to ensure recruitment is deterministic
if the first GRV proxy or resolver is forced to share a process, it should prefer to share with the commit proxy so that the commit proxy has more potential options it can share with
2021-04-08 14:29:12 -07:00
Evan Tschannen 5695a1816f fix: requiredFitness was being set to one higher than the actual requirement 2021-04-07 21:31:14 -07:00
Evan Tschannen 1b1f73ea16 added comments 2021-04-07 20:40:42 -07:00
Evan Tschannen 4d8dd0b0a0 fix: desired must be greater than or equal to required 2021-04-07 20:32:45 -07:00
Evan Tschannen 14213b0151 code cleanup 2021-04-07 20:06:30 -07:00
Evan Tschannen 15e8b43961 rewrote getWorkersForTLogs to do a much better job of avoiding degraded processes and processes in the same DC as the cluster controller 2021-04-07 19:57:24 -07:00
Evan Tschannen c27d82cecd tlog recruitment used a degraded LogClass process over a non-degraded TransactionClass process
tlog recruitment would not use TransactionClass processes if it fulfulled the required amount with LogClass processes
Better master exists did not account for how many times a process had been used when comparing recruitments
Better master exists did not account for the fact that tlogs prefer to be in a different dc than the cluster controller
RoleFitness comparison did not properly order count before degraded or bestFit
betterCount was returning worstFit when worstIsDegraded did not match
backupWorker recruitment did not attempt to avoid sharing processes with other roles
If any of the commit_proxy, grv_proxy, or resolver are forced to share a process, allow the recruitment for all of them to share to an equal degree, this change allows BetterMasterExists to be refactors as a tuple comparison
2021-04-07 16:04:08 -07:00
Markus Pilman 50342b5082 fix a second low-latency bug 2021-03-29 13:31:26 -06:00
Markus Pilman 8555723b98 removing testing case 2021-03-26 15:46:54 -06:00
Markus Pilman 43bed1d9dd Fix bug where betterMasterExist and recruitment disagree 2021-03-26 15:06:59 -06:00
Evan Tschannen 10b6b5d710 If the current configuration does not have a satellite fallback policy we do not care if the old configuration is in fallback mode 2021-03-23 13:02:31 -07:00
A.J. Beamon 99f3bb6d7d
Merge pull request #4509 from sfc-gh-etschannen/feature-bme-count
Do not trigger BetterMasterExists if it lowers the number of processes
2021-03-22 13:43:24 -07:00
Zhe Wu 15f3699e22 Add targeting DC ids in the tlog recruitment event trace. 2021-03-19 14:10:38 -07:00
Meng Xu 0cedef123b
Merge pull request #4518 from halfprice/zhewu/log-tlog-recruitment-failure-reason
Logging more detailed information during Tlog recruitment
2021-03-19 11:36:05 -07:00
Zhe Wu 58d9f47782 log fitness for excluded workers as well 2021-03-19 11:04:53 -07:00
Zhe Wu 4c00361f1c Add comment for 'getWorkersForTlogs' method, and addressed TraceEvent formatting comments. 2021-03-18 21:33:43 -07:00
Zhe Wu 9419387295 Update logging field. 2021-03-18 14:53:43 -07:00
Evan Tschannen 2ff63f544e
Update fdbserver/ClusterController.actor.cpp
Co-authored-by: Lukas Joswiak <lukas.joswiak@snowflake.com>
2021-03-18 13:45:51 -07:00
Zhe Wu 451b14af09 Log detailed information when a worker is considered as unavailable by the cluster controller for TLog recruitment. 2021-03-18 12:18:03 -07:00
Zhe Wu 6468c5aed6 Fix string join 2021-03-17 23:46:11 -07:00
Zhe Wu 1205650a69 Log the dcid during TLog recruitment, so that we can tell in which DC the recruitment is happening 2021-03-17 23:22:42 -07:00
Evan Tschannen 9aeb69ca1c added a comment 2021-03-16 14:19:23 -07:00
Evan Tschannen d0f134c20e added a comment 2021-03-16 13:17:56 -07:00
Evan Tschannen 2a272e525f fix compile error 2021-03-16 12:21:21 -07:00
Evan Tschannen 10fd094920 Better master exists should not trigger if it will lower the total number of processes being recruited 2021-03-16 12:14:19 -07:00
FDB Formatster df90cc89de apply clang-format to *.c, *.cpp, *.h, *.hpp files 2021-03-10 10:18:07 -08:00
Evan Tschannen 346a4e3ecd Merge branch 'release-6.3'
# Conflicts:
#	fdbcli/fdbcli.actor.cpp
#	fdbrpc/LoadBalance.actor.h
#	fdbrpc/MultiInterface.h
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/masterserver.actor.cpp
2021-03-01 18:52:06 -08:00
Meng Xu 33eb1de00e Add some comment to log system
and resolve review comment by deleting my questions.
2021-02-19 21:44:13 -08:00
Meng Xu 9122be4d81 Add comments to HA code and loadBalance code 2021-02-10 13:51:36 -08:00
Richard Chen c77d9e4abe merge conflicts 2020-12-02 21:53:19 +00:00
Markus Pilman bdd3dbfa7d remove duplicates 2020-11-10 14:01:07 -07:00
sfc-gh-tclinkenbeard 4669f837fa Add uses of makeReference 2020-11-07 22:10:18 -08:00
Xin Dong 99d31391ca Fixed a crash found by nightly correctness. 2020-11-03 09:28:04 -08:00
Richard Chen bbf5bdf6da fix stable interfaces test and corresponding changes in simulator 2020-10-12 18:25:12 +00:00
Richard Chen 5488ff1d81 draft diff protocol 2020-10-12 18:24:03 +00:00
Richard Chen 41843f07e6 add simulator support for different process versions and ProtocolVersion test 2020-10-12 18:19:31 +00:00
Xin Dong 175d52312a Prevent segmentation fault. 2020-10-08 13:36:15 -07:00
Young Liu cc5bc16bd8 Rename more places from proxy to commit proxy 2020-09-15 22:29:49 -07:00
Young Liu 35bef73a1c Rename proxy to commit proxy 2020-09-10 17:44:15 -07:00
Young Liu 87693cae81 merge master branch and resolve conflicts 2020-09-02 13:44:33 -07:00
Evan Tschannen 12edadd059 Merge branch 'release-6.3'
# Conflicts:
#	CMakeLists.txt
#	fdbclient/Knobs.cpp
#	fdbclient/MasterProxyInterface.h
#	fdbrpc/simulator.h
#	fdbserver/MasterProxyServer.actor.cpp
#	tests/fast/CycleAndLock.txt
#	tests/fast/TxnStateStoreCycleTest.txt
#	tests/fast/VersionStamp.txt
#	tests/slow/ParallelRestoreOldBackupApiCorrectnessAtomicRestore.txt
#	tests/slow/ParallelRestoreOldBackupCorrectnessCycle.txt
#	versions.target
2020-08-31 19:33:34 -07:00
Evan Tschannen d42a6b6ea7 remove spammy trace event 2020-08-31 10:37:00 -07:00
Young Liu 19df032aec Change some formatting issues 2020-08-13 15:30:21 -07:00
Young Liu 4a30492186 Remove debug trace 2020-08-13 14:42:00 -07:00
Young Liu 79ce16650d merge master branch 2020-08-11 19:22:10 -07:00
Young Liu ba803a5ea3 Fixed formatting issues and removed GRV related code in MasterProxy 2020-08-11 18:54:54 -07:00
Young Liu 104bac3cbd Add trace to debug 2020-08-07 13:02:41 -07:00
Young Liu 56cc15ee71 Add trace to debug 2020-08-07 01:02:07 -07:00
Young Liu d6a23a4d6b Resolve comments to make GRV proxy a separate process class 2020-08-06 00:01:57 -07:00
Young Liu 30ea639666 Remove debug traces 2020-07-29 07:55:05 -07:00
Young Liu f7b76a92af pass joshua 2020-07-29 07:26:55 -07:00
Meng Xu a2089b354a RemoveServersSafely:Safety check toKill1 to avoid cluster getting stuck
toKill1 and toKill2 are a random subset of all processes. If simply kill all processes in toKill1 or toKill2,
we may kill too many processes to make the cluster unavailable and stuck.

Similar as what toKill2 were modified if it can cause cluster unavailable,
we should do the same thing for toKill1
2020-07-28 21:07:31 -07:00
Young Liu 1826ac75d5 Add some trace events to debug 2020-07-25 18:16:08 -07:00
Young Liu 0fc681cc3c Remote some code comments 2020-07-23 22:29:51 -07:00
Young Liu 618414a416 Fix bugs related to getting proxies workers 2020-07-23 18:32:47 -07:00
Young Liu 229ab0d5f1 Fix some conflicts and remote debugging trace events 2020-07-22 23:35:46 -07:00
Young Liu 525f10e30c Merge master branch 2020-07-22 16:08:49 -07:00
Young Liu 302cf5c45f Remove debug trace events 2020-07-22 12:20:22 -07:00
Young Liu 2703cedac5 Fixed known bugs 2020-07-17 22:24:52 -07:00
Young Liu 21c1998cca Fix MaxTLogQueueSize Bug 2020-07-16 15:56:04 -07:00
Young Liu 5b06d69d25 Pass watches test 2020-07-15 00:37:41 -07:00
Andrew Noyes f470ba8316 Remove using namespace std::rel_ops
This causes the following to not compile anymore

\#include <utility>
\#include <vector>

using namespace std::rel_ops;

int main() {
    std::vector<int> xs;
    return xs.rbegin() != xs.rend();
}

See https://godbolt.org/z/s1977n
2020-07-10 22:58:15 +00:00
Meng Xu 9668f32df5
Merge pull request #3388 from apple/release-6.3
Merge Release 6.3 into master
2020-06-18 08:50:25 -07:00
Vishesh Yadav 3068a37e1b refactor: Remove dead failureDetectionServer code 2020-06-17 15:40:21 -07:00
sfc-gh-tclinkenbeard 99bf993815 Replace BOOST_NOEXCEPT with noexcept 2020-06-09 22:39:19 -07:00
negoyal cf13e00a8f Merge remote-tracking branch 'origin/release-6.3' into fdb_cache_wo_allocator 2020-06-01 17:38:31 -07:00
Markus Pilman c2bc75516f Merge branch 'release-6.3' of github.com:apple/foundationdb into features/trace-roles 2020-05-14 10:34:53 -07:00
Evan Tschannen f17f00fdd5 Merge branch 'release-6.2'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
2020-05-10 22:33:38 -07:00
Evan Tschannen 3eaa9d6397 fix: do not report datacenter version difference before both datacenters report a correct version 2020-05-10 17:49:09 -07:00
Markus Pilman 5f9b127e56 Emit traces regularly about role assignment
We are currently emitting Role transition traces when a role starts and
when it ends. While this is useful for debugging, it doesn't work well
with tools that inject data and might potentially miss some trace lines.

We do decorate each trace lines with the roles assigned to that
particular process, however, this is not sufficient for tools that can
make use of the UID -> Role mapping
2020-05-08 16:27:57 -07:00
negoyal dd033736ed Merge branch 'master' into fdb_cache_subfeature2 2020-05-04 17:29:43 -07:00
Evan Tschannen 9e5037291d fix compiler errors 2020-05-01 14:30:50 -07:00
Evan Tschannen a442565e13 more work towards shrinking locality 2020-04-18 21:29:38 -07:00
Evan Tschannen b04478704e fixed improper use of std::set erase 2020-04-17 16:45:22 -07:00
Evan Tschannen 33efb9ec97 code cleanup based on review comments 2020-04-17 15:05:01 -07:00
Evan Tschannen b667d5442f fix: not all removed endpoints were actually removed 2020-04-17 13:47:54 -07:00
Evan Tschannen 9b5130194d avoid updating the same endpoint multiple times 2020-04-11 21:05:30 -07:00
Evan Tschannen 1476057996 properly cache serialization of serverDBInfo 2020-04-11 19:30:05 -07:00
Evan Tschannen 07cc0a8d74 code cleanup 2020-04-10 17:02:11 -07:00
Evan Tschannen ce4493f679 many bug fixes 2020-04-10 13:45:16 -07:00
Evan Tschannen a51c92854a Merge branch 'master' into feature-tree-broadcast
# Conflicts:
#	fdbserver/WorkerInterface.actor.h
#	fdbserver/worker.actor.cpp
2020-04-06 21:09:44 -07:00
Evan Tschannen 2a1bd97120 fix compilation errors 2020-04-06 20:58:43 -07:00
Evan Tschannen 477d66b46d implemented a tree broadcast for txn state message for proxies, and serverDBInfo for workers 2020-04-05 23:09:36 -07:00
negoyal acaf91ac47 Merge branch 'master' into fdb_cache_subfeature2 2020-03-26 13:33:08 -07:00
Jingyu Zhou 5b36dcaad5 Fix oldest backup epoch for backup workers
The oldest backup epoch is piggybacked in LogSystemConfig from master to
cluster controller and then to all workers. Previously, this epoch is set
to the current master epoch, which is wrong.
2020-03-20 20:15:09 -07:00
Evan Tschannen e08f0201f1 merge release 6.2 into master 2020-03-17 12:51:47 -07:00
Evan Tschannen 2038a56ff4
Merge pull request #2819 from etschannen/feature-first-proxy
A "proxy" class process would not be preferred as the "first proxy" for restore and DR purposes
2020-03-16 13:53:28 -07:00
Evan Tschannen 012344e297 refactor getWorkersForRoleInDatacenter 2020-03-16 11:50:17 -07:00
Evan Tschannen 79d5511149 A "proxy" class process would not be preferred as the "first proxy" for restore and DR purposes 2020-03-13 17:49:02 -07:00
Evan Tschannen 4640edf5d6 do not recruit satellite tlogs when usable regions=1 2020-03-13 10:24:52 -07:00
Evan Tschannen 303df197cf Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	bindings/c/test/mako/mako.c
#	documentation/sphinx/source/release-notes.rst
#	fdbbackup/backup.actor.cpp
#	fdbclient/NativeAPI.actor.cpp
#	fdbclient/NativeAPI.actor.h
#	fdbserver/DataDistributionQueue.actor.cpp
#	fdbserver/Knobs.cpp
#	fdbserver/Knobs.h
#	fdbserver/LogRouter.actor.cpp
#	fdbserver/SkipList.cpp
#	fdbserver/fdbserver.actor.cpp
#	flow/CMakeLists.txt
#	flow/Knobs.cpp
#	flow/Knobs.h
#	flow/flow.vcxproj
#	flow/flow.vcxproj.filters
#	versions.target
2020-03-06 18:22:46 -08:00
Evan Tschannen f3ac2c9180 renamed a variable 2020-03-04 18:49:21 -08:00
Evan Tschannen b3ea9d5896 Do not allow the cluster controller to mark any process as failed within 30 seconds of startup 2020-03-04 18:45:26 -08:00
negoyal cd949eca71 Merge branch 'master' into fdb_cache_subfeature2 2020-02-26 11:22:08 -08:00
Evan Tschannen 96258b9809 Merge branch 'release-6.2'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbcli/fdbcli.actor.cpp
#	fdbclient/ManagementAPI.actor.cpp
#	fdbrpc/FlowTransport.actor.cpp
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/DataDistribution.actor.h
#	fdbserver/DataDistributionQueue.actor.cpp
#	fdbserver/KeyValueStoreMemory.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/QuietDatabase.actor.cpp
#	fdbserver/SkipList.cpp
#	fdbserver/StorageMetrics.actor.h
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/fdbserver.actor.cpp
#	fdbserver/storageserver.actor.cpp
#	fdbserver/workloads/KVStoreTest.actor.cpp
#	flow/CMakeLists.txt
#	flow/Knobs.cpp
#	flow/Knobs.h
#	flow/genericactors.actor.cpp
#	flow/serialize.h
2020-02-21 19:09:16 -08:00
Evan Tschannen 8b768e66df
Merge pull request #2694 from dongxinEric/feature/2663/specialize-policy-for-zoneid-in-cc
Added a specialized algorithm for PolicyOne and PolicyAcross(,'zoneId…
2020-02-20 14:46:23 -08:00
Evan Tschannen 574e88ba8e updateGoodRemoteRecruitmentTime was unnecessary because the only way findRemoteWorkers would return would be after a new server has joined which already resets goodRemoteRecruitmentTime 2020-02-20 13:46:22 -08:00
Xin Dong 99095c9224 Again make Clang happy. 2020-02-20 09:50:22 -08:00
Xin Dong 298d6cb3d7 Address review comments. 2020-02-20 09:34:01 -08:00
Evan Tschannen fbd45963d8 The cluster controller waits until no new workers register for 1.0 before starting a bad recruitment 2020-02-19 16:48:30 -08:00
Xin Dong 89fcbb2055 Make clang happy 2020-02-19 09:44:15 -08:00
Xin Dong efc0d7f9d5 Added a specialized algorithm for PolicyOne and PoilcyAcross(,'zoneId',PolicyOne()) to find a set of TLog servers which will be able to fulfill the policy later. 2020-02-19 09:25:57 -08:00
negoyal 85cc35e81e Merge branch 'master' into HEAD 2020-02-05 14:59:55 -08:00
Evan Tschannen 844c8511c4
Merge pull request #2588 from jzhou77/backup-worker
Integrate new backup worker with existing backup command
2020-02-05 14:14:43 -08:00
Jingyu Zhou 52c6737411 Rename backupLoggingEnabled as backupWorkerEnabled
To highlight the changes for 7.0 backup changes. By default,
backup_worker_enabled flag is set for 7.0 version.
2020-02-04 10:09:16 -08:00
Jingyu Zhou 0db03f1d3c Use backup_logging_enabled flag
The default is to enable new backup workers. Users can disable this flag to
turn off the backup worker feature.
2020-02-03 20:03:22 -08:00
Evan Tschannen 4524831456
Merge pull request #2518 from vishesh/task/failmon-remove-server
FailureMonitoring: Server processes no longer need to talk to ClusterController
2020-02-03 17:22:50 -08:00
Jingyu Zhou 38aa1903fd Add a DB configuration option for backup workers
Right now, the default is to keep the old backup behavior, i.e., do NOT use
backup workers. Specifically, if BackupType is not set (or is set to default),
the master will not recruit backup workers and will not add pseudo locality for
backup workers.

The StartFullBackupTaskFunc is updated to check if backup worker is enabled.
Only when it is not enabled, starting a backup will wait on all backup workers
to be started.
2020-01-31 19:29:09 -08:00
Jingyu Zhou 6ddf73e26a Remove code introduced when resolving merge conflicts 2020-01-22 21:23:38 -08:00
Jingyu Zhou c6c39ca99d Update better master exist with backup workers
During recruitment, if there is no desired log router count, use tlog size
instead, because the number of backup workers has to be larger than 0.
2020-01-22 19:43:40 -08:00
Jingyu Zhou 56a2c37071 Recruit backup workers for single region
Enable log router tags for single region, which are popped by backup workers.
Need to add noop for backup workers if there is no active backups.
2020-01-22 19:42:13 -08:00
Jingyu Zhou 19d6a889ff Recruit backup workers for old epochs
If there are unfinished ranges in the old epochs, the new master will recruit
backup workers responsible for finishing these ranges. These workers remains in
the cluster until the next epoch, when it will remove itself.
2020-01-22 19:38:45 -08:00
Jingyu Zhou 7da9f47f26 Enable pop from backup workers
This is still WIP as some edge cases can trigger test failure, most likely due
to not popping mutations by backup workers when epoch ends.
2020-01-22 19:38:45 -08:00
Jingyu Zhou ece3cadf8e Recruit backup worker during master recovery
Right now recruit the same number as TLogs. The backup worker does nothing.
2020-01-22 19:37:48 -08:00
Jingyu Zhou de8d953865 Add backup role, class, and worker skeleton 2020-01-22 19:35:30 -08:00
Vishesh Yadav daef5f011a Merge remote-tracking branch 'apple/master' into task/failmon-remove-server 2020-01-21 13:20:15 -08:00
Evan Tschannen 3f9d9d8b84 Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	cmake/FlowCommands.cmake
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/StorageServerInterface.h
#	fdbserver/DataDistributionTracker.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/fdbserver.actor.cpp
#	flow/Knobs.h
#	flow/Platform.cpp
#	versions.target
2020-01-16 18:37:47 -08:00
Evan Tschannen d55e56993d fix: the cluster controller would not recruit more remote logs before the database became fully_recovered 2020-01-10 12:21:48 -08:00
Alvin Moore 7628d04fb9 Merge branch 'release-6.2' of github.com:apple/foundationdb into release_6.2_merge
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
2020-01-09 07:21:16 -08:00
mpilman d3d6016c90 Merge remote-tracking branch 'negoyal/fdb_cache_subfeature2' into features/cache-initialization 2020-01-07 19:53:09 -08:00
Vishesh Yadav 6e6cfaff16 Cleanup old Failure Monitoring code 2020-01-07 15:53:32 -08:00
negoyal 29b77863f0 Cache warmup and Consistency check workload changes. 2020-01-07 13:06:58 -08:00
Evan Tschannen 3eae401886 fix: we were recruiting one too few oldLogRouters
code cleanup
2020-01-02 15:05:44 -08:00
Evan Tschannen 5e5e618da0 during recovery, only send the full serverDBInfo to processes that are part of the new generation 2019-12-09 13:17:49 -08:00
Evan Tschannen bcce5968a4 recruit oldLogRouters on TLogs, do not recruit oldLogRouters on the cluster controller if possible 2019-12-09 13:12:13 -08:00
mpilman 821edcb207 Register caches through keyspace
This also removes the old mechanism that registers them
through the serverDBInfo.

Caches do now self-recruit at startup
2019-12-06 13:28:44 -08:00
negoyal cf2563f1c7 Mix of various things, a lot of which will change. 2019-12-05 17:10:32 -08:00
Evan Tschannen 3c769fcf60 Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	versions.target
2019-11-22 15:39:19 -08:00
Evan Tschannen ebcb2f79ed Merge branch 'master' of github.com:apple/foundationdb 2019-11-22 15:34:49 -08:00
A.J. Beamon 7c801513e2 Fix cases where latency band config could be discarded during recovery or process start. 2019-11-20 11:44:18 -08:00
Evan Tschannen 8d3ef89540 Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/MutationList.h
#	fdbserver/MasterProxyServer.actor.cpp
#	versions.target
2019-11-14 15:49:56 -08:00
Evan Tschannen ffc89d1182 fix: dd test recruitment should prefer the location of ratekeeper over other used processes 2019-11-13 12:58:55 -08:00
Balachandar Namasivayam 2e41497580 This commit tries to distribute RK and DD among other empty available processes. 2019-11-12 17:52:42 -08:00
Balachandar Namasivayam f5282f2c7e Fix bug where DD or RK could be halted and re-recruited in a loop for certain valid process class configurations. Specifically, recruitment of DD or RK takes into account that master process is preferred over proxy, resolver or cc.
But check for better DD only looks for better machine class ignoring that the new recruit could share a proxy or resolver or CC. Also try to balance the distribution of the DD and RK role if there are enough processes to do so.
2019-11-12 14:22:36 -08:00
negoyal a4a0bf18f9 Merging with Master. 2019-11-12 13:01:29 -08:00
Evan Tschannen 688940b685 merge 6.2 into master 2019-10-21 11:43:46 -07:00
Evan Tschannen 43e99ef6a4 fix: better master exists must check if fitness is better for proxies or resolvers before looking at the count of either of them 2019-10-17 13:18:31 -07:00
Evan Tschannen 298b815109 one proxy or resolver with best fitness no longer prevents more proxies or resolvers from being recruited with good fitness 2019-10-14 18:32:17 -07:00
Evan Tschannen 5064d91b75 fix: the cluster controller would not change to a new set of satellite tlogs when they become available in a better satellite location 2019-10-14 18:31:23 -07:00
Evan Tschannen 35e816e9ad added the ability to configure satellite_logs by satellite location, this will overwrite the region configure if both are present 2019-10-14 18:30:15 -07:00
A.J. Beamon 31ce56eddf Add cluster controller metrics 2019-10-03 15:29:11 -07:00
Evan Tschannen b495cc697b Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	documentation/sphinx/source/release-notes.rst
#	versions.target
2019-09-13 09:25:08 -07:00
Evan Tschannen a62862c105 add yieldedFutures to prevent slow tasks 2019-09-11 16:26:48 -07:00
Evan Tschannen 945cff1e5b the cluster controller caches the serialization of serverDBInfo, to avoid regenerating it many times 2019-09-10 14:27:22 -07:00
Meng Xu 39680fa515 StorageEngineSwitch:Clean up unnecessary trace
And do not trigger storage recruitment unnecessarily.
2019-08-19 14:11:57 -07:00
Meng Xu 4ab322f52c Merge branch 'master' into mengxu/storage-engine-switch-PR-v2 2019-08-19 13:06:32 -07:00
Meng Xu 3034a5e0c5 StorageRecruitment:Suppress outstanding req errors
When too many outstanding requests cannot find a worker for storage server
role, many same errors will be put into trace log. Only one error is enough
to alert the problem.

Too many same errors cause false positive in nightly test and thus should be suppressed.
2019-08-14 11:31:06 -07:00
Meng Xu a588710376 StorageEngineSwitch:Graceful switch
When fdbcli change storeType for storage engines,
we switch the store type of storage servers one by one gracefully.
This avoids recruiting multiple storage servers on the same process,
which can cause OOM error.
2019-08-12 17:37:52 -07:00
Evan Tschannen 90e3b50213 Merge branch 'master' into feature-coordinator-connection
# Conflicts:
#	fdbclient/DatabaseContext.h
#	fdbclient/NativeAPI.actor.cpp
#	fdbclient/NativeAPI.actor.h
#	fdbserver/workloads/KillRegion.actor.cpp
2019-07-26 15:05:02 -07:00
Evan Tschannen be5d144b8b added status information on connected clients 2019-07-25 17:15:31 -07:00
Jingyu Zhou bbeaf0ebbb Add a monitorServerInfoConfig() call back
This was deleted during a code refactor in ef868f5. Because no tests were
complaining, we didn't find this until now.
2019-07-25 15:17:26 -07:00
Evan Tschannen 4a866290b7 Clients keep a persistent connection open with coordinators to get updates to the list of proxies
Status still needs to be updated with client information with information from the coordinators
2019-07-23 19:22:44 -07:00
Jingyu Zhou 50e7593c5b
Merge pull request #1796 from ajbeamon/remove-trace-event-underscores
Remove trace event underscores
2019-07-05 21:45:55 -07:00
A.J. Beamon 9f4b6fd770 Remove additional underscores 2019-07-05 08:12:25 -07:00
Alex Miller 7a500cd37f A giant translation of TaskFooPriority -> TaskPriority::Foo
This is so that APIs that take priorities don't take ints, which are
common and easy to accidentally pass the wrong thing.
2019-06-25 02:47:35 -07:00
Vishesh Yadav a8e408e268 run clang-format on changes 2019-06-10 14:10:24 -07:00
Vishesh Yadav 6fa7081a21 net: Don't make FailureMonitoring requests from client
This patch removes the need for clients to continuously contact
cluster coordinator for failure monitoring information. Instead, it
uses the FlowTransport to monitor the statuses of peers and update
FailureMonitor accordingly.
2019-06-09 00:43:38 -07:00
Evan Tschannen 29b96414e2 Merge branch 'release-6.1'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/NativeAPI.actor.cpp
#	fdbserver/Coordination.actor.cpp
#	flow/Arena.h
#	versions.target
2019-06-03 18:49:35 -07:00
Evan Tschannen 7c333dbc16 If a process receives a message in its clusterControllerInterface before becoming the cluster controller, if the process does not become the cluster controller in the next minute it should destroy the interface to prevent a memory leak. 2019-05-29 16:57:13 -07:00
A.J. Beamon 5f55f3f613 Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used. 2019-05-10 14:01:52 -07:00
Andrew Noyes 6207d724f8 Fix all -Wunused-variable warnings 2019-04-15 18:13:00 -07:00
mpilman 1c16f87a4e Remove trace-calls to printable (in non-workloads) 2019-04-05 13:12:19 -07:00
mpilman c008e16c81 Defer formatting in traces to make them cheaper
This is the first part of making `TraceEvent` cheaper. The main idea is
to defer calls to any code that formats string. These are the main
changes:

- TraceEvent::detail now takes a c-string instead of std::string for
  literals. This prevents unnecessary allocations if the trace is not
  going to be printed in the first place (for example for SevDebug).
  Before that `detail` expected a `std::string` as key, which mean that
  any string literal would be copied on each call.
- Templates Traceable and SpecialTraceMetricType. These templates can be
  specialized for any type that needs to be printed. The actual
  formatting will be deferred to after the `enabled` check. This
  provides two benefits: (1) if a TraceEvent is disabled, we don't pay
  for the formatting and (2) TraceEvent can trace types that it doesn't
  know about.
- TraceEvent::enabled will be set in the constructor if the Severity is
  passed. This will make sure that `TraceEvent::init` is not called.
- `TraceEvent::detail` will be inlined. So for disabled TraceEvent
  calls, a call to detail will only introduce a if-branch which is much
  cheaper than a function call.
2019-04-05 13:12:19 -07:00
Evan Tschannen 8ebf771392 cleanup cluster controller trace events 2019-03-30 14:17:18 -07:00
A.J. Beamon 71e2fdafb8 Changes to ratekeeper camel case 2019-03-27 08:24:25 -07:00
Evan Tschannen 5e03e178de
Merge pull request #1345 from ajbeamon/support-multiple-client-or-worker-issues
Add support for a client or worker having multiple issues.
2019-03-24 17:27:50 -07:00
Evan Tschannen d45159ebf7
Merge pull request #1307 from jzhou77/ratekeeper
Monitor placement of Ratekeeper and DataDistributor
2019-03-24 17:26:07 -07:00
Evan Tschannen d6ad027d37 ratekeeper needs to be recruited for proxies to make progress, so if one has not registered with the cluster controller by the time we are accepting commits, recruit a new one 2019-03-24 16:48:24 -07:00
Evan Tschannen f426d732ea fix: forgot to remove one location where id_used was incremented for distributor and ratekeeper 2019-03-24 16:04:59 -07:00
Evan Tschannen e8948726e8 once we recruit a ratekeeper, do not allow any other ratekeepers to register 2019-03-24 11:04:39 -07:00
Jingyu Zhou 40eec20252 Restore master PID in worker registration
This fix is lost during merge.
2019-03-23 21:02:11 -07:00
Jingyu Zhou 3ef26e6be3 Fix fitness assignment statements
Found by MacOS build.
2019-03-23 19:16:04 -07:00
Evan Tschannen 1fc6937802 changed NetworkAddressList to at most two addresses for performance 2019-03-23 17:54:46 -07:00
Evan Tschannen b51a24453e the data distributor and ratekeeper are not included in id_used, but when comparing equally good options we prefer to avoid sharing with those roles
excluded data distributor and ratekeeper were improperly killed when the best option was also excluded
2019-03-23 13:25:36 -07:00
Jingyu Zhou fdc5b5ddbf Fix: spurious ratekeeper registration
A rare race condition:
-r simulation -f ./foundationdb/tests/slow/WriteDuringReadAtomicRestore.txt -s 114256311 -b on

- A is the ratekeeper.
- CC recruit B and B starts
- CC halts ratekeeper A and A is halted
- A registers back with CC, which then halts B. CC sets A to be the ratekeeper.

CC starts recruiting and finds A is the best machine. But skips recruiting
because CC thinks A is already used. Now the cluster is left with no ratekeeper.

Fix by disallowing ratekeeper registration with previous ID.
2019-03-23 11:03:51 -07:00
Jingyu Zhou 6523cd4931 Fix: recruit ratekeeper is not triggerred 2019-03-23 09:20:54 -07:00
Evan Tschannen 2da46e3172 fix: halt if datacenters are different 2019-03-22 23:53:21 -07:00
Evan Tschannen d34c56c9a5 ensure that the processId exists in id_worker before accessing it 2019-03-22 18:54:39 -07:00
Evan Tschannen 36ab852bb1 Merge branch 'master' into ratekeeper
# Conflicts:
#	fdbserver/ClusterController.actor.cpp
2019-03-22 18:41:00 -07:00
Evan Tschannen ddb6058770 simplified ratekeeper monitoring loop 2019-03-22 18:22:45 -07:00
Jingyu Zhou 12917d8c7d Add actors to store halt request futures
Address best fitness in checking better DD or RK.
2019-03-22 18:06:38 -07:00
Jingyu Zhou e8977aeb98 Remove clusterControllerDcId check
This is no longer needed since it'll be set in the ctor.
2019-03-22 18:01:54 -07:00
Evan Tschannen 82bc447e29 startRatekeeper is responsible for updating serverDBInfo 2019-03-22 17:56:16 -07:00
Evan Tschannen 82c80c225d make sure id_worker is updated before setting ratekeeper or data distribution 2019-03-22 17:08:54 -07:00
Evan Tschannen 6a9c9d79cc
Update fdbserver/ClusterController.actor.cpp 2019-03-22 17:00:58 -07:00
Evan Tschannen 70b1c88cdd
Update fdbserver/ClusterController.actor.cpp 2019-03-22 17:00:52 -07:00
Jingyu Zhou 16f54577ee Restore master PID in cluster controller worker registration
CC may think master failed and clear the master PID, which can block both data
distributor and ratekeeper recruitment. Fix by restoring it during worker
registration.
2019-03-22 14:53:05 -07:00
A.J. Beamon 4eb5715689 Add support for a client or worker having multiple issues. 2019-03-22 08:29:41 -07:00
Jingyu Zhou da338c3ad6 Avoid unnecessary recuriting of DD or RK
While waiting for recruting data distributor or ratekeeper, a previous one
could already joined. So we can skip this unnecessary recruiting.

Revert the change of worker.actor.cpp for ratekeeper. Instead, recruiting
ratekeeper should avoid the process with an existing one. This fixes a bug
where the ratekeeper interface became zombie, killing other healthy ratekeeper
but doing no useful work. Found by:

-r simulation --crash -f tests/fast/WriteDuringRead.txt -s 31858110 -b on
2019-03-21 22:40:07 -07:00
Evan Tschannen fe4464e786 fix: processClassFitness could be wrong if the client changed their class while rebooting 2019-03-21 17:56:04 -07:00
Jingyu Zhou 299961aecb Move ratekeeper or data distributor from excluded servers 2019-03-21 17:17:33 -07:00
Jingyu Zhou 48324ad4be Fix a race during ratekeeper registration
When a ratekeeper registers, the monitorRatekeeper wakes up and recruits a new
ratekeeper. Adding a 0s delay to avoid this.

If a ratekeeper is recruited on an existing machine, update the interface so
that the cluster controller can clear the ratekeeperID.
2019-03-21 12:56:56 -07:00
Evan Tschannen e692f0f70f fix: degraded is only used for tlog recruitment, so we should not use it in the fitness calculation for other roles 2019-03-21 11:23:49 -07:00
Jingyu Zhou 8edefda193 Fix test stuck due to invalid worker in cluster controller
Test case:
-r simulation --crash -f ./tests/rare/CloggedCycleWithKills.txt -s 688927581 -b off
2019-03-20 22:24:01 -07:00
Jingyu Zhou 937b6dde31 Fix a race of DD, RK, Master failure
If all DD, RK, Master run on the same process and failed. Recruiting of new
DD or RK could try to use the old master worker interface, which is an invalid
one and causes recruitment to be stuck.

Fix by adding a delay and checking master is valid before recruitment.
2019-03-20 16:19:20 -07:00
Jingyu Zhou ce5c6d18d2 Fix ratekeeper recruitment bug 2019-03-20 14:22:22 -07:00
Jingyu Zhou 86b687981b Fix ratekeeper and data distributor recruiting bug
Avoid multiple concurrent recuriting of ratekeepers with a recruiting flag.
Fix endless recruiting when the chosen worker is a proxy or a resolver --
prefer master in this case.
2019-03-20 10:00:31 -07:00
Jingyu Zhou 474abd81bd Move placement monitoring inside doCheckOutstandingRequests 2019-03-19 22:48:21 -07:00
Balachandar Namasivayam f9560e1abd Addressed Review Comments 2019-03-19 15:23:14 -07:00
Jingyu Zhou bc6fdaea3e Recruit a new ratekeeper before halting the old 2019-03-19 15:21:46 -07:00
Jingyu Zhou 0fb6a03c07 First round of review comment fixes for PR#1307 2019-03-19 11:29:19 -07:00
Jingyu Zhou 8d609eb51d Protect ratekeeper registration race during recruitment
This is similar one to DataDistributor.
2019-03-18 13:53:50 -07:00
Balachandar Namasivayam 5471725db5 Support config where the primary and remote DC's can be used as satellites. 2019-03-18 12:17:59 -07:00
Jingyu Zhou 2b41a97a6e Fix the issue of slow dying Data Distributor
Test with:
-r simulation -f ./foundationdb/tests/slow/CommitBug.txt -s 67828576 -b on

The test has the following event sequence:
- Time 113.3s, CC noticed DD failure, cleard DD interface.
- 1s later, DD rejoined and registered with CC.
- Time 131.7s, DD actor cancelled. This old DD raced to register with CC and
the failure monitor is not installed because monitorDataDistributor is stalled
waiting for new DD.
- Time 161.4s, new DD running. New DD recruting was delayed due to no servers
in the period.

Fix by disabling DD registration during the recruting process.
2019-03-17 22:19:23 -07:00
Jingyu Zhou 254c78053c Fix a segfault error
After wait, ServerDBInfo may have changed. Using the old copy is wrong.
2019-03-15 22:11:13 -07:00
Jingyu Zhou 12ddd56698 Fix Ratekeeper and DataDistributor placement
Make sure both RateKeeper and DataDistributor are placed in the same data
center as the Master. Make sure only one RateKeeper is live in the cluster as
well.
2019-03-15 17:09:28 -07:00
Jingyu Zhou bb5686eb75 Fix monitoring of DD and RK 2019-03-15 16:02:17 -07:00
Jingyu Zhou 9f6fe5f649 Merge remote-tracking branch 'apple/master' into ratekeeper 2019-03-15 11:30:04 -07:00
Jingyu Zhou 40860e0093 Attempt to fix. 2019-03-15 11:29:04 -07:00
Jingyu Zhou 99d521ef4f Monitor Ratekeeper and DataDistributor to use stateless processes
Since Ratekeeper and DataDistributor are no longer running with Master, they
might be running with stateful processes before a new Master becomes alive,
which is undesirable.

This PR adds a monitoring of both Ratekeeper and DataDistributor at Cluster
Controller -- if Master runs on a stateless class and RK/DD runs at a worse
class, then RK/DD will be killed. I.e., RK/DD should be running at their own
classes or on the same stateless process as Master. After restart, RK/DD should
be running at a better process class.
2019-03-14 15:00:57 -07:00
Meng Xu 5a10bf5dfc Merge branch 'master' into mengxu/tls-switch-status-PR 2019-03-14 10:35:12 -07:00
Evan Tschannen a2108047aa removed LocalitySetRef and IRepPolicyRef typedefs, because for clarity the Ref suffix is reserved for arena allocated objects instead of reference counted objects. 2019-03-13 13:14:39 -07:00
Evan Tschannen e068c478b5 merge master 2019-03-12 18:31:25 -07:00
Evan Tschannen 5392742902 fixed review comments 2019-03-12 14:38:54 -07:00
Jingyu Zhou 2b0139670e Fix review comment for PR 1176 2019-03-12 12:02:30 -07:00
Meng Xu 46f4b02807 TLS Status: Resolve review comments
Use connectedCoordinatorsNumDelayed to reduce the load on cluster controller;
Set connectedCoordinatorsNum to null by default for monitorLeader()
2019-03-11 17:10:08 -07:00
Evan Tschannen 1be9ae5ce3 fixed merge conflict 2019-03-08 22:51:06 -05:00
Evan Tschannen 044b6b4f8a Merge branch 'master' into feature-degraded-tlog
# Conflicts:
#	fdbserver/ClusterController.actor.cpp
2019-03-08 22:50:41 -05:00
Evan Tschannen 45fe6b369b tlog recruitment will prefer non-degraded processes, however it will not choose less than desired number of tlogs to avoid degraded processes
better master exists will switch the master to avoid degraded processes
2019-03-08 14:40:00 -05:00