sfc-gh-tclinkenbeard
5c2d7b6080
Create RangeResult type alias
2021-05-03 13:14:16 -07:00
sfc-gh-tclinkenbeard
f9ede75b42
Remove unused variable in ClusterController.actor.cpp
2021-05-03 11:10:43 -07:00
Markus Pilman
54919d4f3b
Merge remote-tracking branch 'sfc/features/actor-lineage' into features/actor-lineage
2021-04-28 09:22:14 -06:00
Evan Tschannen
1f98dec1df
cleaned up default constructed maps
2021-04-26 19:26:25 -07:00
sfc-gh-tclinkenbeard
dc577b6608
Fix some bugs in distribution of configBroadcaster interface
2021-04-26 18:46:22 -07:00
sfc-gh-tclinkenbeard
7211d838cf
Remove broadcastConfigDatabase actor
2021-04-26 15:54:08 -07:00
Evan Tschannen
451609e6be
code cleanup
2021-04-26 10:16:18 -07:00
Evan Tschannen
50bb9b51b4
simulation does recruitment twice and compares the results to ensure recruitment is deterministic
2021-04-26 10:13:59 -07:00
Evan Tschannen
49ca48f82e
fix: tlog recruitment could select more than the desired about of tlogs
...
fix: tlog recruitment did not attempt to avoid longLivedStateless processes
2021-04-26 10:09:44 -07:00
Evan Tschannen
7503964ee9
recruitment tries to avoid degraded processes altogether, rather than just the worst one. Since this is a behavior change from the backup recruitment, we cannot compared degraded between the two recruitments
2021-04-26 10:01:54 -07:00
Evan Tschannen
ccfc77f6fb
changed preferredSharing to be ordered, so that recruitment will always share with the same other role when everything else is equal
2021-04-26 09:57:46 -07:00
sfc-gh-tclinkenbeard
9bed1f7aa5
Run SimpleConfigBroadcaster on cluster controller
2021-04-25 17:20:02 -07:00
Evan Tschannen
b61a911685
removed an ASSERT that was for debugging purposed, and increased the max commit latency, because it can be spuriously triggered by dummy transactions that take 5+ seconds each
2021-04-21 14:30:06 -07:00
Evan Tschannen
e18c9961b4
rewrote tlog recruitment logic so that it is deterministic, to prevent better master exists from triggering spuriously
2021-04-21 00:22:33 -07:00
Lukas Joswiak
c81e1e9519
Add sampling profiler frequency to global config
2021-04-19 22:46:57 -07:00
RenxuanW
4bf7218e8f
Merge pull request #4635 from RenxuanW/priority_logging
...
Log a warning when remote dc is disabled (priority < 0)
2021-04-15 17:00:41 -07:00
Lukas Joswiak
7de23918c0
Add comments, fix erase bug, make optimizations
2021-04-14 10:56:33 -07:00
Lukas Joswiak
c38ddf5eb7
Add comments
2021-04-14 10:56:33 -07:00
Lukas Joswiak
7ba7257cd2
Store global config data on heap
2021-04-14 10:56:33 -07:00
Lukas Joswiak
1c60653c2a
Add fix to conditionally set global config history
2021-04-14 10:56:33 -07:00
Lukas Joswiak
6de28dd916
clang-format
2021-04-14 10:56:33 -07:00
Lukas Joswiak
1260385965
Use object to wrap global configuration history
2021-04-14 10:56:32 -07:00
Lukas Joswiak
fb9a929780
Fix issue with freed memory being accessed
2021-04-14 10:56:32 -07:00
Lukas Joswiak
c3f68831af
Move existing ClientDBInfo variables to global configuration
2021-04-14 10:56:32 -07:00
Lukas Joswiak
7bb0b3d899
Use commit version for global configuration updates
...
FIXME: There is a memory issue where the underlying data for values set
in the `data` field of GlobalConfig will be freed shortly after being
set.
2021-04-14 10:56:32 -07:00
Lukas Joswiak
f1415412f1
Add global configuration framework implementation
2021-04-14 10:56:32 -07:00
Evan Tschannen
bd6db9ca7c
Update fdbserver/ClusterController.actor.cpp
...
Co-authored-by: Markus Pilman <markus.pilman@snowflake.com>
2021-04-13 15:13:45 -07:00
RenxuanW
7be8dab045
Change DcPriorityNegative to CCDcPriorityNegative
2021-04-08 16:00:37 -07:00
RenxuanW
738e7402f7
Log a warning when remote dc is disabled (priority < 0)
2021-04-08 15:36:52 -07:00
RenxuanW
f3d5fa4750
Revert "Log a warning when remote dc's priority doesn't match the original primary."
...
This reverts commit 1d701e8bcf
.
2021-04-08 15:19:43 -07:00
RenxuanW
1d701e8bcf
Log a warning when remote dc's priority doesn't match the original primary.
2021-04-08 14:38:37 -07:00
Evan Tschannen
a90c26f1d0
The master, proxies, and resolver all need to have the same machine class fitness function besides best fit to ensure recruitment is deterministic
...
if the first GRV proxy or resolver is forced to share a process, it should prefer to share with the commit proxy so that the commit proxy has more potential options it can share with
2021-04-08 14:29:12 -07:00
Evan Tschannen
5695a1816f
fix: requiredFitness was being set to one higher than the actual requirement
2021-04-07 21:31:14 -07:00
Evan Tschannen
1b1f73ea16
added comments
2021-04-07 20:40:42 -07:00
Evan Tschannen
4d8dd0b0a0
fix: desired must be greater than or equal to required
2021-04-07 20:32:45 -07:00
Evan Tschannen
14213b0151
code cleanup
2021-04-07 20:06:30 -07:00
Evan Tschannen
15e8b43961
rewrote getWorkersForTLogs to do a much better job of avoiding degraded processes and processes in the same DC as the cluster controller
2021-04-07 19:57:24 -07:00
Evan Tschannen
c27d82cecd
tlog recruitment used a degraded LogClass process over a non-degraded TransactionClass process
...
tlog recruitment would not use TransactionClass processes if it fulfulled the required amount with LogClass processes
Better master exists did not account for how many times a process had been used when comparing recruitments
Better master exists did not account for the fact that tlogs prefer to be in a different dc than the cluster controller
RoleFitness comparison did not properly order count before degraded or bestFit
betterCount was returning worstFit when worstIsDegraded did not match
backupWorker recruitment did not attempt to avoid sharing processes with other roles
If any of the commit_proxy, grv_proxy, or resolver are forced to share a process, allow the recruitment for all of them to share to an equal degree, this change allows BetterMasterExists to be refactors as a tuple comparison
2021-04-07 16:04:08 -07:00
Markus Pilman
50342b5082
fix a second low-latency bug
2021-03-29 13:31:26 -06:00
Markus Pilman
8555723b98
removing testing case
2021-03-26 15:46:54 -06:00
Markus Pilman
43bed1d9dd
Fix bug where betterMasterExist and recruitment disagree
2021-03-26 15:06:59 -06:00
Evan Tschannen
10b6b5d710
If the current configuration does not have a satellite fallback policy we do not care if the old configuration is in fallback mode
2021-03-23 13:02:31 -07:00
A.J. Beamon
99f3bb6d7d
Merge pull request #4509 from sfc-gh-etschannen/feature-bme-count
...
Do not trigger BetterMasterExists if it lowers the number of processes
2021-03-22 13:43:24 -07:00
Zhe Wu
15f3699e22
Add targeting DC ids in the tlog recruitment event trace.
2021-03-19 14:10:38 -07:00
Meng Xu
0cedef123b
Merge pull request #4518 from halfprice/zhewu/log-tlog-recruitment-failure-reason
...
Logging more detailed information during Tlog recruitment
2021-03-19 11:36:05 -07:00
Zhe Wu
58d9f47782
log fitness for excluded workers as well
2021-03-19 11:04:53 -07:00
Zhe Wu
4c00361f1c
Add comment for 'getWorkersForTlogs' method, and addressed TraceEvent formatting comments.
2021-03-18 21:33:43 -07:00
Zhe Wu
9419387295
Update logging field.
2021-03-18 14:53:43 -07:00
Evan Tschannen
2ff63f544e
Update fdbserver/ClusterController.actor.cpp
...
Co-authored-by: Lukas Joswiak <lukas.joswiak@snowflake.com>
2021-03-18 13:45:51 -07:00
Zhe Wu
451b14af09
Log detailed information when a worker is considered as unavailable by the cluster controller for TLog recruitment.
2021-03-18 12:18:03 -07:00
Zhe Wu
6468c5aed6
Fix string join
2021-03-17 23:46:11 -07:00
Zhe Wu
1205650a69
Log the dcid during TLog recruitment, so that we can tell in which DC the recruitment is happening
2021-03-17 23:22:42 -07:00
Evan Tschannen
9aeb69ca1c
added a comment
2021-03-16 14:19:23 -07:00
Evan Tschannen
d0f134c20e
added a comment
2021-03-16 13:17:56 -07:00
Evan Tschannen
2a272e525f
fix compile error
2021-03-16 12:21:21 -07:00
Evan Tschannen
10fd094920
Better master exists should not trigger if it will lower the total number of processes being recruited
2021-03-16 12:14:19 -07:00
FDB Formatster
df90cc89de
apply clang-format to *.c, *.cpp, *.h, *.hpp files
2021-03-10 10:18:07 -08:00
Evan Tschannen
346a4e3ecd
Merge branch 'release-6.3'
...
# Conflicts:
# fdbcli/fdbcli.actor.cpp
# fdbrpc/LoadBalance.actor.h
# fdbrpc/MultiInterface.h
# fdbserver/ClusterController.actor.cpp
# fdbserver/MasterProxyServer.actor.cpp
# fdbserver/masterserver.actor.cpp
2021-03-01 18:52:06 -08:00
Meng Xu
33eb1de00e
Add some comment to log system
...
and resolve review comment by deleting my questions.
2021-02-19 21:44:13 -08:00
Meng Xu
9122be4d81
Add comments to HA code and loadBalance code
2021-02-10 13:51:36 -08:00
Richard Chen
c77d9e4abe
merge conflicts
2020-12-02 21:53:19 +00:00
Markus Pilman
bdd3dbfa7d
remove duplicates
2020-11-10 14:01:07 -07:00
sfc-gh-tclinkenbeard
4669f837fa
Add uses of makeReference
2020-11-07 22:10:18 -08:00
Xin Dong
99d31391ca
Fixed a crash found by nightly correctness.
2020-11-03 09:28:04 -08:00
Richard Chen
bbf5bdf6da
fix stable interfaces test and corresponding changes in simulator
2020-10-12 18:25:12 +00:00
Richard Chen
5488ff1d81
draft diff protocol
2020-10-12 18:24:03 +00:00
Richard Chen
41843f07e6
add simulator support for different process versions and ProtocolVersion test
2020-10-12 18:19:31 +00:00
Xin Dong
175d52312a
Prevent segmentation fault.
2020-10-08 13:36:15 -07:00
Young Liu
cc5bc16bd8
Rename more places from proxy to commit proxy
2020-09-15 22:29:49 -07:00
Young Liu
35bef73a1c
Rename proxy to commit proxy
2020-09-10 17:44:15 -07:00
Young Liu
87693cae81
merge master branch and resolve conflicts
2020-09-02 13:44:33 -07:00
Evan Tschannen
12edadd059
Merge branch 'release-6.3'
...
# Conflicts:
# CMakeLists.txt
# fdbclient/Knobs.cpp
# fdbclient/MasterProxyInterface.h
# fdbrpc/simulator.h
# fdbserver/MasterProxyServer.actor.cpp
# tests/fast/CycleAndLock.txt
# tests/fast/TxnStateStoreCycleTest.txt
# tests/fast/VersionStamp.txt
# tests/slow/ParallelRestoreOldBackupApiCorrectnessAtomicRestore.txt
# tests/slow/ParallelRestoreOldBackupCorrectnessCycle.txt
# versions.target
2020-08-31 19:33:34 -07:00
Evan Tschannen
d42a6b6ea7
remove spammy trace event
2020-08-31 10:37:00 -07:00
Young Liu
19df032aec
Change some formatting issues
2020-08-13 15:30:21 -07:00
Young Liu
4a30492186
Remove debug trace
2020-08-13 14:42:00 -07:00
Young Liu
79ce16650d
merge master branch
2020-08-11 19:22:10 -07:00
Young Liu
ba803a5ea3
Fixed formatting issues and removed GRV related code in MasterProxy
2020-08-11 18:54:54 -07:00
Young Liu
104bac3cbd
Add trace to debug
2020-08-07 13:02:41 -07:00
Young Liu
56cc15ee71
Add trace to debug
2020-08-07 01:02:07 -07:00
Young Liu
d6a23a4d6b
Resolve comments to make GRV proxy a separate process class
2020-08-06 00:01:57 -07:00
Young Liu
30ea639666
Remove debug traces
2020-07-29 07:55:05 -07:00
Young Liu
f7b76a92af
pass joshua
2020-07-29 07:26:55 -07:00
Meng Xu
a2089b354a
RemoveServersSafely:Safety check toKill1 to avoid cluster getting stuck
...
toKill1 and toKill2 are a random subset of all processes. If simply kill all processes in toKill1 or toKill2,
we may kill too many processes to make the cluster unavailable and stuck.
Similar as what toKill2 were modified if it can cause cluster unavailable,
we should do the same thing for toKill1
2020-07-28 21:07:31 -07:00
Young Liu
1826ac75d5
Add some trace events to debug
2020-07-25 18:16:08 -07:00
Young Liu
0fc681cc3c
Remote some code comments
2020-07-23 22:29:51 -07:00
Young Liu
618414a416
Fix bugs related to getting proxies workers
2020-07-23 18:32:47 -07:00
Young Liu
229ab0d5f1
Fix some conflicts and remote debugging trace events
2020-07-22 23:35:46 -07:00
Young Liu
525f10e30c
Merge master branch
2020-07-22 16:08:49 -07:00
Young Liu
302cf5c45f
Remove debug trace events
2020-07-22 12:20:22 -07:00
Young Liu
2703cedac5
Fixed known bugs
2020-07-17 22:24:52 -07:00
Young Liu
21c1998cca
Fix MaxTLogQueueSize Bug
2020-07-16 15:56:04 -07:00
Young Liu
5b06d69d25
Pass watches test
2020-07-15 00:37:41 -07:00
Andrew Noyes
f470ba8316
Remove using namespace std::rel_ops
...
This causes the following to not compile anymore
\#include <utility>
\#include <vector>
using namespace std::rel_ops;
int main() {
std::vector<int> xs;
return xs.rbegin() != xs.rend();
}
See https://godbolt.org/z/s1977n
2020-07-10 22:58:15 +00:00
Meng Xu
9668f32df5
Merge pull request #3388 from apple/release-6.3
...
Merge Release 6.3 into master
2020-06-18 08:50:25 -07:00
Vishesh Yadav
3068a37e1b
refactor: Remove dead failureDetectionServer code
2020-06-17 15:40:21 -07:00
sfc-gh-tclinkenbeard
99bf993815
Replace BOOST_NOEXCEPT with noexcept
2020-06-09 22:39:19 -07:00
negoyal
cf13e00a8f
Merge remote-tracking branch 'origin/release-6.3' into fdb_cache_wo_allocator
2020-06-01 17:38:31 -07:00
Markus Pilman
c2bc75516f
Merge branch 'release-6.3' of github.com:apple/foundationdb into features/trace-roles
2020-05-14 10:34:53 -07:00
Evan Tschannen
f17f00fdd5
Merge branch 'release-6.2'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
2020-05-10 22:33:38 -07:00
Evan Tschannen
3eaa9d6397
fix: do not report datacenter version difference before both datacenters report a correct version
2020-05-10 17:49:09 -07:00
Markus Pilman
5f9b127e56
Emit traces regularly about role assignment
...
We are currently emitting Role transition traces when a role starts and
when it ends. While this is useful for debugging, it doesn't work well
with tools that inject data and might potentially miss some trace lines.
We do decorate each trace lines with the roles assigned to that
particular process, however, this is not sufficient for tools that can
make use of the UID -> Role mapping
2020-05-08 16:27:57 -07:00
negoyal
dd033736ed
Merge branch 'master' into fdb_cache_subfeature2
2020-05-04 17:29:43 -07:00
Evan Tschannen
9e5037291d
fix compiler errors
2020-05-01 14:30:50 -07:00
Evan Tschannen
a442565e13
more work towards shrinking locality
2020-04-18 21:29:38 -07:00
Evan Tschannen
b04478704e
fixed improper use of std::set erase
2020-04-17 16:45:22 -07:00
Evan Tschannen
33efb9ec97
code cleanup based on review comments
2020-04-17 15:05:01 -07:00
Evan Tschannen
b667d5442f
fix: not all removed endpoints were actually removed
2020-04-17 13:47:54 -07:00
Evan Tschannen
9b5130194d
avoid updating the same endpoint multiple times
2020-04-11 21:05:30 -07:00
Evan Tschannen
1476057996
properly cache serialization of serverDBInfo
2020-04-11 19:30:05 -07:00
Evan Tschannen
07cc0a8d74
code cleanup
2020-04-10 17:02:11 -07:00
Evan Tschannen
ce4493f679
many bug fixes
2020-04-10 13:45:16 -07:00
Evan Tschannen
a51c92854a
Merge branch 'master' into feature-tree-broadcast
...
# Conflicts:
# fdbserver/WorkerInterface.actor.h
# fdbserver/worker.actor.cpp
2020-04-06 21:09:44 -07:00
Evan Tschannen
2a1bd97120
fix compilation errors
2020-04-06 20:58:43 -07:00
Evan Tschannen
477d66b46d
implemented a tree broadcast for txn state message for proxies, and serverDBInfo for workers
2020-04-05 23:09:36 -07:00
negoyal
acaf91ac47
Merge branch 'master' into fdb_cache_subfeature2
2020-03-26 13:33:08 -07:00
Jingyu Zhou
5b36dcaad5
Fix oldest backup epoch for backup workers
...
The oldest backup epoch is piggybacked in LogSystemConfig from master to
cluster controller and then to all workers. Previously, this epoch is set
to the current master epoch, which is wrong.
2020-03-20 20:15:09 -07:00
Evan Tschannen
e08f0201f1
merge release 6.2 into master
2020-03-17 12:51:47 -07:00
Evan Tschannen
2038a56ff4
Merge pull request #2819 from etschannen/feature-first-proxy
...
A "proxy" class process would not be preferred as the "first proxy" for restore and DR purposes
2020-03-16 13:53:28 -07:00
Evan Tschannen
012344e297
refactor getWorkersForRoleInDatacenter
2020-03-16 11:50:17 -07:00
Evan Tschannen
79d5511149
A "proxy" class process would not be preferred as the "first proxy" for restore and DR purposes
2020-03-13 17:49:02 -07:00
Evan Tschannen
4640edf5d6
do not recruit satellite tlogs when usable regions=1
2020-03-13 10:24:52 -07:00
Evan Tschannen
303df197cf
Merge branch 'release-6.2'
...
# Conflicts:
# CMakeLists.txt
# bindings/c/test/mako/mako.c
# documentation/sphinx/source/release-notes.rst
# fdbbackup/backup.actor.cpp
# fdbclient/NativeAPI.actor.cpp
# fdbclient/NativeAPI.actor.h
# fdbserver/DataDistributionQueue.actor.cpp
# fdbserver/Knobs.cpp
# fdbserver/Knobs.h
# fdbserver/LogRouter.actor.cpp
# fdbserver/SkipList.cpp
# fdbserver/fdbserver.actor.cpp
# flow/CMakeLists.txt
# flow/Knobs.cpp
# flow/Knobs.h
# flow/flow.vcxproj
# flow/flow.vcxproj.filters
# versions.target
2020-03-06 18:22:46 -08:00
Evan Tschannen
f3ac2c9180
renamed a variable
2020-03-04 18:49:21 -08:00
Evan Tschannen
b3ea9d5896
Do not allow the cluster controller to mark any process as failed within 30 seconds of startup
2020-03-04 18:45:26 -08:00
negoyal
cd949eca71
Merge branch 'master' into fdb_cache_subfeature2
2020-02-26 11:22:08 -08:00
Evan Tschannen
96258b9809
Merge branch 'release-6.2'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbcli/fdbcli.actor.cpp
# fdbclient/ManagementAPI.actor.cpp
# fdbrpc/FlowTransport.actor.cpp
# fdbserver/ClusterController.actor.cpp
# fdbserver/DataDistribution.actor.cpp
# fdbserver/DataDistribution.actor.h
# fdbserver/DataDistributionQueue.actor.cpp
# fdbserver/KeyValueStoreMemory.actor.cpp
# fdbserver/MasterProxyServer.actor.cpp
# fdbserver/QuietDatabase.actor.cpp
# fdbserver/SkipList.cpp
# fdbserver/StorageMetrics.actor.h
# fdbserver/TLogServer.actor.cpp
# fdbserver/fdbserver.actor.cpp
# fdbserver/storageserver.actor.cpp
# fdbserver/workloads/KVStoreTest.actor.cpp
# flow/CMakeLists.txt
# flow/Knobs.cpp
# flow/Knobs.h
# flow/genericactors.actor.cpp
# flow/serialize.h
2020-02-21 19:09:16 -08:00
Evan Tschannen
8b768e66df
Merge pull request #2694 from dongxinEric/feature/2663/specialize-policy-for-zoneid-in-cc
...
Added a specialized algorithm for PolicyOne and PolicyAcross(,'zoneId…
2020-02-20 14:46:23 -08:00
Evan Tschannen
574e88ba8e
updateGoodRemoteRecruitmentTime was unnecessary because the only way findRemoteWorkers would return would be after a new server has joined which already resets goodRemoteRecruitmentTime
2020-02-20 13:46:22 -08:00
Xin Dong
99095c9224
Again make Clang happy.
2020-02-20 09:50:22 -08:00
Xin Dong
298d6cb3d7
Address review comments.
2020-02-20 09:34:01 -08:00
Evan Tschannen
fbd45963d8
The cluster controller waits until no new workers register for 1.0 before starting a bad recruitment
2020-02-19 16:48:30 -08:00
Xin Dong
89fcbb2055
Make clang happy
2020-02-19 09:44:15 -08:00
Xin Dong
efc0d7f9d5
Added a specialized algorithm for PolicyOne and PoilcyAcross(,'zoneId',PolicyOne()) to find a set of TLog servers which will be able to fulfill the policy later.
2020-02-19 09:25:57 -08:00
negoyal
85cc35e81e
Merge branch 'master' into HEAD
2020-02-05 14:59:55 -08:00
Evan Tschannen
844c8511c4
Merge pull request #2588 from jzhou77/backup-worker
...
Integrate new backup worker with existing backup command
2020-02-05 14:14:43 -08:00
Jingyu Zhou
52c6737411
Rename backupLoggingEnabled as backupWorkerEnabled
...
To highlight the changes for 7.0 backup changes. By default,
backup_worker_enabled flag is set for 7.0 version.
2020-02-04 10:09:16 -08:00
Jingyu Zhou
0db03f1d3c
Use backup_logging_enabled flag
...
The default is to enable new backup workers. Users can disable this flag to
turn off the backup worker feature.
2020-02-03 20:03:22 -08:00
Evan Tschannen
4524831456
Merge pull request #2518 from vishesh/task/failmon-remove-server
...
FailureMonitoring: Server processes no longer need to talk to ClusterController
2020-02-03 17:22:50 -08:00
Jingyu Zhou
38aa1903fd
Add a DB configuration option for backup workers
...
Right now, the default is to keep the old backup behavior, i.e., do NOT use
backup workers. Specifically, if BackupType is not set (or is set to default),
the master will not recruit backup workers and will not add pseudo locality for
backup workers.
The StartFullBackupTaskFunc is updated to check if backup worker is enabled.
Only when it is not enabled, starting a backup will wait on all backup workers
to be started.
2020-01-31 19:29:09 -08:00
Jingyu Zhou
6ddf73e26a
Remove code introduced when resolving merge conflicts
2020-01-22 21:23:38 -08:00
Jingyu Zhou
c6c39ca99d
Update better master exist with backup workers
...
During recruitment, if there is no desired log router count, use tlog size
instead, because the number of backup workers has to be larger than 0.
2020-01-22 19:43:40 -08:00
Jingyu Zhou
56a2c37071
Recruit backup workers for single region
...
Enable log router tags for single region, which are popped by backup workers.
Need to add noop for backup workers if there is no active backups.
2020-01-22 19:42:13 -08:00
Jingyu Zhou
19d6a889ff
Recruit backup workers for old epochs
...
If there are unfinished ranges in the old epochs, the new master will recruit
backup workers responsible for finishing these ranges. These workers remains in
the cluster until the next epoch, when it will remove itself.
2020-01-22 19:38:45 -08:00
Jingyu Zhou
7da9f47f26
Enable pop from backup workers
...
This is still WIP as some edge cases can trigger test failure, most likely due
to not popping mutations by backup workers when epoch ends.
2020-01-22 19:38:45 -08:00
Jingyu Zhou
ece3cadf8e
Recruit backup worker during master recovery
...
Right now recruit the same number as TLogs. The backup worker does nothing.
2020-01-22 19:37:48 -08:00
Jingyu Zhou
de8d953865
Add backup role, class, and worker skeleton
2020-01-22 19:35:30 -08:00
Vishesh Yadav
daef5f011a
Merge remote-tracking branch 'apple/master' into task/failmon-remove-server
2020-01-21 13:20:15 -08:00
Evan Tschannen
3f9d9d8b84
Merge branch 'release-6.2'
...
# Conflicts:
# CMakeLists.txt
# cmake/FlowCommands.cmake
# documentation/sphinx/source/release-notes.rst
# fdbclient/StorageServerInterface.h
# fdbserver/DataDistributionTracker.actor.cpp
# fdbserver/MasterProxyServer.actor.cpp
# fdbserver/fdbserver.actor.cpp
# flow/Knobs.h
# flow/Platform.cpp
# versions.target
2020-01-16 18:37:47 -08:00
Evan Tschannen
d55e56993d
fix: the cluster controller would not recruit more remote logs before the database became fully_recovered
2020-01-10 12:21:48 -08:00
Alvin Moore
7628d04fb9
Merge branch 'release-6.2' of github.com:apple/foundationdb into release_6.2_merge
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
2020-01-09 07:21:16 -08:00
mpilman
d3d6016c90
Merge remote-tracking branch 'negoyal/fdb_cache_subfeature2' into features/cache-initialization
2020-01-07 19:53:09 -08:00
Vishesh Yadav
6e6cfaff16
Cleanup old Failure Monitoring code
2020-01-07 15:53:32 -08:00
negoyal
29b77863f0
Cache warmup and Consistency check workload changes.
2020-01-07 13:06:58 -08:00
Evan Tschannen
3eae401886
fix: we were recruiting one too few oldLogRouters
...
code cleanup
2020-01-02 15:05:44 -08:00
Evan Tschannen
5e5e618da0
during recovery, only send the full serverDBInfo to processes that are part of the new generation
2019-12-09 13:17:49 -08:00
Evan Tschannen
bcce5968a4
recruit oldLogRouters on TLogs, do not recruit oldLogRouters on the cluster controller if possible
2019-12-09 13:12:13 -08:00
mpilman
821edcb207
Register caches through keyspace
...
This also removes the old mechanism that registers them
through the serverDBInfo.
Caches do now self-recruit at startup
2019-12-06 13:28:44 -08:00
negoyal
cf2563f1c7
Mix of various things, a lot of which will change.
2019-12-05 17:10:32 -08:00
Evan Tschannen
3c769fcf60
Merge branch 'release-6.2'
...
# Conflicts:
# CMakeLists.txt
# documentation/sphinx/source/release-notes.rst
# fdbserver/ClusterController.actor.cpp
# fdbserver/MasterProxyServer.actor.cpp
# versions.target
2019-11-22 15:39:19 -08:00
Evan Tschannen
ebcb2f79ed
Merge branch 'master' of github.com:apple/foundationdb
2019-11-22 15:34:49 -08:00
A.J. Beamon
7c801513e2
Fix cases where latency band config could be discarded during recovery or process start.
2019-11-20 11:44:18 -08:00
Evan Tschannen
8d3ef89540
Merge branch 'release-6.2'
...
# Conflicts:
# CMakeLists.txt
# documentation/sphinx/source/release-notes.rst
# fdbclient/MutationList.h
# fdbserver/MasterProxyServer.actor.cpp
# versions.target
2019-11-14 15:49:56 -08:00
Evan Tschannen
ffc89d1182
fix: dd test recruitment should prefer the location of ratekeeper over other used processes
2019-11-13 12:58:55 -08:00
Balachandar Namasivayam
2e41497580
This commit tries to distribute RK and DD among other empty available processes.
2019-11-12 17:52:42 -08:00
Balachandar Namasivayam
f5282f2c7e
Fix bug where DD or RK could be halted and re-recruited in a loop for certain valid process class configurations. Specifically, recruitment of DD or RK takes into account that master process is preferred over proxy, resolver or cc.
...
But check for better DD only looks for better machine class ignoring that the new recruit could share a proxy or resolver or CC. Also try to balance the distribution of the DD and RK role if there are enough processes to do so.
2019-11-12 14:22:36 -08:00
negoyal
a4a0bf18f9
Merging with Master.
2019-11-12 13:01:29 -08:00
Evan Tschannen
688940b685
merge 6.2 into master
2019-10-21 11:43:46 -07:00
Evan Tschannen
43e99ef6a4
fix: better master exists must check if fitness is better for proxies or resolvers before looking at the count of either of them
2019-10-17 13:18:31 -07:00
Evan Tschannen
298b815109
one proxy or resolver with best fitness no longer prevents more proxies or resolvers from being recruited with good fitness
2019-10-14 18:32:17 -07:00
Evan Tschannen
5064d91b75
fix: the cluster controller would not change to a new set of satellite tlogs when they become available in a better satellite location
2019-10-14 18:31:23 -07:00
Evan Tschannen
35e816e9ad
added the ability to configure satellite_logs by satellite location, this will overwrite the region configure if both are present
2019-10-14 18:30:15 -07:00
A.J. Beamon
31ce56eddf
Add cluster controller metrics
2019-10-03 15:29:11 -07:00
Evan Tschannen
b495cc697b
Merge branch 'release-6.2'
...
# Conflicts:
# CMakeLists.txt
# documentation/sphinx/source/release-notes.rst
# versions.target
2019-09-13 09:25:08 -07:00
Evan Tschannen
a62862c105
add yieldedFutures to prevent slow tasks
2019-09-11 16:26:48 -07:00
Evan Tschannen
945cff1e5b
the cluster controller caches the serialization of serverDBInfo, to avoid regenerating it many times
2019-09-10 14:27:22 -07:00
Meng Xu
39680fa515
StorageEngineSwitch:Clean up unnecessary trace
...
And do not trigger storage recruitment unnecessarily.
2019-08-19 14:11:57 -07:00
Meng Xu
4ab322f52c
Merge branch 'master' into mengxu/storage-engine-switch-PR-v2
2019-08-19 13:06:32 -07:00
Meng Xu
3034a5e0c5
StorageRecruitment:Suppress outstanding req errors
...
When too many outstanding requests cannot find a worker for storage server
role, many same errors will be put into trace log. Only one error is enough
to alert the problem.
Too many same errors cause false positive in nightly test and thus should be suppressed.
2019-08-14 11:31:06 -07:00
Meng Xu
a588710376
StorageEngineSwitch:Graceful switch
...
When fdbcli change storeType for storage engines,
we switch the store type of storage servers one by one gracefully.
This avoids recruiting multiple storage servers on the same process,
which can cause OOM error.
2019-08-12 17:37:52 -07:00
Evan Tschannen
90e3b50213
Merge branch 'master' into feature-coordinator-connection
...
# Conflicts:
# fdbclient/DatabaseContext.h
# fdbclient/NativeAPI.actor.cpp
# fdbclient/NativeAPI.actor.h
# fdbserver/workloads/KillRegion.actor.cpp
2019-07-26 15:05:02 -07:00
Evan Tschannen
be5d144b8b
added status information on connected clients
2019-07-25 17:15:31 -07:00
Jingyu Zhou
bbeaf0ebbb
Add a monitorServerInfoConfig() call back
...
This was deleted during a code refactor in ef868f5
. Because no tests were
complaining, we didn't find this until now.
2019-07-25 15:17:26 -07:00
Evan Tschannen
4a866290b7
Clients keep a persistent connection open with coordinators to get updates to the list of proxies
...
Status still needs to be updated with client information with information from the coordinators
2019-07-23 19:22:44 -07:00
Jingyu Zhou
50e7593c5b
Merge pull request #1796 from ajbeamon/remove-trace-event-underscores
...
Remove trace event underscores
2019-07-05 21:45:55 -07:00
A.J. Beamon
9f4b6fd770
Remove additional underscores
2019-07-05 08:12:25 -07:00
Alex Miller
7a500cd37f
A giant translation of TaskFooPriority -> TaskPriority::Foo
...
This is so that APIs that take priorities don't take ints, which are
common and easy to accidentally pass the wrong thing.
2019-06-25 02:47:35 -07:00
Vishesh Yadav
a8e408e268
run clang-format on changes
2019-06-10 14:10:24 -07:00
Vishesh Yadav
6fa7081a21
net: Don't make FailureMonitoring requests from client
...
This patch removes the need for clients to continuously contact
cluster coordinator for failure monitoring information. Instead, it
uses the FlowTransport to monitor the statuses of peers and update
FailureMonitor accordingly.
2019-06-09 00:43:38 -07:00
Evan Tschannen
29b96414e2
Merge branch 'release-6.1'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbclient/NativeAPI.actor.cpp
# fdbserver/Coordination.actor.cpp
# flow/Arena.h
# versions.target
2019-06-03 18:49:35 -07:00
Evan Tschannen
7c333dbc16
If a process receives a message in its clusterControllerInterface before becoming the cluster controller, if the process does not become the cluster controller in the next minute it should destroy the interface to prevent a memory leak.
2019-05-29 16:57:13 -07:00
A.J. Beamon
5f55f3f613
Replace g_random and g_nondeterministic_random with functions deterministicRandom() and nondeterministicRandom() that return thread_local random number generators. Delete g_debug_random and trace_random. Allow only deterministicRandom() to be seeded, and require it to be seeded from each thread on which it is used.
2019-05-10 14:01:52 -07:00
Andrew Noyes
6207d724f8
Fix all -Wunused-variable warnings
2019-04-15 18:13:00 -07:00
mpilman
1c16f87a4e
Remove trace-calls to printable (in non-workloads)
2019-04-05 13:12:19 -07:00
mpilman
c008e16c81
Defer formatting in traces to make them cheaper
...
This is the first part of making `TraceEvent` cheaper. The main idea is
to defer calls to any code that formats string. These are the main
changes:
- TraceEvent::detail now takes a c-string instead of std::string for
literals. This prevents unnecessary allocations if the trace is not
going to be printed in the first place (for example for SevDebug).
Before that `detail` expected a `std::string` as key, which mean that
any string literal would be copied on each call.
- Templates Traceable and SpecialTraceMetricType. These templates can be
specialized for any type that needs to be printed. The actual
formatting will be deferred to after the `enabled` check. This
provides two benefits: (1) if a TraceEvent is disabled, we don't pay
for the formatting and (2) TraceEvent can trace types that it doesn't
know about.
- TraceEvent::enabled will be set in the constructor if the Severity is
passed. This will make sure that `TraceEvent::init` is not called.
- `TraceEvent::detail` will be inlined. So for disabled TraceEvent
calls, a call to detail will only introduce a if-branch which is much
cheaper than a function call.
2019-04-05 13:12:19 -07:00
Evan Tschannen
8ebf771392
cleanup cluster controller trace events
2019-03-30 14:17:18 -07:00
A.J. Beamon
71e2fdafb8
Changes to ratekeeper camel case
2019-03-27 08:24:25 -07:00
Evan Tschannen
5e03e178de
Merge pull request #1345 from ajbeamon/support-multiple-client-or-worker-issues
...
Add support for a client or worker having multiple issues.
2019-03-24 17:27:50 -07:00
Evan Tschannen
d45159ebf7
Merge pull request #1307 from jzhou77/ratekeeper
...
Monitor placement of Ratekeeper and DataDistributor
2019-03-24 17:26:07 -07:00
Evan Tschannen
d6ad027d37
ratekeeper needs to be recruited for proxies to make progress, so if one has not registered with the cluster controller by the time we are accepting commits, recruit a new one
2019-03-24 16:48:24 -07:00
Evan Tschannen
f426d732ea
fix: forgot to remove one location where id_used was incremented for distributor and ratekeeper
2019-03-24 16:04:59 -07:00
Evan Tschannen
e8948726e8
once we recruit a ratekeeper, do not allow any other ratekeepers to register
2019-03-24 11:04:39 -07:00
Jingyu Zhou
40eec20252
Restore master PID in worker registration
...
This fix is lost during merge.
2019-03-23 21:02:11 -07:00
Jingyu Zhou
3ef26e6be3
Fix fitness assignment statements
...
Found by MacOS build.
2019-03-23 19:16:04 -07:00
Evan Tschannen
1fc6937802
changed NetworkAddressList to at most two addresses for performance
2019-03-23 17:54:46 -07:00
Evan Tschannen
b51a24453e
the data distributor and ratekeeper are not included in id_used, but when comparing equally good options we prefer to avoid sharing with those roles
...
excluded data distributor and ratekeeper were improperly killed when the best option was also excluded
2019-03-23 13:25:36 -07:00
Jingyu Zhou
fdc5b5ddbf
Fix: spurious ratekeeper registration
...
A rare race condition:
-r simulation -f ./foundationdb/tests/slow/WriteDuringReadAtomicRestore.txt -s 114256311 -b on
- A is the ratekeeper.
- CC recruit B and B starts
- CC halts ratekeeper A and A is halted
- A registers back with CC, which then halts B. CC sets A to be the ratekeeper.
CC starts recruiting and finds A is the best machine. But skips recruiting
because CC thinks A is already used. Now the cluster is left with no ratekeeper.
Fix by disallowing ratekeeper registration with previous ID.
2019-03-23 11:03:51 -07:00
Jingyu Zhou
6523cd4931
Fix: recruit ratekeeper is not triggerred
2019-03-23 09:20:54 -07:00
Evan Tschannen
2da46e3172
fix: halt if datacenters are different
2019-03-22 23:53:21 -07:00
Evan Tschannen
d34c56c9a5
ensure that the processId exists in id_worker before accessing it
2019-03-22 18:54:39 -07:00
Evan Tschannen
36ab852bb1
Merge branch 'master' into ratekeeper
...
# Conflicts:
# fdbserver/ClusterController.actor.cpp
2019-03-22 18:41:00 -07:00
Evan Tschannen
ddb6058770
simplified ratekeeper monitoring loop
2019-03-22 18:22:45 -07:00
Jingyu Zhou
12917d8c7d
Add actors to store halt request futures
...
Address best fitness in checking better DD or RK.
2019-03-22 18:06:38 -07:00
Jingyu Zhou
e8977aeb98
Remove clusterControllerDcId check
...
This is no longer needed since it'll be set in the ctor.
2019-03-22 18:01:54 -07:00
Evan Tschannen
82bc447e29
startRatekeeper is responsible for updating serverDBInfo
2019-03-22 17:56:16 -07:00
Evan Tschannen
82c80c225d
make sure id_worker is updated before setting ratekeeper or data distribution
2019-03-22 17:08:54 -07:00
Evan Tschannen
6a9c9d79cc
Update fdbserver/ClusterController.actor.cpp
2019-03-22 17:00:58 -07:00
Evan Tschannen
70b1c88cdd
Update fdbserver/ClusterController.actor.cpp
2019-03-22 17:00:52 -07:00
Jingyu Zhou
16f54577ee
Restore master PID in cluster controller worker registration
...
CC may think master failed and clear the master PID, which can block both data
distributor and ratekeeper recruitment. Fix by restoring it during worker
registration.
2019-03-22 14:53:05 -07:00
A.J. Beamon
4eb5715689
Add support for a client or worker having multiple issues.
2019-03-22 08:29:41 -07:00
Jingyu Zhou
da338c3ad6
Avoid unnecessary recuriting of DD or RK
...
While waiting for recruting data distributor or ratekeeper, a previous one
could already joined. So we can skip this unnecessary recruiting.
Revert the change of worker.actor.cpp for ratekeeper. Instead, recruiting
ratekeeper should avoid the process with an existing one. This fixes a bug
where the ratekeeper interface became zombie, killing other healthy ratekeeper
but doing no useful work. Found by:
-r simulation --crash -f tests/fast/WriteDuringRead.txt -s 31858110 -b on
2019-03-21 22:40:07 -07:00
Evan Tschannen
fe4464e786
fix: processClassFitness could be wrong if the client changed their class while rebooting
2019-03-21 17:56:04 -07:00
Jingyu Zhou
299961aecb
Move ratekeeper or data distributor from excluded servers
2019-03-21 17:17:33 -07:00
Jingyu Zhou
48324ad4be
Fix a race during ratekeeper registration
...
When a ratekeeper registers, the monitorRatekeeper wakes up and recruits a new
ratekeeper. Adding a 0s delay to avoid this.
If a ratekeeper is recruited on an existing machine, update the interface so
that the cluster controller can clear the ratekeeperID.
2019-03-21 12:56:56 -07:00
Evan Tschannen
e692f0f70f
fix: degraded is only used for tlog recruitment, so we should not use it in the fitness calculation for other roles
2019-03-21 11:23:49 -07:00
Jingyu Zhou
8edefda193
Fix test stuck due to invalid worker in cluster controller
...
Test case:
-r simulation --crash -f ./tests/rare/CloggedCycleWithKills.txt -s 688927581 -b off
2019-03-20 22:24:01 -07:00
Jingyu Zhou
937b6dde31
Fix a race of DD, RK, Master failure
...
If all DD, RK, Master run on the same process and failed. Recruiting of new
DD or RK could try to use the old master worker interface, which is an invalid
one and causes recruitment to be stuck.
Fix by adding a delay and checking master is valid before recruitment.
2019-03-20 16:19:20 -07:00
Jingyu Zhou
ce5c6d18d2
Fix ratekeeper recruitment bug
2019-03-20 14:22:22 -07:00
Jingyu Zhou
86b687981b
Fix ratekeeper and data distributor recruiting bug
...
Avoid multiple concurrent recuriting of ratekeepers with a recruiting flag.
Fix endless recruiting when the chosen worker is a proxy or a resolver --
prefer master in this case.
2019-03-20 10:00:31 -07:00
Jingyu Zhou
474abd81bd
Move placement monitoring inside doCheckOutstandingRequests
2019-03-19 22:48:21 -07:00
Balachandar Namasivayam
f9560e1abd
Addressed Review Comments
2019-03-19 15:23:14 -07:00
Jingyu Zhou
bc6fdaea3e
Recruit a new ratekeeper before halting the old
2019-03-19 15:21:46 -07:00
Jingyu Zhou
0fb6a03c07
First round of review comment fixes for PR#1307
2019-03-19 11:29:19 -07:00
Jingyu Zhou
8d609eb51d
Protect ratekeeper registration race during recruitment
...
This is similar one to DataDistributor.
2019-03-18 13:53:50 -07:00
Balachandar Namasivayam
5471725db5
Support config where the primary and remote DC's can be used as satellites.
2019-03-18 12:17:59 -07:00
Jingyu Zhou
2b41a97a6e
Fix the issue of slow dying Data Distributor
...
Test with:
-r simulation -f ./foundationdb/tests/slow/CommitBug.txt -s 67828576 -b on
The test has the following event sequence:
- Time 113.3s, CC noticed DD failure, cleard DD interface.
- 1s later, DD rejoined and registered with CC.
- Time 131.7s, DD actor cancelled. This old DD raced to register with CC and
the failure monitor is not installed because monitorDataDistributor is stalled
waiting for new DD.
- Time 161.4s, new DD running. New DD recruting was delayed due to no servers
in the period.
Fix by disabling DD registration during the recruting process.
2019-03-17 22:19:23 -07:00
Jingyu Zhou
254c78053c
Fix a segfault error
...
After wait, ServerDBInfo may have changed. Using the old copy is wrong.
2019-03-15 22:11:13 -07:00
Jingyu Zhou
12ddd56698
Fix Ratekeeper and DataDistributor placement
...
Make sure both RateKeeper and DataDistributor are placed in the same data
center as the Master. Make sure only one RateKeeper is live in the cluster as
well.
2019-03-15 17:09:28 -07:00
Jingyu Zhou
bb5686eb75
Fix monitoring of DD and RK
2019-03-15 16:02:17 -07:00
Jingyu Zhou
9f6fe5f649
Merge remote-tracking branch 'apple/master' into ratekeeper
2019-03-15 11:30:04 -07:00
Jingyu Zhou
40860e0093
Attempt to fix.
2019-03-15 11:29:04 -07:00
Jingyu Zhou
99d521ef4f
Monitor Ratekeeper and DataDistributor to use stateless processes
...
Since Ratekeeper and DataDistributor are no longer running with Master, they
might be running with stateful processes before a new Master becomes alive,
which is undesirable.
This PR adds a monitoring of both Ratekeeper and DataDistributor at Cluster
Controller -- if Master runs on a stateless class and RK/DD runs at a worse
class, then RK/DD will be killed. I.e., RK/DD should be running at their own
classes or on the same stateless process as Master. After restart, RK/DD should
be running at a better process class.
2019-03-14 15:00:57 -07:00
Meng Xu
5a10bf5dfc
Merge branch 'master' into mengxu/tls-switch-status-PR
2019-03-14 10:35:12 -07:00
Evan Tschannen
a2108047aa
removed LocalitySetRef and IRepPolicyRef typedefs, because for clarity the Ref suffix is reserved for arena allocated objects instead of reference counted objects.
2019-03-13 13:14:39 -07:00
Evan Tschannen
e068c478b5
merge master
2019-03-12 18:31:25 -07:00
Evan Tschannen
5392742902
fixed review comments
2019-03-12 14:38:54 -07:00
Jingyu Zhou
2b0139670e
Fix review comment for PR 1176
2019-03-12 12:02:30 -07:00
Meng Xu
46f4b02807
TLS Status: Resolve review comments
...
Use connectedCoordinatorsNumDelayed to reduce the load on cluster controller;
Set connectedCoordinatorsNum to null by default for monitorLeader()
2019-03-11 17:10:08 -07:00
Evan Tschannen
1be9ae5ce3
fixed merge conflict
2019-03-08 22:51:06 -05:00
Evan Tschannen
044b6b4f8a
Merge branch 'master' into feature-degraded-tlog
...
# Conflicts:
# fdbserver/ClusterController.actor.cpp
2019-03-08 22:50:41 -05:00
Evan Tschannen
45fe6b369b
tlog recruitment will prefer non-degraded processes, however it will not choose less than desired number of tlogs to avoid degraded processes
...
better master exists will switch the master to avoid degraded processes
2019-03-08 14:40:00 -05:00