Commit Graph

171 Commits

Author SHA1 Message Date
Xin Dong 4363dd0f25 This resolves issue #3739 by exposing time since last full recovery. 2020-09-08 14:26:01 -07:00
Young Liu 23e1ff694c Report missing old tlogs in recovery between accepting commits and storage recovered 2020-09-08 13:35:42 -07:00
XiaoxiWang ecf2c0109c more concise status json 2020-09-04 18:40:45 +00:00
XiaoxiWang fb758bf937 update Schemas.cpp 2020-09-04 16:34:05 +00:00
Young Liu 9564171463 Merge branch 'master' into grv-proxy 2020-08-24 22:45:01 -07:00
XiaoxiWang 4e627691a9 add throttle objects into Schemas.cpp 2020-08-24 23:37:58 +00:00
Young Liu 79ce16650d merge master branch 2020-08-11 19:22:10 -07:00
Young Liu d6a23a4d6b Resolve comments to make GRV proxy a separate process class 2020-08-06 00:01:57 -07:00
Chaoguang Lin 3ba940a63d Change json string to snake_case 2020-07-28 11:42:03 -07:00
Chaoguang Lin 52178f9eae Add json schema for the management api error message in special key space 2020-07-27 17:37:19 -07:00
A.J. Beamon b09dddc07e Merge branch 'release-6.2' into merge-release-6.2-into-release-6.3
# Conflicts:
#	cmake/ConfigureCompiler.cmake
#	documentation/sphinx/source/downloads.rst
#	fdbrpc/FlowTransport.actor.cpp
#	fdbrpc/fdbrpc.vcxproj
#	fdbserver/DataDistributionQueue.actor.cpp
#	fdbserver/Knobs.cpp
#	fdbserver/Knobs.h
#	fdbserver/LogSystemPeekCursor.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/Status.actor.cpp
#	fdbserver/storageserver.actor.cpp
#	flow/flow.vcxproj
2020-07-10 15:06:34 -07:00
Evan Tschannen 8befb0829d
Merge pull request #3481 from ajbeamon/fix-dc-timeout-message
Add missing messages to schema and rename one to match later versions
2020-07-10 10:30:21 -07:00
A.J. Beamon b51beead53 The backport of a change in later versions didn't include some updates to the schema and a change to the name of one of the messages. 2020-07-09 16:58:13 -07:00
A.J. Beamon 693595f4e5 Fix make build, fix GRV schema 2020-07-09 16:50:08 -07:00
A.J. Beamon 04d1217941 Track statistics about server-side request latency on each process, to include min, max, mean, and various percentiles. 2020-07-09 16:39:15 -07:00
Andrew Noyes 409ccf3be2 Use snake_case to match status json convention 2020-06-30 17:05:29 +00:00
Andrew Noyes e1dfa410c1 Remove "Storages" field from data_distribution_stats
It seems like a cool idea, but it feels rushed to me. Reverting for now.
2020-06-30 15:24:16 +00:00
Andrew Noyes 403274bba8 Add schemas, and check dataDistributionStatsSchema 2020-06-30 15:24:16 +00:00
Daniel Smith acbfe2e4c9
Revert "Revert "Initial RocksDB"" 2020-06-15 12:45:36 -04:00
Jingyu Zhou 9cd1614c82
Revert "Initial RocksDB" 2020-06-11 15:29:46 -07:00
A.J. Beamon e10704fd76 Cherry-pick region related status changes from 6.3 2020-06-09 14:56:21 -07:00
Daniel Smith 5d361fe532 Copy/paste rebase onto 6.3 2020-05-22 15:02:51 +00:00
A.J. Beamon d636194d0d Remove deprecated fields in status: worst_version_lag_storage_server and limiting_version_lag_storage_server 2020-05-19 13:12:10 -07:00
A.J. Beamon d0c66d7282 Fix typo 2020-05-12 18:38:20 -07:00
A.J. Beamon e0526e0095 Add busiest read tags to storage server status 2020-05-12 15:49:40 -07:00
A.J. Beamon 36da61dd9c Merge branch 'master' into transaction-tagging
# Conflicts:
#	fdbclient/NativeAPI.actor.cpp
#	fdbclient/vexillographer/fdb.options
2020-04-07 21:12:14 -07:00
A.J. Beamon 2309e9f156 Consistently use timeout instead of timedout in status messages. 2020-04-07 08:43:23 -07:00
Xin Dong 587419f984 Fix the missing schema field which caused a lot noise in nightly 2020-04-06 10:18:58 -07:00
A.J. Beamon 2336f073ad Checkpointing a bunch of work on throttles. Rudimentary implementation of auto-throttling. Support for manual throttling via fdbcli. Throttles are stored in the system keyspace. 2020-04-03 15:24:14 -07:00
Xin Dong e755583c07 Address review comments. 2020-04-01 15:13:04 -07:00
Xin Dong a7e8bfad82 Fix the test failure, which was introduced by a typo 2020-03-30 15:24:08 -07:00
Xin Dong 012d41548e Address review comments 2020-03-30 13:55:59 -07:00
Balachandar Namasivayam 58a9bfa78b
Merge pull request #2820 from dongxinEric/fix/1977/add-back-trace-event-flush-failure-report
Fix/1977/add back trace event flush failure report
2020-03-18 16:11:44 -07:00
Evan Tschannen e08f0201f1 merge release 6.2 into master 2020-03-17 12:51:47 -07:00
Evan Tschannen ed4d02a3e4
Merge pull request #2812 from etschannen/feature-proxy-mem-limit
Limit the amount of requests the proxy can queue up in memory
2020-03-16 14:56:56 -07:00
A.J. Beamon f2defc3a3a
Merge pull request #2814 from etschannen/feature-delay-recovery
Prevent coordinated state from filling up with too many old generations
2020-03-16 11:45:17 -07:00
Evan Tschannen 76db8343c0 update status schema 2020-03-16 11:00:51 -07:00
Evan Tschannen 04b752b40a Added additional logging related to memory errors (including in status) 2020-03-13 18:31:22 -07:00
Alex Miller 5be7fa52bc Remove comma, and add schema change to documentation 2020-03-13 14:51:56 -07:00
Alex Miller 04498cbc0e Make policy failures be reported as per 1s and not over 5s. 2020-03-13 02:49:06 -07:00
Alex Miller d86a601b84 Add cluster.processes.id.network.tls_policy.hz to status.
This allows monitoring of TLS policy failures, but one has to go scrape
for TLSPolicyFailure trace events to figure out why they're happening.
2020-03-13 02:46:10 -07:00
Xin Dong 5967ef5eab Added back the changes that report trace log flush failures and fix the random crash 2020-03-12 14:34:19 -07:00
Evan Tschannen 303df197cf Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	bindings/c/test/mako/mako.c
#	documentation/sphinx/source/release-notes.rst
#	fdbbackup/backup.actor.cpp
#	fdbclient/NativeAPI.actor.cpp
#	fdbclient/NativeAPI.actor.h
#	fdbserver/DataDistributionQueue.actor.cpp
#	fdbserver/Knobs.cpp
#	fdbserver/Knobs.h
#	fdbserver/LogRouter.actor.cpp
#	fdbserver/SkipList.cpp
#	fdbserver/fdbserver.actor.cpp
#	flow/CMakeLists.txt
#	flow/Knobs.cpp
#	flow/Knobs.h
#	flow/flow.vcxproj
#	flow/flow.vcxproj.filters
#	versions.target
2020-03-06 18:22:46 -08:00
Evan Tschannen 15f1a75d4f updated documentation for 6.2.18 2020-03-06 11:16:10 -08:00
Xin Dong 39610d15f8 Revert this change since it somehow introduced a random crash detected on circus 2020-03-04 16:14:38 -08:00
Xin Dong 090c89e90a Addressed review comments. Fix the bug where issues on a worker may be wrongly cleared by subsequent GetDBinfo request. 2020-02-25 15:39:38 -08:00
Xin Dong 1c346fcfb0 Added the new issues into Status Schema. Remove the issue reporting in lastError since:
- If the issue string contains the error number, status schema needs to be super verbose to include all possible issue strings
- If the issue string does not contain the error number, the generic issue string can be pretty useless.

Thus now specific issues are being reported before calling lastError
2020-02-25 15:38:14 -08:00
Jingyu Zhou 52c6737411 Rename backupLoggingEnabled as backupWorkerEnabled
To highlight the changes for 7.0 backup changes. By default,
backup_worker_enabled flag is set for 7.0 version.
2020-02-04 10:09:16 -08:00
Jingyu Zhou 0db03f1d3c Use backup_logging_enabled flag
The default is to enable new backup workers. Users can disable this flag to
turn off the backup worker feature.
2020-02-03 20:03:22 -08:00
Jingyu Zhou 297f22726c Add backup_type database configuration option
Update simulation tests to randomly set backup types to be one of: old backup
(default), new backup (tagged), or both (default+tagged).
2020-01-31 19:29:09 -08:00
mengranwo 227edd4248 change memory storage engine name from memory-radixtree to memory-radixtree-beta 2020-01-15 13:49:45 -08:00
mengranwo f597aa7e18 WIP : deployable/stable version since Nov 3. Start rebase to master branch 2020-01-15 13:49:45 -08:00
negoyal a4a0bf18f9 Merging with Master. 2019-11-12 13:01:29 -08:00
Evan Tschannen 688940b685 merge 6.2 into master 2019-10-21 11:43:46 -07:00
Evan Tschannen ef0890c23a updated status schema 2019-10-16 22:37:57 -07:00
Evan Tschannen 35e816e9ad added the ability to configure satellite_logs by satellite location, this will overwrite the region configure if both are present 2019-10-14 18:30:15 -07:00
Meng Xu d0147e5e5d Merge branch 'release-6.2' into mengxu/merge-release620-to-master-v3
Resolved Conflicts:
	documentation/sphinx/source/release-notes.rst
	fdbserver/DataDistribution.actor.cpp
	versions.target
2019-10-02 13:22:56 -07:00
Evan Tschannen 045175bd0e added tracking for the size of the system keyspace 2019-09-27 22:39:19 -07:00
A.J. Beamon 6100d3274d
Merge pull request #2058 from tclinken/expose-lock-status
Added lockUID to status output if database is locked
2019-09-11 08:47:35 -07:00
Trevor Clinkenbeard 8c31a839be s/lockUID/lock_uid in status 2019-09-06 22:20:55 -07:00
Trevor Clinkenbeard 2d216f7ae5 Added database_lock_state to statusSchema 2019-09-06 22:20:50 -07:00
A.J. Beamon 3f9e392668
Merge pull request #2014 from etschannen/feature-fdbcli-sleep
Added a sleep command to fdbcli
2019-08-30 11:22:13 -07:00
Evan Tschannen f3bc7e0abd do not duplicate data distribution disabled fields in status
fixed a few bugs related to the existing data distribution disabled fields in status
2019-08-29 18:41:34 -07:00
A.J. Beamon 2b80d836f4 Merge branch 'release-6.2' into add-coordinator-to-status-roles-list
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
2019-08-19 15:03:59 -07:00
A.J. Beamon b8e57f37d7 Add 'coordinator' to the list of roles that a process can have in status. 2019-08-15 14:42:49 -07:00
A.J. Beamon bb72cdd36a Report lag with the usual "seconds" and "versions" fields. Rename and deprecate the qos.*version_lag_storage_server fields. 2019-08-15 13:42:39 -07:00
A.J. Beamon 6581161dd3 Add ratekeeper's durability lag statistics to status 2019-08-15 11:07:04 -07:00
A.J. Beamon 438bc636d5 Rename max_machine_failures_without_losing_X to max_zone_failures_without_losing_X in status. 2019-07-30 14:02:31 -07:00
Evan Tschannen 90e3b50213 Merge branch 'master' into feature-coordinator-connection
# Conflicts:
#	fdbclient/DatabaseContext.h
#	fdbclient/NativeAPI.actor.cpp
#	fdbclient/NativeAPI.actor.h
#	fdbserver/workloads/KillRegion.actor.cpp
2019-07-26 15:05:02 -07:00
Evan Tschannen be5d144b8b added status information on connected clients 2019-07-25 17:15:31 -07:00
Evan Tschannen 846038b0e6
Merge pull request #1858 from bnamasivayam/rk-ssfetch-throttle
Ratekeeper throttling aggressively when unable to fetch storage server list
2019-07-19 16:41:58 -07:00
Evan Tschannen 94c66f8d58
Merge pull request #1738 from bnamasivayam/consistency-check-disable
Disable/Re-enable consistency check through a database key.
2019-07-18 10:56:02 -07:00
Balachandar Namasivayam 406bcebdc4 Ratekeeper to throttle tpsLimit to 1 if it is not able to fetch storage server list for some configurable amount of time. 2019-07-17 18:08:17 -07:00
A.J. Beamon 2cd05e9ac9
Merge pull request #1712 from tclinken/add-local-rk-to-status
Track the local ratekeeper rate in status
2019-07-15 15:17:11 -07:00
Balachandar Namasivayam 9169232fa9 Add the new messages to Schema. 2019-07-15 13:47:27 -07:00
Evan Tschannen 1a18c859c7 knobified the durability lag rate controls 2019-07-12 18:50:56 -07:00
A.J. Beamon f31884c749 Merge branch 'master' into add-priority-starts-to-status
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
2019-07-11 15:26:52 -07:00
A.J. Beamon 97609ad991 Add information about transaction starts at different priorities to status. 2019-07-11 13:54:44 -07:00
A.J. Beamon b4dbc6d7fa Change the way cache hits and misses are tracked to avoid counting blind page writes as misses and count the results of partial page writes. Report cache hit rate in status. 2019-07-10 14:43:20 -07:00
A.J. Beamon 69d7c4f79c Merge branch 'master' into track-run-loop-busyness
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	flow/Net2.actor.cpp
#	flow/network.h
2019-07-09 18:39:23 -07:00
Trevor Clinkenbeard 1bac04509e Track the local ratekeeper rate as a percentage
This value is reported in status for each storage server.
2019-07-09 12:46:53 -07:00
A.J. Beamon 4be08d9b2d Rename datacenter_version_difference to datacenter_lag and include both seconds and versions. 2019-07-05 14:36:18 -07:00
Evan Tschannen b9a6271375 local ratekeeper no longer globally limits 2019-06-28 16:54:22 -07:00
Evan Tschannen 92b32855ca ratekeeper’s control algorithm would oscillate when limited by local ratekeeper 2019-06-28 16:54:22 -07:00
A.J. Beamon 7f23814841 Track run loop busyness and report it in status. 2019-06-26 14:03:02 -07:00
Evan Tschannen dccb9bc26d fixed a number of correctness problems 2019-06-12 19:40:50 -07:00
Evan Tschannen 8590b710bf added additional logging on the logs and log routers 2019-05-02 17:24:39 -07:00
Evan Tschannen 3356ac27bf added three_data_hall_fallback configuration 2019-04-07 22:58:18 -07:00
Evan Tschannen 628fec8c8b updated status with information about ongoing maintenance
clear the maintenance zone if a different storage server is detected failed
2019-04-02 14:15:51 -07:00
Jingyu Zhou b81de9831f Fix SchemaMismatch error
Add data_distributor and ratekeeper roles to schema.
2019-03-27 09:54:01 -07:00
Evan Tschannen eb54a700ba changed the old memory configuration to memory-1 2019-03-18 15:10:04 -07:00
Evan Tschannen a372c7cf18 configure memory now selects the ssd engine for transaction log spilling. Transaction log spilling is only used when the transaction logs cannot keep all of the unpopped mutations it has in memory. If we are only using this data structure because we do not have enough memory, it is much less safe to use the memory storage engine for this purpose. 2019-03-16 22:48:24 -07:00
Meng Xu 5a10bf5dfc Merge branch 'master' into mengxu/tls-switch-status-PR 2019-03-14 10:35:12 -07:00
Evan Tschannen e068c478b5 merge master 2019-03-12 18:31:25 -07:00
Meng Xu 435e515985 Merge branch 'master' into mengxu/tls-switch-status-PR 2019-03-11 11:17:40 -07:00
Evan Tschannen 80c3f2f8e2 added status fields detailing which processes are degraded, and also the total number of degraded processes 2019-03-10 22:58:15 -07:00
Jingyu Zhou 7340998261 Fix status message for ratekeeper 2019-03-07 13:16:20 -08:00
Meng Xu 845f8fdcbc Status:healthy: Add optimizing_team_collections
Change removing_redundant_teams status name to
optimizing_team_collections.
The new name is more general and can be applied in the future
when we switch storage engines.
2019-03-06 15:05:23 -08:00
Meng Xu 04880e3d4d Merge branch 'master' into mengxu/tls-switch-status-PR 2019-03-06 13:41:16 -08:00
Meng Xu b7a52e81e2 Status: Count connected coordinators per client
A client will always try to connect all coordinators.
This commit let Status track the number of connected coordinators
for each client.

This allows us to do canary in coordinators. For example,
when we switch from non-TLS to TLS, we can switch 1 coordinator
from non-TLS to TLS. This can help check if a client has the ability
to connect through TLS.
We can make the non-TLS to TLS switch for each coordinators
one by one. This avoid the risk of losing connection in the switch.
2019-03-05 21:21:23 -08:00