Commit Graph

149 Commits

Author SHA1 Message Date
Xiaoxi Wang 6c11fc74ba add debug traces 2022-05-18 15:20:23 -07:00
Jingyu Zhou 05e63bc703
Fix orphaned storage server due to force recovery (#6914)
* Fix orphaned storage server due to force recovery

The force recovery can roll back the transaction that adds a storage server.
However, the storage server may now at version B > A, the recovery version.
As a result, its peek to buddy TLog won't return TLogPeekReply::popped to
trigger its exit, and instead getting a higher version C > B back. To the
storage server, this means the message is empty, thus not removing itself and
keeps peeking.

The fix is to instead of using recovery version as the popped version for the
SS, we use the recovery transaction version, which is the first transaction
after the recovery. Force recovery bumps this version to a much higher version
than the SS's version. So the TLog would set TLogPeekReply::popped to trigger
the storage server exit.

* Fix tlog peek to disallow return empty message between recoveredAt and recovery txn version

This contract today is not explicitly set and can cause storage server to fail
with assertion "rollbackVersion >= data->storageVersion()". This is because if
such an empty version is returned, SS may advance its storage version to a
value larger than the rollback version set in the recovery transaction.

The fix is to block peek reply until recovery transaction has been received.

* Move recoveryTxnReceived to be per LogData

This is because a shared TLog can have a first generation TLog which is already
setting the promise, thus later generations won't wait for the recovery version.
For the current generation, all peeks need to wait, while for older generations,
there is no need to wait (by checking if they are stopped).

* For initial commit, poppedVersion needs to be at least 2

To get rid of the previous unsuccessful recovery's recruited seed
storage servers.
2022-05-02 17:17:37 -07:00
Jingyu Zhou 0a03b190da Fix multiple PeekStream requests to log routers
There is a bug in how a log router handles streaming read:
* Log router has a `logRouterPeekStream` actor A running.
* Remote tlog detects some problem and starts another streaming connection (maybe just reuse the connection?)
* Log router now has a new `logRouterPeekStream` actor B running.
* B runs and found that popped version > reqBegin, so `LogRouterPeekPopped` . This is because A is still running and changed the popped version.
* A ends with `TLogPeekStreamEnd operation_obsolete`
* B become stuck at `wait(req.reply.onReady() && store(reply.rep, future)`, because the future was sent `Never()`.

As a result, the remote tlog can no longer retrieve data from this log router.

Fix by killing the `logRouterPeekStream` B.
2022-04-15 14:11:52 -07:00
sfc-gh-tclinkenbeard a71099471b Update copyright header dates 2022-03-21 13:36:23 -07:00
A.J. Beamon 250a88e682 Enforce that trace event suppression calls happen first when using trace event call chaining. Fix various instances where we weren't following this requirement. 2022-02-24 12:25:52 -08:00
Zhe Wu e07ae6fdb9 Address comments 2022-02-16 15:28:56 -08:00
Zhe Wu 9da735c38e Batch empty peek reply 2022-02-16 15:28:56 -08:00
Xiaoge Su 067c1cc55b Extract methods in LogSystem.h to corresponding cpp file 2021-09-12 14:17:19 -07:00
Xiaoxi Wang df7a801945 remove FIXME 2021-08-12 14:10:34 -07:00
Xiaoxi Wang a97570bd06 solve mis-spelling, trace log and format problems 2021-08-11 18:26:00 -07:00
Xiaoxi Wang 2263626cdc 200k test clean: enable remote Log pull from LogRouter 2021-08-07 09:53:32 -07:00
Xiaoxi Wang fd74a16f35 format code 2021-08-02 14:24:20 -07:00
Xiaoxi Wang 2df0474fec merge master 2021-08-02 11:58:35 -07:00
Xiaoxi Wang 1c4bce17aa revert code refactor 2021-07-30 19:08:22 -07:00
Xiaoxi Wang 12d4f5c261 disable streaming peek for localities < 0 2021-07-28 14:11:25 -07:00
Xiaoxi Wang c6b0de1264 problem: OOM 2021-07-26 09:36:53 -07:00
sfc-gh-tclinkenbeard b9a22a61ef Fix many -Wreorder-ctor warnings 2021-07-23 17:33:18 -07:00
Xiaoxi Wang bfebd4e812 Merge branch 'master' of https://github.com/apple/foundationdb into tlog_dev 2021-07-22 16:15:07 -07:00
Xiaoxi Wang cd32478b52 memory error(Simple config) 2021-07-22 15:45:59 -07:00
Xiaoxi Wang 974bb4b344 add stream peek function to oldTLogServer_x_x.actor.cpp and LogRouter 2021-07-20 17:01:37 -07:00
Xiaoxi Wang 5046ee3b07 add stream peek to logRouter 2021-07-20 17:42:00 +00:00
sfc-gh-tclinkenbeard a106d40012 Prevent logRouter from modifying ServerDBInfo object 2021-07-11 22:05:26 -07:00
Andrew Noyes ce25a99000 Disallow conversion from float in specialCounter 2021-06-04 12:09:13 -07:00
FDB Formatster df90cc89de apply clang-format to *.c, *.cpp, *.h, *.hpp files 2021-03-10 10:18:07 -08:00
Evan Tschannen 346a4e3ecd Merge branch 'release-6.3'
# Conflicts:
#	fdbcli/fdbcli.actor.cpp
#	fdbrpc/LoadBalance.actor.h
#	fdbrpc/MultiInterface.h
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/masterserver.actor.cpp
2021-03-01 18:52:06 -08:00
Meng Xu ef0bf2728e Merge branch 'release-6.3' into mengxu/ha-code
Resolve Conflicts:
	fdbserver/LogRouter.actor.cpp: Only conflicts at comments
2021-02-19 21:47:09 -08:00
Meng Xu 33eb1de00e Add some comment to log system
and resolve review comment by deleting my questions.
2021-02-19 21:44:13 -08:00
sfc-gh-tclinkenbeard 5b2e88b187 Use structured bindings in for loops 2020-12-27 01:46:20 -04:00
Andrew Noyes 877997632d Merge branch 'release-6.3' into anoyes/merge-release-6.3-master
Include conflict markers for review purposes
2020-12-04 01:38:07 +00:00
Andrew Noyes dc2bac5670 Resolve conflicts 2020-11-24 19:09:42 +00:00
Andrew Noyes 1f541f02be Merge branch 'anoyes/merge-6.2-to-6.3' into anoyes/release-6.3-merge
Merge, leaving conflict markers for now
2020-11-24 16:55:34 +00:00
David Youngworth d64cf8b9e3 Merge branch 6.3 into master 2020-11-17 11:22:45 -08:00
David Youngworth d0391db862 Merge branch 'release-6.2' into release-6.3 2020-11-16 10:15:23 -08:00
Meng Xu 4b0fba6ea8 Explain waitForVersion why wait for version minus MAX_READ_TRANSACTION_LIFE_VERSIONS 2020-11-13 22:14:01 -08:00
Meng Xu 222da17558 Merge branch 'release-6.2' into mengxu/ha-code-read 2020-11-12 13:39:27 -08:00
Young Liu bc688a23c5 Use histogram new API and change group name 2020-11-09 18:54:21 -08:00
Young Liu c6768c4004 log router peek latency metrics 2020-11-09 15:04:37 -08:00
sfc-gh-tclinkenbeard 4669f837fa Add uses of makeReference 2020-11-07 22:10:18 -08:00
Meng Xu 4788544a6f Revise comments based on review suggestions
Ack. Jingyu and Xin for their suggestions.
2020-11-06 08:51:13 -08:00
Xin Dong 5d7ec6555a
Update fdbserver/LogRouter.actor.cpp 2020-11-04 16:34:32 -08:00
Xin Dong 44cdc4dfa6
Update fdbserver/LogRouter.actor.cpp 2020-11-04 09:44:28 -08:00
Meng Xu 1664e2ff7f Add more comments and questions to LR tLog and loadbalance 2020-11-01 21:22:23 -08:00
Xin Dong 46150d22c3 Attach generation(recovery count) to TLog metrics and LogRouter metrics. 2020-11-01 11:24:23 -08:00
Xin Dong d302f60925 Fix build error. 2020-10-30 17:06:22 -07:00
Xin Dong 566365accd Fix typo. 2020-10-30 16:28:05 -07:00
Xin Dong af7e65110f Allow the caller to decorate role metrics trace event with more details. 2020-10-30 16:20:08 -07:00
Xin Dong f2a6a6101e Fix build error. 2020-10-30 13:43:39 -07:00
Meng Xu 063700e4d6 Add comments and questions to HA and tLog code reading
The comments' correctness need to be confirmed by reviewers.
2020-10-30 12:14:57 -07:00
Xin Dong eead86f006 Add primary peek location, aks paring TLog ID to LogRouterMetrics. 2020-10-30 11:42:09 -07:00
Meng Xu ef8c1060a2 Merge branch 'master' into mengxu/tmp-merge-6.3 2020-07-13 10:15:56 -07:00