Commit Graph

11088 Commits

Author SHA1 Message Date
Jon Fu 8ef0411b32 address code review comments and introduce offset parameter 2022-11-03 11:39:39 -07:00
Jon Fu d95eb4dd71 Merge branch 'main' of github.com:apple/foundationdb into tenant-list-filter 2022-10-28 10:00:42 -07:00
Andrew Noyes 0a15f081a1
Proactively clean up idempotency ids for successful commits (#8578)
* Proactively clean up idempotency ids for successful commits

This change also includes some minor changes from my branch working on
an idempotency ids cleaner, that I'd like to get merged sooner rather
than later.

- Adding a timestamp to idempotency values
- Making IdempotencyId an actor file
- Adding commit_unknown_result_fatal
- Checking idempotencyIdsExpiredVersion in determineCommitStatus
- Some testing QOL changes

* Factor out decodeIdempotencyKey logic

* Fix formatting

* Update flow/include/flow/error_definitions.h

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>

* Use KeyBackedObjectProperty for idempotencyIdsExpiredVersion

* Add IDEMPOTENCY_ID_IN_MEMORY_LIFETIME knob

* Rename ExpireIdempotencyKeyValuePairRequest

Also add a code probe for the case where an ExpireIdempotencyIdRequest is
received before the count is known, and add an assert

* Fix formatting and add TODO for nwijetunga

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
2022-10-28 09:07:54 -07:00
Lukas Joswiak 91146a03f0 Write cluster ID to `ClientDBInfo`
This enables clients to receive the cluster ID.
2022-10-27 13:56:13 -07:00
Lukas Joswiak 28540e5962 Format 2022-10-27 13:56:13 -07:00
Lukas Joswiak 02bc5edbf8 Avoid blocking in choose when 2022-10-27 13:56:13 -07:00
Lukas Joswiak 9d3c3b1efe Remove cluster ID logic from individual roles
The logic to determine the validity of a process joining a cluster now
belongs on the worker and the cluster controller. It is no longer
restricted to tlogs and storages, but instead applies to all processes
(even stateless ones).
2022-10-27 13:56:13 -07:00
Lukas Joswiak 1fca3b7ddc Modify how cluster ID tests are run in simulation 2022-10-27 13:56:13 -07:00
Lukas Joswiak bba05b7c9b Move cluster ID from txnStateStore to the database
The cluster ID is now stored in the database instead of in the
txnStateStore. The cluster controller will read it on boot and send it
to all processes to persist.
2022-10-27 13:56:13 -07:00
Lukas Joswiak 5ca2b89bdf Fix simulation issue where process switch was ignored
The simulator tracks only active processes. Rebooted or killed processes
are removed from the list of processes, and only get added back when the
process is rebooted and starts up again. This causes a problem for the
`RebootProcessAndSwitch` kill type, which wants to simultaneously reboot
all machines in a cluster and change their cluster file. If a machine is
currently being rebooted, it will miss the reboot process and switch
command.

The fix is to add a check when a process is being started in simulation.
If the process has had its cluster file changed and the cluster is in a
state where all processes should have had their cluster files reverted
to the original value, the simulator will now send a
`RebootProcessAndSwitch` signal right when the process is started. This
will cause an extra reboot, but should correctly switch the process back
to its original, correct cluster file, allowing the cluster to fully
recover all clusters.

Note that the above issue should only affect simulation, due to how the
simulator tracks processes and handles kill signals.

This commit also adds a field to each process struct to determine
whether the process is being run in a DR cluster in the simulation run.
This is needed because simulation does not differentiate between
processes in different clusters (other than by the IP), and some
processes needed to switch clusters and some simply needed to be
rebooted.
2022-10-27 13:56:13 -07:00
Lukas Joswiak f43011e4b7 Notify processes joining the wrong cluster
And have these processes enter a "zombie" state where they cancel all
their actors and then wait forever, refusing to do any additional work
until they are manually handled by the operator.
2022-10-27 13:56:13 -07:00
Lukas Joswiak 72a97afcd6 Avoid recruiting workers with different cluster ID 2022-10-27 13:56:13 -07:00
Lukas Joswiak a72066be33 Add simulation support for changing the cluster file 2022-10-27 13:56:13 -07:00
Xiaoge Su ab0f827058 configurationMonitor does not need to check watch reference count 2022-10-27 12:42:05 -07:00
Xiaoge Su 639afbe62c Cancel watch when the key is not being waited
Currently, there is a cyclic reference situation in

    DatabaseContext -> WatchMetadata -> watchStorageServerResp ->
    DatabaseContext

If there is a watch created in the DatabaseContext, even the
corresponding wait ACTOR is cancelled, the WatchMetadata will still hold
a reference to watchStorageServerResp ACTOR, which holds a reference to
DatabaseContext.

In this situation, any DatabaseContext who held a watch will not be
automatically destructed since its reference count will never reduce to
0 until the watch value is changed. Every time the cluster recoveries,
several watches are created, and when the cluster restarts, the
DatabaseContext which not being used, will not be able to destructed due
to these watches.

With this patch, each wait to the watch will be counted. Either the
watch is triggered or cancelled, the corresponding count will be
reduced. If a watch is not being waited, the watch will be cancelled,
effectively reduce the reference count of DatabaseContext. This will
hopefully fix the issue mentioned above.

The code is tested by 1) Manually change the number of logs of a local
cluster, see the cluster recovery and previous DatabaseContext being
destructed; 2) 100K joshua run, with 1 failure, the same test will fail
on the current git main branch.
2022-10-27 12:42:05 -07:00
Nim Wijetunga bf01d9b879
Bulk Setup Workload Improvements (#8573)
* bulk setup  workload improvements

* fix workload

* modify
2022-10-27 11:10:14 -07:00
Josh Slocum 4d3553481f
Blob connection provider test (#8478)
* Refactoring test blob metadata creation

* Implementing BlobConnectionProviderTest

* createRandomTestBlobMetadata supports blobstore and works outside simulation
2022-10-27 10:44:06 -05:00
Jon Fu 886c286297 Merge branch 'main' of github.com:apple/foundationdb into tenant-list-filter 2022-10-26 15:01:46 -07:00
Jon Fu b17c3fecbb add invalid tenant state and assertion in metacluster consistency 2022-10-26 14:37:00 -07:00
Dennis Zhou deeedfc3f8
Merge pull request #8537 from sfc-gh-dzhou/unblob
blob: allow purge ranges to begin and end in unblobbified regions
2022-10-26 11:11:09 -07:00
Josh Slocum 623e6ef761
adding delay in bw forced shutdown to prevent crash races (#8552) 2022-10-26 12:22:41 -05:00
Nim Wijetunga 6f37f55917
Restore System Keys First in Backup/Restore Workloads (#8475)
* system key restore ordering

* restore system keys before regular data

* atomic restore backup fix

* change testing

* fix compile error

* fix compile issue

* fix compile issues

* Trigger Build

* only split restore if encryption is enabled

* revert knob changes

* Update fdbserver/workloads/AtomicSwitchover.actor.cpp

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>

* Update fdbserver/workloads/AtomicSwitchover.actor.cpp

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>

* Update fdbserver/workloads/BackupCorrectness.actor.cpp

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>

* Update fdbserver/workloads/AtomicRestore.actor.cpp

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>

* add todo

* strengthen check

* seperate system restore for atomic restore

* address pr comments

* address pr comments

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
2022-10-26 09:38:27 -07:00
Josh Slocum ab6953be7d
Blob Granule read-driven compaction (#8572) 2022-10-26 09:02:50 -07:00
Marian Dvorsky 3c5d3f7a94
Fix SpanContext for GP:getLiveCommittedVersion (#8565)
* Fix SpanContext for GP:getLiveCommittedVersion
2022-10-26 16:29:28 +02:00
Xiaoxi Wang bb0236433c
Merge pull request #8540 from sfc-gh-xwang/feature/main/storageMetrics
Make MockStorageServer serve StorageMetrics related request
2022-10-25 17:29:21 -07:00
Xiaoxi Wang 0a5e596758 fix network failure check in unit test 2022-10-25 16:43:00 -07:00
Xiaoxi Wang 36d9de9072 change UNREACHABLE to ASSERT(false); change function name 2022-10-25 15:43:24 -07:00
Trevor Clinkenbeard 0f4fddfa17
Merge pull request #8480 from sfc-gh-tclinkenbeard/reject-tag-throttled-txns
Reject transactions that have been tag throttled too long
2022-10-25 15:34:07 -07:00
Jingyu Zhou 744c391608
Merge pull request #8539 from vishesh/cc-fail-later
Don't fail ConsistencyCheck on first mismatch
2022-10-25 15:33:11 -07:00
sfc-gh-tclinkenbeard e8e7c873d8 Merge remote-tracking branch 'origin/main' into reject-tag-throttled-txns 2022-10-25 14:28:55 -07:00
Trevor Clinkenbeard 25f3a99b3d
Merge pull request #8568 from sfc-gh-tclinkenbeard/make-tracecounters-method
Encapsulate `CounterCollection`
2022-10-25 14:27:56 -07:00
sfc-gh-tclinkenbeard f339819758 Merge remote-tracking branch 'origin/main' into reject-tag-throttled-txns 2022-10-25 11:59:35 -07:00
Xiaoxi Wang 5a8adca1f7 solve review comments: mark const; add comments; template abbreviation 2022-10-25 10:56:24 -07:00
sfc-gh-tclinkenbeard 74212eeacf Encapsulate CounterCollection 2022-10-25 10:17:15 -07:00
Jingyu Zhou 0ae568a872
Merge pull request #8556 from jzhou77/fix
Fix stack overflows
2022-10-24 16:46:35 -07:00
Ankita Kejriwal ce733cd1a1
Merge pull request #8538 from sfc-gh-akejriwal/monitorusage
Add functionality to get tenants over storage quota and improve the relevant monitors
2022-10-24 16:28:07 -07:00
Hui Liu e2dc50d220
Merge pull request #8508 from sfc-gh-huliu/storageinterf
Implement StorageServerInterface for BlobMigrator
2022-10-24 16:10:31 -07:00
Zhe Wu 0140991d15 Rename NewPhysicalShardReason to RetryFindDstReason 2022-10-24 15:18:20 -07:00
Zhe Wu fc9295ab66 Address comments 2022-10-24 15:18:20 -07:00
Zhe Wu 22047385c4 Count the detailed reason for new physical shard creation during data move 2022-10-24 15:18:20 -07:00
Hui Liu f2289ced27 Add StorageServerInterface for BlobMigrator 2022-10-24 13:12:07 -07:00
Xiaoxi Wang db72a29c06 fix compile error after rebase 2022-10-24 11:16:23 -07:00
Jingyu Zhou a8f821e152 Fix stack overflows
The loop is transformed by actor compiler into recursions that may cause stack
overflows. Thus, I added yield() to unwind stack and refactor the parsing code
so that the subsequent files are blocked until previous ones have finished.
2022-10-24 11:13:11 -07:00
Dennis Zhou 136a325fdc blob/testing: randomly purge the whole range instead of just active 2022-10-24 11:08:04 -07:00
Dennis Zhou 070e4c133e blob/testing: remove setRange() and call (un)blobbifyRange() directly
This also fixes a few wrong setRange(true/false).
2022-10-24 11:08:04 -07:00
Xiaoxi Wang 918018d492 format code 2022-10-24 10:50:46 -07:00
Xiaoxi Wang 0d4b4d05e2 implement MSS as IStorageMetricsService and pass the unit test 2022-10-24 09:58:41 -07:00
Xiaoxi Wang 3c67b7df39 extract serveStorageMetricsRequests template function 2022-10-24 09:58:41 -07:00
Xiaoxi Wang c14ee5395f define IStorageMetricsService 2022-10-24 09:58:41 -07:00
Xiaoxi Wang e07a50573a splitStorageMetrics finish implementation (no unit test yet but 100k
test pass)
2022-10-24 09:58:41 -07:00