Commit Graph

23462 Commits

Author SHA1 Message Date
Jingyu Zhou dc60f63f9b Revert "Cancel watch when the key is not being waited"
This reverts commit 639afbe62c.
2022-10-27 19:46:05 -07:00
Jingyu Zhou fbe9802be5 Revert "configurationMonitor does not need to check watch reference count"
This reverts commit ab0f827058.
2022-10-27 19:46:05 -07:00
Jingyu Zhou 634bd529e7 Revert "Record the version of each watch"
This reverts commit 4bd24e4d64.
2022-10-27 19:46:05 -07:00
Jingyu Zhou 19ae4e7eb7 Revert "Reformat source"
This reverts commit ec47c261bf.
2022-10-27 19:46:05 -07:00
Jingyu Zhou e460933b52 Revert "Remove debugging output"
This reverts commit 41d1d6404d.
2022-10-27 19:46:05 -07:00
Jingyu Zhou e7fd3eda00 Revert "Update fdbclient/NativeAPI.actor.cpp"
This reverts commit 812243bafa.
2022-10-27 19:46:05 -07:00
Lukas Joswiak 9625efd5b9 Add comment about configuration database 2022-10-27 13:56:13 -07:00
Lukas Joswiak 8e76621653 Disable shared state updates on configuration database 2022-10-27 13:56:13 -07:00
Lukas Joswiak 91146a03f0 Write cluster ID to `ClientDBInfo`
This enables clients to receive the cluster ID.
2022-10-27 13:56:13 -07:00
Lukas Joswiak 28540e5962 Format 2022-10-27 13:56:13 -07:00
Lukas Joswiak a8f8757f77 Rename cluster ID key
In FDB 7.1, this key was stored in the txnStateStore. In 7.2, it has
been moved to the database. This was causing protocol compatibility
issues during upgrades, so we need to rename the key.
2022-10-27 13:56:13 -07:00
Lukas Joswiak 02bc5edbf8 Avoid blocking in choose when 2022-10-27 13:56:13 -07:00
Lukas Joswiak 9d3c3b1efe Remove cluster ID logic from individual roles
The logic to determine the validity of a process joining a cluster now
belongs on the worker and the cluster controller. It is no longer
restricted to tlogs and storages, but instead applies to all processes
(even stateless ones).
2022-10-27 13:56:13 -07:00
Lukas Joswiak 1fca3b7ddc Modify how cluster ID tests are run in simulation 2022-10-27 13:56:13 -07:00
Lukas Joswiak bba05b7c9b Move cluster ID from txnStateStore to the database
The cluster ID is now stored in the database instead of in the
txnStateStore. The cluster controller will read it on boot and send it
to all processes to persist.
2022-10-27 13:56:13 -07:00
Lukas Joswiak 5ca2b89bdf Fix simulation issue where process switch was ignored
The simulator tracks only active processes. Rebooted or killed processes
are removed from the list of processes, and only get added back when the
process is rebooted and starts up again. This causes a problem for the
`RebootProcessAndSwitch` kill type, which wants to simultaneously reboot
all machines in a cluster and change their cluster file. If a machine is
currently being rebooted, it will miss the reboot process and switch
command.

The fix is to add a check when a process is being started in simulation.
If the process has had its cluster file changed and the cluster is in a
state where all processes should have had their cluster files reverted
to the original value, the simulator will now send a
`RebootProcessAndSwitch` signal right when the process is started. This
will cause an extra reboot, but should correctly switch the process back
to its original, correct cluster file, allowing the cluster to fully
recover all clusters.

Note that the above issue should only affect simulation, due to how the
simulator tracks processes and handles kill signals.

This commit also adds a field to each process struct to determine
whether the process is being run in a DR cluster in the simulation run.
This is needed because simulation does not differentiate between
processes in different clusters (other than by the IP), and some
processes needed to switch clusters and some simply needed to be
rebooted.
2022-10-27 13:56:13 -07:00
Lukas Joswiak f43011e4b7 Notify processes joining the wrong cluster
And have these processes enter a "zombie" state where they cancel all
their actors and then wait forever, refusing to do any additional work
until they are manually handled by the operator.
2022-10-27 13:56:13 -07:00
Lukas Joswiak 72a97afcd6 Avoid recruiting workers with different cluster ID 2022-10-27 13:56:13 -07:00
Lukas Joswiak a72066be33 Add simulation support for changing the cluster file 2022-10-27 13:56:13 -07:00
Jingyu Zhou 6e0835f8a8
Merge pull request #8599 from technmsg/main
updated copyright year on web site
2022-10-27 13:36:56 -07:00
Xiaoge Su 812243bafa Update fdbclient/NativeAPI.actor.cpp
Co-authored-by: Jingyu Zhou <jingyuzhou@gmail.com>
2022-10-27 12:42:05 -07:00
Xiaoge Su 41d1d6404d Remove debugging output 2022-10-27 12:42:05 -07:00
Xiaoge Su ec47c261bf Reformat source 2022-10-27 12:42:05 -07:00
Xiaoge Su 4bd24e4d64 Record the version of each watch
In the case
    1. A watch to key A is set, the watchValueMap ACTOR, noted as X, starts waiting.
    2. All watches are cleared due to connection string change.
    3. The watch to key A is restarted with watchValueMap ACTOR Y.
    4. X receives the cancel exception, and tries to dereference the counter. This causes Y gets cancelled.

the reference count will cause watch prematurely terminate. Recording
the versions of each watch would help preventing this issue
2022-10-27 12:42:05 -07:00
Xiaoge Su ab0f827058 configurationMonitor does not need to check watch reference count 2022-10-27 12:42:05 -07:00
Xiaoge Su 639afbe62c Cancel watch when the key is not being waited
Currently, there is a cyclic reference situation in

    DatabaseContext -> WatchMetadata -> watchStorageServerResp ->
    DatabaseContext

If there is a watch created in the DatabaseContext, even the
corresponding wait ACTOR is cancelled, the WatchMetadata will still hold
a reference to watchStorageServerResp ACTOR, which holds a reference to
DatabaseContext.

In this situation, any DatabaseContext who held a watch will not be
automatically destructed since its reference count will never reduce to
0 until the watch value is changed. Every time the cluster recoveries,
several watches are created, and when the cluster restarts, the
DatabaseContext which not being used, will not be able to destructed due
to these watches.

With this patch, each wait to the watch will be counted. Either the
watch is triggered or cancelled, the corresponding count will be
reduced. If a watch is not being waited, the watch will be cancelled,
effectively reduce the reference count of DatabaseContext. This will
hopefully fix the issue mentioned above.

The code is tested by 1) Manually change the number of logs of a local
cluster, see the cluster recovery and previous DatabaseContext being
destructed; 2) 100K joshua run, with 1 failure, the same test will fail
on the current git main branch.
2022-10-27 12:42:05 -07:00
Xiaoge Su 03b102d86a Clean up unused comment in flow.h 2022-10-27 12:42:05 -07:00
Alex Moundalexis 67049518b9
updated copyright year on web site 2022-10-27 15:05:52 -04:00
Nim Wijetunga bf01d9b879
Bulk Setup Workload Improvements (#8573)
* bulk setup  workload improvements

* fix workload

* modify
2022-10-27 11:10:14 -07:00
Jingyu Zhou fe66c026b4
Merge pull request #8598 from jzhou77/fix
Fix restarting restore test failure
2022-10-27 10:44:17 -07:00
Josh Slocum 4d3553481f
Blob connection provider test (#8478)
* Refactoring test blob metadata creation

* Implementing BlobConnectionProviderTest

* createRandomTestBlobMetadata supports blobstore and works outside simulation
2022-10-27 10:44:06 -05:00
Jingyu Zhou 6c0f890f78 Fix restarting restore test failure
Old fdbserver may not set the "enableSnapshotBackupEncryption" key, thus we
should allow the key to be not present.
2022-10-27 08:43:55 -07:00
Vaidas Gasiunas c6adb3a98c
Building fdb_c_shim to a shared library (#8586) 2022-10-27 12:37:20 +02:00
Markus Pilman 2bf9c2f448
Merge pull request #8588 from sfc-gh-mpilman/bugfixes/fix-build-dependencies
Fix AWS SDK build and removed check for old build system
2022-10-26 12:36:08 -06:00
Dennis Zhou deeedfc3f8
Merge pull request #8537 from sfc-gh-dzhou/unblob
blob: allow purge ranges to begin and end in unblobbified regions
2022-10-26 11:11:09 -07:00
Markus Pilman 989731f7f4 Fix AWS SDK build and removed check for old build system 2022-10-26 11:48:10 -06:00
Aaron Molitor f620f391f5 make same change to Dockerfile.eks (from #8583) 2022-10-26 12:24:37 -05:00
Josh Slocum 623e6ef761
adding delay in bw forced shutdown to prevent crash races (#8552) 2022-10-26 12:22:41 -05:00
Nim Wijetunga 6f37f55917
Restore System Keys First in Backup/Restore Workloads (#8475)
* system key restore ordering

* restore system keys before regular data

* atomic restore backup fix

* change testing

* fix compile error

* fix compile issue

* fix compile issues

* Trigger Build

* only split restore if encryption is enabled

* revert knob changes

* Update fdbserver/workloads/AtomicSwitchover.actor.cpp

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>

* Update fdbserver/workloads/AtomicSwitchover.actor.cpp

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>

* Update fdbserver/workloads/BackupCorrectness.actor.cpp

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>

* Update fdbserver/workloads/AtomicRestore.actor.cpp

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>

* add todo

* strengthen check

* seperate system restore for atomic restore

* address pr comments

* address pr comments

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
2022-10-26 09:38:27 -07:00
Josh Slocum ab6953be7d
Blob Granule read-driven compaction (#8572) 2022-10-26 09:02:50 -07:00
Aaron Molitor b8b7b46d8f update kubectl and awscli 2022-10-26 10:52:05 -05:00
Marian Dvorsky 3c5d3f7a94
Fix SpanContext for GP:getLiveCommittedVersion (#8565)
* Fix SpanContext for GP:getLiveCommittedVersion
2022-10-26 16:29:28 +02:00
Junhyun Shim 32099bfce5
Merge pull request #8564 from sfc-gh-jshim/enable-authz-benchmark-in-mako
Enable authz/TLS-enabled benchmark in mako
2022-10-26 14:55:53 +02:00
Junhyun Shim 2917598dc4 Merge remote-tracking branch 'origin/main' into enable-authz-benchmark-in-mako 2022-10-26 12:49:13 +02:00
Aaron Molitor e4116f8aee cleanup shell script, remove set -x, add more detailed logging 2022-10-25 23:23:22 -05:00
Xiaoxi Wang bb0236433c
Merge pull request #8540 from sfc-gh-xwang/feature/main/storageMetrics
Make MockStorageServer serve StorageMetrics related request
2022-10-25 17:29:21 -07:00
Xiaoxi Wang 0a5e596758 fix network failure check in unit test 2022-10-25 16:43:00 -07:00
Xiaoxi Wang 36d9de9072 change UNREACHABLE to ASSERT(false); change function name 2022-10-25 15:43:24 -07:00
Trevor Clinkenbeard 0f4fddfa17
Merge pull request #8480 from sfc-gh-tclinkenbeard/reject-tag-throttled-txns
Reject transactions that have been tag throttled too long
2022-10-25 15:34:07 -07:00
Jingyu Zhou 744c391608
Merge pull request #8539 from vishesh/cc-fail-later
Don't fail ConsistencyCheck on first mismatch
2022-10-25 15:33:11 -07:00