Commit Graph

687 Commits

Author SHA1 Message Date
Jingyu Zhou fecffc93e4 Fix a segfault when tlog encounters platform_error
During destruction, rejoinClusterController actor should be cancelled to avoid
accessing TLogData object.
2024-05-17 11:40:22 -07:00
Dimitris Apostolou a88114c222
Fix typos 2024-02-07 01:16:00 +02:00
Hao Fu 9b17dd8caf
Fix backup workers stability issues (#11044)
This PR includes a few stability fixes for Backup Worker

* Fixed memory bookkeeping issue in Backup Worker. Previously
it didn't release flow lock correctly when erasing messages.

* Added TLogServer fix to return 0 from poppedVersion() for
unrecognized log router tags.
2023-11-13 15:55:25 -08:00
Zhe Wu 83992d61ec Add a knob to guard the gray failure detection during TLog recovery 2023-08-29 14:49:39 -07:00
Zhe Wu 2078a0055a Add documentation 2023-07-26 14:16:16 -07:00
Zhe Wu f9c3ac3704 Remove debugging logging 2023-07-26 12:35:19 -07:00
Zhe Wu 00cdf09966 Cluster controller monitors recovery stats and trigger recovery if current recovery contains degraded servers 2023-07-17 11:14:22 -07:00
Zhe Wu 9670ed1cd8 Make TLog explicitely monitor connectivity issue during [start version, recover version] recovery 2023-07-17 11:12:28 -07:00
Evan Tschannen ef682d304e fix IKeyValueStore include 2023-06-16 13:28:40 -07:00
Evan Tschannen 359e178dcd Merge branch 'main' into feature-durable-change-feed
# Conflicts:
#	fdbclient/ClientKnobs.cpp
#	fdbserver/BlobManager.actor.cpp
#	fdbserver/worker.actor.cpp
2023-06-11 13:58:35 -07:00
Zhe Wu 1c290d3bc8 Make TLog server to handle empty oldGenerationRecoverAtVersions 2023-05-16 15:16:42 -07:00
Zhe Wu 1eae833ae2 test record_recover_at_in_cstate and track_tlog_recovery in restart test 2023-05-16 13:37:42 -07:00
Zhe Wu a956979c32 Replace oldestGenerationStartVersion with oldestGenerationRecoverAtVersion 2023-05-16 13:09:34 -07:00
Evan Tschannen 3dd86d6c22 move IKeyValueStore.h to the client 2023-05-10 15:41:47 -07:00
Dan Adkins aaa4860f76
Fix commit location accounting in DiskQueue. (#10075)
We encountered a situation in simulation where the disk queue was in the following state

    +------------+------------+
    | page 1     | page 2     |
    +------------+------------+
    |rec |.......|rec |.......|
    +------------+------------+
    0..85        4096..4181
    ^.   ^__             ^
    popped. committed    pushed

and we attempted to pop up to 4096, i.e. everything before page 2. This triggered
one of the assertions in the disk queue code which was meant to catch tlog logic
bugs where we pop too much.

The issue, though, is the accounting of the commit location in the disk queue.
While we only pushed records through position 85, we committed the entire page.
Attempts to pop everything before page 2 should have succeeded since we're not
attempting to pop any uncommitted data.

The solution is to fix the commit location accounting in the disk queue to round
up to the next page, to reflect the reality that we only commit entire pages.

This bug was discovered in the first place by introducing a delay into the commit
queue loop during simulation testing. That delay is included in this change.

We also noticed that getNextCommitLocation() was incorrect. Since there are no
users of that function, we've removed it entirely.
2023-05-02 12:33:29 -04:00
Zhe Wu 33736ff9af Cleanup GcGeneration test and function documents 2023-03-27 12:31:44 -07:00
Zhe Wu d576d9a66a Remote debug TraceEvent 2023-03-27 11:47:11 -07:00
Zhe Wu 40dc54223c Add GC generation test, and make all simulation test passing 2023-03-27 11:46:13 -07:00
Zhe Wu 78bef8110b Track tlog recovery: tlog side implementation 2023-03-27 11:42:27 -07:00
Dan Adkins b8c9c8b0f4 Add metric for tlog commit time minus time spent waiting in the queue. 2023-02-27 15:40:22 -08:00
Dan Adkins 37b6804f88 Add metric for queue wait time in tlog. 2023-02-27 15:40:22 -08:00
Dan Adkins e3a61b9b22
Add metrics to understand tail commit latency (#9435)
* Add server-side latency metrics for Resolver requests.

* Add separate resolver latency metrics for queue wait and compute time.

* Add histogram for queue depth observed on resolver (during metrics interval).

* Fix tlog latency measurement to use timer() instead of now().
2023-02-24 14:13:12 -05:00
Yi Wu eac757d186
EaR: cleanup encryption knobs (#9386)
Changes:
* Cleanup all encryption knobs 
* Update simulated cluster to randomly enable encryption with higher probability
2023-02-18 13:18:20 -08:00
Zhe Wu 359b3a11e7 Change TLog pull async data warning timeout 2023-01-11 19:36:00 -08:00
Zhe Wu 087d37d10b Add event for txn server initialization and a warning for TLog slow catching up 2023-01-11 10:02:06 -08:00
Xiaoxi Wang 8266f52dea
Merge pull request #9012 from sfc-gh-xwang/feature/main/wiggleDelay
Persist accumulated wiggle delay
2023-01-04 16:14:09 -08:00
Xiaoxi Wang bbcb3cc018 extract KeyBackedConfig, StorageWiggleData class; solve template resolution problem; solve MV txn and native api conflict by splitting RunTransaction file 2023-01-02 23:34:39 -08:00
Jingyu Zhou 667d58ac35 Add back samples for (non)empty peeks stats
These were lost, likely due to refactoring. Now TLogMetrics have meaningful
data like:

TLogMetrics ID=59ec9c67b4d07433 Elapsed=5 BytesInput=0 -1 17048 BytesDurable=47.4 225.405 17048 BlockingPeeks=0 -1 0 BlockingPeekTimeouts=0 -1 0 EmptyPeeks=1.6 2.79237 236 NonEmptyPeeks=0 -1 32 ...
2022-12-21 11:18:28 -08:00
FoundationDB CI 86d6106dc1
format source code after switch to clang 15 2022-12-08 17:26:45 +00:00
Daniel Luan 4be05e5a5d Upgrade C++ Standard to 20 2022-12-06 14:19:06 -08:00
Yi Wu 551fd0b9bb
EAR: Cleanup Redwood tenant map usage (#8902)
We have a recent redesign that no longer required to pass tenant name to get encryption key, and also not allowing optional tenant mode for tenant-aware encryption. This PR clean up Redwood code to remove tenant map usage, and update various checks accordingly.

Changes:
* Cleanup TenantPrefixIndex in TenantAwareEncryptionKeyProvider and related logic in storage server and Redwood for passing the map around.
* Cleanup and update DecodeBoundaryVerifier the reflect the new design.
* A minor fix to writePages() that avoid a page that's default domain encrypted having a lower bound key belonging to a non-default domain.
* Fix TenantAwareEncryptionKeyProvider::getEncryptionDomain() returning wrong prefix long for system domain.
* A minor change to add a context string to IoTimeoutError.
2022-11-23 09:41:40 -08:00
sfc-gh-tclinkenbeard 3c97f43138 Change Histogram::Unit::microseconds to milliseconds 2022-11-21 08:03:56 -08:00
Jingyu Zhou f285a91f6c Add more debug events 2022-11-17 11:54:21 -08:00
Steve Atherton 8e8c4b4489
Merge pull request #8170 from sfc-gh-sgwydir/ddsketch
Use DDSketch for sample data
2022-11-17 10:38:12 -08:00
Sam Gwydir 214db4d17e formatting 2022-11-15 13:38:55 -08:00
sfc-gh-tclinkenbeard c03f60c618 Update rare code probe annotations 2022-11-15 13:21:25 -08:00
Sam Gwydir 7f33b0fa70 clang-format 2022-11-12 14:09:31 -08:00
Sam Gwydir 23706c957b Use DDSketch for Sample Data. 2022-11-12 13:45:46 -08:00
Zhe Wu ae4d66c0d7 Update TLogServer in main 2022-11-04 15:05:37 -07:00
Zhe Wu fc35ed9d0a Change ASSERT_WE_THINK to ASSERT when checking that peek reply start version must be greater than latest pop version 2022-11-04 15:05:37 -07:00
Zhe Wu 32bc9b6ebb Fix a race condition between batched peek and pop, where the server removal pop may be lost 2022-11-04 15:05:37 -07:00
Ata E Husain Bohra a7d123643d
Extend Tlog persistentStorage to persist encryption state (#8344)
* Extend Tlog persistentStorage to persist encryption state

Description

 diff-3: Address review comment.
 diff-2: Extend ClusterController endpoints to allow query
         cluster's encryptionAtRest status
         Update Tlog recovery to ensure on-disk encryption
         status matches with cluster's cstate persisted
         encryptionAtRest
 diff-1: Store encryptionAtRestMode state in Coordinators

Major changes proposed are:
1. Extend TLog persistentStorage to persist encryption state
2. Encryption state persisted is derived from corresponding
db-config and relevant SERVER_KNOBS. In near future, knobs
shall be removed.
3. On TLog startup, the persisted encryption state is compared
against cluster configuration, if mismatch, the TLog is killed
and not allowed to rejoin the cluster.

Testing

devRunCorrectness - 100K
2022-11-03 11:16:50 -07:00
Lukas Joswiak 28540e5962 Format 2022-10-27 13:56:13 -07:00
Lukas Joswiak 9d3c3b1efe Remove cluster ID logic from individual roles
The logic to determine the validity of a process joining a cluster now
belongs on the worker and the cluster controller. It is no longer
restricted to tlogs and storages, but instead applies to all processes
(even stateless ones).
2022-10-27 13:56:13 -07:00
Lukas Joswiak 72a97afcd6 Avoid recruiting workers with different cluster ID 2022-10-27 13:56:13 -07:00
sfc-gh-tclinkenbeard 74212eeacf Encapsulate CounterCollection 2022-10-25 10:17:15 -07:00
Ankita Kejriwal 854212fe94 Incorportate code review suggestions 2022-10-13 17:41:31 -07:00
Ankita Kejriwal 59686fa2e5 Increase the IO timeout to avoid flakiness in simulation tests. 2022-10-13 13:40:40 -07:00
Steve Atherton 8ccdc91b5e Merge commit '7c89cd705faee52d5d78e6c77665cb7cc4502f58' into redwood-commit-overlap 2022-10-07 15:59:45 -07:00
Markus Pilman ea1325a552
Merge pull request #8319 from sfc-gh-tclinkenbeard/add-rare-code-probe-annotation
Add `rare` code probe decoration
2022-10-07 09:39:00 -06:00