foundationdb

Commit Graph

Author	SHA1	Message	Date
Jingyu Zhou	fecffc93e4	Fix a segfault when tlog encounters platform_error During destruction, rejoinClusterController actor should be cancelled to avoid accessing TLogData object.	2024-05-17 11:40:22 -07:00
Dimitris Apostolou	a88114c222	Fix typos	2024-02-07 01:16:00 +02:00
Hao Fu	9b17dd8caf	Fix backup workers stability issues (#11044 ) This PR includes a few stability fixes for Backup Worker * Fixed memory bookkeeping issue in Backup Worker. Previously it didn't release flow lock correctly when erasing messages. * Added TLogServer fix to return 0 from poppedVersion() for unrecognized log router tags.	2023-11-13 15:55:25 -08:00
Zhe Wu	83992d61ec	Add a knob to guard the gray failure detection during TLog recovery	2023-08-29 14:49:39 -07:00
Zhe Wu	2078a0055a	Add documentation	2023-07-26 14:16:16 -07:00
Zhe Wu	f9c3ac3704	Remove debugging logging	2023-07-26 12:35:19 -07:00
Zhe Wu	00cdf09966	Cluster controller monitors recovery stats and trigger recovery if current recovery contains degraded servers	2023-07-17 11:14:22 -07:00
Zhe Wu	9670ed1cd8	Make TLog explicitely monitor connectivity issue during [start version, recover version] recovery	2023-07-17 11:12:28 -07:00
Evan Tschannen	ef682d304e	fix IKeyValueStore include	2023-06-16 13:28:40 -07:00
Evan Tschannen	359e178dcd	Merge branch 'main' into feature-durable-change-feed # Conflicts: # fdbclient/ClientKnobs.cpp # fdbserver/BlobManager.actor.cpp # fdbserver/worker.actor.cpp	2023-06-11 13:58:35 -07:00
Zhe Wu	1c290d3bc8	Make TLog server to handle empty oldGenerationRecoverAtVersions	2023-05-16 15:16:42 -07:00
Zhe Wu	1eae833ae2	test record_recover_at_in_cstate and track_tlog_recovery in restart test	2023-05-16 13:37:42 -07:00
Zhe Wu	a956979c32	Replace oldestGenerationStartVersion with oldestGenerationRecoverAtVersion	2023-05-16 13:09:34 -07:00
Evan Tschannen	3dd86d6c22	move IKeyValueStore.h to the client	2023-05-10 15:41:47 -07:00
Dan Adkins	aaa4860f76	Fix commit location accounting in DiskQueue. (#10075 ) We encountered a situation in simulation where the disk queue was in the following state +------------+------------+ \| page 1 \| page 2 \| +------------+------------+ \|rec \|.......\|rec \|.......\| +------------+------------+ 0..85 4096..4181 ^. ^__ ^ popped. committed pushed and we attempted to pop up to 4096, i.e. everything before page 2. This triggered one of the assertions in the disk queue code which was meant to catch tlog logic bugs where we pop too much. The issue, though, is the accounting of the commit location in the disk queue. While we only pushed records through position 85, we committed the entire page. Attempts to pop everything before page 2 should have succeeded since we're not attempting to pop any uncommitted data. The solution is to fix the commit location accounting in the disk queue to round up to the next page, to reflect the reality that we only commit entire pages. This bug was discovered in the first place by introducing a delay into the commit queue loop during simulation testing. That delay is included in this change. We also noticed that getNextCommitLocation() was incorrect. Since there are no users of that function, we've removed it entirely.	2023-05-02 12:33:29 -04:00
Zhe Wu	33736ff9af	Cleanup GcGeneration test and function documents	2023-03-27 12:31:44 -07:00
Zhe Wu	d576d9a66a	Remote debug TraceEvent	2023-03-27 11:47:11 -07:00
Zhe Wu	40dc54223c	Add GC generation test, and make all simulation test passing	2023-03-27 11:46:13 -07:00
Zhe Wu	78bef8110b	Track tlog recovery: tlog side implementation	2023-03-27 11:42:27 -07:00
Dan Adkins	b8c9c8b0f4	Add metric for tlog commit time minus time spent waiting in the queue.	2023-02-27 15:40:22 -08:00
Dan Adkins	37b6804f88	Add metric for queue wait time in tlog.	2023-02-27 15:40:22 -08:00
Dan Adkins	e3a61b9b22	Add metrics to understand tail commit latency (#9435 ) * Add server-side latency metrics for Resolver requests. * Add separate resolver latency metrics for queue wait and compute time. * Add histogram for queue depth observed on resolver (during metrics interval). * Fix tlog latency measurement to use timer() instead of now().	2023-02-24 14:13:12 -05:00
Yi Wu	eac757d186	EaR: cleanup encryption knobs (#9386 ) Changes: * Cleanup all encryption knobs * Update simulated cluster to randomly enable encryption with higher probability	2023-02-18 13:18:20 -08:00
Zhe Wu	359b3a11e7	Change TLog pull async data warning timeout	2023-01-11 19:36:00 -08:00
Zhe Wu	087d37d10b	Add event for txn server initialization and a warning for TLog slow catching up	2023-01-11 10:02:06 -08:00
Xiaoxi Wang	8266f52dea	Merge pull request #9012 from sfc-gh-xwang/feature/main/wiggleDelay Persist accumulated wiggle delay	2023-01-04 16:14:09 -08:00
Xiaoxi Wang	bbcb3cc018	extract KeyBackedConfig, StorageWiggleData class; solve template resolution problem; solve MV txn and native api conflict by splitting RunTransaction file	2023-01-02 23:34:39 -08:00
Jingyu Zhou	667d58ac35	Add back samples for (non)empty peeks stats These were lost, likely due to refactoring. Now TLogMetrics have meaningful data like: TLogMetrics ID=59ec9c67b4d07433 Elapsed=5 BytesInput=0 -1 17048 BytesDurable=47.4 225.405 17048 BlockingPeeks=0 -1 0 BlockingPeekTimeouts=0 -1 0 EmptyPeeks=1.6 2.79237 236 NonEmptyPeeks=0 -1 32 ...	2022-12-21 11:18:28 -08:00
FoundationDB CI	86d6106dc1	format source code after switch to clang 15	2022-12-08 17:26:45 +00:00
Daniel Luan	4be05e5a5d	Upgrade C++ Standard to 20	2022-12-06 14:19:06 -08:00
Yi Wu	551fd0b9bb	EAR: Cleanup Redwood tenant map usage (#8902 ) We have a recent redesign that no longer required to pass tenant name to get encryption key, and also not allowing optional tenant mode for tenant-aware encryption. This PR clean up Redwood code to remove tenant map usage, and update various checks accordingly. Changes: * Cleanup TenantPrefixIndex in TenantAwareEncryptionKeyProvider and related logic in storage server and Redwood for passing the map around. * Cleanup and update DecodeBoundaryVerifier the reflect the new design. * A minor fix to writePages() that avoid a page that's default domain encrypted having a lower bound key belonging to a non-default domain. * Fix TenantAwareEncryptionKeyProvider::getEncryptionDomain() returning wrong prefix long for system domain. * A minor change to add a context string to IoTimeoutError.	2022-11-23 09:41:40 -08:00
sfc-gh-tclinkenbeard	3c97f43138	Change Histogram::Unit::microseconds to milliseconds	2022-11-21 08:03:56 -08:00
Jingyu Zhou	f285a91f6c	Add more debug events	2022-11-17 11:54:21 -08:00
Steve Atherton	8e8c4b4489	Merge pull request #8170 from sfc-gh-sgwydir/ddsketch Use DDSketch for sample data	2022-11-17 10:38:12 -08:00
Sam Gwydir	214db4d17e	formatting	2022-11-15 13:38:55 -08:00
sfc-gh-tclinkenbeard	c03f60c618	Update rare code probe annotations	2022-11-15 13:21:25 -08:00
Sam Gwydir	7f33b0fa70	clang-format	2022-11-12 14:09:31 -08:00
Sam Gwydir	23706c957b	Use DDSketch for Sample Data.	2022-11-12 13:45:46 -08:00
Zhe Wu	ae4d66c0d7	Update TLogServer in main	2022-11-04 15:05:37 -07:00
Zhe Wu	fc35ed9d0a	Change ASSERT_WE_THINK to ASSERT when checking that peek reply start version must be greater than latest pop version	2022-11-04 15:05:37 -07:00
Zhe Wu	32bc9b6ebb	Fix a race condition between batched peek and pop, where the server removal pop may be lost	2022-11-04 15:05:37 -07:00
Ata E Husain Bohra	a7d123643d	Extend Tlog persistentStorage to persist encryption state (#8344 ) * Extend Tlog persistentStorage to persist encryption state Description diff-3: Address review comment. diff-2: Extend ClusterController endpoints to allow query cluster's encryptionAtRest status Update Tlog recovery to ensure on-disk encryption status matches with cluster's cstate persisted encryptionAtRest diff-1: Store encryptionAtRestMode state in Coordinators Major changes proposed are: 1. Extend TLog persistentStorage to persist encryption state 2. Encryption state persisted is derived from corresponding db-config and relevant SERVER_KNOBS. In near future, knobs shall be removed. 3. On TLog startup, the persisted encryption state is compared against cluster configuration, if mismatch, the TLog is killed and not allowed to rejoin the cluster. Testing devRunCorrectness - 100K	2022-11-03 11:16:50 -07:00
Lukas Joswiak	28540e5962	Format	2022-10-27 13:56:13 -07:00
Lukas Joswiak	9d3c3b1efe	Remove cluster ID logic from individual roles The logic to determine the validity of a process joining a cluster now belongs on the worker and the cluster controller. It is no longer restricted to tlogs and storages, but instead applies to all processes (even stateless ones).	2022-10-27 13:56:13 -07:00
Lukas Joswiak	72a97afcd6	Avoid recruiting workers with different cluster ID	2022-10-27 13:56:13 -07:00
sfc-gh-tclinkenbeard	74212eeacf	Encapsulate CounterCollection	2022-10-25 10:17:15 -07:00
Ankita Kejriwal	854212fe94	Incorportate code review suggestions	2022-10-13 17:41:31 -07:00
Ankita Kejriwal	59686fa2e5	Increase the IO timeout to avoid flakiness in simulation tests.	2022-10-13 13:40:40 -07:00
Steve Atherton	8ccdc91b5e	Merge commit '7c89cd705faee52d5d78e6c77665cb7cc4502f58' into redwood-commit-overlap	2022-10-07 15:59:45 -07:00
Markus Pilman	ea1325a552	Merge pull request #8319 from sfc-gh-tclinkenbeard/add-rare-code-probe-annotation Add `rare` code probe decoration	2022-10-07 09:39:00 -06:00

1 2 3 4 5 ...

687 Commits