Commit Graph

12377 Commits

Author SHA1 Message Date
Xiaoxi Wang e48fd10d8d add perpetual wiggle to .team_tracker field 2023-03-20 09:46:36 -07:00
Xiaoxi Wang 4615926d3f merge upstream/main and solve conflicts 2023-03-17 21:27:19 -07:00
Xiaoxi Wang 5f4e3a95c5 change run function use reference parameter rather than pointer 2023-03-17 20:59:18 -07:00
Steve Atherton 216d0be2cf
Add processID, networkAddress, and locality to layer status JSON for Backup Agents. (#9736)
* Add processID, networkAddress, and locality to layer status JSON for Backup Agents.

* Backup/dr agent determines network address to report in Layer Status only once, when the status updater loop begins, since it is a blocking call which connects to the cluster.  And lots of code cleanup.
2023-03-17 18:07:03 -07:00
Evan Tschannen 4d757fbcd3 format tester 2023-03-17 15:09:31 -07:00
Evan Tschannen a7e1616de8 code cleanup 2023-03-17 14:57:07 -07:00
A.J. Beamon fe5d0928f3 Remove doEmptyCommit function 2023-03-17 12:58:41 -07:00
A.J. Beamon dc2bd78aa7 The consistency check should retry if it couldn't find all the commit proxies when getting key server locations 2023-03-17 12:00:47 -07:00
Evan Tschannen 6882f21c10 fix: only consider large team creation successful if the the has the correct size 2023-03-17 10:52:53 -07:00
Evan Tschannen 73767501d4 Merge branch 'main' into feature-custom-dd
# Conflicts:
#	fdbserver/tester.actor.cpp
2023-03-17 10:33:38 -07:00
Evan Tschannen de7f40d2f4 made a variety of improvements that came from code review 2023-03-17 10:30:34 -07:00
Ata E Husain Bohra c492f83bf4
EaR: Avoid appending `tls` to the URL (#9734)
Description

Patch proposes two changes:

1. Avoid appending tls as part of URI for secure connections
2. RefreshEKs recurring task can be skipped if there are no keys to be refreshed

Testing

EncryptionOps.toml
EncryptKeyProxyTest.toml
devRunCorrectness 
devRunCorrectnessFiltered 'Encrypt*'
2023-03-16 22:52:51 -07:00
He Liu 0f5e75b34b
Added newDataMoveId(). (#9647)
* Added newDataMoveId().

* Added `ENABLE_DD_PHYSICAL_SHARD_MOVE`

* fmt.

* Replace `teamId` with `shardId`.
2023-03-16 18:06:06 -07:00
A.J. Beamon aeaedb147f
Merge pull request #9727 from sfc-gh-ajbeamon/fix-shared-remote-region-kills
Avoid killing too many machines if one region is being shared between the remote primary and a satellite
2023-03-16 17:46:12 -07:00
Josh Slocum 3c1ac344f1
buggify blob granule compression per-file (#9670) 2023-03-16 17:46:18 -05:00
A.J. Beamon 6818ce950c Fix check that excludes satellites from consideration to consider satelliteTLogReplicationFactor and satelliteTLogUsableDcs. Update trace event with more info about the updated policy. 2023-03-16 14:27:30 -07:00
Steve Atherton 5c795c3abe Rewrite corrupt block number calculations to be more clear. 2023-03-16 13:02:15 -07:00
Markus Pilman df5b15e56c
Merge pull request #9634 from sfc-gh-mpilman/features/negative-simulation
Framework to write negative tests
2023-03-16 12:47:02 -07:00
A.J. Beamon 735327f1cf
Merge pull request #9718 from sfc-gh-ajbeamon/decrease-duration-of-automatic-idempotency-workload
Decrease number of transactions in automatic idempotency workload
2023-03-16 12:31:24 -07:00
A.J. Beamon 75b8148e91 If one region is being shared between the remote primary and a satellite, the simulator could kill too many machines 2023-03-16 11:24:12 -07:00
A.J. Beamon f8255fe7a1
Merge pull request #9724 from sfc-gh-ajbeamon/fix-disk-corruption-check
Fix possible off-by-one in the simulation upper bound check for page corruption
2023-03-16 11:05:25 -07:00
Josh Slocum c7c41bc9db
adding implementation and check for blob worker exclusion (#9700) 2023-03-16 12:09:43 -05:00
Evan Tschannen ac54962533 code cleanup 2023-03-16 09:47:21 -07:00
A.J. Beamon 99f75a9bb1 Fix possible off-by-one in the simulation upper bound check for page corruption 2023-03-16 09:31:11 -07:00
Jingyu Zhou adda32db46
Merge pull request #9691 from sfc-gh-dadkins/sfc-gh-dadkins/commit-proxy-unavailable
Replace 10-second delay with explicit wait for cluster recovery in checkExtraDataStores
2023-03-16 09:21:59 -07:00
A.J. Beamon 4b8311d932 The automatic idempotency workload has a long runtime and can occasionally log too many events, etc. This decreases the number of transactions it runs significantly to avoid that issue. 2023-03-15 18:45:26 -07:00
A.J. Beamon 436a187171 Merge branch 'main' into fix-storage-quota-enables-tenant-aware-dd 2023-03-15 17:59:01 -07:00
A.J. Beamon a6202253a4
When a storage server fails to register (e.g. due to worker_removed), we need to throw that error to terminate the SS. (#9712) 2023-03-15 17:46:21 -07:00
A.J. Beamon 3f9d51db4e The DD_TENANT_AWARENESS_ENABLED knob was indirectly disabling the feature by not initializing a dd tenant cache, but this could be bypassed by enabling storage quotas. This makes the knob more explicitly control the feature. 2023-03-15 15:56:24 -07:00
Josh Slocum b4eb665f1d
fixing copy constructor error and adding test for it (#9711) 2023-03-15 15:33:16 -07:00
Ata E Husain Bohra dbcab0b1bd
Revert "Refactor GetEncryptCipherKeys (#9600)" (#9708)
This reverts commit 2702665e35.
2023-03-15 12:10:08 -07:00
Evan Tschannen aaf7b9b32b Added the ability to manually create a shard and also increase its replication factor 2023-03-15 11:26:15 -07:00
Markus Pilman 303b833d7b Adding data corruption test to verify consistency check 2023-03-15 11:22:25 -07:00
Markus Pilman 79447c6e06 First successful negative run 2023-03-15 11:22:25 -07:00
Markus Pilman 3894d5069e fix compiler error 2023-03-15 11:22:25 -07:00
Markus Pilman 7a108a2768 Add framework for writing negative simulation tests 2023-03-15 11:22:25 -07:00
Markus Pilman aa09baadab
Merge pull request #9635 from sfc-gh-etschannen/fix-consistency-check
Fix: the consistency check did not properly report failed tests
2023-03-15 11:21:44 -07:00
Evan Tschannen 6c1d02a14f
Merge pull request #9703 from sfc-gh-jslocum/bg_file_logical_size
adding blob granule logical size
2023-03-15 09:59:57 -07:00
Evan Tschannen 2f96627d43 merge in main 2023-03-15 09:26:22 -07:00
Jingyu Zhou bc380c9a5d
Merge pull request #9699 from sfc-gh-xwang/fix/main/tcTest
fix unit test failure because of implicit uint16_t conversion to int
2023-03-15 09:18:10 -07:00
Evan Tschannen 0a8435b742
Merge pull request #9702 from sfc-gh-jslocum/dbg_bg_ctest_timeout
fixing 2 bugs related to high delta file waitCommitted latency
2023-03-15 08:52:35 -07:00
Josh Slocum a5b4212990 adding blob granule logical size 2023-03-15 08:54:49 -05:00
Josh Slocum 52c0dc56cc fixing 2 bugs related to high delta file waitCommitted latency 2023-03-15 08:39:42 -05:00
Josh Slocum 03818e94f3
add exclusion tracker utility and use it in DD (#9669) 2023-03-15 08:21:28 -05:00
Xiaoxi Wang 213263b5d2 fix unit test failure because of implicit uint16_t conversion to int 2023-03-14 22:23:20 -07:00
Evan Tschannen c435e8336a no message 2023-03-14 16:40:50 -07:00
He Liu a0a3f4bff3
Fetch byte sample file (#9657) 2023-03-14 16:24:08 -07:00
Dan Adkins 6c796fa0d1 Get read version after setting transaction options. 2023-03-14 15:55:10 -07:00
Dan Adkins 4757545396 Replace 10-second delay with explicit wait for cluster recovery in checkExtraDataStores.
CheckExtraDataStores reboots or kills storage servers with extra data stores.
Since this occurs during a consistency check, the expectation is that the database
is quiet and not in the midst of recovery. This was done with a 10-second delay,
but it's possible during simulation tests that it takes longer than 10 seconds
to recruit a new master, so this assumption is invalid and can cause a test failure
when the consistency checks proceed.

Instead of a delay, we run an empty transaction through the system and explicitly
wait for the cluster to return to a fully-recovered state.
2023-03-14 12:46:13 -07:00
Yanqin Jin 37b0b0852c Merge remote-tracking branch 'origin/main' into deflake-test-1 2023-03-14 09:12:01 -07:00
Hui Liu 499a4cab93
Add correctness test for point-in-time restore (#9185) 2023-03-14 08:56:34 -07:00
A.J. Beamon d39cda610a Merge branch 'main' into metacluster-improvements
# Conflicts:
#	fdbcli/TenantCommands.actor.cpp
2023-03-13 15:58:39 -07:00
A.J. Beamon 45056370b8 Merge branch 'main' into metacluster-improvements 2023-03-13 13:14:09 -07:00
A.J. Beamon 18cf523f49
Merge pull request #9660 from sfc-gh-ajbeamon/tenant-id-restore-safety
Disallow repopulating a management cluster from a data cluster with matching tenant ID prefix
2023-03-13 13:12:30 -07:00
Ata E Husain Bohra ea796eb3ec
EaR: REST kms misc fixes (#9664)
* EaR: REST kms misc fixes

Description

Patch addresses following issues:
1. Fix "return connection" routine, it fixes a regression introduced by
an earlier fix.
2. Update RESTConnectionPool::connectionPoolMap to an "unordered_map"
for O(1) lookups
3. Improve logging
4. Make RESTUrl parsing handle extra '/' for 'resource'

Testing

Standalone fdbserver connecting to external KMS and database create
2023-03-13 13:11:05 -07:00
Josh Slocum 4a0ceca75e swallowing errors in redwood dispose 2023-03-10 17:49:56 -06:00
A.J. Beamon cbc330697c Disallow repopulating a management cluster from a data cluster with matching tenant ID prefix unless forced. Remember the largest used tenant ID on the data cluster and use it to update the management cluster tenant ID when force repopulating the same ID. 2023-03-10 15:36:37 -08:00
Yanqin Jin 86682668ca Merge remote-tracking branch 'origin/main' into deflake-test-1 2023-03-10 14:57:59 -08:00
Jingyu Zhou b13e496986
Merge pull request #9645 from sfc-gh-huliu/fixasan
Fix asan error caused by StringRef parameter of updateRestoreState
2023-03-09 17:35:56 -08:00
Jingyu Zhou b755e668bf
Merge pull request #9601 from jzhou77/fix-head
Allow log router to detect slow peeks and to switch DC for peeking
2023-03-09 15:34:24 -08:00
Yanqin Jin effda73ef4 Rename a variable 2023-03-09 14:42:18 -08:00
Jingyu Zhou eb4e122787 Reduce running time for DcLag
The switch can happen quicker than the workload detection time, so need to
adjust detection time lower than LOG_ROUTER_PEEK_SWITCH_DC_TIME.
2023-03-09 14:34:23 -08:00
Yanqin Jin 2feb60cd63
Update fdbserver/workloads/FastTriggeredWatches.actor.cpp
Change `StringRef()` to `""_sr`.

Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
2023-03-09 14:28:04 -08:00
Jingyu Zhou abe6b40fc9 Address comments 2023-03-09 13:58:37 -08:00
Yanqin Jin 8e478b7dcc Merge remote-tracking branch 'origin/main' into deflake-test-1 2023-03-09 13:29:28 -08:00
Yanqin Jin 88d9c1f610 address comments 2023-03-09 13:29:07 -08:00
Hui Liu 6fca5b4e13 Fix asan error caused by StringRef parameter of updateRestoreState 2023-03-09 13:04:45 -08:00
Yanqin Jin 4822af9417 Revert "Deflake test FastTriggeredWatches"
This reverts commit c2939ba51e.
2023-03-09 12:55:48 -08:00
sfc-gh-tclinkenbeard 1c7076c9a4 Fix argument type in ProxyCommitData::updateSSTagCost 2023-03-09 11:01:13 -08:00
Jingyu Zhou 5c97fb2c20 Use a constant for connectionFailuresDisableDuration 2023-03-09 09:50:24 -08:00
Jingyu Zhou e18ed14278 Refactor to address comments 2023-03-09 09:39:27 -08:00
Josh Slocum 0c718f7a04
Fixing double-completing of destination servers in busy map (#9629) 2023-03-09 09:43:45 -06:00
Josh Slocum dd032d7a16
fixing bytesInNewDeltaFiles calculation when a snapshot file is rolled back (#9609) 2023-03-09 09:43:27 -06:00
Ata E Husain Bohra b227007ab0
EaR: Fix knob name (#9630)
Description

Knob 'REST_KMS_ALLOW_NOT_SECURE_CONNECTION' got renamed in recent
patch, however, there are other places that needs an update too.

Testing

devRunCorrectness - 100K
RESTUtilUnits.toml
RESTKmsConnectorUnits.toml
2023-03-08 17:37:39 -08:00
Nim Wijetunga 2702665e35
Refactor GetEncryptCipherKeys (#9600)
* inital commit

* address pr comments
2023-03-08 17:05:03 -08:00
Evan Tschannen 4a17ed363a Fix: the consistency check did not properly report failed tests 2023-03-08 16:56:23 -08:00
Jingyu Zhou 38c1e3f603 Revert disableSimSpeedup 2023-03-08 15:54:21 -08:00
Jingyu Zhou 493e81f31d Limit connection failures to be within tests
In particular, disable connection failures when initializing the database
during the startup phase, i.e., before running with test specs.
2023-03-08 15:36:58 -08:00
Jingyu Zhou 9913f5b5e1 Simplify DcLag code 2023-03-08 15:23:34 -08:00
Yanqin Jin c2939ba51e Deflake test FastTriggeredWatches
In FastTriggeredWatchesWorkload, if the randomized new value for the given
`setKey` happens to be the same as the current value, the following will hold

- `first` is true, and
- `getDuration` is 0, and
- assertion `lastReadVersion - ver >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT || lastReadVersion - ver < SERVER_KNOBS->VERSIONS_PER_SECOND * (25 + getDuration)` will fail

To fix this, change the assertion to
```
assert(first || lastReadVersion - ver >= SERVER_KNOBS->MAX_VERSIONS_IN_FLIGHT ||
				       lastReadVersion - ver < SERVER_KNOBS->VERSIONS_PER_SECOND * (25 + getDuration));

```

Test plan:
Apply to the fix on top of the commit reported by Joshua test. Rerun the command to make sure it passes.
2023-03-08 14:07:57 -08:00
Xiaoge Su 4373f111fb Let FDB uses Findjemalloc.cmake 2023-03-08 13:09:13 -08:00
Ata E Husain Bohra d0eec9d0ba
EaR: REST KMS fixes - encryption integration testing (#9598)
* EaR: REST KMS fixes - encryption integration testing

Description

Major changes:
1. Multiple fixes observed while performing integration end-to-end
testing for Encryption at-rest feature.
2. Improve REST module logging. Introduced FLOW_KNOBS->REST_LOG_LEVEL
to have more granular control of feature logging disconnected from
the cluster log level.

Testing

Integration testbed:
1. Run fdbserver standalone
2. Run external KMS http-server to serve encryption key fetch requests
2023-03-08 09:49:43 -08:00
Markus Pilman 6937524594
Merge pull request #9599
Allow code to act on test timeouts
2023-03-08 09:26:55 -07:00
Hui Liu c43f8b3fdc
Refactor - introduce BlobRestoreController for APIs to manage restore state (#9616) 2023-03-08 07:50:30 -08:00
Xiaoxi Wang a92669f232
Merge pull request #9571 from sfc-gh-xwang/fix/main/sampleCopy
IndexSet and TransientStorageMetricSample optimization - change parameter type from KeyRef to Key to avoid extra copy in sampler
2023-03-07 20:50:31 -08:00
Jingyu Zhou 2b73a0c5c1
Fix ClogTlog valgrind error (#9588)
* Fix ClogTlog valgrind error

addr is the one we want to keep. gcc build seems to push_back a copy of it into
the vector.

* Use changeConfig that takes a string

* Remove an unused variable
2023-03-07 20:30:58 -08:00
A.J. Beamon de5f2c0fee Disallow cluster names that start with the `\xff` byte 2023-03-07 11:46:34 -08:00
Steve Atherton 5ff0bc3f87
Merge pull request #9576 from sfc-gh-satherton/storage-configure-refactor
Storage and log engine configuration support / refactor a few things.
2023-03-07 02:10:14 -08:00
Steve Atherton 77f626194d Remove duplicate if block. 2023-03-07 00:02:42 -08:00
Xiaoxi Wang 41d01629f1 change IndexSet::addMetrics return type to pair<metrics, iterator> to reduce another find call 2023-03-06 22:23:55 -08:00
Jingyu Zhou 7c54cc823b Require at least 2 regions and having satellites 2023-03-06 20:31:28 -08:00
Jingyu Zhou 31a051b1f1 Enable DcLag test 2023-03-06 19:04:24 -08:00
Steve Atherton 3faff52266 Fix comment. 2023-03-06 18:44:58 -08:00
Jingyu Zhou 0259a243ae Switch DC if log router peek becomes stuck
Trying to a different DC if this happens.
2023-03-06 17:41:56 -08:00
Markus Pilman 7eaf999644 reverting testing code 2023-03-06 17:38:41 -07:00
Markus Pilman 0838bfcfa2 Allow workloads to log errors when test times out 2023-03-06 17:36:26 -07:00
Ata E Husain Bohra a45de70003
EaR: RESTClient HTTP compliance, fix json request content type (#9544)
* EaR: RESTClient HTTP compliance, fix json request content type

Description

  diff-1: Address review comments

RESTClient is responsible to handle FDB <-> KMS communication
for Encryption and other usecases. By design, it only supports
"secure connection" i.e. "https"; however, it seems there is a
need to expand the module to support "http" connection,
for instance: test and dev deployments for instance.

However, given RESTClient gets involved in handling high
sensitive contents such as: plaintext "encryption cipher
from a KMS", the feature is guarded using
CLIENT_KNOB->REST_KMS_ENABLE_NOT_SECURE_CONNECTION which is
settable using FDBServer command line argument
"--kms-rest-enable_not_secure_connection" (boolean)

Testing

Deployed a standalone fdbserver and communicate with a
simple "http" server
2023-03-06 16:06:03 -08:00
Jingyu Zhou 1cb4070252 Refactor LogRouter's pullAsyncData 2023-03-06 15:45:46 -08:00
Jingyu Zhou c5d123ee66 Ignore the DcLag test 2023-03-06 15:43:06 -08:00
Jingyu Zhou 64876234d5 Add disableSimSpeedup to clog network longer 2023-03-06 15:40:47 -08:00