Commit Graph

1656 Commits

Author SHA1 Message Date
Vaidas Gasiunas b55863d958
Merge pull request #7132 from sfc-gh-vgasiunas/vgasiunas-fix-version-check-for-downgrades
Fixing the problem with client getting stuck on server downgrades
2022-05-13 10:53:18 +02:00
Andrew Noyes a92ef37d44 Log a backtrace before throwing serialization_failed 2022-05-12 20:08:36 -07:00
Renxuan Wang 30e124c09b Remove HostnameStatus and resolve trigger.
They are no longer needed since we have coordinators DNS cache; and they are introducing complex crashes.
2022-05-12 10:13:55 -07:00
Junhyun Shim 0dbbadfd77
Merge pull request #7088 from sfc-gh-jshim/mtls-test-helpers
mTLS test helpers
2022-05-12 14:35:23 +02:00
Vaidas Gasiunas 2f67637701 Fixing the problem with client getting stuck on server downgrades:
- Restoring the original check for strict protocol compatibility before sending packets
 - Resetting the compatibility flag when connection is closed, so that the protocol compatibility is checked again for a new connection
2022-05-11 19:12:38 +02:00
Junhyun Shim 6459e840dc Merge remote-tracking branch 'remotes/origin/main' into mtls-test-helpers 2022-05-11 16:00:17 +02:00
Junhyun Shim 8789232df4 Add ScopeExit to flow and remove scattered impls 2022-05-11 11:51:11 +02:00
A.J. Beamon 0597ae7513 When an incompatible connection is closed, clear the state that prevents us from sending messages to it 2022-05-10 15:29:36 -07:00
Sam Gwydir b1ce3fc15a WolfSSL fix for TokenSign 2022-05-09 13:57:03 -07:00
Ata E Husain Bohra 33ae398268
REST KmsConnector implementation (#6994)
* REST KmsConnector implementation

Description
  diff-1: Address review comments.
          Add utility interface to Platform namespace to
          create and operate on tmpfile
 diff-2: Address review comments
         Link Boost::filesystem to CMake build process

Major changes includes:
1. Implement REST based KmsConnector implementation.
2. Salient features of the connector:
 2.1. Two required configuration are:
   a. Discovery KMS URLs - enable KMS discovery on bootstrap
   b. Endpoint path configuration to construct URI to fetch/refresh
      encryption keys
   c. Configuration to provide "validationTokens" to connect with
      external KMS. Patch implements file-based token validation scheme.
 2.2. On startup, RESTKmsConnector discovers KMS Urls and caches
      them in-memory. Extracts "validationTokens" based on input config.
 2.3. Expose endpoints to allow fetch/refresh of encryption keys.
 2.4. Defines JSON format to interact with external KMS - request &
      response payload format.
3. Extend Platform namespace with an interface to create and operate on
   tmp files.
4. Update Platform 'readFileBytes' and 'writeFileBytes' to leverage
   fstream supported implementation.

NOTE: KMS URLs fetched after initial discovery will be persisted using
      DynamicKnobs. It is TODO at the moment and shall be completed
      once DynamicKnobs is feature complete

Testing

Unit test to validation following:
1. Parsing on "validation tokens" logic.
2. Construction and parsing of REST JSON request and response strings.
2022-05-07 13:18:35 -07:00
sfc-gh-tclinkenbeard 8ea68154bf Remove WITH_TLS CMake variable 2022-05-02 22:45:00 -07:00
sfc-gh-tclinkenbeard 475d66084d Remove ENCRYPTION_ENABLED macro 2022-05-02 22:26:31 -07:00
Ray Jenkins dc9e782ccc
OpenTelemetry Tracing Perf Fixes (#6990) 2022-05-02 14:56:51 -05:00
Junhyun Shim 41d1c73b9c Fix TokenSign copying and using uninitialized arena
TokenSign was copying unused Arena held by Standalone instead of refering to it.
An Arena has to be used at least once before it holds a valid, copyable reference.
Otherwise the lifecycle of the copied Arena would be its own and not be shared with the original.
Thus, when the copied arena went out of scope,
the memory supposed to be held by returned Standalone also got released.

Fix: instead of copying, refer to Standalone's arena.
2022-05-02 09:48:43 +02:00
Jingyu Zhou 0ca9761088 Fix IDE build warnings and errors 2022-05-01 16:20:57 -07:00
Steve Atherton 1ab5c21967
Merge pull request #6979 from sfc-gh-jslocum/speedup_tail_latency
Don't do huge tail latencies for network requests when speed up simul…
2022-04-29 22:31:35 -07:00
Josh Slocum e5840d3a38 Merge branch 'main' into speedup_tail_latency 2022-04-29 16:05:12 -05:00
Steve Atherton 338d2304d7
Bug fix: Killing a machine process would not wait for AsyncFileNonDurable close operations to finish, causing a later reopen of the same file in a new process to hang forever. Renamed AsyncFileNonDurable::deleteFile to closeFile for clarity. Renamed Machine deletingFiles to deletingOrClosingFiles for clarity. (#7007) 2022-04-29 14:01:18 -07:00
Renxuan Wang c69a07a858
Check in the new Hostname logic. (#6926)
* Revert #6655.

20220407-031010-renxuan-c101052c21da8346           compressed=True data_size=31004844 duration=4310801 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=1:04:15 sanity=False started=100047 stopped=20220407-041425 submitted=20220407-031010 timeout=5400 username=renxuan

* Revert #6271.

20220407-051532-renxuan-470f0fe6aac1c217           compressed=True data_size=30982370 duration=3491067 ended=100002 fail_fast=10 max_runs=100000 pass=100002 priority=100 remaining=0 runtime=0:59:57 sanity=False started=100141 stopped=20220407-061529 submitted=20220407-051532 timeout=5400 username=renxuan

* Revert #6266.

Remove resolving-related functionalities in connection string. Connection string will be used for storing purpose only, and non-mutable.

20220407-175119-renxuan-55d30ee1a4b42c2f           compressed=True data_size=30970443 duration=5437659 ended=100000 fail_fast=10 max_runs=100000 pass=100000 priority=100 remaining=0 runtime=0:59:31 sanity=False started=100154 stopped=20220407-185050 submitted=20220407-175119 timeout=5400 username=renxuan

* Add hostname to coordinator interfaces.

* Turn on the new hostname logic.

* Add the corresponding change in config txns.

The most notable change is before calling basicLoadBalance(), we need to call tryInitializeRequestStream() to initialize request streams first.

Passed correctness tests.

* Return error when hostnames cannot be resolved in coordinators command.

* Minor fixes.
2022-04-27 21:54:13 -07:00
Josh Slocum 1e567c12e8 Don't do huge tail latencies for network requests when speed up simulation is set 2022-04-27 15:04:17 -05:00
Ata E Husain Bohra 333aadb903
Interface to enable clients to send/receive REST requests/responses (#6866)
* Interface to enable clients to send/receive REST requests/responses

Description

Major changes:
1. Add RESTClient interface enabling client to send/receive REST HTTP
   requests. Support REST APIs are: get, head, put, post, delete, trace
2. Add RESTUtil file introducing below interfaces:
 2.1. RESTUrl - Extract URI information: host, service, request-parameters.
 2.2. RESTConnectionPool-
      Connection establishment, life-cycle management, connection-pool (TTL)
 2.3. RESTClientKnobs - supports REST Knob parameter management and updates

Testing

Unit test - fdbrpc/RESTClient, fdbrpc/RESTUtils
2022-04-27 12:17:52 -07:00
Renxuan Wang e40cc8722c
A few hostname improvements. (#6825)
* Add tryResolveHostnames() in connection string.

* Add missing hostname to related interfaces.

* Do not pass RequestStream into *GetReplyFromHostname() functions.

Because we are using new RequestStream for each request anyways. Also, the passed in pointer could be nullptr, which results in seg faults.

* Add dynamic hostname resolve and reconnect intervals.

* Address comments.
2022-04-20 13:42:46 -07:00
Binglin Chang 408c0cf1c9
Fix compile errors on ubuntu 20.04 (#4931) 2022-04-20 10:00:46 -07:00
Junhyun Shim edc659d339 Use camelCase & move error code to 6xxx 2022-04-13 21:11:52 +02:00
Junhyun Shim b6a0c0f942 Merge remote-tracking branch 'upstream/main' into tenant-token-sign 2022-04-13 19:55:37 +02:00
Markus Pilman 099385928c Address review comments 2022-04-11 09:17:10 -06:00
Markus Pilman 64ac66c1d0 fix merge conflict 2022-04-10 14:16:21 -06:00
Markus Pilman 16467262f0 Merge remote-tracking branch 'origin/main' into features/private-request-streams 2022-04-10 14:12:37 -06:00
Markus Pilman d8a0b57b6c clients have to listen on a port in simulation 2022-04-10 14:09:15 -06:00
Renxuan Wang 938e8ed996 Do not throw lookup_failed when resolving fails.
Instead, return an empty Optional<NetworkAddress>. For resolveWithRetry(), still return NetworkAddress because it retries until succeed.
2022-04-08 14:21:49 -07:00
Renxuan Wang 70c49b1b87 DNS cache entry should also be removed in tryGetReplyFromHostname(). 2022-04-08 14:21:49 -07:00
Renxuan Wang f3a8ac21be Change resolve functions in hostname to return network address. 2022-04-08 14:21:49 -07:00
Renxuan Wang 3290c81f1d Fix tryGetReplyFromHostname(). 2022-04-08 14:21:49 -07:00
Josh Slocum 6276cebad9
Blob integration (#6808)
* Fixing leaked stream with explicit notify failed before destructor

* better logic to prevent races in change feed fetching

* Found new race that makes assert incorrect

* handle server overloaded in initial read from fdb

* Handling more blob error types in granule retry

* Fixing rollback metadata problem, added better debugging

* Fixing version race when fetching change feed metadata

* Better racing split request handling

* fixing assert

* Handle change feed popped check in the blob worker

* fix: do not use a RYW transaction for a versionstamp because of randomize API version (#6768)

* more merge conflict issues

* Change feed destroy fixes

* Fixing change feed destroy and move race

* Check error condition in BG file req

* Using relative endpoints for blob worker interface

* Fixing bug in previous fix

* More destroy and move race fixes

* Don't update empty version on destroy in case it gets rolled back. moved() and removing will take care of ensuring it is not read

* Bug fix (#6796)

* fix: do not use a RYW transaction for a versionstamp because of randomize API version

* fix: if the initialSnapshotVersion was pruned, granule history was incorrect

* added a way to compress null bytes in printable()

* Fixing durability issue with moving and destroying change feeds

* Adding fix for not fully deleting files for a granule that child granules need to re-snapshot

* More destroy and move races

* Fixing change feed destroy and pop races

* Renaming bg prune to purge, and adding a C api and unit test for it

* more cleanup

* review comments

* Observability for granule purging

* better handling for change feed not registered

* Fixed purging bugs (#6815)

* fix: do not use a RYW transaction for a versionstamp because of randomize API version

* fix: if the initialSnapshotVersion was pruned, granule history was incorrect

* added a way to compress null bytes in printable()

* fixed a few purging bugs

Co-authored-by: Evan Tschannen <evan.tschannen@snowflake.com>
2022-04-08 14:15:25 -07:00
Markus Pilman 7631d299bf Merge remote-tracking branch 'origin/main' into features/private-request-streams 2022-04-08 09:58:56 -06:00
Zhe Wu ebb60f690b Fix valgrind error in HealthMonitor 2022-04-07 15:48:06 -07:00
Markus Pilman bf956f5630 Merge remote-tracking branch 'origin/main' into features/private-request-streams 2022-04-07 13:29:27 -06:00
Markus Pilman e2d7d4075d multiple bug fixes 2022-04-07 11:08:07 -06:00
Zhe Wu 5fd494a57b Allow worker health monitor to report recent destroyed peers who currently have roles in transaction systems 2022-04-06 13:33:50 -07:00
Renxuan Wang 267c4deaee
Add tryGetReplyFromHostname() and retryGetReplyFromHostname(). (#6761)
* Add hostname to coordination interfaces.

* Add tryGetReplyFromHostname() and retryGetReplyFromHostname().

* Change tryGetReplyFromHostname() to call hostname.resolve().

* Add throw for actor_cancelled.
2022-04-06 10:47:00 -07:00
Renxuan Wang e548c0d604 Add DNS cache. 2022-04-04 15:08:17 -07:00
Renxuan Wang ff934ca2ad Change MockDNS to DNSCache. 2022-04-04 15:08:17 -07:00
Renxuan Wang ebe928e7e1 Throw lookup_failed() when hostname resolving fails. 2022-04-04 15:08:17 -07:00
Chaoguang Lin 7d365bd1bb
Remote ikvs debugging (#6465)
* initial structure for remote IKVS server

* moved struct to .h file, added new files to CMakeList

* happy path implementation, connection error when testing

* saved minor local change

* changed tracing to debug

* fixed onClosed and getError being called before init is finished

* fix spawn process bug, now use absolute path

* added server knob to set ikvs process port number

* added server knob for remote/local kv store

* implement simulator remote process spawning

* fixed bug for simulator timeout

* commit all changes

* removed print lines in trace

* added FlowProcess implementation by Markus

* initial debug of FlowProcess, stuck at parent sending OpenKVStoreRequest to child

* temporary fix for process factory throwing segfault on create

* specify public address in command

* change remote kv store knob to false for jenkins build

* made port 0 open random unused port

* change remote store knob to true for benchmark

* set listening port to randomly opened port

* added print lines for jenkins run open kv store timeout debug

* removed most tracing and print lines

* removed tutorial changes

* update handleIOErrors error handling to handle remote-ikvs cases

* Push all debugging changes

* A version where worker bug exists

* A version where restarting tests fail

* Use both the name and the port to determine the child process

* Remove unnecessary update on local address

* Disable remote-kvs for DiskFailureCycle test

* A version where restarting stuck

* A version where most restarting tests green

* Reset connection with child process explicitly

* Remove change on unnecessary files

* Unify flags from _ to -

* fix merging unexpected changes

* fix trac.error to .errorUnsuppressed

* Add license header

* Remove unnecessary header in FlowProcess.actor.cpp

* Fix Windows build

* Fix Windows build, add missing ;

* Fix a stupid bug caused by code dropped by code merging

* Disable remote kvs by default

* Pass the conn_file path to the flow process, though not needed, but the buildNetwork is difficult to tune

* serialization change on readrange

* Update traces

* Refactor the RemoteIKVS interface

* Format files

* Update sim2 interface to not clog connections between parent and child processes in simulation

* Update comments; remove debugging symbols; Add error handling for remote_kvs_cancelled

* Add comments, format files

* Change method name from isBuggifyDisabled to isStableConnection; Decrease(0.1x) latency for stable connections

* Commit the IConnection interface change, forgot in previous commit

* Fix the issue that onClosed request is cancelled by ActorCollection

* Enable the remote kv store knob

* Remove FlowProcess.actor.cpp and move functions to RemoteIKeyValueStore.actor.cpp; Add remote kv store delay to avoid race; Bind the child process to die with parent process

* Fix the bug where one process starts storage server more than once

* Add a please_reboot_remote_kv_store error to restart the storage server worker if remote kvs died abnormally

* Remove unreachable code path and add comments

* Clang format the code

* Fix a simple wait error

* Clang format after merging the main branch

* Testing mixed mode in simulation if remote_kvs knob is enabled, setting the default to false

* Disable remote kvs for PhysicalShardMove which is for RocksDB

* Cleanup #include orders, remove debugging traces

* Revert the reorder in fdbserver.actor.cpp, which fails the gcc build

Co-authored-by: “Lincoln <“lincoln.xiao@snowflake.com”>
2022-03-31 17:08:59 -07:00
Yi Wu acfee48894 AsyncFileKAIO: add latency histograms 2022-03-30 10:45:04 -07:00
Markus Pilman b595d4462f Throw error on unauthorized access 2022-03-29 14:58:43 -06:00
sfc-gh-tclinkenbeard 77786f4fc6 Merge remote-tracking branch 'origin/main' into change-data-hall 2022-03-27 12:44:05 -07:00
Steve Atherton 478ff1eb76
Merge pull request #6652 from sfc-gh-anoyes/anoyes/fast-alloc
Move most usage of FastAlloc to malloc
2022-03-26 15:46:05 -07:00
Junhyun Shim 410a422bd7 Use Standalone instead of embedded Arenas
- Repeat TokenSign test 100 times per run instead of 1
- Test for verify fail case
2022-03-25 14:03:37 +01:00
Junhyun Shim 9f3fa5ba9b Fix clang format 2022-03-24 19:12:34 +01:00