Commit Graph

50 Commits

Author SHA1 Message Date
Lukas Joswiak 5ca2b89bdf Fix simulation issue where process switch was ignored
The simulator tracks only active processes. Rebooted or killed processes
are removed from the list of processes, and only get added back when the
process is rebooted and starts up again. This causes a problem for the
`RebootProcessAndSwitch` kill type, which wants to simultaneously reboot
all machines in a cluster and change their cluster file. If a machine is
currently being rebooted, it will miss the reboot process and switch
command.

The fix is to add a check when a process is being started in simulation.
If the process has had its cluster file changed and the cluster is in a
state where all processes should have had their cluster files reverted
to the original value, the simulator will now send a
`RebootProcessAndSwitch` signal right when the process is started. This
will cause an extra reboot, but should correctly switch the process back
to its original, correct cluster file, allowing the cluster to fully
recover all clusters.

Note that the above issue should only affect simulation, due to how the
simulator tracks processes and handles kill signals.

This commit also adds a field to each process struct to determine
whether the process is being run in a DR cluster in the simulation run.
This is needed because simulation does not differentiate between
processes in different clusters (other than by the IP), and some
processes needed to switch clusters and some simply needed to be
rebooted.
2022-10-27 13:56:13 -07:00
Jingyu Zhou a8391caf23 Revert "Data loss protection v2" 2022-10-20 18:09:58 -05:00
Lukas Joswiak 7f889c87e3 Fix simulation issue where process switch was ignored
The simulator tracks only active processes. Rebooted or killed processes
are removed from the list of processes, and only get added back when the
process is rebooted and starts up again. This causes a problem for the
`RebootProcessAndSwitch` kill type, which wants to simultaneously reboot
all machines in a cluster and change their cluster file. If a machine is
currently being rebooted, it will miss the reboot process and switch
command.

The fix is to add a check when a process is being started in simulation.
If the process has had its cluster file changed and the cluster is in a
state where all processes should have had their cluster files reverted
to the original value, the simulator will now send a
`RebootProcessAndSwitch` signal right when the process is started. This
will cause an extra reboot, but should correctly switch the process back
to its original, correct cluster file, allowing the cluster to fully
recover all clusters.

Note that the above issue should only affect simulation, due to how the
simulator tracks processes and handles kill signals.

This commit also adds a field to each process struct to determine
whether the process is being run in a DR cluster in the simulation run.
This is needed because simulation does not differentiate between
processes in different clusters (other than by the IP), and some
processes needed to switch clusters and some simply needed to be
rebooted.
2022-10-18 21:37:42 -07:00
Dmitry Wagin 0cc3e6875f
Merge branch 'apple:main' into dwagin/freebsd 2022-09-21 17:16:28 +03:00
A.J. Beamon 4fd64630e8 Convert literal string ref instances to use _sr suffix 2022-09-19 11:35:58 -07:00
sfc-gh-tclinkenbeard 82adc1e856 Make g_simulator a pointer 2022-09-15 09:00:33 -07:00
Dmitry Wagin 0e86d32a87 Fixed FreeBSD build N2 2022-09-03 23:27:40 +03:00
Chaoguang Lin 29f98f3654
Avoid duplicate snapshot on one process if it serves as multiple roles (#7294)
* Fix comments

* Add simulation value for SERVER_KNOBS->SNAP_CREATE_MAX_TIMEOUT

* A work version with correctness clean

* Remove unnecessay comments; debugging symbols

* Only check secondary address for coordinators, same as before

* Change the trace to SevError and remove the ASSERT(false)

* Remove TLogSnapRequest handling on TlogServer, which is changed to use WorkerSnapRequest

* Add retry for network failures

* Add retry limit for network failures; still allow duplicate snapshots on processes are both tlog and storage to avoid race

* Add retry limit as a knob and make backoff exponentail

* Add getDatabaseConfiguration(Transaction* tr)

* revert back to send request for each role once

* update some comments
2022-06-29 11:23:07 -07:00
Markus Pilman ffaf15c12a moved wellknownendpoints and fixed some includes 2022-06-23 17:03:53 -06:00
Josh Slocum a43c98519d
Switching char* to std::string for ProcessInfo to have it own memory (valgrind errors) (#7176) 2022-05-17 12:41:04 -07:00
Andrew Noyes 9f628df278 outputBuffer may not be null-terminated, so don't convert to std::string
This fixes a heap buffer overflow caught by ASAN, and importantly not
valgrind, since valgrind currently doesn't run on the first binary in a
restarting test. The next commit will change that so we always run
valgrind on the binary under test.
2022-05-10 14:45:00 -07:00
Binglin Chang 408c0cf1c9
Fix compile errors on ubuntu 20.04 (#4931) 2022-04-20 10:00:46 -07:00
Chaoguang Lin 7d365bd1bb
Remote ikvs debugging (#6465)
* initial structure for remote IKVS server

* moved struct to .h file, added new files to CMakeList

* happy path implementation, connection error when testing

* saved minor local change

* changed tracing to debug

* fixed onClosed and getError being called before init is finished

* fix spawn process bug, now use absolute path

* added server knob to set ikvs process port number

* added server knob for remote/local kv store

* implement simulator remote process spawning

* fixed bug for simulator timeout

* commit all changes

* removed print lines in trace

* added FlowProcess implementation by Markus

* initial debug of FlowProcess, stuck at parent sending OpenKVStoreRequest to child

* temporary fix for process factory throwing segfault on create

* specify public address in command

* change remote kv store knob to false for jenkins build

* made port 0 open random unused port

* change remote store knob to true for benchmark

* set listening port to randomly opened port

* added print lines for jenkins run open kv store timeout debug

* removed most tracing and print lines

* removed tutorial changes

* update handleIOErrors error handling to handle remote-ikvs cases

* Push all debugging changes

* A version where worker bug exists

* A version where restarting tests fail

* Use both the name and the port to determine the child process

* Remove unnecessary update on local address

* Disable remote-kvs for DiskFailureCycle test

* A version where restarting stuck

* A version where most restarting tests green

* Reset connection with child process explicitly

* Remove change on unnecessary files

* Unify flags from _ to -

* fix merging unexpected changes

* fix trac.error to .errorUnsuppressed

* Add license header

* Remove unnecessary header in FlowProcess.actor.cpp

* Fix Windows build

* Fix Windows build, add missing ;

* Fix a stupid bug caused by code dropped by code merging

* Disable remote kvs by default

* Pass the conn_file path to the flow process, though not needed, but the buildNetwork is difficult to tune

* serialization change on readrange

* Update traces

* Refactor the RemoteIKVS interface

* Format files

* Update sim2 interface to not clog connections between parent and child processes in simulation

* Update comments; remove debugging symbols; Add error handling for remote_kvs_cancelled

* Add comments, format files

* Change method name from isBuggifyDisabled to isStableConnection; Decrease(0.1x) latency for stable connections

* Commit the IConnection interface change, forgot in previous commit

* Fix the issue that onClosed request is cancelled by ActorCollection

* Enable the remote kv store knob

* Remove FlowProcess.actor.cpp and move functions to RemoteIKeyValueStore.actor.cpp; Add remote kv store delay to avoid race; Bind the child process to die with parent process

* Fix the bug where one process starts storage server more than once

* Add a please_reboot_remote_kv_store error to restart the storage server worker if remote kvs died abnormally

* Remove unreachable code path and add comments

* Clang format the code

* Fix a simple wait error

* Clang format after merging the main branch

* Testing mixed mode in simulation if remote_kvs knob is enabled, setting the default to false

* Disable remote kvs for PhysicalShardMove which is for RocksDB

* Cleanup #include orders, remove debugging traces

* Revert the reorder in fdbserver.actor.cpp, which fails the gcc build

Co-authored-by: “Lincoln <“lincoln.xiao@snowflake.com”>
2022-03-31 17:08:59 -07:00
sfc-gh-tclinkenbeard 66d71e107d Move actorcompiler.h include to the end of includes 2022-03-16 00:09:16 -07:00
“Lincoln cbcf0fa400 forgot to save before commit 2021-09-25 16:35:54 -06:00
“Lincoln 112d7fe632 make sure that bytes only get added to bytesRead when bytes > 0 2021-09-25 16:35:54 -06:00
“Lincoln cd25d42d66 minor fix 2021-09-25 16:35:54 -06:00
“Lincoln 8d30766129 modified file descriptor to non-blocking access before performing read, also throw any error that occurs while reading 2021-09-25 16:35:54 -06:00
Josh Slocum c31196ab01
Update fdbserver/FDBExecHelper.actor.cpp
Co-authored-by: A.J. Beamon <aj.beamon@snowflake.com>
2021-05-25 10:24:00 -07:00
Josh Slocum 07edc1db9a Removing spaces in SevWarn trace event names 2021-05-25 17:06:48 +00:00
FDB Formatster df90cc89de apply clang-format to *.c, *.cpp, *.h, *.hpp files 2021-03-10 10:18:07 -08:00
Andrew Noyes 79cec09255 Apply clang-tidy's performance-inefficient-vector-operation fix
I ran this command in my build directory after compiling with
OPEN_FOR_IDE. It took a few small tweaks to get it to compile, which is
outside the scope of this commit.

    $ python run-clang-tidy.py -j $(nproc) -checks='-*,performance-inefficient-vector-operation' -fix
2021-03-04 03:58:25 +00:00
sfc-gh-tclinkenbeard f9aba7064d Use consistent return method for fork_child 2021-01-29 16:27:48 -08:00
sfc-gh-tclinkenbeard acac02587d Trace output from forked processes 2021-01-29 01:31:26 -08:00
sfc-gh-tclinkenbeard 8dc39f4d8f Make ExecCmdValueString const-correct 2020-12-27 14:15:22 -04:00
A.J. Beamon d128252e90 Merge release-6.3 into master 2020-05-22 09:25:32 -07:00
sramamoorthy 096afe40be enhance spawnProcess 2020-05-07 14:24:35 -07:00
sramamoorthy 789975e191 fixes in spawnProcess 2020-05-07 14:24:34 -07:00
sramamoorthy 697a9422f5 replace boost::process with execv to spawn snapCreate process 2020-05-07 14:24:34 -07:00
Markus Pilman e4611e8ae4 fix versions.h stupidity 2020-04-06 10:28:55 -07:00
Markus Pilman 8b5780c36c don't include source and binary dir
This forces users to use include paths from the sources root.

So `#include "Arena.h"` won't work anymore, only
`#include "flow/Arena.h"` will.
2020-04-06 10:13:49 -07:00
mpilman d09e07f1f5 Merge remote-tracking branch 'upstream/master' into features/icc 2020-02-04 10:26:18 -08:00
Andrew Noyes 1827e77f2e Update fdbserver/FDBExecHelper.actor.cpp
Co-Authored-By: Jingyu Zhou <jingyuzhou@gmail.com>
2019-10-25 10:42:22 -07:00
Andrew Noyes d4de608bb6 Fix OPEN_FOR_IDE build 2019-10-25 10:42:22 -07:00
sramamoorthy c9097cca18 deprecate isTLogInSameNode used by snapshot V1 2019-10-09 15:33:11 -07:00
Evan Tschannen dc1d055b27
Merge pull request #2042 from senthil-ram/snap_cli_fix
fix fdbcli --exec 'snapshot create.sh' failure
2019-08-30 13:40:38 -07:00
sramamoorthy b3277f2982 Fix #2009 posix compliant args for snapshot binary 2019-08-30 12:54:09 -07:00
sramamoorthy 64000eafb2 Fixes #2020 - snap binpath not to be passed as arg 2019-08-27 11:49:12 -07:00
sramamoorthy 9afd162e2f remove snap v1 related code 2019-07-25 17:29:31 -07:00
sramamoorthy 869f77aef1 Few cosmetic edits and fixes 2019-07-24 15:36:28 -07:00
sramamoorthy 8f1f0c0435 snap v2: worker and other helper related changes 2019-07-24 15:36:28 -07:00
mpilman ab019fbe41 More minor fixes, removed snapshots 2019-06-20 14:28:31 -07:00
sramamoorthy 1d1d42c8af disable boost::process code for windows and mac 2019-06-13 15:43:03 -07:00
A.J. Beamon 3dd2479193 Try avoiding use of boost in FDBExecHelper 2019-06-13 13:09:29 -07:00
sramamoorthy 4bcb590f12 g_random -> deterministicRandom() 2019-05-28 22:07:46 -07:00
sramamoorthy c906da1f62 simulator: spawnProcess to wait for long duration
spawnProcess was waiting for 3 seconds and terminating
the child process for synchronous calls, but in the
simulator, this can lead to non-determinism, because
some cases the command can run in <3 or >3 seconds.
The fix is to increase the wait for duration to be
very long that it has to synchronously wait and get
the results or the test will timeout.
2019-05-28 22:07:46 -07:00
sramamoorthy b56d8e648f bp::child->wait_for does not give correct err code
boost::process::child->wait_for does not give the error code
from the process being run. Re-arrange the code to work-around
it.
2019-05-28 22:07:46 -07:00
sramamoorthy dcd2d96751 make spawnProcess predictable in the simulator 2019-05-28 22:07:46 -07:00
sramamoorthy 936ffc2dde rebase related changes 2019-05-28 22:07:46 -07:00
sramamoorthy ec7834e2f7 code re-orgnaization and address comments 2019-05-28 22:07:46 -07:00