Commit Graph

75 Commits

Author SHA1 Message Date
Meng Xu 6f2e12be42 Minor improvement on comments 2020-07-12 18:32:47 -07:00
sfc-gh-tclinkenbeard a59925dd73 Added more const-correctness improvements 2020-06-20 22:15:19 -07:00
tclinken 3ff2fa9c2a Removed uses of outdated cpuTicks and errorCounts 2020-05-13 23:27:44 -07:00
Meng Xu bd345f85db ConsistencyCheck:Fix failue due to address inconsistency between process and worker
With TLS, a worker (or process) can have a TLS address and non-TLS address.
When a process is created in simulation, the primary address is TLS by default.
The non-TLS one is the TLS address port plus one.

In a connection between two workers, if their primary addresses do not enable
or disable TLS together, one worker will swap its primary address and secondary address
so that the TLS config of the two endpoints can match.

The swap can make the primary address no longer the TLS one that was created
when the process is created. And the swap only happens for worker instead of
process struct in simulation.

This swap can cause worker->address != process->address.
In checkForExtraDataStores actor, we use worker->address to check if a process
is killable and use the process->address to kill the process. The inconsistency
can cause simulation to kill a protected process that is not killable and leads
to simulation failure.
2020-03-10 21:07:16 -07:00
Evan Tschannen 96258b9809 Merge branch 'release-6.2'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbcli/fdbcli.actor.cpp
#	fdbclient/ManagementAPI.actor.cpp
#	fdbrpc/FlowTransport.actor.cpp
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/DataDistribution.actor.h
#	fdbserver/DataDistributionQueue.actor.cpp
#	fdbserver/KeyValueStoreMemory.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/QuietDatabase.actor.cpp
#	fdbserver/SkipList.cpp
#	fdbserver/StorageMetrics.actor.h
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/fdbserver.actor.cpp
#	fdbserver/storageserver.actor.cpp
#	fdbserver/workloads/KVStoreTest.actor.cpp
#	flow/CMakeLists.txt
#	flow/Knobs.cpp
#	flow/Knobs.h
#	flow/genericactors.actor.cpp
#	flow/serialize.h
2020-02-21 19:09:16 -08:00
Evan Tschannen dcbce3593e fixed TLS in simulation 2020-02-10 14:00:21 -08:00
Jingyu Zhou de8d953865 Add backup role, class, and worker skeleton 2020-01-22 19:35:30 -08:00
negoyal a4a0bf18f9 Merging with Master. 2019-11-12 13:01:29 -08:00
mpilman 370ba8b841 Remove --object-serializer flag from executables 2019-08-06 09:25:40 -07:00
Alex Miller 7a500cd37f A giant translation of TaskFooPriority -> TaskPriority::Foo
This is so that APIs that take priorities don't take ints, which are
common and easy to accidentally pass the wrong thing.
2019-06-25 02:47:35 -07:00
Evan Tschannen 55f7e7d372 fix: The delay inside the disabledMap was causing the storage server updateStorage actor to run on the client process 2019-06-13 14:28:30 -07:00
Trevor Clinkenbeard 8144882d7b Merge branch 'apple-master' into features/local-rk 2019-06-10 19:40:25 -07:00
mpilman 20c3f7f264 remove mixed-mode support 2019-05-13 14:15:23 -07:00
mpilman 69fa3d3903 fixed compilation issues after rebase 2019-05-13 14:15:23 -07:00
mpilman 9eeb48c43d Allow to turn on object serializer
This commit includes functionality to turn on
the object serializer for network communication.
This is done the following way:

- On incoming connections, a process will detect
  whether the client supports the object serializer
  and will only serialize responses with it, if it does
- On outgoing connections, the command line flag is used
  to determine whether the object serializer should be used
  to send data.

This way, a cluster can run in mixed mode. To upgrade one
can upgrade one process at a time and set the flag one process
at a time.

This is how this is tested on the simulator:
- The command line flag can take three options: on, off,
  and random.
- For off, the object serializer will never we used.
- For on, the object serializer will be always used.
- For random, the simulator will flip a coin for each
  process it starts up.
2019-05-13 14:15:22 -07:00
mpilman bdba8e22eb Added test and bugfixes 2019-04-08 11:05:29 -07:00
Markus Pilman 101a05ae77
Merge branch 'master' into features/client-simulator 2019-04-03 10:03:56 -08:00
mpilman e23e63c6ac Implemented JavaWorkload
This change allows a user to write a workload in Java.

The way this is implemented is by creating a JVM within the
simulator and calling the corresponding workload class. A
workload can then run in the simulator or on a testing cluster.

If the workload is executed within the simulator, the resulting
test will not be deterministic anymore as it will execute in a
different thread (and even without that it is not clear, whether
we could get determinism as the JVM does a lot of stuff that are
not deterministic).

This is intendet to get better testing of the Java client and
layer authors can use the simulator to test their layers on a single
machine but they can still simulate failing machines etc.
2019-03-31 17:57:43 -07:00
A.J. Beamon 71e2fdafb8 Changes to ratekeeper camel case 2019-03-27 08:24:25 -07:00
Evan Tschannen 1fc6937802 changed NetworkAddressList to at most two addresses for performance 2019-03-23 17:54:46 -07:00
Evan Tschannen a2108047aa removed LocalitySetRef and IRepPolicyRef typedefs, because for clarity the Ref suffix is reserved for arena allocated objects instead of reference counted objects. 2019-03-13 13:14:39 -07:00
Jingyu Zhou 3c86643822 Separate Ratekeeper from data distribution.
Add a new role for ratekeeper.

Remove StorageServerChanges from data distribution.
Ratekeeper monitors storage servers, which borrows the idea from
DataDistribution.
2019-03-07 13:16:20 -08:00
Vishesh Yadav 592e224155 net: add/use formatIpPort to format IP:PORT pairs #963 2019-03-04 14:12:45 -08:00
Vishesh Yadav cc9ad0e202 net: Use IPv6 in simulation testing #963
25% times we will use IPv6 addresses
2019-03-04 14:12:45 -08:00
Vishesh Yadav 57832e625d net: Support IPv6 #963
- NetworkAddress now contains IPAddress object which can be either
IPv4 or IPv6 address. 128bits are used even for IPv4 addresses,
however only 32bits are used when using/serializing IPv4 address.

- ConnectPacket is updated to store IPv6 address. Backward compatible
with old format since the first 32bits of IP address field is used
for serialization of IPv4.

- Mainly updates rest of the code to use IPAddress structure instead
of plain uint32_t.

- IPv6 address/pair ports should be represented as `[ip]:port` as per
convention. This applies to both cluster files and command line
arguments.
2019-03-04 14:12:41 -08:00
Vishesh Yadav e05b53d755 Merge remote-tracking branch 'apple/master' into task/tls-upgrade 2019-02-15 20:37:07 -08:00
Jingyu Zhou 886e7ab2ba Add a new DataDistributor role.
Let cluster controller to start a new data distributor role by sending a
message to a chosen worker.
Change MasterInterface usage in DataDistribution to masterId

Add DataDistributor rejoin handling.

This allows the data distributor to tell the new cluster controller of its
existence so that the controller doesn't spawn a new one. I.e., there should
be only ONE data distributor in the cluster.

If DataDistributor (DD) doesn't join in a while, then ClusterController (CC) tries
to recruit one as DD. CC also monitors DD and restarts one if it failed.

The Proxy is also monitoring the DD. If DD failed, the Proxy will ask CC for
the new DD.

Add GetRecoveryInfo RPC to master server, which is called by data distributor
to obtain the recovery Transaction version from the master server.
2019-02-14 16:30:13 -08:00
Vishesh Yadav 907446d0ce Merge remote-tracking branch 'apple/master' into task/tls-upgrade 2019-02-14 11:37:38 -08:00
Evan Tschannen e9ddd94e27 The failure monitor is given a list of all IP addresses associated with a process
The connect packet includes the correct remote address
Did a lot of code cleanup
Simulation test mixed TLS and non-TLS listeners on the same process
2019-01-31 18:20:14 -08:00
Evan Tschannen 699f8dd617 fix: coordinators auto could put two coordinators in the same zone
simulation now tests two machines in the same zone
2019-01-18 15:42:48 -08:00
Vishesh Yadav 31c4ac07ac WIP: FailureMonitoring use endpointAddressList (create individual endpoints for each address) WIP: g_currentDeliveryPeerAddress WIP: FlowTransport endpoint map WIP: Add peerReference to addressToEndpointMap 2019-01-09 07:46:01 -08:00
Vishesh Yadav 51b89ae083 WIP 2019-01-09 07:41:02 -08:00
Vishesh Yadav e04abf25f7 simulator: Support multiple listeners on single process
Sim2Listener can now take the network address to listen on. This is
used to listen to multiple ports in simulator and test the patch
which added multiple network addresses to single endpoint.
2018-12-13 13:36:52 -08:00
Vishesh Yadav 3eb9b23024 Listen to multiple addresses and start using vector<NetworkAdddress> in Endpoint
- This patch will make FDB listen to multiple addresses given via
  command line. Although, we'll still use first address in most places,
  this patch starts using vector<NetworkAddress> in Endpoint at some basic
  places.
- When sending packets to an endpoint, pick a random network address in
  endpoints
- Renames Endpoint::address to Endpoint::addresses since it
  now holds a vector of addresses.
2018-12-13 13:36:52 -08:00
Vishesh Yadav 43e5a46f9b Change Endpoint::address(NetworkAddress) to vector<NetworkAddress>
Extend `Endpoint` class to take multiple NetworkAddresses instead of
just one. Hence, to talk to an endpoint instead of one IP:PORT, we'll
have multiple IP:PORT pairs.

This patch simply adds the field and makes changes to compile the
codebase. The first element of of `address` field is used everywhere.
Hence the way we talk to remains same with this patch.

NOTE:

Directly accessing the first memeber of Endpoint::address is unsafe
as Endpoint() doesn't enforces non-empty address list. However, since
the correctness test pass for now and are anyway replacing all those
unsafe accesses with ones considering the whole vector, this patch
ignores to access them in safe way.
2018-12-13 13:36:52 -08:00
Evan Tschannen 4b5d0b4e2c Merge branch 'release-6.0'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/AsyncFileBlobStore.actor.cpp
#	fdbclient/AsyncFileBlobStore.actor.h
#	fdbclient/BlobStore.actor.cpp
#	fdbclient/BlobStore.h
#	fdbclient/HTTP.actor.cpp
#	fdbclient/ManagementAPI.actor.cpp
#	fdbclient/NativeAPI.actor.cpp
#	fdbrpc/LoadBalance.actor.h
#	fdbrpc/batcher.actor.h
#	fdbrpc/fdbrpc.vcxproj
#	fdbrpc/sim2.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/DataDistributionTracker.actor.cpp
#	fdbserver/SimulatedCluster.actor.cpp
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/masterserver.actor.cpp
2018-11-10 13:04:24 -08:00
Evan Tschannen 56c51c1bb3 fix: usableRegions was uninitialized 2018-11-09 10:17:35 -08:00
Robert Escriva 268093a96d Adjust all includes to be relative to the root.
Remove the use of relative paths.  A header at foo/bar.h could be included by
files under foo/ with "bar.h", but would be included everywhere else as
"foo/bar.h".  Adjust so that every include references such a header with the
latter form.

Signed-off-by: Robert Escriva <rescriva@dropbox.com>
2018-10-19 17:35:33 +00:00
Evan Tschannen 200e65fe61 added a workload which tests killing an entire region, and recovering from the failure with data loss.
fix: we cannot pop the txs tag from remote logs until they have a full copy of the txnStateStore
fix: we have to modify all of history, we cannot stop after finding a local remote
2018-09-17 18:32:39 -07:00
A.J. Beamon 2de0b5d6d7 Add the roles running on a process as a field on trace events in the form of a comma delimited string of role abbreviations. 2018-09-05 15:06:14 -07:00
Steve Atherton fb46385a39 Merge pull request #628 from alexmiller-apple/reloadcertificates
Reload certificates if changed.

This is a cherry-pick of #628 back to release-6.0
2018-08-06 18:04:04 -07:00
Evan Tschannen f72a9f60c0 only disable fearless if a datacenter has actually been killed
fix: we must prevent recovery into the dead datacenter while reducing usable_regions
2018-07-16 10:06:57 -07:00
Evan Tschannen 82cc30be62 added testing for two_satellite_fast and two_satellite_safe 2018-07-09 22:01:46 -07:00
Evan Tschannen 6f4ca2eba2 fix: get all processes did not include rebooting processes 2018-07-05 21:13:56 -07:00
Evan Tschannen cd4fb9285a waitForExlusion requires both regions to be healthy, which is only possible if we do not kill all logs in a region 2018-07-05 14:04:42 -07:00
Evan Tschannen 7315e5da55 fix: isExcluded and isCleared were exactly wrong
fix: isCleared should mean the process is dead
2018-07-05 02:22:22 -07:00
Evan Tschannen e17dfea3b6 fix: desiredTLogCount was used instead of getDesiredLogs(), which caused problems with recruitment when desiredTLogCount was -1.
canKillProcess logic was wrong.
We still need to configure usable_regions because if datacenterVersionDifference is too large we cannot complete data movement.
2018-07-04 16:22:32 -04:00
Evan Tschannen 0913368651 added usable_regions to specify if we will replicate into a remote region
remote replication defaults to the primary replication
removed remote_logs, because they should be specified as an override in the regions object
2018-06-17 19:31:15 -07:00
Yichi Chiang 26b93ff920 Share log mutations between backups and DRs which have the same backup range 2018-03-16 18:09:23 -07:00
Evan Tschannen 37a6a81634 Merge commit '7f6fc3e039c911cd84b8540f7f799fc38a1c1822' into feature-remote-logs
# Conflicts:
#	fdbserver/workloads/RestartRecovery.actor.cpp
2018-02-23 12:33:28 -08:00