Commit Graph

1332 Commits

Author SHA1 Message Date
mpilman f79a9594c1 Several bugfixes to make fdb build on non-ide 2019-02-19 15:16:59 -08:00
mpilman 999ea09bfd Use correct fwd decls in TesterInterface
Also TesterInterface.h -> TesterInterface.actor.h
2019-02-19 15:16:59 -08:00
mpilman 699216f713 Use fwd decls in workloads
Also workloads.h -> workloads.actor.h
2019-02-19 15:16:59 -08:00
mpilman 3f0fd2a20c Use fwd decls in WorkerInterface
Also WorkerInterface.h -> WorkerInterface.actor.h
2019-02-19 15:16:59 -08:00
mpilman 27a3153719 Use ACTOR forward declarations in MoveKeys
Also MoveKeys.h -> MoveKeys.actor.h
2019-02-19 15:16:59 -08:00
mpilman 3a0f9839b9 Fix minor IDE build errors 2019-02-19 15:16:59 -08:00
mpilman 9b14aeb156 Tell cmake not to link/install on ide build 2019-02-19 15:16:59 -08:00
mpilman 0bb60e5a3b Use proper fwd decl in NativeAPI
Also NativeAPI.h -> NativeAPI.actor.h
2019-02-19 15:16:59 -08:00
mpilman 78dd80ea8a Proper fwd decl in BackupAgent
Also BackupAgent.h -> BackupAgent.actor.h
2019-02-19 15:16:59 -08:00
mpilman 3cb2391b58 use proper fwd declarations in ManagementAPI
Also ManagementAPI.h -> ManagementAPI.actor.h
2019-02-19 15:16:59 -08:00
Vishesh Yadav 124a277a65 Remove coordinator printf in SimulatedCluster 2019-02-19 13:53:17 -08:00
Vishesh Yadav 0898686c9b Remove old TODO 2019-02-18 15:43:27 -08:00
Vishesh Yadav e05b53d755 Merge remote-tracking branch 'apple/master' into task/tls-upgrade 2019-02-15 20:37:07 -08:00
Vishesh Yadav d34a658357 Add Restore role previous removed by mistake 2019-02-15 20:25:03 -08:00
Vishesh Yadav c03de6c7b6 Update CLI to take addresses with mixed TLS states 2019-02-15 20:23:07 -08:00
Evan Tschannen 83060c6e56
Merge pull request #1062 from jzhou77/PR
Add a new DataDistributor role.
2019-02-15 13:51:27 -08:00
Evan Tschannen 8099a0b665
Merge pull request #1129 from alexmiller-apple/tstlog-1
Spill-by-reference TLog Part 1: IDiskQueue::read()
2019-02-15 11:48:14 -08:00
mpilman 75f692b931 simplify actorcompiler and target to compile coveragetool 2019-02-15 00:01:42 -08:00
Jingyu Zhou 5e6577cc82 Final cleanup per review comments
Make distributor interface optional in ServerDBInfo and many other small
changes.
2019-02-14 16:37:17 -08:00
Jingyu Zhou bf6da81bf9 Remove recovery version from data distribution queue
This parameter is no longer used/needed.
2019-02-14 16:37:16 -08:00
Evan Tschannen 038144adb1 Update fdbserver/DataDistribution.actor.cpp
Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>
2019-02-14 16:37:16 -08:00
Evan Tschannen 171a69c810 Update fdbserver/ClusterController.actor.cpp
Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>
2019-02-14 16:37:16 -08:00
Evan Tschannen a4b2c9ef88 Update fdbserver/ClusterController.actor.cpp
Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>
2019-02-14 16:37:16 -08:00
Jingyu Zhou fc3a784963 Fix another build team bug
The buildTeam() can create teams with undesired storage servers, which are
considered unhealthy. As a result, the data movement can become stuck.

Fix this by adding an ACTOR monitorHealthyTeams that builds team every one
second whenever there is no healthy teams.

Clean up storageServerTracker() interface.
2019-02-14 16:37:16 -08:00
Jingyu Zhou 8afe84d31b Fix an ordering bug for buildTeam
When zeroHealthyTeams signals and the storage server becomes healthy, we could
attempt buildTeam before the ServerStatusMap is updated. As a result, the
healthy server is not available for use. Fix by delaying the buildTeam after the
status map is updated.
2019-02-14 16:37:16 -08:00
Jingyu Zhou a7d1111a10 Make servers and serverIDs private for TCTeamInfo
Make both accessible through public member functions instead.
2019-02-14 16:37:16 -08:00
Jingyu Zhou 0e47912192 Fix an out-of-memory error 2019-02-14 16:37:16 -08:00
Jingyu Zhou 8b1235533e Fix segfault during configuration change
This bug was introduced in cee23ee3. During a configuration change, the data
distributor was restarted, which destroys previous DDTeamCollection and cancels
all previous teamTracker(). In this case, even though the healthy team count
reaches 0, there is no need to try to rebuild teams. The bug is triggered when
trying rebuilding teams, DDTeamCollection is already destroyed.
2019-02-14 16:37:16 -08:00
Jingyu Zhou 07dab56133 Fix a data movement stuck bug
When moving keys to a team, if one of the server in the target team died, then
the move can become stuck. This is because the DDTeamCollection waits for all
the data movement of the failed server to be completed. However, in this case,
because the movement has not finished yet, checking the database tells us there
is no key assocated with this server and it is safe to go ahead. In reality,
only the in-memory structure knows there is pending movement, i.e., unfinished
move causes some keys to be attributed to the failed server. Thus, the server
can't be removed yet. Fix by adding a check with in-memory structure in
waitForAllDataRemoved().

Use const& to optimize a few function parameters.
2019-02-14 16:37:16 -08:00
Jingyu Zhou 961d71538e A follow-on fix to ensure build team for zero teams 2019-02-14 16:37:16 -08:00
Jingyu Zhou 5deeec29e3 Fix a bug where team is not rebuild after storage failure
When two failures happened to a team, one of the server recovered. The current
logic skips for building a new team, which is wrong.
2019-02-14 16:37:16 -08:00
Jingyu Zhou 62c67a50e5 Fix segfault error
The usedIds is updated by master registration request, which populates the
usedIds map. However, this request may contain processes that cluster controller
is not aware, i.e., not in id_worker map.

This is ok until I added tracing the usedIds, which silently insert an empty
entry into id_worker map for the unknown process. This new entry can cause
crashing failure when trying to access its LocalityData.

Remove AsyncTrigger for usedIds, and change to serverInfo->onChange.

Use const & to avoid unnecessary copies in WorkerInterface's LocalityData
and getExtraTLogEligibleMachines().
2019-02-14 16:37:16 -08:00
Jingyu Zhou 21066b013a Remove DataDistributorRejoinRequest
This is no longer needed, since worker registration piggybacks distributor
interface now.
2019-02-14 16:37:16 -08:00
Jingyu Zhou 816f8b1ae1 Per review comments
Add a knob for starting distributor delay.
Move distributor failed variable to a local loop.
2019-02-14 16:37:16 -08:00
Jingyu Zhou 578473a974 Various review comments fixes 2019-02-14 16:37:16 -08:00
Jingyu Zhou b3d1633114 Fix bugs of missing request
The quite database can fail to send out requests and report timeout. This seems
to be caused by reusing a request that uses the same ReplyPromise. Another bug
is Proxy can wait for unneeded time for a dabase change, while the distributor
is already known to itself.
2019-02-14 16:37:16 -08:00
Evan Tschannen 5fb48083cd Update fdbserver/ClusterController.actor.cpp
Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>
2019-02-14 16:37:16 -08:00
Evan Tschannen 2db31d70a5 Update fdbserver/DataDistributorInterface.h
Co-Authored-By: jzhou77 <jingyuzhou@gmail.com>
2019-02-14 16:37:16 -08:00
Jingyu Zhou 8c61de318f Fix segfault and no_more_servers errors 2019-02-14 16:37:16 -08:00
Jingyu Zhou 7897616164 Fix wait failure bug on cluster controller
The setDistributor() sets an AsyncVar and then runs waitFailureClient. This
ordering is wrong because the AsyncVar::set triggers the other loop to run
first, which will wait on Never(). The correct code should wait on the Future
returned by the waitFailureClient.
2019-02-14 16:37:16 -08:00
Jingyu Zhou 00f2253229 Piggyback data distributor interface in worker registration
This allows cluster controller to know data distributor during worker
registration phase, thus avoiding recruiting a new data distributor after
starting.

Also change the worker to skip creating a new data distributor if there is
already one running on the worker, which can trigger operation timeout in tests.
2019-02-14 16:37:16 -08:00
Jingyu Zhou 39e4a59154 Add used worker IDs to cluster controller
This "usedIds" is updated when receiving a master registration message, so that
when recruiting new data distributor, existing assignment is known.
2019-02-14 16:37:16 -08:00
Jingyu Zhou 6a655143e8 A follow-on fix for config key usage
And some trace event cleanups.
2019-02-14 16:37:16 -08:00
Jingyu Zhou be5c962bb7 Add a new configuration version key \xff/conf/version
This fixed a bug found by upgrade test, where the configuration monitor of the
data distributor was monitoring excludedServersVersionKey, which doesn't
change in ChangeConfig workload. As a result, data distributor was not aware of
configuration changes.

Adding this new key and make sure this key is updated in configuration changes
so that the monitor can detect configuration changes.
2019-02-14 16:37:16 -08:00
Jingyu Zhou 3135f1d84b Cluster controller ignores distrobutor rejoin
After controller starts one, it will wait for that one and ignore any rejoins
received later.

Add remoteRecovered() to data distribution for remote team collection.
2019-02-14 16:37:16 -08:00
Jingyu Zhou 99e109d6c5 Fix timeout error due to lost exception
Found in tests, a move key conflict exception was not handled because the
Future object was not waited by someone. As a result, the data distributor
did not die and database checking couldn't get the metric and keep trying until
timeout.
2019-02-14 16:37:16 -08:00
Jingyu Zhou c38b2a8c38 Change masterId to distributorId in tracker.
This reflects the change of moving data distribution out of master server.
2019-02-14 16:37:16 -08:00
Jingyu Zhou aea602d9c7 Remove getRecoveryInfo from master interface. 2019-02-14 16:37:16 -08:00
Jingyu Zhou f5242bda7c Update data distributor to use configuration monitor
This enable removal of GetRecoveryInfoRequest from master interface.
Remove recoveryTransactionVersion from dataDistribution().
2019-02-14 16:37:16 -08:00
Jingyu Zhou e0a7162cf8 Add a failure timeout knob for data distributor.
Set default time to 1.0s.
2019-02-14 16:37:16 -08:00