Commit Graph

104 Commits

Author SHA1 Message Date
A.J. Beamon 31caac67dc Rename supported_versions[x].clients to supported_versions[x].connected_clients 2017-11-01 10:41:30 -07:00
Yichi Chiang d4f75630de Support log group field in status json 2017-09-28 16:31:29 -07:00
Evan Tschannen acb7e66d01 fix: failed logs do not count even if they have returned a result 2017-09-25 18:14:40 -07:00
Evan Tschannen 2bf042a559 fix: file_corrupt was not checking for fault injection
latency threshold was too long
2017-09-25 17:22:41 -07:00
Evan Tschannen cce4eeb52d fix: the master was sending the cluster controller uninitialized configurations 2017-09-22 16:59:24 -07:00
Evan Tschannen 180438d41e fix: use the number of present logServers rather than the total size of the vector 2017-09-22 16:19:16 -07:00
Evan Tschannen 738ae21c3a fix: an optimization in buggified locking can cause recovery to break because it would not restart if a locked process was killed when the remaining logs cannot obtain a quorum 2017-09-22 15:07:57 -07:00
Alex Miller 585c9bf68f Quick fix to reduce CPU usage of ensureEpochLive.
It is suspected that policy recomputations are driving proxy CPU usage up, and
thus latency and throughput down.  To quickly confirm this theory, we're
forcing ensureEpochLive to wait until it has RF responses, which means we'll
probably only validate the policy once per call.
2017-09-21 18:22:24 -07:00
Evan Tschannen fbd67ea547 fix: excluded servers are worst fit for master rather than never assign (so that we can recover if every process has been excluded)
fix: better master exists did not use exclusions because the configuration was reset
2017-09-20 11:48:26 -07:00
Evan Tschannen cb43563b2d fix: toMap properly lists the redundancy mode of the cluster 2017-09-19 16:35:42 -07:00
Evan Tschannen f75dfc3153 do not register with the master until recovery of the queue is complete, to avoid having the master wait a long time for a peek response 2017-09-18 17:39:12 -07:00
Alex Miller 567d663afd Fix SimulationConfig never generating a custom config.
A 0 was changed to a 1 when rewriting code, and `case 0:` was never being hit. :(
Thankfully, it looks like nothing was broken by this in the meantime.
2017-09-18 17:29:36 -07:00
Evan Tschannen e8b895c878 added the ability to disable connection failures for a period of time after one happens 2017-09-18 12:46:29 -07:00
Evan Tschannen 489332533c all timeouts longer than two minutes have been can be lowered to 60.0 with buggification
added a workload that tries for a 50 second maximum latency in the presence of one failure with both buggification and connection failures
2017-09-18 11:04:51 -07:00
Evan Tschannen 34f987f56d added a test in simulation which ensures that a recovery after a single failure takes less than 15 seconds 2017-09-15 17:55:01 -07:00
Evan Tschannen d9b64899c5 fix: we need to wait for log server failures if we have not locked all of the logs 2017-09-15 13:11:21 -07:00
Evan Tschannen 36c98f18e9 do not register a worker with the cluster controller until it has finished recovering all files from disk 2017-09-15 10:57:58 -07:00
Evan Tschannen f3b7aa615d fix: seed storage servers are recruited based on the storage policy 2017-09-14 17:06:00 -07:00
Alvin Moore 9404d226d0 Merge branch 'release-5.0' 2017-09-13 16:49:00 -07:00
Alvin Moore cb92194772 Fixed problem with master being recruited on excluded servers 2017-09-13 16:48:27 -07:00
Alex Miller 5e14f19875 Merge pull request #147 from cie/alexmiller/grvtlogs
Only verify a quorum of TLogs are unlocked for a GRV request
2017-09-13 16:07:25 -07:00
Alex Miller d6b3be98fe Fix whitespace. 2017-09-13 15:49:39 -07:00
Alex Miller 06a9c7a772 Remove unnecessary policy recomputations in confirmEpochLive.
Watching for interface changes on readied servers was done as a workaround for
a case where all futures could be ready, but the policy verification would
never succeed.  This turns out to be because stopping a tlog causes an error to
be returned.  However, if a TLog is stopped, then we know that we can't do any
more commits, so we can just immediately stop trying and never mark our future
as ready.
2017-09-13 15:45:09 -07:00
Evan Tschannen 8cb53fd608 Merge pull request #149 from cie/choose-leader-on-stateless-processes
choose leader on the perferred process class
2017-09-13 13:58:49 -07:00
A.J. Beamon 4fa2415553 Merge branch 'release-5.0' 2017-09-08 17:28:12 -07:00
A.J. Beamon bb8a245bdb circus: throughput test scales latency error by the target latency 2017-09-08 17:27:54 -07:00
Yichi Chiang bd1c7e7295 Use addTeamsBestOf() instead of addAllTeams() when team size is greater than 3 2017-09-07 12:31:01 -07:00
Evan Tschannen dc1f7ca6b7 testers now use client locality load balancing 2017-09-01 12:53:01 -07:00
A.J. Beamon cc24072a5d Add the multi version API to the list of APIs to choose in the APICorrectness tester. Support for the multi-version client already existed. 2017-08-31 16:23:55 -07:00
Evan Tschannen d61be4c760 Merge branch 'release-5.0' 2017-08-30 12:59:24 -07:00
Evan Tschannen 963e1c3f31 fix: we need to reboot the process even if it will result in too many files, because the check will not succeed without it 2017-08-30 12:58:46 -07:00
Alex Miller 8d97a15c3f BUGGIFY recovery to lock only the minimum number of TLogs required to prevent a quorum.
This is to test the quorum logic introduced in the previous patch, and should
flush out any other bugs that rely on TLog locking during recovery.
2017-08-29 14:43:40 -07:00
Alex Miller f8486d1368 Only ensure a quorum of TLogs are unlocked to confirm the epoch hasn't ended.
Currently, GRV will wait to hear back from (almost) all TLogs to confirm that
they're unlocked and that the current epoch hasn't ended.  This confirms that
there isn't a new set of proxies and using the commit version from the old set
of proxies would violate causal consistency.

However, during recovery, we ensure that no quorum of TLogs exists before
starting a new epoch and allowing new commits on the new TLogs.  Thus, we only
need to wait until we have a quorum of TLogs that are unlocked.

This should be a significant improvement in latency particularly for the cases
when we start running >10 TLogs.
2017-08-29 14:43:40 -07:00
Alex Miller 4c1d61cd08 Assorted minor changes.
In which we:
* Clarify some math in a comment
* Remove misleading debugging information
* Add a useful trace event
2017-08-29 14:43:40 -07:00
Alex Miller dbfa94f735 LF -> CRLF
It appears a previous patch left parts of this file ending with LF, and the
majority of the file ends in CRLF.  I see no reason to keep this inconsistency,
but these line ending wars are going to drive me insane.
2017-08-29 14:43:40 -07:00
Alvin Moore 6020d70863 Added trace event to track reboots initiated by ConsistencyCheck workload in simulation 2017-08-29 11:41:27 -07:00
Alvin Moore c95a1be5ec Add trace event for rebooting process during simulation for consistency check 2017-08-29 11:00:44 -07:00
A.J. Beamon 86774f6e42 Merge branch 'release-5.0' 2017-08-28 17:17:00 -07:00
A.J. Beamon 03478561b9 fix: Set lock aware at the transaction level for latency probe to avoid having to fill the shard cache every time. 2017-08-28 17:16:46 -07:00
A.J. Beamon 9a0a3b6329 Merge commit '66528becb82d826e81fa644bb378212584ab580e' 2017-08-28 16:47:59 -07:00
Yichi Chiang 9fe927127f choose leader on the perferred process class 2017-08-28 14:41:04 -07:00
Alvin Moore 44e0df78c5 Added support for tracking roles for simulation workers
Fixed the exclusion and inclusion address simulation API and integration within workloads
Added more information within trace events for simulation
2017-08-28 11:25:37 -07:00
Alvin Moore 581bd6c8ed Added option to delay the displaying of the simulation workers 2017-08-28 10:53:56 -07:00
Alec Grieser 300b5a17ed Merge branch 'release-5.0' 2017-08-25 18:55:33 -07:00
Evan Tschannen 272b4b984c fix: fixed a rare bug where we do not wait for a file in the process of being deleted to shutdown before rebooting a machine 2017-08-25 10:12:58 -07:00
Evan Tschannen 26a5b5e422 rollback workload now clogs the communication between one of the proxies and the tlogs, since that is what will cause a rollback 2017-08-23 16:08:13 -07:00
A.J. Beamon 4c706d33e9 Merge branch 'release-5.0' 2017-08-23 14:59:43 -07:00
Evan Tschannen be941b4bd1 sending void to committed could cause self to be deleted, so call cleanup before sending 2017-08-23 13:56:18 -07:00
Alvin Moore 7729f663e9 Ensured that the circus id is always lowercase 2017-08-23 13:45:00 -07:00
Evan Tschannen f9308b8fa6 Merge pull request #145 from cie/alexmiller/simrefactor
Refactor simulation to pull all configuration parameters into one struct.
2017-08-23 12:54:21 -07:00