Commit Graph

2270 Commits

Author SHA1 Message Date
Meng Xu 08a721b320 Merge branch 'master' into mengxu/server-team-remover-PR 2019-07-08 16:30:32 -07:00
A.J. Beamon 0a5c7608df Remove "Number" suffix from newly added events (and variables that feed the events). 2019-07-08 15:45:28 -07:00
A.J. Beamon f52c239ef8 Merge branch 'master' into trace-event-rename
# Conflicts:
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/QuietDatabase.actor.cpp
2019-07-08 15:37:00 -07:00
A.J. Beamon a5a6f8431c Add a random UID to TransactionMetrics in case a client opens multiple connections and also a field to indicate whether the connection is internal. Convert some of the metrics to our Counter object instead of running totals. 2019-07-08 14:01:04 -07:00
Evan Tschannen c348b3da51 After a proxy dies, it will remain alive for an additional 10 seconds to forward clients to the new proxies 2019-07-08 12:53:40 -07:00
Evan Tschannen ec11ef024b
Merge pull request from ajbeamon/merge-release-6.1-into-master
Merge release 6.1 into master
2019-07-08 09:02:56 -07:00
A.J. Beamon dd85edb08c
Merge pull request from xumengpanda/mengxu/DD-ensure-redundant-team-priority-as700-PR
TeamTracker:Set redundant team priority as PRIORITY_TEAM_REDUNDANT
2019-07-08 08:47:28 -07:00
Vishesh Yadav 8d3a826c63
Merge pull request from alexmiller-apple/cycle-verify-only
Add a checkOnly parameter to Cycle workload.
2019-07-05 21:59:52 -07:00
Jingyu Zhou 50e7593c5b
Merge pull request from ajbeamon/remove-trace-event-underscores
Remove trace event underscores
2019-07-05 21:45:55 -07:00
Alex Miller 14e5dd74fe Add a checkOnly parameter to Cycle workload.
So that it can be used in the real world for consistency checking of
backup and DR.
2019-07-05 19:09:09 -07:00
Evan Tschannen 310a5fe9a3 fix: we cannot reject 100% of requests, because a storage server which is stuck needs to get a future version error to trigger an all alternatives failed message from load balance so that clients will re-grab storage server interfaces from the proxy 2019-07-05 17:28:22 -07:00
Meng Xu e8fb7564f5 Merge branch 'master' into mengxu/DD-ensure-redundant-team-priority-as700-PR 2019-07-05 17:28:12 -07:00
Meng Xu c7a996267c TeamRemover: Remove unused declaration
Also change state variable to variable.
2019-07-05 16:54:06 -07:00
Evan Tschannen 15e894c724 Merge in master 2019-07-05 15:49:24 -07:00
Evan Tschannen e7c0ecf729 fix: we cannot reject 100% of requests, because a storage server which is stuck needs to get a future version error to trigger an all alternatives failed message from load balance so that clients will re-grab storage server interfaces from the proxy 2019-07-05 15:46:16 -07:00
Meng Xu 46d28a3b79 TeamTracker:Set redundant team priority as redundant
The redundant team removed by teamRemover will not exist
in the global teams data structure. So we will not find
the redundant team from shard-to-team mapping in the system key.

Before this change, teamTracker marks such team as PRIORITY_TEAM_UNHEALTHY.
With this change, it marks it as PRIORITY_TEAM_REDUNDANT
2019-07-05 15:24:00 -07:00
A.J. Beamon 4be08d9b2d Rename datacenter_version_difference to datacenter_lag and include both seconds and versions. 2019-07-05 14:36:18 -07:00
A.J. Beamon 2a56e011ea Merge branch 'release-6.1' into merge-release-6.1-into-master
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/DataDistribution.actor.cpp
2019-07-05 13:52:29 -07:00
Meng Xu 7ba6cd2d9d ServerTeamRemover:Reduce the overshot server team number to build
Each server has the maximum of DESIRED_TEAMS_PER_SERVER and
(DESIRED_TEAMS_PER_SERVER * storageTeamSize) / 2)
2019-07-05 11:01:50 -07:00
A.J. Beamon 2a709ee5d0 Rename event details that use the suffix "Number" to indicate a count, as number could also imply an index. Rename a few other trace events and details that e.g. needed to be pluralized. 2019-07-05 08:54:21 -07:00
A.J. Beamon 9f4b6fd770 Remove additional underscores 2019-07-05 08:12:25 -07:00
A.J. Beamon a3ac9c7eea Remove underscores from some trace event names 2019-07-05 08:08:29 -07:00
Alex Miller ea6898144d Merge remote-tracking branch 'upstream/master' into flowlock-api 2019-07-03 20:44:15 -07:00
Evan Tschannen 23ecc17075
Merge pull request from senthil-ram/recoveryFix
sev40 if knownCommittedVersion > recoveryVersion
2019-07-03 16:39:16 -07:00
Evan Tschannen e153571a50
Merge pull request from alexmiller-apple/crc32c-memory-storage
Memory storage engine to use crc32c DiskQueue by default (in 6.2).
2019-07-03 16:37:42 -07:00
Evan Tschannen 79a90d33a7 fix: the push location for txs tags needs to be based on what the tag will become after changing the number of txs tags 2019-07-03 16:06:54 -07:00
A.J. Beamon 8c10d832a1 Add coordinator role in trace events 2019-07-03 11:09:36 -07:00
Meng Xu 2782d432ac ServerTeamRemover:Update the desired number and pick unhealthy teams first 2019-07-02 22:17:53 -07:00
Meng Xu 599fcb2e6d Add serverTeamRemover to remove redundant server teams 2019-07-02 17:40:37 -07:00
Meng Xu 716494ed9f ConsistencyCheck:Check serverTeamNumber larger than desired number 2019-07-02 17:40:37 -07:00
Meng Xu 7461c87ae6 AddTeamsBestOf: Build more teams than desired
We build more teams than we finally want so that we can use serverTeamRemover() actor to remove the teams
whose member belong to too many teams. This allows us to get a more balanced number of teams per server.
2019-07-02 17:40:37 -07:00
Evan Tschannen 8afab93e29
Merge pull request from etschannen/master
revert storage server priority changes
2019-07-02 17:25:31 -07:00
Evan Tschannen 3fb0999e10 revert storage server priority changes 2019-07-02 16:54:47 -07:00
Evan Tschannen 86b0224347 Merge branch 'release-6.1' of github.com:apple/foundationdb into release-6.1 2019-07-02 16:27:31 -07:00
Evan Tschannen 64e33bb4f9 added logging for maintenance mode 2019-07-02 16:25:29 -07:00
Stephen Atherton 71ba490cf8 Removed use of the C "struct hack" as it is not valid C++. Replaced zero-length members with functions returning a pointer for arrays or a reference for single members. 2019-07-02 16:02:58 -07:00
dyoungworth 817fce080b Fix minor bug in External Workload 2019-07-02 15:57:26 -07:00
Meng Xu 7afbd10a10 Change teamRemover to machineTeamRemover 2019-07-02 15:16:34 -07:00
Meng Xu d2d6022ed4 StorageServerTracker:Do not always set doBuildTeams
When interface changes, we set doBuildTeams to true only when
the interface location changes.
2019-07-02 14:24:26 -07:00
Meng Xu de5bcaf588 minTeamNumber for server and machine cannot be uint64_t
Because the consistency check will try to conver the value to int64_t.
If no server exists, the variable will not be updated and thus get overflowed
when it is converted to int64_t
2019-07-01 21:39:18 -07:00
Evan Tschannen 841e61ac25 fixed a broken promise in localRatekeeper 2019-07-01 16:56:35 -07:00
Meng Xu 347a7ecdff MachineTeams:Make traceTeamCollectionInfo not an actor 2019-07-01 16:50:53 -07:00
mengranwo e54eedf0e2 Address pr comments, remove wait(tr.commit()) for read-only txn 2019-07-01 16:09:51 -07:00
mengranwo 0ad151e70a style formatting 2019-07-01 16:09:51 -07:00
mengranwo 819b6e3d6d fix compiling error 2019-07-01 16:09:51 -07:00
mengranwo c7148bbb14 address cr comments: 2019-07-01 16:09:51 -07:00
mengranwo d96cdacdd5 fix format issue 2019-07-01 16:09:51 -07:00
mengranwo 11161746f8 add try catch block around tx.onerror() 2019-07-01 16:09:51 -07:00
mengranwo 6b61b0e030 fix syntax error, pass compile 2019-07-01 16:09:51 -07:00
mengranwo 0b9cd18fb4 checking cluster is healthy or not during recovery process(for storage engine), if healthy, delete data files and join as new 2019-07-01 16:09:51 -07:00
Jingyu Zhou b69d7adabc Remove unused remoteRecovered from master server 2019-07-01 15:41:35 -07:00
Alex Miller 23de5b64ad Memory storage engine to use crc32c DiskQueue by default (in 6.2). 2019-07-01 13:38:06 -07:00
Meng Xu b8cb883040 AddBestMachineTeams:Fix input must be non-negative value 2019-06-28 22:46:16 -07:00
Evan Tschannen 4e45a58750 fix: forced recovery did not copy the number of txsTags properly 2019-06-28 20:51:16 -07:00
Evan Tschannen 2c40c818cf fix: txsTags was not copied into oldLogData 2019-06-28 17:51:16 -07:00
Alex Miller 8e1ab6e7db Merge remote-tracking branch 'upstream/master' into flowlock-api 2019-06-28 17:32:54 -07:00
Evan Tschannen 5041ff38b1 removed unneeded description 2019-06-28 16:54:22 -07:00
Evan Tschannen a124fc6e8a fixed compiler error 2019-06-28 16:54:22 -07:00
Evan Tschannen b9a6271375 local ratekeeper no longer globally limits 2019-06-28 16:54:22 -07:00
Evan Tschannen 4cef1d3937 Experimental change of storage write priority 2019-06-28 16:54:22 -07:00
Evan Tschannen f539b5f09a fix: a large targetRateRatio means limiting more 2019-06-28 16:54:22 -07:00
Evan Tschannen 18d5fbf1e0 Avoid jumping from rejecting 0% of requests directly to 20% of requests 2019-06-28 16:54:22 -07:00
Evan Tschannen db413c37f7 restored the STORAGE_DURABILITY_LAG_SOFT_MAX knob and made the rk target slightly smaller than the soft limit, to avoid inaccuracies in ratekeeper control causing behavior changes on the storage servers 2019-06-28 16:54:22 -07:00
Evan Tschannen ec16688db1 fixed the local ratekeeper workload to match the logic on the storage server 2019-06-28 16:54:22 -07:00
Evan Tschannen a97940a10b fixed compiler error 2019-06-28 16:54:22 -07:00
Evan Tschannen 92b32855ca ratekeeper’s control algorithm would oscillate when limited by local ratekeeper 2019-06-28 16:54:22 -07:00
Evan Tschannen 1b939d5208
Merge pull request from satherton/feature-redwood
Update redwood storage engine to latest correctness-passing version
2019-06-28 16:22:06 -07:00
Meng Xu 63c42533eb TaceTeamCollectionInfo:Remove delay 2019-06-28 16:19:58 -07:00
Meng Xu 875cb877ac TeamCollection: Apply clang-format 2019-06-28 16:01:05 -07:00
Meng Xu 0baae134f6 TeamCollectionInfo: Resolve review comments 2019-06-28 15:59:47 -07:00
Evan Tschannen cfce1e1705 fix: buffered peek cursor would advance very slowly through large ranges of empty versions 2019-06-28 15:54:08 -07:00
Evan Tschannen 7f4586ad49 the number of txsTags needs to be tracked separately from the number of transaction logs because of forced recoveries 2019-06-28 12:33:24 -07:00
Meng Xu cb681693df TeamCollection:Do NOT consider healthyness in counting team number
If a team is removed from DD, it will be marked as failed and eventually removed from the
global teams data structure.
Team healthyness is likely to be a temporary state which can be changed rather quickly.
2019-06-28 09:50:43 -07:00
Evan Tschannen 2113d6d01e fix: peek all possible txsTags which could have been used by old log sets 2019-06-27 23:39:19 -07:00
Evan Tschannen 235697f688 fix: txsTags are not popped at the recovery version 2019-06-27 23:18:26 -07:00
Meng Xu 4da345f7d2 TeamCollectionTest:Remove test on minTeamOnServer 2019-06-27 19:05:10 -07:00
Meng Xu ce7eb10cac TeamCollectionInfo: Only count team number for healthy server and machine 2019-06-27 19:04:22 -07:00
Meng Xu f889843332 Change traceTeamCollectionInfo to actor
There are cases where traceTeamCollectionInfo was called within the same execution block, i.e.,
no wait between the two traceTeamCollectionInfo calls.
Because simulation uses the same time for all execution instructions in the same execution block,
having more than one traceTeamCollectionInfo at the same time will mess up the trackLatest semantics.
When one of them is always chosen by simulator, simulation test will report false positive error.

Changing this function to actor and adding a small delay inside the function can solve this problem.
2019-06-27 18:24:20 -07:00
Meng Xu 4fe3c7f749 TeamCollectionInfo:Revert to original version where it is 2019-06-27 17:09:21 -07:00
Meng Xu 42620e4831 TeamCollectionTest:GetTeamCollectionValid wait until values are correct 2019-06-27 16:52:36 -07:00
Meng Xu ee41311a54 TeamCollection:Call addTeamsBestOf when remainingTeamBudget is not 0 2019-06-27 15:29:26 -07:00
Evan Tschannen 52efcfd136 fix: properly create the right number for txsTags when changing between different numbers of logs 2019-06-27 15:15:05 -07:00
Meng Xu 8d5e848808 QuitDatabase test: Check each server has at least 1 team 2019-06-27 14:22:41 -07:00
Meng Xu 2993a96de8 TeamCollectionInfo: Remove debug trace and apply clang format 2019-06-27 14:15:51 -07:00
Meng Xu 5f5c404291 BugFix:ReplicationPolicy always fails when teamSize is 1
Whenever use selectReplicas function, be careful that it may have bugs!
This bug is that it always return false (not able to find candidates)
when the storage team size is 1. This is wrong because when storage team size
is 1, the selectReplicas should return an empty result.
2019-06-27 13:47:49 -07:00
A.J. Beamon 35b6277a50 Fix knob copy paste error 2019-06-27 12:55:39 -07:00
mpilman 7bfda1faaa Fixed three more Windows issues
This is now compiling on my Windows machine
2019-06-27 11:39:36 -07:00
Meng Xu 90c158984c TeamCollection:Add extra trace events 2019-06-27 11:27:29 -07:00
Meng Xu aaf97542e9 TeamCollectionTest: Update unit test 2019-06-27 11:27:29 -07:00
Meng Xu 53324e4db7 TeamCollectionInfo: clang format 2019-06-27 11:27:29 -07:00
Meng Xu cc6a0e9bcd TeamCollectionTest:Do not enforce minServerTeamOnServer larger than 0
In ConfigureTest, one server may be left with 0 server teams, even if
we call buildTeams in the storageServerTracker.
2019-06-27 11:27:29 -07:00
Meng Xu c23d89c98a TeamCollection:Only count healthy teams for a server
When team collection add new server teams, it picks a team with
the least number of teams. We should only consider the healthy teams
because the unhealthy ones will not be useful.
2019-06-27 11:27:29 -07:00
Meng Xu 02cdcc0b0c TeamCollectionTest: Only ensure each server and machine have a team 2019-06-27 11:27:29 -07:00
Meng Xu e1d459075a TeamCollection:Count healthy machine teams only
Team collection should prioritize to build machine teams for a machine
that has the least number of healthy machine teams, instead of just
machine teams, because unhealthy machine team will not be able to
produce more server teams.
2019-06-27 11:27:29 -07:00
Meng Xu ee916b337d TeamCollection:Change the target team number to build
When team collection (TC) build server teams and machine teams,
it needs to build enough teams such that each server and machine has
the DESIRED_TEAMS_PER_SERVER server teams and machine teams.

This change calculate the number of teams (server team and machine teams)
needed to get each teams for each server and machine.
2019-06-27 11:16:44 -07:00
Meng Xu 21664742a6 TeamCollection:Desired team number may be larger than the max possible team number
For example, we have 3 servers for replica factor 3. We can have only 1 team
but the desired team number is 3 times 5 equal to 15.

Instead of sanity checking the absolute team number per server, we check
the difference between the minServerTeamOnServer and maxServerTeamOnServer.
2019-06-27 11:15:06 -07:00
Meng Xu 08f28e99f9 TeamCollection:Test no server or machine has incorrect team number
Add test for simulation test which make sure the server team number
per server will be no less than the desired_teams_per_server defined
in knobs and no larger than the max_teams_per_server.

Add similar test for machine teams number per machine as well.
2019-06-27 11:15:06 -07:00
A.J. Beamon 7f23814841 Track run loop busyness and report it in status. 2019-06-26 14:03:02 -07:00
Alex Miller 83fae6cc15 Fix ExternalWorkload not being a part of the old build/test system. 2019-06-25 21:42:35 -07:00
Alex Miller b5af601a8a Fix ExternalWorkload not being a part of the old build/test system. 2019-06-25 21:41:43 -07:00
sramamoorthy 0a94f96dee sev40 if knownCommittedVersion > recoveryVersion 2019-06-25 16:17:45 -07:00
Alex Miller bf883d7055 Merge remote-tracking branch 'upstream/master' into flowlock-api 2019-06-25 14:26:50 -07:00
Evan Tschannen 0fe6edc254
Merge pull request from mpilman/features/external-workload
Features/external workload
2019-06-25 13:53:19 -07:00
Evan Tschannen c913aafc1c
Merge pull request from bnamasivayam/address-comma-separate-list
Make public address and listen address a comma separated list
2019-06-25 13:52:16 -07:00
Alex Miller 7a500cd37f A giant translation of TaskFooPriority -> TaskPriority::Foo
This is so that APIs that take priorities don't take ints, which are
common and easy to accidentally pass the wrong thing.
2019-06-25 02:47:35 -07:00
Stephen Atherton f1f1081202 Merge branch 'master' of github.com:apple/foundationdb into feature-redwood
# Conflicts:
#	fdbserver/VersionedBTree.actor.cpp
2019-06-24 20:17:49 -07:00
Evan Tschannen 76ba4e60b7 fixed a stack overflow bug 2019-06-24 13:03:35 -07:00
sramamoorthy 212136d024 SnapTest to handle retries for exec txns 2019-06-24 10:22:42 -07:00
Stephen Atherton 112b0918c9 Refactored set() speed test to produce random sets of consecutive records with random prefixes that will often share common bytes. 2019-06-24 01:05:16 -07:00
Alec Grieser e8c75505d3
Merge pull request from jzhou77/db-option
Add transaction size option
2019-06-21 08:25:34 -07:00
Balachandar Namasivayam 5ce45a8a2d Addressed review comments. 2019-06-20 23:03:49 -07:00
Balachandar Namasivayam 7489f83a7f Disable/Re-enable consistency check through a database key.
fdbcli has a new command 'consistencycheck' to disable/re-enable consistency check.
cluster_healthy metric in status becomes false if consistencycheck is disabled.
2019-06-20 21:38:45 -07:00
Evan Tschannen 1c005d5878
Merge pull request from alexmiller-apple/spilled-only-peek
Save TLog resources by letting peek request only spilled data.
2019-06-20 18:22:31 -07:00
Alex Miller 26343f557a Update getMore() contract.
MultiCursor already did this.
2019-06-20 17:48:24 -07:00
Evan Tschannen 37c1df2491
Merge pull request from bnamasivayam/suspend-process
Extend RebootRequest API to include time to suspend the process befor…
2019-06-20 17:36:25 -07:00
Evan Tschannen 460af91913
Merge pull request from alexmiller-apple/dd-failure-time
Increase how long FDB will wait before starting DD to repair data loss.
2019-06-20 17:33:16 -07:00
Jingyu Zhou 357c9ba0fb Refactor code 2019-06-19 20:41:53 -07:00
Evan Tschannen e0be631414 shard the txs tag so that more transaction logs are involved in its recovery 2019-06-19 18:15:09 -07:00
Alex Miller df0baa0066
Merge pull request from mpilman/features/protocol-version
Make protocol version a type
2019-06-19 13:46:35 -07:00
Alex Miller 61901effed Increase how long FDB will wait before starting DD to repair data loss.
10s is a bit short for starting data distribution, which is rather
expensive.  60s is a bit more reasonable.
2019-06-19 13:40:21 -07:00
mpilman ab7562160c Made JavaWorkload an external workload 2019-06-19 13:03:41 -07:00
mpilman 2eff2b7e21 First simple test is working (but very buggy) 2019-06-19 13:03:41 -07:00
mpilman 1707f068e0 started implementation first c workload 2019-06-19 13:03:41 -07:00
mpilman c8957d93f8 Implementation code complete 2019-06-19 13:03:41 -07:00
Alex Miller ce24db3c53 Fully consume parallelPeekMore results before switching back. 2019-06-19 01:30:49 -07:00
Balachandar Namasivayam 4832404c85 Make public address and listen address a comma separated list 2019-06-18 18:15:15 -07:00
mpilman 68ce9a5e75 ProtocolVersion type - second try 2019-06-18 17:55:27 -07:00
Alex Miller 51fd42a4d2 Merge remote-tracking branch 'upstream/master' into spilled-only-peek 2019-06-18 17:33:52 -07:00
Alex Miller 4fa5dc0502 Merge remote-tracking branch 'upstream/master' into cloexec 2019-06-18 16:35:18 -07:00
mpilman 8576665a90 Revert "Revert "Make protocol version a type""
This reverts commit 455bf3b3ec.
2019-06-18 14:49:04 -07:00
Alex Miller 455bf3b3ec Revert "Make protocol version a type" 2019-06-18 10:59:17 -07:00
A.J. Beamon c3aa5819f2
Merge pull request from mpilman/features/client-buggify
Overall framework and first buggify entries
2019-06-18 09:10:11 -07:00
Stephen Atherton d4b7f9b606 Fixed some cmake, compile, and IDE warnings. 2019-06-17 18:55:49 -07:00
Steve Atherton ba52623637
Merge pull request from tclinken/features/sqlite-crc32c
Use crc32 for sqlite page checksums
2019-06-17 14:20:41 -07:00
mpilman da53a92bec Make protocol version a type
This fixes 

The basic idea is that ProtocolVersion is now its own type. This
alone is an improvement as it makes many things more typesafe. For
each version, we can now add breaking features (for example Fearless).
After that, there's no need to test against actual (confusing) version
numbers. Instead a developer can simply test
`protocolVersion->hasFearless()` and this will return true iff the
protocolVersion is newer than the newest version that didn't support
fearless.
2019-06-16 09:59:15 -07:00
mpilman 6ea75713cb Overall framework and first buggify entries 2019-06-16 09:09:09 -07:00
Evan Tschannen 20e3edeb0a Merge branch 'release-6.1'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/storageserver.actor.cpp
#	versions.target
2019-06-14 12:42:59 -07:00
Balachandar Namasivayam 5eb833759e Extend RebootRequest API to include time to suspend the process before reboot. This is intended to be used for testing purposes to simulate failures. 2019-06-14 11:35:38 -07:00
Evan Tschannen 6ececa94ce
Merge pull request from vishesh/task/client-failmon
Clients will no longer get failure monitoring info from cluster controller
2019-06-13 17:31:17 -07:00
A.J. Beamon fddcf3486c
Merge pull request from etschannen/increase_idle_delay
Increase idle delay
2019-06-13 16:34:22 -07:00
A.J. Beamon aad79aae49
Merge pull request from senthil-ram/boostwindowsmac
disable boost::process code for windows and mac
2019-06-13 16:12:40 -07:00
Evan Tschannen 924f92e5aa Prevent the byte sample recovery from interfering with storage server recovery 2019-06-13 15:55:25 -07:00
sramamoorthy 1d1d42c8af disable boost::process code for windows and mac 2019-06-13 15:43:03 -07:00
Evan Tschannen b2a5d4fd0d Merge branch 'master' into increase_idle_delay 2019-06-13 15:23:18 -07:00
A.J. Beamon e45c13358e
Merge pull request from etschannen/master
Fixed a number of correctness problems
2019-06-13 15:11:16 -07:00
Evan Tschannen 054d775343 increase the delay between idle commits to reduce the rate idle clusters fsync 2019-06-13 14:55:37 -07:00
Evan Tschannen 55f7e7d372 fix: The delay inside the disabledMap was causing the storage server updateStorage actor to run on the client process 2019-06-13 14:28:30 -07:00
A.J. Beamon 3dd2479193 Try avoiding use of boost in FDBExecHelper 2019-06-13 13:09:29 -07:00
Evan Tschannen dccb9bc26d fixed a number of correctness problems 2019-06-12 19:40:50 -07:00
Trevor Clinkenbeard 1e8f7e5b82 Refactor NextFastAllocatedSize to be constexpr function 2019-06-11 15:55:23 -07:00
Trevor Clinkenbeard cb420ea4bd Only construct waitDescription in simulator 2019-06-11 12:43:39 -07:00
Trevor Clinkenbeard 8144882d7b Merge branch 'apple-master' into features/local-rk 2019-06-10 19:40:25 -07:00
Trevor Clinkenbeard 46b77819aa Fixed LocalRatekeeper test 2019-06-10 18:25:58 -07:00
Vishesh Yadav a8e408e268 run clang-format on changes 2019-06-10 14:10:24 -07:00
Vishesh Yadav 6fa7081a21 net: Don't make FailureMonitoring requests from client
This patch removes the need for clients to continuously contact
cluster coordinator for failure monitoring information. Instead, it
uses the FlowTransport to monitor the statuses of peers and update
FailureMonitor accordingly.
2019-06-09 00:43:38 -07:00
Vishesh Yadav 6b4d30c3ae failmon: Identify client vs server when starting failure monitoring client 2019-06-09 00:43:12 -07:00
Evan Tschannen 5bdf5aaeb6
Merge pull request from etschannen/master
Merge 6.1 into master
2019-06-06 13:57:34 -07:00
Stephen Atherton 100789b354 More bug fixes in handling upperBound changes in modified pages and worst-case delta size calculation. Normalized some formatting in debug statements. Fixed compile error on linux. Updated test specs. 2019-06-05 20:58:47 -07:00
Trevor Clinkenbeard 8dbb231f33 Don't reject read requests until the storage server durability lag gets large enough 2019-06-05 15:42:58 -07:00
Trevor Clinkenbeard d1d98f298a Changed storage server getPenalty calculation.
Penalty should always be >= 1.0
2019-06-05 14:14:40 -07:00
chaoguang 877a59fab9 add in fdbserver.vcxproj.filters 2019-06-04 15:58:17 -07:00
Stephen Atherton 6aad34620d Bug fix in upper boundary selection in commitSubtree(). More debug output. 2019-06-04 04:55:09 -07:00
Stephen Atherton 653440d54c Changes and bug fixes in how boundary keys are modified during clears in internal pages by rewriting how internal pages are modified, making edge cases much easier to handle. Several debug output improvements. Page numbers stored on disk are now big endian. 2019-06-04 04:03:52 -07:00
Evan Tschannen 29b96414e2 Merge branch 'release-6.1'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbclient/NativeAPI.actor.cpp
#	fdbserver/Coordination.actor.cpp
#	flow/Arena.h
#	versions.target
2019-06-03 18:49:35 -07:00
chaoguang 66811b7bd2 update to latest version 2019-06-03 16:49:19 -07:00
chaoguang 3055376b45 remove static keyword to make variables not in binary 2019-06-03 16:40:34 -07:00
Parallels 773f52d0a1 Merge remote-tracking branch 'upstream/master' into cloexec 2019-06-03 15:43:32 -07:00
A.J. Beamon bb22ee7d37
Merge pull request from etschannen/feature-coordinator-bug
The coordinators did not always converge on the same leader
2019-06-03 15:04:25 -07:00
A.J. Beamon 773bce9e32
Merge pull request from etschannen/feature-cc-mem-leak
Fixed a memory leak on the cluster controller
2019-06-03 15:02:36 -07:00
Meng Xu dc59f63d0e TraceEvent:First letter must be capitalized 2019-06-03 13:27:18 -07:00
chaoguang ac2c0f38b7 remove inheritance from KVWorkload 2019-06-02 23:16:39 -07:00
chaoguang d07c46e3f3 fix issues by comments 2019-05-31 00:44:07 -07:00
chaoguang 66d25cef21 fix issues by comments 2019-05-31 00:27:30 -07:00
Evan Tschannen b830fa4c84 fix: A minority of coordinators could continue choosing a candidate which was not the leader 2019-05-30 17:25:20 -07:00
Stephen Atherton 9f064ad7cf Added back minimal btree internal page boundaries using RedwoodRecordRef. 2019-05-30 02:10:07 -07:00
Stephen Atherton 098ac46af9 RedwoodRecordRef::deltaSize() now calculates actual delta size instead of a conservative estimate. 2019-05-29 18:06:11 -07:00
Stephen Atherton 3e155a2563 Bug fixes. 2019-05-29 17:38:55 -07:00
Evan Tschannen 7c333dbc16 If a process receives a message in its clusterControllerInterface before becoming the cluster controller, if the process does not become the cluster controller in the next minute it should destroy the interface to prevent a memory leak. 2019-05-29 16:57:13 -07:00
Stephen Atherton cedcfcddd0 Bug fix in RedwoodRecordRef::Delta var int writer, new tests. 2019-05-29 16:47:53 -07:00
Stephen Atherton 1e5b9faa11 Bug fixes in RedwoodRecordRef::Delta. 2019-05-29 16:26:58 -07:00
Evan Tschannen 362c2bf1e6 improved the cpu efficiency of printable 2019-05-29 14:55:45 -07:00
Stephen Atherton 02882dbf00 Checkpointing progress, RedwoodRecordRef and DeltaTree tests pass but BTree test does not. RedwoodRecordRef::Delta rewritten to actually do prefix compression on key and integer fields. Added related unit tests and benchmarks. Some improvements to DeltaTree and requirements on its T and Delta types to avoid repeated common prefix discovery. 2019-05-29 06:23:32 -07:00
sramamoorthy 1190f2f33d rebased related changes 2019-05-28 22:07:46 -07:00
sramamoorthy 4bcb590f12 g_random -> deterministicRandom() 2019-05-28 22:07:46 -07:00
sramamoorthy b43c100e57 TLog bug fixes 2019-05-28 22:07:46 -07:00
sramamoorthy 42c551a996 handle isRestoring & BackupFailed not being set
restartInfo.in->BackupFailed and isRestoring may not be
set in all cases, handle the absence of them.
2019-05-28 22:07:46 -07:00
sramamoorthy 3877f87481 comment change in tLogCommit 2019-05-28 22:07:46 -07:00
sramamoorthy 2a68b28590 rebase related changes 2019-05-28 22:07:46 -07:00
sramamoorthy b17ad85497 exec op not supported when log_anti_quorum > 0 2019-05-28 22:07:46 -07:00
sramamoorthy 3aa848b8af minor bug in whitelist binary path testing 2019-05-28 22:07:46 -07:00
sramamoorthy c906da1f62 simulator: spawnProcess to wait for long duration
spawnProcess was waiting for 3 seconds and terminating
the child process for synchronous calls, but in the
simulator, this can lead to non-determinism, because
some cases the command can run in <3 or >3 seconds.
The fix is to increase the wait for duration to be
very long that it has to synchronously wait and get
the results or the test will timeout.
2019-05-28 22:07:46 -07:00
sramamoorthy 31b6c86650 ignorePopDeadline to have high limit in simulator
- ignorePopDeadline to have highier limit in simulator
to accommdate for the buggify delays and make snapshot succeed.

- introduce a new knob for auto resetting the disabling of tlog pop
2019-05-28 22:07:46 -07:00
sramamoorthy 40358e1dd6 limit of getRange in snapTest reduced
With CLIENT_KNOBS->TOO_MANY in snapTest, by the time getRange
gathers all the results, the storage server's oldest version has
gone past the req->version and hence the transaction fails with
transaction_too_old
2019-05-28 22:07:46 -07:00
sramamoorthy b1b96946af logData->stop check right after execOpHold wait 2019-05-28 22:07:46 -07:00
sramamoorthy 5749e220bd use FlowLock for implementing critical section
Instead of using Promises and future to implement
critcal section use FlowLock
2019-05-28 22:07:46 -07:00
sramamoorthy e6c0b87a4d remove unused variable 2019-05-28 22:07:46 -07:00
sramamoorthy b56d8e648f bp::child->wait_for does not give correct err code
boost::process::child->wait_for does not give the error code
from the process being run. Re-arrange the code to work-around
it.
2019-05-28 22:07:46 -07:00
sramamoorthy f27a40f118 execProcessingHelper made synchronous
tLogCommit exects no blocking between duplicate check and
setting of the new version, that constraint was broken
when synchronous execProcessingHelper was introduced.
As a fix, execProcessingHelper was made asynchronous.
2019-05-28 22:07:46 -07:00
sramamoorthy ceac68c990 restore - remove emtpy snapdir,snap loop retry fix
- remove partially snapped directories to avoid no cluster file assert
- snap create to retry max 3 times for not_fully_recovered and keep
  retrying for the other failures
2019-05-28 22:07:46 -07:00
sramamoorthy d3a179b6f9 Multiple bug fixes
- wait for snapTLogFailKeys in a loop, otherwise in some race
  condition it can cause a false assert
- in single region, there does not seem to be a guarantee of
  tagLocalityListKey for a given DC ID, avoiding that assert for now
- to find the workers that are coordinators, looking up by primary
  address is not sufficient in some cases, hence looking by both
  primary and secondary address
- test make files to reflect the location of the new test cases
2019-05-28 22:07:46 -07:00