Jingyu Zhou
b69d7adabc
Remove unused remoteRecovered from master server
2019-07-01 15:41:35 -07:00
Alex Miller
23de5b64ad
Memory storage engine to use crc32c DiskQueue by default (in 6.2).
2019-07-01 13:38:06 -07:00
Meng Xu
b8cb883040
AddBestMachineTeams:Fix input must be non-negative value
2019-06-28 22:46:16 -07:00
Evan Tschannen
4e45a58750
fix: forced recovery did not copy the number of txsTags properly
2019-06-28 20:51:16 -07:00
Evan Tschannen
2c40c818cf
fix: txsTags was not copied into oldLogData
2019-06-28 17:51:16 -07:00
Alex Miller
8e1ab6e7db
Merge remote-tracking branch 'upstream/master' into flowlock-api
2019-06-28 17:32:54 -07:00
Evan Tschannen
5041ff38b1
removed unneeded description
2019-06-28 16:54:22 -07:00
Evan Tschannen
a124fc6e8a
fixed compiler error
2019-06-28 16:54:22 -07:00
Evan Tschannen
b9a6271375
local ratekeeper no longer globally limits
2019-06-28 16:54:22 -07:00
Evan Tschannen
4cef1d3937
Experimental change of storage write priority
2019-06-28 16:54:22 -07:00
Evan Tschannen
f539b5f09a
fix: a large targetRateRatio means limiting more
2019-06-28 16:54:22 -07:00
Evan Tschannen
18d5fbf1e0
Avoid jumping from rejecting 0% of requests directly to 20% of requests
2019-06-28 16:54:22 -07:00
Evan Tschannen
db413c37f7
restored the STORAGE_DURABILITY_LAG_SOFT_MAX knob and made the rk target slightly smaller than the soft limit, to avoid inaccuracies in ratekeeper control causing behavior changes on the storage servers
2019-06-28 16:54:22 -07:00
Evan Tschannen
ec16688db1
fixed the local ratekeeper workload to match the logic on the storage server
2019-06-28 16:54:22 -07:00
Evan Tschannen
a97940a10b
fixed compiler error
2019-06-28 16:54:22 -07:00
Evan Tschannen
92b32855ca
ratekeeper’s control algorithm would oscillate when limited by local ratekeeper
2019-06-28 16:54:22 -07:00
Evan Tschannen
1b939d5208
Merge pull request #1749 from satherton/feature-redwood
...
Update redwood storage engine to latest correctness-passing version
2019-06-28 16:22:06 -07:00
Meng Xu
63c42533eb
TaceTeamCollectionInfo:Remove delay
2019-06-28 16:19:58 -07:00
Meng Xu
875cb877ac
TeamCollection: Apply clang-format
2019-06-28 16:01:05 -07:00
Meng Xu
0baae134f6
TeamCollectionInfo: Resolve review comments
2019-06-28 15:59:47 -07:00
Evan Tschannen
cfce1e1705
fix: buffered peek cursor would advance very slowly through large ranges of empty versions
2019-06-28 15:54:08 -07:00
Evan Tschannen
7f4586ad49
the number of txsTags needs to be tracked separately from the number of transaction logs because of forced recoveries
2019-06-28 12:33:24 -07:00
Meng Xu
cb681693df
TeamCollection:Do NOT consider healthyness in counting team number
...
If a team is removed from DD, it will be marked as failed and eventually removed from the
global teams data structure.
Team healthyness is likely to be a temporary state which can be changed rather quickly.
2019-06-28 09:50:43 -07:00
Evan Tschannen
2113d6d01e
fix: peek all possible txsTags which could have been used by old log sets
2019-06-27 23:39:19 -07:00
Evan Tschannen
235697f688
fix: txsTags are not popped at the recovery version
2019-06-27 23:18:26 -07:00
Meng Xu
4da345f7d2
TeamCollectionTest:Remove test on minTeamOnServer
2019-06-27 19:05:10 -07:00
Meng Xu
ce7eb10cac
TeamCollectionInfo: Only count team number for healthy server and machine
2019-06-27 19:04:22 -07:00
Meng Xu
f889843332
Change traceTeamCollectionInfo to actor
...
There are cases where traceTeamCollectionInfo was called within the same execution block, i.e.,
no wait between the two traceTeamCollectionInfo calls.
Because simulation uses the same time for all execution instructions in the same execution block,
having more than one traceTeamCollectionInfo at the same time will mess up the trackLatest semantics.
When one of them is always chosen by simulator, simulation test will report false positive error.
Changing this function to actor and adding a small delay inside the function can solve this problem.
2019-06-27 18:24:20 -07:00
Meng Xu
4fe3c7f749
TeamCollectionInfo:Revert to original version where it is
2019-06-27 17:09:21 -07:00
Meng Xu
42620e4831
TeamCollectionTest:GetTeamCollectionValid wait until values are correct
2019-06-27 16:52:36 -07:00
Meng Xu
ee41311a54
TeamCollection:Call addTeamsBestOf when remainingTeamBudget is not 0
2019-06-27 15:29:26 -07:00
Evan Tschannen
52efcfd136
fix: properly create the right number for txsTags when changing between different numbers of logs
2019-06-27 15:15:05 -07:00
Meng Xu
8d5e848808
QuitDatabase test: Check each server has at least 1 team
2019-06-27 14:22:41 -07:00
Meng Xu
2993a96de8
TeamCollectionInfo: Remove debug trace and apply clang format
2019-06-27 14:15:51 -07:00
Meng Xu
5f5c404291
BugFix:ReplicationPolicy always fails when teamSize is 1
...
Whenever use selectReplicas function, be careful that it may have bugs!
This bug is that it always return false (not able to find candidates)
when the storage team size is 1. This is wrong because when storage team size
is 1, the selectReplicas should return an empty result.
2019-06-27 13:47:49 -07:00
A.J. Beamon
35b6277a50
Fix knob copy paste error
2019-06-27 12:55:39 -07:00
mpilman
7bfda1faaa
Fixed three more Windows issues
...
This is now compiling on my Windows machine
2019-06-27 11:39:36 -07:00
Meng Xu
90c158984c
TeamCollection:Add extra trace events
2019-06-27 11:27:29 -07:00
Meng Xu
aaf97542e9
TeamCollectionTest: Update unit test
2019-06-27 11:27:29 -07:00
Meng Xu
53324e4db7
TeamCollectionInfo: clang format
2019-06-27 11:27:29 -07:00
Meng Xu
cc6a0e9bcd
TeamCollectionTest:Do not enforce minServerTeamOnServer larger than 0
...
In ConfigureTest, one server may be left with 0 server teams, even if
we call buildTeams in the storageServerTracker.
2019-06-27 11:27:29 -07:00
Meng Xu
c23d89c98a
TeamCollection:Only count healthy teams for a server
...
When team collection add new server teams, it picks a team with
the least number of teams. We should only consider the healthy teams
because the unhealthy ones will not be useful.
2019-06-27 11:27:29 -07:00
Meng Xu
02cdcc0b0c
TeamCollectionTest: Only ensure each server and machine have a team
2019-06-27 11:27:29 -07:00
Meng Xu
e1d459075a
TeamCollection:Count healthy machine teams only
...
Team collection should prioritize to build machine teams for a machine
that has the least number of healthy machine teams, instead of just
machine teams, because unhealthy machine team will not be able to
produce more server teams.
2019-06-27 11:27:29 -07:00
Meng Xu
ee916b337d
TeamCollection:Change the target team number to build
...
When team collection (TC) build server teams and machine teams,
it needs to build enough teams such that each server and machine has
the DESIRED_TEAMS_PER_SERVER server teams and machine teams.
This change calculate the number of teams (server team and machine teams)
needed to get each teams for each server and machine.
2019-06-27 11:16:44 -07:00
Meng Xu
21664742a6
TeamCollection:Desired team number may be larger than the max possible team number
...
For example, we have 3 servers for replica factor 3. We can have only 1 team
but the desired team number is 3 times 5 equal to 15.
Instead of sanity checking the absolute team number per server, we check
the difference between the minServerTeamOnServer and maxServerTeamOnServer.
2019-06-27 11:15:06 -07:00
Meng Xu
08f28e99f9
TeamCollection:Test no server or machine has incorrect team number
...
Add test for simulation test which make sure the server team number
per server will be no less than the desired_teams_per_server defined
in knobs and no larger than the max_teams_per_server.
Add similar test for machine teams number per machine as well.
2019-06-27 11:15:06 -07:00
A.J. Beamon
7f23814841
Track run loop busyness and report it in status.
2019-06-26 14:03:02 -07:00
Alex Miller
83fae6cc15
Fix ExternalWorkload not being a part of the old build/test system.
2019-06-25 21:42:35 -07:00
Alex Miller
b5af601a8a
Fix ExternalWorkload not being a part of the old build/test system.
2019-06-25 21:41:43 -07:00
sramamoorthy
0a94f96dee
sev40 if knownCommittedVersion > recoveryVersion
2019-06-25 16:17:45 -07:00
Alex Miller
bf883d7055
Merge remote-tracking branch 'upstream/master' into flowlock-api
2019-06-25 14:26:50 -07:00
Evan Tschannen
0fe6edc254
Merge pull request #1678 from mpilman/features/external-workload
...
Features/external workload
2019-06-25 13:53:19 -07:00
Evan Tschannen
c913aafc1c
Merge pull request #1721 from bnamasivayam/address-comma-separate-list
...
Make public address and listen address a comma separated list
2019-06-25 13:52:16 -07:00
Alex Miller
7a500cd37f
A giant translation of TaskFooPriority -> TaskPriority::Foo
...
This is so that APIs that take priorities don't take ints, which are
common and easy to accidentally pass the wrong thing.
2019-06-25 02:47:35 -07:00
Stephen Atherton
f1f1081202
Merge branch 'master' of github.com:apple/foundationdb into feature-redwood
...
# Conflicts:
# fdbserver/VersionedBTree.actor.cpp
2019-06-24 20:17:49 -07:00
Evan Tschannen
76ba4e60b7
fixed a stack overflow bug
2019-06-24 13:03:35 -07:00
sramamoorthy
212136d024
SnapTest to handle retries for exec txns
2019-06-24 10:22:42 -07:00
Stephen Atherton
112b0918c9
Refactored set() speed test to produce random sets of consecutive records with random prefixes that will often share common bytes.
2019-06-24 01:05:16 -07:00
Alec Grieser
e8c75505d3
Merge pull request #1725 from jzhou77/db-option
...
Add transaction size option
2019-06-21 08:25:34 -07:00
Balachandar Namasivayam
5ce45a8a2d
Addressed review comments.
2019-06-20 23:03:49 -07:00
Balachandar Namasivayam
7489f83a7f
Disable/Re-enable consistency check through a database key.
...
fdbcli has a new command 'consistencycheck' to disable/re-enable consistency check.
cluster_healthy metric in status becomes false if consistencycheck is disabled.
2019-06-20 21:38:45 -07:00
Evan Tschannen
1c005d5878
Merge pull request #1584 from alexmiller-apple/spilled-only-peek
...
Save TLog resources by letting peek request only spilled data.
2019-06-20 18:22:31 -07:00
Alex Miller
26343f557a
Update getMore() contract.
...
MultiCursor already did this.
2019-06-20 17:48:24 -07:00
Evan Tschannen
37c1df2491
Merge pull request #1705 from bnamasivayam/suspend-process
...
Extend RebootRequest API to include time to suspend the process befor…
2019-06-20 17:36:25 -07:00
Evan Tschannen
460af91913
Merge pull request #1727 from alexmiller-apple/dd-failure-time
...
Increase how long FDB will wait before starting DD to repair data loss.
2019-06-20 17:33:16 -07:00
Jingyu Zhou
357c9ba0fb
Refactor code
2019-06-19 20:41:53 -07:00
Evan Tschannen
e0be631414
shard the txs tag so that more transaction logs are involved in its recovery
2019-06-19 18:15:09 -07:00
Alex Miller
df0baa0066
Merge pull request #1720 from mpilman/features/protocol-version
...
Make protocol version a type
2019-06-19 13:46:35 -07:00
Alex Miller
61901effed
Increase how long FDB will wait before starting DD to repair data loss.
...
10s is a bit short for starting data distribution, which is rather
expensive. 60s is a bit more reasonable.
2019-06-19 13:40:21 -07:00
mpilman
ab7562160c
Made JavaWorkload an external workload
2019-06-19 13:03:41 -07:00
mpilman
2eff2b7e21
First simple test is working (but very buggy)
2019-06-19 13:03:41 -07:00
mpilman
1707f068e0
started implementation first c workload
2019-06-19 13:03:41 -07:00
mpilman
c8957d93f8
Implementation code complete
2019-06-19 13:03:41 -07:00
Alex Miller
ce24db3c53
Fully consume parallelPeekMore results before switching back.
2019-06-19 01:30:49 -07:00
Balachandar Namasivayam
4832404c85
Make public address and listen address a comma separated list
2019-06-18 18:15:15 -07:00
mpilman
68ce9a5e75
ProtocolVersion type - second try
2019-06-18 17:55:27 -07:00
Alex Miller
51fd42a4d2
Merge remote-tracking branch 'upstream/master' into spilled-only-peek
2019-06-18 17:33:52 -07:00
Alex Miller
4fa5dc0502
Merge remote-tracking branch 'upstream/master' into cloexec
2019-06-18 16:35:18 -07:00
mpilman
8576665a90
Revert "Revert "Make protocol version a type""
...
This reverts commit 455bf3b3ec
.
2019-06-18 14:49:04 -07:00
Alex Miller
455bf3b3ec
Revert "Make protocol version a type"
2019-06-18 10:59:17 -07:00
A.J. Beamon
c3aa5819f2
Merge pull request #1417 from mpilman/features/client-buggify
...
Overall framework and first buggify entries
2019-06-18 09:10:11 -07:00
Stephen Atherton
d4b7f9b606
Fixed some cmake, compile, and IDE warnings.
2019-06-17 18:55:49 -07:00
Steve Atherton
ba52623637
Merge pull request #1582 from tclinken/features/sqlite-crc32c
...
Use crc32 for sqlite page checksums
2019-06-17 14:20:41 -07:00
mpilman
da53a92bec
Make protocol version a type
...
This fixes #1214
The basic idea is that ProtocolVersion is now its own type. This
alone is an improvement as it makes many things more typesafe. For
each version, we can now add breaking features (for example Fearless).
After that, there's no need to test against actual (confusing) version
numbers. Instead a developer can simply test
`protocolVersion->hasFearless()` and this will return true iff the
protocolVersion is newer than the newest version that didn't support
fearless.
2019-06-16 09:59:15 -07:00
mpilman
6ea75713cb
Overall framework and first buggify entries
2019-06-16 09:09:09 -07:00
Evan Tschannen
20e3edeb0a
Merge branch 'release-6.1'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbserver/storageserver.actor.cpp
# versions.target
2019-06-14 12:42:59 -07:00
Balachandar Namasivayam
5eb833759e
Extend RebootRequest API to include time to suspend the process before reboot. This is intended to be used for testing purposes to simulate failures.
2019-06-14 11:35:38 -07:00
Evan Tschannen
6ececa94ce
Merge pull request #1640 from vishesh/task/client-failmon
...
Clients will no longer get failure monitoring info from cluster controller
2019-06-13 17:31:17 -07:00
A.J. Beamon
fddcf3486c
Merge pull request #1697 from etschannen/increase_idle_delay
...
Increase idle delay
2019-06-13 16:34:22 -07:00
A.J. Beamon
aad79aae49
Merge pull request #1699 from senthil-ram/boostwindowsmac
...
disable boost::process code for windows and mac
2019-06-13 16:12:40 -07:00
Evan Tschannen
924f92e5aa
Prevent the byte sample recovery from interfering with storage server recovery
2019-06-13 15:55:25 -07:00
sramamoorthy
1d1d42c8af
disable boost::process code for windows and mac
2019-06-13 15:43:03 -07:00
Evan Tschannen
b2a5d4fd0d
Merge branch 'master' into increase_idle_delay
2019-06-13 15:23:18 -07:00
A.J. Beamon
e45c13358e
Merge pull request #1691 from etschannen/master
...
Fixed a number of correctness problems
2019-06-13 15:11:16 -07:00
Evan Tschannen
054d775343
increase the delay between idle commits to reduce the rate idle clusters fsync
2019-06-13 14:55:37 -07:00
Evan Tschannen
55f7e7d372
fix: The delay inside the disabledMap was causing the storage server updateStorage actor to run on the client process
2019-06-13 14:28:30 -07:00
A.J. Beamon
3dd2479193
Try avoiding use of boost in FDBExecHelper
2019-06-13 13:09:29 -07:00
Evan Tschannen
dccb9bc26d
fixed a number of correctness problems
2019-06-12 19:40:50 -07:00
Trevor Clinkenbeard
1e8f7e5b82
Refactor NextFastAllocatedSize to be constexpr function
2019-06-11 15:55:23 -07:00
Trevor Clinkenbeard
cb420ea4bd
Only construct waitDescription in simulator
2019-06-11 12:43:39 -07:00
Trevor Clinkenbeard
8144882d7b
Merge branch 'apple-master' into features/local-rk
2019-06-10 19:40:25 -07:00
Trevor Clinkenbeard
46b77819aa
Fixed LocalRatekeeper test
2019-06-10 18:25:58 -07:00
Vishesh Yadav
a8e408e268
run clang-format on changes
2019-06-10 14:10:24 -07:00
Vishesh Yadav
6fa7081a21
net: Don't make FailureMonitoring requests from client
...
This patch removes the need for clients to continuously contact
cluster coordinator for failure monitoring information. Instead, it
uses the FlowTransport to monitor the statuses of peers and update
FailureMonitor accordingly.
2019-06-09 00:43:38 -07:00
Vishesh Yadav
6b4d30c3ae
failmon: Identify client vs server when starting failure monitoring client
2019-06-09 00:43:12 -07:00
Evan Tschannen
5bdf5aaeb6
Merge pull request #1662 from etschannen/master
...
Merge 6.1 into master
2019-06-06 13:57:34 -07:00
Stephen Atherton
100789b354
More bug fixes in handling upperBound changes in modified pages and worst-case delta size calculation. Normalized some formatting in debug statements. Fixed compile error on linux. Updated test specs.
2019-06-05 20:58:47 -07:00
Trevor Clinkenbeard
8dbb231f33
Don't reject read requests until the storage server durability lag gets large enough
2019-06-05 15:42:58 -07:00
Trevor Clinkenbeard
d1d98f298a
Changed storage server getPenalty calculation.
...
Penalty should always be >= 1.0
2019-06-05 14:14:40 -07:00
chaoguang
877a59fab9
add in fdbserver.vcxproj.filters
2019-06-04 15:58:17 -07:00
Stephen Atherton
6aad34620d
Bug fix in upper boundary selection in commitSubtree(). More debug output.
2019-06-04 04:55:09 -07:00
Stephen Atherton
653440d54c
Changes and bug fixes in how boundary keys are modified during clears in internal pages by rewriting how internal pages are modified, making edge cases much easier to handle. Several debug output improvements. Page numbers stored on disk are now big endian.
2019-06-04 04:03:52 -07:00
Evan Tschannen
29b96414e2
Merge branch 'release-6.1'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbclient/NativeAPI.actor.cpp
# fdbserver/Coordination.actor.cpp
# flow/Arena.h
# versions.target
2019-06-03 18:49:35 -07:00
chaoguang
66811b7bd2
update to latest version
2019-06-03 16:49:19 -07:00
chaoguang
3055376b45
remove static keyword to make variables not in binary
2019-06-03 16:40:34 -07:00
Parallels
773f52d0a1
Merge remote-tracking branch 'upstream/master' into cloexec
2019-06-03 15:43:32 -07:00
A.J. Beamon
bb22ee7d37
Merge pull request #1649 from etschannen/feature-coordinator-bug
...
The coordinators did not always converge on the same leader
2019-06-03 15:04:25 -07:00
A.J. Beamon
773bce9e32
Merge pull request #1643 from etschannen/feature-cc-mem-leak
...
Fixed a memory leak on the cluster controller
2019-06-03 15:02:36 -07:00
Meng Xu
dc59f63d0e
TraceEvent:First letter must be capitalized
2019-06-03 13:27:18 -07:00
chaoguang
ac2c0f38b7
remove inheritance from KVWorkload
2019-06-02 23:16:39 -07:00
chaoguang
d07c46e3f3
fix issues by comments
2019-05-31 00:44:07 -07:00
chaoguang
66d25cef21
fix issues by comments
2019-05-31 00:27:30 -07:00
Evan Tschannen
b830fa4c84
fix: A minority of coordinators could continue choosing a candidate which was not the leader
2019-05-30 17:25:20 -07:00
Stephen Atherton
9f064ad7cf
Added back minimal btree internal page boundaries using RedwoodRecordRef.
2019-05-30 02:10:07 -07:00
Stephen Atherton
098ac46af9
RedwoodRecordRef::deltaSize() now calculates actual delta size instead of a conservative estimate.
2019-05-29 18:06:11 -07:00
Stephen Atherton
3e155a2563
Bug fixes.
2019-05-29 17:38:55 -07:00
Evan Tschannen
7c333dbc16
If a process receives a message in its clusterControllerInterface before becoming the cluster controller, if the process does not become the cluster controller in the next minute it should destroy the interface to prevent a memory leak.
2019-05-29 16:57:13 -07:00
Stephen Atherton
cedcfcddd0
Bug fix in RedwoodRecordRef::Delta var int writer, new tests.
2019-05-29 16:47:53 -07:00
Stephen Atherton
1e5b9faa11
Bug fixes in RedwoodRecordRef::Delta.
2019-05-29 16:26:58 -07:00
Evan Tschannen
362c2bf1e6
improved the cpu efficiency of printable
2019-05-29 14:55:45 -07:00
Stephen Atherton
02882dbf00
Checkpointing progress, RedwoodRecordRef and DeltaTree tests pass but BTree test does not. RedwoodRecordRef::Delta rewritten to actually do prefix compression on key and integer fields. Added related unit tests and benchmarks. Some improvements to DeltaTree and requirements on its T and Delta types to avoid repeated common prefix discovery.
2019-05-29 06:23:32 -07:00
sramamoorthy
1190f2f33d
rebased related changes
2019-05-28 22:07:46 -07:00
sramamoorthy
4bcb590f12
g_random -> deterministicRandom()
2019-05-28 22:07:46 -07:00
sramamoorthy
b43c100e57
TLog bug fixes
2019-05-28 22:07:46 -07:00
sramamoorthy
42c551a996
handle isRestoring & BackupFailed not being set
...
restartInfo.in->BackupFailed and isRestoring may not be
set in all cases, handle the absence of them.
2019-05-28 22:07:46 -07:00
sramamoorthy
3877f87481
comment change in tLogCommit
2019-05-28 22:07:46 -07:00
sramamoorthy
2a68b28590
rebase related changes
2019-05-28 22:07:46 -07:00
sramamoorthy
b17ad85497
exec op not supported when log_anti_quorum > 0
2019-05-28 22:07:46 -07:00
sramamoorthy
3aa848b8af
minor bug in whitelist binary path testing
2019-05-28 22:07:46 -07:00
sramamoorthy
c906da1f62
simulator: spawnProcess to wait for long duration
...
spawnProcess was waiting for 3 seconds and terminating
the child process for synchronous calls, but in the
simulator, this can lead to non-determinism, because
some cases the command can run in <3 or >3 seconds.
The fix is to increase the wait for duration to be
very long that it has to synchronously wait and get
the results or the test will timeout.
2019-05-28 22:07:46 -07:00
sramamoorthy
31b6c86650
ignorePopDeadline to have high limit in simulator
...
- ignorePopDeadline to have highier limit in simulator
to accommdate for the buggify delays and make snapshot succeed.
- introduce a new knob for auto resetting the disabling of tlog pop
2019-05-28 22:07:46 -07:00
sramamoorthy
40358e1dd6
limit of getRange in snapTest reduced
...
With CLIENT_KNOBS->TOO_MANY in snapTest, by the time getRange
gathers all the results, the storage server's oldest version has
gone past the req->version and hence the transaction fails with
transaction_too_old
2019-05-28 22:07:46 -07:00
sramamoorthy
b1b96946af
logData->stop check right after execOpHold wait
2019-05-28 22:07:46 -07:00
sramamoorthy
5749e220bd
use FlowLock for implementing critical section
...
Instead of using Promises and future to implement
critcal section use FlowLock
2019-05-28 22:07:46 -07:00
sramamoorthy
e6c0b87a4d
remove unused variable
2019-05-28 22:07:46 -07:00
sramamoorthy
b56d8e648f
bp::child->wait_for does not give correct err code
...
boost::process::child->wait_for does not give the error code
from the process being run. Re-arrange the code to work-around
it.
2019-05-28 22:07:46 -07:00
sramamoorthy
f27a40f118
execProcessingHelper made synchronous
...
tLogCommit exects no blocking between duplicate check and
setting of the new version, that constraint was broken
when synchronous execProcessingHelper was introduced.
As a fix, execProcessingHelper was made asynchronous.
2019-05-28 22:07:46 -07:00
sramamoorthy
ceac68c990
restore - remove emtpy snapdir,snap loop retry fix
...
- remove partially snapped directories to avoid no cluster file assert
- snap create to retry max 3 times for not_fully_recovered and keep
retrying for the other failures
2019-05-28 22:07:46 -07:00
sramamoorthy
d3a179b6f9
Multiple bug fixes
...
- wait for snapTLogFailKeys in a loop, otherwise in some race
condition it can cause a false assert
- in single region, there does not seem to be a guarantee of
tagLocalityListKey for a given DC ID, avoiding that assert for now
- to find the workers that are coordinators, looking up by primary
address is not sufficient in some cases, hence looking by both
primary and secondary address
- test make files to reflect the location of the new test cases
2019-05-28 22:07:46 -07:00
sramamoorthy
bb474dc323
if recovery < fully_recovered then fail the exec
...
Will do more cleanup, pushing it for a test run in CI
2019-05-28 22:07:46 -07:00
sramamoorthy
925499954b
New status cluster_not_fully_recovered
2019-05-28 22:07:46 -07:00
sramamoorthy
591ff96b93
increase retry and use eat instead of parsing
2019-05-28 22:07:46 -07:00
sramamoorthy
6f42337c09
TransactionNotPermitted instead of conflict error
...
When the cluster has not recovered completely, return op not
permitted instead of conflict error
2019-05-28 22:07:46 -07:00
sramamoorthy
dcd2d96751
make spawnProcess predictable in the simulator
2019-05-28 22:07:46 -07:00
sramamoorthy
4083af0b01
Avoid using trackLatest for TLog pop test cases
2019-05-28 22:07:46 -07:00
sramamoorthy
936ffc2dde
rebase related changes
2019-05-28 22:07:46 -07:00
sramamoorthy
ec7834e2f7
code re-orgnaization and address comments
2019-05-28 22:07:46 -07:00
sramamoorthy
b6e037ffbc
Replace fork with boost::process::child
2019-05-28 22:07:46 -07:00
sramamoorthy
c76cc84ded
execute coordinators code reorganized
2019-05-28 22:07:46 -07:00
sramamoorthy
e91c76834e
tlog: move snap create part to indepdendent funcs
2019-05-28 22:07:46 -07:00
sramamoorthy
61e93a9304
Address review comments and minor fixes
2019-05-28 22:07:46 -07:00
sramamoorthy
9e3104c2d4
Fix: races in async exec leading to bad backup
2019-05-28 22:07:46 -07:00
sramamoorthy
858604b51d
minor cleanups to SnapTest
2019-05-28 22:07:46 -07:00
sramamoorthy
00ccee8a6c
workaround for log giving remote log and others
...
logSystemConfig.allLocalLogs() sometimes returns remote TLog interface
and a workaround is implemented here. Other minor cleanup.
2019-05-28 22:07:46 -07:00
sramamoorthy
090bb53034
ShardInfo::addMutation to handle exec mutation
2019-05-28 22:07:46 -07:00
sramamoorthy
cfdad0c5e6
tlog to snapshot exactly at exec version
2019-05-28 22:07:46 -07:00
sramamoorthy
89b7a052f5
Bug fixes for snapping coordinators
2019-05-28 22:07:46 -07:00
sramamoorthy
539e65efad
Skip parsing mutations if it is tagged for TxsTag
...
In Tlog, if a mutation is targetted for TxsTag then skip from
parsing them.
2019-05-28 22:07:46 -07:00
sramamoorthy
17ecba8313
trace cleanup and other indentation changes
2019-05-28 22:07:46 -07:00
sramamoorthy
898bed66c1
Allow only whitelisted binary path for exec op
2019-05-28 22:07:46 -07:00
sramamoorthy
aa79480d69
changes to make fdbfork asynchronous
2019-05-28 22:07:46 -07:00
sramamoorthy
c4d27ac9d2
bug fixes in SnapTest
...
Earlier the test was checking for the following condition:
durable version of storage > min version of tlog, but the
check has been modified to:
durable version of storage >= min version of tlog - 1.
Ensure that the pre-snap validate keys are exactly 1000 in
the case of commit retires.
2019-05-28 22:07:46 -07:00
sramamoorthy
d282016f93
Exec op to tag only local storage nodes
2019-05-28 22:07:46 -07:00
sramamoorthy
a60145b9a1
Restore the cluster in single region configuration
2019-05-28 22:07:46 -07:00
sramamoorthy
382b246930
trace change and retain fitness file after restore
2019-05-28 22:07:46 -07:00
sramamoorthy
281c785f94
'--restoring' cmd line arg removed for fdbserver
...
'--restoring' command line option was introduced to indicate
simulated fdbserver to restore from snapshot and restart the cluster.
As part of this change that option is removed and restore
information is stored in the restartInfo.ini.
2019-05-28 22:07:46 -07:00
sramamoorthy
6431513ad0
Fail exec req until the cluster is fully_recovered
2019-05-28 22:07:46 -07:00
sramamoorthy
4016f16c76
Fix few compilation and bugs in rebase
2019-05-28 22:07:46 -07:00
sramamoorthy
3d5998e9dd
tlog: when pops are disabled, store them & replay
...
In Tlogs, disable pop is done whlie taking snapshots. Earlier, tlogs
were ignoring the pops if it got pop requests when pops were
disabled. In this change, instead of ignoring the pop - it remembers
the list of pops in-memory and plays them once the popping is
enabled.
2019-05-28 22:07:46 -07:00
sramamoorthy
4bc4c615da
exec op to all tlog, restore change in test &other
...
- exec operation to go to all the TLogs
- minor bug fix in tlog
- restore implementation for the simulator
- restore snap UID to be stored in restartInfo.ini
- test cases added
- indentation and trace file fixes
2019-05-28 22:07:46 -07:00
sramamoorthy
72dd067173
Trace message changes and fix few FIXMEs
2019-05-28 22:07:46 -07:00
sramamoorthy
69edefe68b
Snapshot based backup and resotre implementation
2019-05-28 22:07:46 -07:00
chaoguang
5350c2777a
change g_random to deterministicRandom()
2019-05-28 18:37:55 -07:00
chaoguang
a7920ef311
Merge branch 'master' of https://github.com/apple/foundationdb into MakoWorkload
2019-05-28 18:21:02 -07:00
chaoguang
7329466182
update comments, parameter names and descriptions
2019-05-28 15:43:41 -07:00
Trevor Clinkenbeard
53f8ba499c
Merge branch 'master' into features/sqlite-crc32c
2019-05-24 16:46:32 -07:00
A.J. Beamon
20d83d61db
Merge branch 'master' into thread-safe-random-number-generation
2019-05-23 11:07:08 -07:00
Evan Tschannen
b451c2cd56
Merge pull request #1497 from alexmiller-apple/fastrecovery
...
Add an \xff keyrange that is backed by the txnStateStore.
2019-05-23 10:52:35 -07:00
A.J. Beamon
f417e60264
Merge branch 'merge-release-6.1-into-master' into thread-safe-random-number-generation
...
# Conflicts:
# fdbserver/QuietDatabase.actor.cpp
2019-05-23 09:52:00 -07:00
A.J. Beamon
d29c7e4c9b
Merge branch 'release-6.1' into merge-release-6.1-into-master
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbserver/QuietDatabase.actor.cpp
# versions.target
2019-05-23 09:28:45 -07:00
A.J. Beamon
e5381e0612
Fix some new usages of g_random
2019-05-23 09:23:27 -07:00
A.J. Beamon
603721e125
Merge branch 'master' into thread-safe-random-number-generation
...
# Conflicts:
# fdbclient/ManagementAPI.actor.cpp
# fdbrpc/AsyncFileCached.actor.h
# fdbrpc/genericactors.actor.cpp
# fdbrpc/sim2.actor.cpp
# fdbserver/DiskQueue.actor.cpp
# fdbserver/workloads/BulkSetup.actor.h
# flow/ActorCollection.actor.cpp
# flow/Net2.actor.cpp
# flow/Trace.cpp
# flow/flow.cpp
2019-05-23 08:35:47 -07:00
Evan Tschannen
003cc6be18
fix: nothingPersistent could be incorrect when popped is equal to persistentDataVersion
2019-05-22 20:23:35 -10:00
chaoguang
c527b1a6b1
renaming function, add comments, fix bugs.
2019-05-22 17:39:36 -07:00
Evan Tschannen
4e12721227
fix: nothingPersistent could be incorrect when popped is equal to persistentDataVersion
2019-05-22 11:23:21 -07:00
Stephen Atherton
0fb8612ef5
debug_printf_noop() was incorrectly defined as a function, which still has a runtime cost of argument evaluation.
2019-05-22 03:40:18 -07:00
Stephen Atherton
f99c36aad2
Fixed merge mistake.
2019-05-22 00:23:31 -07:00
Stephen Atherton
ebc96a7e0e
Merge branch 'master' of github.com:apple/foundationdb into feature-redwood
...
# Conflicts:
# fdbserver/VersionedBTree.actor.cpp
2019-05-21 23:49:27 -07:00
Stephen Atherton
e9197a8f70
Added time limit.
2019-05-21 22:19:14 -07:00
Stephen Atherton
3f8fce0296
Checkpointing progress on single-version mode in VersionedBTree. Subtree clears now work, preserving internal page boundary keys when necessary. Multi-version mode is unfortunately now broken, in addition to being incomplete. Added serial and simple btree unit test options.
2019-05-21 19:16:32 -07:00
chaoguang
57968d9df7
Merge branch 'master' of https://github.com/apple/foundationdb into MakoWorkload
2019-05-21 16:24:11 -07:00
chaoguang
0bbcc75e4b
fix bug
2019-05-21 16:22:02 -07:00
Evan Tschannen
a686402671
Merge branch 'feature-pop-diskqueue' into feature-slow-storage-failure
2019-05-21 15:19:06 -07:00
Evan Tschannen
9604452e50
mistakenly changed a quiet database parameter
2019-05-21 15:17:46 -07:00
Evan Tschannen
90fe085696
fix: the healthyZone needs to be checked again once the timeout is expected to have elapsed
2019-05-21 13:49:16 -07:00
Evan Tschannen
a8e8be5aac
added a wait failure client which always waits the full failure reaction time, even if it knows the interface is never coming back
...
use this new wait failure client in data distribution, to give time for a storage server to rejoin the cluster after its interface fails
2019-05-21 11:54:17 -07:00
Evan Tschannen
f4b18f2c4f
fixed whitespace
2019-05-21 11:31:34 -07:00
Evan Tschannen
23091a7d96
fixed review comments
2019-05-21 10:53:36 -07:00
Evan Tschannen
ee04c583fa
fix: do not pop the disk queue past the persistentDataVersion
2019-05-21 10:40:30 -07:00
Evan Tschannen
4059d68348
fix: the tlog would not pop data from the disk queue after a storage server was removed, because the tag still exists in memory on the logs
...
fix: we could incorrectly make data durable if eraseMessagesFromMemory was in progress while running updatePersistentData
the quiet database check now ensure that tlogs have no more than 30 seconds of versions unpopped from the disk queue
2019-05-20 23:58:45 -07:00
chaoguang
12a51b2d39
fix bugs, update naming and comments, refine functions
2019-05-20 18:26:30 -07:00
Evan Tschannen
f4fbaac6b0
Merge branch 'release-6.1'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# versions.target
2019-05-19 10:27:59 -07:00
A.J. Beamon
a8b9d8e34b
Merge pull request #1336 from tclinken/fast-allocate-ptree-nodes
...
Create 96-byte fast allocator for storage queue PTree nodes
2019-05-17 14:22:46 -07:00
Steve Atherton
5a8c97480a
Merge pull request #1506 from nikolas-ioannou/feature-pagecache-lru
...
AsyncFileCached: switch from a random to an LRU cache eviction policy
2019-05-17 13:42:21 -07:00
Jingyu Zhou
b8e7fc1b84
Refactor: add std:: qualifier and use emplace_back
2019-05-17 09:38:50 -10:00
Trevor Clinkenbeard
12ff747e6a
Avoid tracing in PageChecksumCodec::checksum if silent flag is set
2019-05-17 10:49:53 -07:00
Trevor Clinkenbeard
3fac380b90
Avoid tracing in PageChecksumCodec::checksum if silent flag is set
2019-05-17 10:43:28 -07:00
Alvin Moore
22fa0fa1d4
Merge pull request #1599 from AlvinMooreSr/winproject-update
...
Upgraded Windows Tools within projects to 2017
2019-05-17 03:07:39 -07:00
Trevor Clinkenbeard
20e93c67ea
Allow sqlite pages to be checked for CRC32 checksum
...
Future versions of FDB will write sqlite pages with CRC32 checksums. In
order to roll back to this version from a version that writes CRC32
checksums, this version must be able to verify those checksums.
2019-05-17 01:05:06 -07:00
Alvin Moore
3acaa7343e
Enabled C++17 for all Windows projects
...
Set Visual Studio version to 2017 (first version to support C++17)
2019-05-16 17:44:13 -07:00
Paul J. Davis
53b97fe506
Extend support for parentpid
...
This adds support for the `--parentpid` option to non-Windows platforms.
This option is intended for testing layer implementations. When running
higher level CI chains its useful to ensure that any ephemeral instances
of fdbserver are automatically reaped.
2019-05-16 14:24:11 -10:00
Trevor Clinkenbeard
d7bcbe1210
Refactored PageChecksumCodec::checksum
2019-05-16 16:07:35 -07:00
Trevor Clinkenbeard
90d886df95
Trace both hashlittle2 and crc32 checksums for SQLitePageChecksumFailure
2019-05-16 15:51:21 -07:00
Alvin Moore
94aed513c7
Switched Windows tools within projects to 2017
2019-05-16 15:05:11 -07:00
Trevor Clinkenbeard
04a72bdad6
Eliminate duplicate code in PageChecksumCodec::checksum
2019-05-16 11:09:37 -07:00
Trevor Clinkenbeard
aca90cd4e2
Don't use memcpy in PageChecksumCodec::checksum
2019-05-16 07:25:58 -07:00
chaoguang
6788c8eb7d
update cleanup process
2019-05-15 16:17:01 -07:00
chaoguang
106bb7677d
update
2019-05-15 12:58:12 -07:00
Alex Miller
658e61b394
And now use spilledOnly as a hint to do parallel peeks.
...
If there's some spilled data, there's probably a lot of spilled data,
and now we can pull all of it faster.
2019-05-14 21:03:44 -10:00
Alex Miller
69fb852ee0
Add more CLOEXEC-like things.
...
From missed call sites found during/after code review.
2019-05-14 20:30:58 -10:00
Alex Miller
4eb4c03ce5
Save TLog resources by letting peek request only spilled data.
...
If a peek is entirely fulfilled from spilled data, then it's likely that
the next peek will be also. It is thus wasteful for each of these peeks
to call peekMessagesFromMemory, which memcpy's excessively, and then
throw all that data away without using it.
Now, TLogs will give a hint back to peek cursors about if the provided
reply was served entirely from the spilled data, which peek curors then
feed back as the hint into their next request.
At some point, a cursor will send a request for only spilled data, get
an incomplete response, and then be told to send its next request as one
that peeks from memory as well, and then it will fully catch up.
2019-05-14 15:38:48 -10:00
Trevor Clinkenbeard
601c38ad82
Use crc32 for sqlite page checksums
2019-05-14 13:43:55 -07:00
chaoguang
4c9cc44c73
add paras
2019-05-14 10:13:13 -07:00
mpilman
46e7a0ca56
address reviews and make compile with `-Wunused-variable`
2019-05-13 14:15:23 -07:00
mpilman
57912b33a5
fixed merge error
2019-05-13 14:15:23 -07:00
mpilman
96aaa31a6c
Compiling on clang again
2019-05-13 14:15:23 -07:00
mpilman
20c3f7f264
remove mixed-mode support
2019-05-13 14:15:23 -07:00
mpilman
42385c2f81
Fixed issues introduced during rebase
2019-05-13 14:15:23 -07:00
mpilman
f6fbad5061
Fix memory bug
2019-05-13 14:15:23 -07:00
mpilman
44db3450ec
Several flatbuffers bug fixes
2019-05-13 14:15:23 -07:00
mpilman
9c02354255
pass NDEBUG to sqlite to enable debug mode
2019-05-13 14:15:23 -07:00
mpilman
69fa3d3903
fixed compilation issues after rebase
2019-05-13 14:15:23 -07:00
mpilman
642a96807b
Fixed compilation issues after rebase
2019-05-13 14:15:22 -07:00
mpilman
6afce01744
Implementation complete (not yet working)
2019-05-13 14:15:22 -07:00
mpilman
92bad76479
Wrap ClusterClientInterface into its own type
...
When a process joins a cluster it fetches the cluster
interface. However, not the whole interface is exposed
to the client. This mechanism relies on the fact that
the serializer keeps the field ordering and doesn't
verify the message before parsing it.
To make this work, we provide a client type with one
member (the ClusterInterface which is exposed to the
client and the server). This client interface has the
same FileIdentifier as the ClusterControllerFullInterface
which has the same first member. This works because
FlatBuffers allows for members to be missing.
2019-05-13 14:15:22 -07:00
mpilman
9eeb48c43d
Allow to turn on object serializer
...
This commit includes functionality to turn on
the object serializer for network communication.
This is done the following way:
- On incoming connections, a process will detect
whether the client supports the object serializer
and will only serialize responses with it, if it does
- On outgoing connections, the command line flag is used
to determine whether the object serializer should be used
to send data.
This way, a cluster can run in mixed mode. To upgrade one
can upgrade one process at a time and set the flag one process
at a time.
This is how this is tested on the simulator:
- The command line flag can take three options: on, off,
and random.
- For off, the object serializer will never we used.
- For on, the object serializer will be always used.
- For random, the simulator will flip a coin for each
process it starts up.
2019-05-13 14:15:22 -07:00
mpilman
ba83c458a6
types implemented
2019-05-13 14:15:22 -07:00
Nikolas Ioannou
067cdf9cde
Simplified cache eviction policy knob arg check.
2019-05-13 08:50:04 +02:00
Evan Tschannen
8c3516951a
Merge branch 'release-6.1'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# versions.target
2019-05-12 20:13:49 -07:00