Commit Graph

391 Commits

Author SHA1 Message Date
Daniel Smith 9c2937d4d0 Only check for files/directories when needed 2020-06-29 16:25:36 +00:00
Daniel Smith b53faa1695 Actually check directory suffix 2020-06-18 17:21:14 +00:00
Daniel Smith 73091b212c Allow detection of storage engines by presense of directory. 2020-06-17 21:50:06 +00:00
Young Liu 4dfb903a3a tmp merge 2020-06-16 20:32:07 -07:00
Meng Xu 96206a8032
Merge pull request #3368 from apple/release-6.3
Merge Release 6.3 to master
2020-06-15 20:15:22 -07:00
Evan Tschannen 4c7d43271a merge 6.3 into 7.0 2020-06-15 11:14:11 -07:00
Daniel Smith a959c6eb23 Fix copy/paste error 2020-06-15 16:48:19 +00:00
Daniel Smith acbfe2e4c9
Revert "Revert "Initial RocksDB"" 2020-06-15 12:45:36 -04:00
Evan Tschannen beab24de76 Merge branch 'release-6.3' of github.com:apple/foundationdb into release-6.3 2020-06-14 22:38:37 -07:00
Evan Tschannen c56d97cc9f randomize the coordinator a storage worker connects to 2020-06-14 22:26:06 -07:00
Young Liu f211a54593 Merged from upstream master 2020-06-13 16:47:12 -07:00
Young Liu f8c457d74d Minor fix against Meng's comments 2020-06-13 16:27:08 -07:00
Meng Xu 8595813b7d
Merge pull request #3355 from apple/release-6.3
Merge Release 6.3 into master branch
2020-06-12 20:08:47 -07:00
Jingyu Zhou 9cd1614c82
Revert "Initial RocksDB" 2020-06-11 15:29:46 -07:00
Daniel Smith a4dbb5dd01 Merge branch 'trace-batch-thread-hostile' into rocksdb-6.3 2020-06-11 15:53:57 +00:00
Young Liu a47806a966 Fixed locked and metadataVersion in GetReadVersion 2020-06-10 15:55:23 -07:00
A.J. Beamon 739767b838 Delay cluster controller candidacy for all worst fit processes, not just storage servers. 2020-06-10 09:59:56 -07:00
Young Liu 3a37e0af75 Serve GetReadVersion through master instead of peer proxies 2020-06-09 20:47:34 -07:00
negoyal 23a565ec63 Few bug fixes. 2020-06-05 16:27:04 -07:00
Evan Tschannen 30bfd606c0 Merge branch 'release-6.2' into release-6.3
# Conflicts:
#	CMakeLists.txt
#	documentation/sphinx/source/downloads.rst
#	documentation/sphinx/source/release-notes.rst
#	fdbserver/worker.actor.cpp
#	packaging/msi/FDBInstaller.wxs
#	versions.target
2020-06-04 19:21:32 -07:00
A.J. Beamon 9edc872041 Don't attempt to become a cluster controller on any process with a class that has NeverAssign fitness. 2020-06-03 16:05:21 -07:00
negoyal cf13e00a8f Merge remote-tracking branch 'origin/release-6.3' into fdb_cache_wo_allocator 2020-06-01 17:38:31 -07:00
A.J. Beamon 8329a242d2 Merge branch 'release-6.2' into merge-release-6.2-into-release-6.3
# Conflicts:
#	documentation/sphinx/source/downloads.rst
#	documentation/sphinx/source/release-notes.rst
2020-05-29 15:51:56 -07:00
Evan Tschannen e938d741e3 kill the process when a shared tlog throws an io_error 2020-05-29 09:02:55 -07:00
Daniel Smith 8731700d80 Merge remote-tracking branch 'upstream/release-6.3' into rocksdb-6.3 2020-05-27 20:02:25 +00:00
Evan Tschannen 4b6e1d8a57 fix compile problem 2020-05-22 17:16:59 -07:00
Evan Tschannen ced65cd30b finished explicitly versioning everything stored in the database 2020-05-22 17:14:21 -07:00
Daniel Smith 5d361fe532 Copy/paste rebase onto 6.3 2020-05-22 15:02:51 +00:00
Markus Pilman eaaceab845 fixed compiler issues 2020-05-14 13:48:19 -07:00
Markus Pilman c2bc75516f Merge branch 'release-6.3' of github.com:apple/foundationdb into features/trace-roles 2020-05-14 10:34:53 -07:00
Evan Tschannen 07111f0e41 add a large random delay on failure detection so that not all storage servers need to attempt to become the cluster controller 2020-05-10 17:09:33 -07:00
Evan Tschannen 048201717c Fixed a number of problems with monitorLeaderRemotely 2020-05-10 14:20:50 -07:00
Evan Tschannen 6fca885b9d revert strage class monitor leader because of correctness issues 2020-05-09 18:03:59 -07:00
Evan Tschannen f9518c3441
Merge pull request #3069 from alexmiller-apple/tls-connection-count
YOLO at reducing TLS connection count via doing monitorLeader on coordinators
2020-05-09 17:12:27 -07:00
Markus Pilman 025f27f389 control trace interval with a knob 2020-05-08 17:14:42 -07:00
Evan Tschannen f0f52fb2be Merge branch 'master' into feature-small-endpoint
# Conflicts:
#	fdbclient/StorageServerInterface.h
2020-05-08 16:37:35 -07:00
Markus Pilman 5f9b127e56 Emit traces regularly about role assignment
We are currently emitting Role transition traces when a role starts and
when it ends. While this is useful for debugging, it doesn't work well
with tools that inject data and might potentially miss some trace lines.

We do decorate each trace lines with the roles assigned to that
particular process, however, this is not sufficient for tools that can
make use of the UID -> Role mapping
2020-05-08 16:27:57 -07:00
negoyal 749fcd13b0 Merge branch 'master' into fdb_cache_wo_allocator 2020-05-08 16:23:29 -07:00
Alex Miller 383099aef3 Bug fixes to get it actually doing the right thing:
* Intialize electionResult when constructing with NetworkAddress.
* Return after sending a reply.
* Reset the reply promise on each new request.
2020-05-08 01:00:18 -07:00
Evan Tschannen 51d3aaf4ae fixed a few rare correctness bugs 2020-05-06 23:24:58 -07:00
Alex Miller 8a6e177950 Merge remote-tracking branch 'upstream/master' into tls-connection-count 2020-05-05 16:49:36 -07:00
A.J. Beamon 0b4c93bb1b More aggressively cleanup a bad process ID file in simulation 2020-05-05 15:59:02 -07:00
Evan Tschannen f329164fb4
Merge pull request #2532 from dongxinEric/feature/hot-read-key-detection-part-2
Feature/hot read key detection part 2
2020-05-05 14:33:34 -07:00
Alex Miller 1117eae2b5 Rework to make ElectionResult code similar to OpenDatabase code.
And also restore and fix the delayed cluster controller code.
2020-05-05 01:00:17 -07:00
negoyal dd033736ed Merge branch 'master' into fdb_cache_subfeature2 2020-05-04 17:29:43 -07:00
Evan Tschannen ca92a39f5d reduced the size of proxy and tlog interfaces 2020-05-01 16:41:20 -07:00
Alex Miller 43a63452d8 YOLO at reducing TLS connection count via doing monitorLeader on coordinators 2020-05-01 14:40:21 -07:00
Evan Tschannen 4d131bdd4a Merge branch 'master' into feature-small-endpoint 2020-05-01 13:16:15 -07:00
Dave Cottlehuber 98639645b1 fdbserver: update headers 2020-04-30 18:11:23 +00:00
Evan Tschannen a442565e13 more work towards shrinking locality 2020-04-18 21:29:38 -07:00
Evan Tschannen 4c51e0a05b
Update fdbserver/worker.actor.cpp
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2020-04-17 14:44:58 -07:00
Xin Dong 7dd7406c59
Merge branch 'master' into feature/hot-read-key-detection-part-2 2020-04-16 14:54:05 -07:00
Evan Tschannen 2eec3bb9b1 fixed logic for skipping broadcast 2020-04-13 13:09:21 -07:00
Evan Tschannen 8f78912483 knobified parameter 2020-04-11 20:54:17 -07:00
Evan Tschannen e5ec7f2800 do not broadcast obsolete serverDBInfo 2020-04-11 20:05:03 -07:00
Evan Tschannen 1476057996 properly cache serialization of serverDBInfo 2020-04-11 19:30:05 -07:00
Evan Tschannen 07cc0a8d74 code cleanup 2020-04-10 17:02:11 -07:00
Evan Tschannen ce4493f679 many bug fixes 2020-04-10 13:45:16 -07:00
Evan Tschannen a51c92854a Merge branch 'master' into feature-tree-broadcast
# Conflicts:
#	fdbserver/WorkerInterface.actor.h
#	fdbserver/worker.actor.cpp
2020-04-06 21:09:44 -07:00
Evan Tschannen 2a1bd97120 fix compilation errors 2020-04-06 20:58:43 -07:00
Evan Tschannen 477d66b46d implemented a tree broadcast for txn state message for proxies, and serverDBInfo for workers 2020-04-05 23:09:36 -07:00
negoyal acaf91ac47 Merge branch 'master' into fdb_cache_subfeature2 2020-03-26 13:33:08 -07:00
Jingyu Zhou f0f4e42a4c Add removal for backupWorkerCache 2020-03-23 12:47:42 -07:00
Jingyu Zhou 658504bc66 Add a cache to handle repeated delivery of backup recruitment messages 2020-03-23 10:22:24 -07:00
Balachandar Namasivayam 58a9bfa78b
Merge pull request #2820 from dongxinEric/fix/1977/add-back-trace-event-flush-failure-report
Fix/1977/add back trace event flush failure report
2020-03-18 16:11:44 -07:00
Xin Dong 89861c661e Fix the random crash. Use a thread safe 'ThreadReturnPromise' instead of the ThreadFuture. 2020-03-16 13:36:55 -07:00
Xin Dong 5967ef5eab Added back the changes that report trace log flush failures and fix the random crash 2020-03-12 14:34:19 -07:00
Meng Xu a9136f3f72 Add waitForUnreliableExtraStoreReboot to wait for extra store to reboot 2020-03-12 10:18:31 -07:00
Evan Tschannen 303df197cf Merge branch 'release-6.2'
# Conflicts:
#	CMakeLists.txt
#	bindings/c/test/mako/mako.c
#	documentation/sphinx/source/release-notes.rst
#	fdbbackup/backup.actor.cpp
#	fdbclient/NativeAPI.actor.cpp
#	fdbclient/NativeAPI.actor.h
#	fdbserver/DataDistributionQueue.actor.cpp
#	fdbserver/Knobs.cpp
#	fdbserver/Knobs.h
#	fdbserver/LogRouter.actor.cpp
#	fdbserver/SkipList.cpp
#	fdbserver/fdbserver.actor.cpp
#	flow/CMakeLists.txt
#	flow/Knobs.cpp
#	flow/Knobs.h
#	flow/flow.vcxproj
#	flow/flow.vcxproj.filters
#	versions.target
2020-03-06 18:22:46 -08:00
Evan Tschannen 1128666840 added additional logging on the log router 2020-03-05 18:17:06 -08:00
Xin Dong 39610d15f8 Revert this change since it somehow introduced a random crash detected on circus 2020-03-04 16:14:38 -08:00
negoyal 3acd3ad3af Some bugfixes and cleanup. 2020-03-02 17:11:23 -08:00
negoyal cd949eca71 Merge branch 'master' into fdb_cache_subfeature2 2020-02-26 11:22:08 -08:00
Xin Dong f20619c9fb Resolve review comments. Changed how issues got cleared 2020-02-25 15:39:51 -08:00
Xin Dong fce71e4516 Added a TODO for the usage of 'issues' in 'monitorServerDBInfo' 2020-02-25 15:39:38 -08:00
Xin Dong 090c89e90a Addressed review comments. Fix the bug where issues on a worker may be wrongly cleared by subsequent GetDBinfo request. 2020-02-25 15:39:38 -08:00
Xin Dong 288e95c7e1 Reallocate the issues set after each get. Changed an issues name to be accurate 2020-02-25 15:39:09 -08:00
Xin Dong f4f860bfa8 Changed issue reporting to be thread safe. Also changed the liveness ping to be thread safe. 2020-02-25 15:38:14 -08:00
Xin Dong a6580dc15f Added the ability to ping a trace log writer thread and the monitoring in worker.actor.cpp. The current solution is simple a loose check. We can change this to be accurate check by using 'pthread_kill(writer_thread, 0)' 2020-02-25 15:37:53 -08:00
Xin Dong 0b0414fb94 Addressded review comments. Change the issue reporting from 'ITraceLogWriter' to be a more generic way. 2020-02-25 15:37:53 -08:00
Xin Dong 034dfe5e42 Now the inability to flush trace logs will be reported to both 'stderr' and also the status json object.
- Since the first flush failure, if the accumulated consecutive failure count exceeds the value defined in knobs, it will trigger the current worker process to report this issue via the 'GetServerDBInfo' interface of the cluster controler
    - A successful flush will reset the accumulated counter.
    Notice that the current solution does not take the time into consideration. The assumption is that flush failures tend to only happen in a clustered manner. The intermittent, but short, periods of flush failures are not considered as a problem since the memory pressure built by them should be negligible.
2020-02-25 15:37:32 -08:00
negoyal 308e088bca Minor fixes. 2020-02-25 15:00:18 -08:00
Evan Tschannen 96258b9809 Merge branch 'release-6.2'
# Conflicts:
#	documentation/sphinx/source/release-notes.rst
#	fdbcli/fdbcli.actor.cpp
#	fdbclient/ManagementAPI.actor.cpp
#	fdbrpc/FlowTransport.actor.cpp
#	fdbserver/ClusterController.actor.cpp
#	fdbserver/DataDistribution.actor.cpp
#	fdbserver/DataDistribution.actor.h
#	fdbserver/DataDistributionQueue.actor.cpp
#	fdbserver/KeyValueStoreMemory.actor.cpp
#	fdbserver/MasterProxyServer.actor.cpp
#	fdbserver/QuietDatabase.actor.cpp
#	fdbserver/SkipList.cpp
#	fdbserver/StorageMetrics.actor.h
#	fdbserver/TLogServer.actor.cpp
#	fdbserver/fdbserver.actor.cpp
#	fdbserver/storageserver.actor.cpp
#	fdbserver/workloads/KVStoreTest.actor.cpp
#	flow/CMakeLists.txt
#	flow/Knobs.cpp
#	flow/Knobs.h
#	flow/genericactors.actor.cpp
#	flow/serialize.h
2020-02-21 19:09:16 -08:00
Evan Tschannen 59ff782927 fix: only delete the processId file on binaryReader errors 2020-02-20 23:04:39 -08:00
A.J. Beamon 5586e6f6d8
Merge pull request #2697 from etschannen/feature-correctness-fixes
A variety of correctness fixes
2020-02-20 13:32:18 -08:00
Evan Tschannen 9b3254d5f4 A corrupted processId file should be deleted in simulation, as that is the manual operation that would fix the problem in the real world 2020-02-19 15:21:42 -08:00
Meng Xu 94d799552e FastRestore:Apply clang-format against master 2020-02-18 16:41:59 -08:00
Meng Xu 132f5aa9ba FastRestore:Improve trace name and cosmetic change 2020-02-18 16:41:19 -08:00
Meng Xu 31a6ec34b7 Merge branch 'master' into mengxu/fast-restore-agent-PR 2020-02-18 16:17:59 -08:00
Balachandar Namasivayam 1be6915a38 Fix an incorrect if else check. 2020-02-17 17:31:41 -08:00
A.J. Beamon 1d9140d874 Removed TLogVersion logging.
Added logging of SharedTLog ID for each TLog.
Switched ID logged for TLogRejoining event to the TLog instead of the SharedTLog.
Made some parameters to startRole passed by reference.
2020-02-14 12:33:43 -08:00
Balachandar Namasivayam 32165c506f Fix a one line bug where the if check comparison was wrong. 2020-02-12 16:55:33 -08:00
A.J. Beamon 56053c565b Improve TLog "Role" event by adding the worker ID, the TLog version, and under what circumstances the TLog is being started (Restored, Recruited, or Recovered).
The SharedTLog role was being started and stopped twice, so remove one instance of it.
2020-02-12 15:11:38 -08:00
negoyal 85cc35e81e Merge branch 'master' into HEAD 2020-02-05 14:59:55 -08:00
Meng Xu 3b57bf1781 Merge branch 'master' into mengxu/fast-restore-agent-PR 2020-02-03 17:23:54 -08:00
Evan Tschannen 4524831456
Merge pull request #2518 from vishesh/task/failmon-remove-server
FailureMonitoring: Server processes no longer need to talk to ClusterController
2020-02-03 17:22:50 -08:00
Meng Xu ca3b6135d0 FastRestore:Add debug to see why restore role is not connected
Reason: restore is a fdbserver who does not register with CC.
The new failure monitor changes how connection works for client and server.
For client, it does not connect to CC to get connected.
For server, it has to connect to CC to get connected.
Restore worker becomes the special role that behaves like a client but is a server.
2020-02-03 17:19:52 -08:00
Meng Xu 9c2046b11b FastRestore:Minic fdbd to monitor coordintors
Before we start a fdb restore process.
2020-02-03 14:48:31 -08:00
Meng Xu 559b95c61a FastRestore:RestoreRole:Mimic how fdbd starts 2020-02-01 10:23:48 -08:00
Alex Miller ee6490c9d1
Merge pull request #2314 from mengranwo/memory-engine
New Radix-Tree based Memory Storage Engine
2020-01-30 16:20:13 -08:00