Xin Dong
5967ef5eab
Added back the changes that report trace log flush failures and fix the random crash
2020-03-12 14:34:19 -07:00
A.J. Beamon
2466749648
Don't disallow allocation tracking when a trace event is open because we now have state trace events. Instead, only block allocation tracking while we are in the middle of allocation tracking already to prevent recursion.
2020-03-12 11:17:49 -07:00
A.J. Beamon
8cdf918316
Add logging when file identifiers don't match
2020-03-12 11:06:53 -07:00
Andrew Noyes
770ef6e726
Add test
2020-03-10 10:42:57 -07:00
Andrew Noyes
027029cc9b
Remove offending overload?
2020-03-10 10:18:14 -07:00
Evan Tschannen
303df197cf
Merge branch 'release-6.2'
...
# Conflicts:
# CMakeLists.txt
# bindings/c/test/mako/mako.c
# documentation/sphinx/source/release-notes.rst
# fdbbackup/backup.actor.cpp
# fdbclient/NativeAPI.actor.cpp
# fdbclient/NativeAPI.actor.h
# fdbserver/DataDistributionQueue.actor.cpp
# fdbserver/Knobs.cpp
# fdbserver/Knobs.h
# fdbserver/LogRouter.actor.cpp
# fdbserver/SkipList.cpp
# fdbserver/fdbserver.actor.cpp
# flow/CMakeLists.txt
# flow/Knobs.cpp
# flow/Knobs.h
# flow/flow.vcxproj
# flow/flow.vcxproj.filters
# versions.target
2020-03-06 18:22:46 -08:00
tclinken
2017daf7d4
Ignore createDirectory error if directory already exists
2020-03-06 16:48:23 -08:00
Evan Tschannen
dbfc0cbcc0
Merge pull request #2781 from alexmiller-apple/certificate-refresh
...
Refresh certificates used for handshaking when they change on disk
2020-03-06 11:12:04 -08:00
Alex Miller
f9969a853c
Merge remote-tracking branch 'origin/certificate-refresh' into certificate-refresh
2020-03-06 11:10:05 -08:00
Alex Miller
188d9b8239
Don't swallow actor cancellation in certificate refreshing.
2020-03-06 11:09:17 -08:00
Alex Miller
9b760fae2d
Rewrite all Errors into tls_errors if they happen as part of initializing TLS.
2020-03-06 11:06:19 -08:00
Alex Miller
1f56bf8933
Fix the build with success()
...
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2020-03-06 10:15:04 -08:00
Alex Miller
ac52b6b474
Rework a bit of error and exception handling.
...
I went back and dug through all of the "what functions can throw what
types", and made sane decisions about them. boost errors are
aggressively translated into FDB ones, whcih might result in multiple
lines of logging about errors, but this is in infrequently run code, so
it should be fine.
2020-03-06 02:33:16 -08:00
Evan Tschannen
39050308ff
lower accept batch size just to be conservative with the change
2020-03-05 18:17:49 -08:00
Evan Tschannen
1128666840
added additional logging on the log router
2020-03-05 18:17:06 -08:00
Alex Miller
ccef3f7d05
Attempt to fix TLS_DISABLED compiles.
2020-03-05 17:32:10 -08:00
Alex Miller
2d95a1e64d
Implement certificate refreshing
2020-03-05 17:25:33 -08:00
Alex Miller
595dd77ed1
Merge remote-tracking branch 'upstream/release-6.2' into certificate-refresh
2020-03-04 20:25:42 -08:00
Alex Miller
9b5ef3416e
Refactor TLSParams into TLSConfig + LoadedTLSConfig
...
The idea being that we keep around a TLSConfig that the configuration
that the user has provided, and then when we want to intialize an SSL
context, we ask the TLSConfig to load all certificates and return us a
LoadedTLSConfig that is a concrete set of certificate bytes in memory.
initTLS now just takes the in-memory bytes and applies them to the ssl
context.
This is a large refactor to lead up into certificate refeshing, where we
will periodically check for changes to the certificates, and then
re-load them and apply them to a new SSL context.
2020-03-04 20:14:47 -08:00
Xin Dong
39610d15f8
Revert this change since it somehow introduced a random crash detected on circus
2020-03-04 16:14:38 -08:00
Evan Tschannen
2a877bce9a
Merge pull request #2777 from etschannen/feature-accept-batch
...
Accept connections in batches of 20 to improve performance
2020-03-04 16:14:24 -08:00
Evan Tschannen
c73cae0feb
Merge pull request #2760 from ajbeamon/client-version-fixes
...
Improvements to client version reporting
2020-03-04 15:52:49 -08:00
A.J. Beamon
b3c3f8aa5f
Update flow/genericactors.actor.h
...
Pass by reference
2020-03-04 15:35:51 -08:00
Evan Tschannen
7cbabca124
remove printing to stderr from initTLS because that could cause problems on clients
2020-03-04 15:06:22 -08:00
Evan Tschannen
35a1ac6482
prepare net2 for new versions of boost
2020-03-04 14:26:01 -08:00
Evan Tschannen
da579faf62
add missing task priority
2020-03-04 14:25:30 -08:00
Evan Tschannen
820957025f
accept connections in batches of 20 to improve performance
2020-03-04 14:24:57 -08:00
Andrew Noyes
24bbf5a8f0
Avoid invalid read on invalid Void msg
2020-03-02 12:11:43 -08:00
Andrew Noyes
cdbe3117d7
Fix typo
2020-03-02 12:11:43 -08:00
Andrew Noyes
7119b46eb2
Add unit test
2020-03-02 12:11:43 -08:00
Evan Tschannen
c11c24b79d
removed the fdbrpc version of platform.h
2020-02-28 14:56:10 -08:00
Andrew Noyes
e6d36a0aa5
Fix Makefile build
2020-02-28 13:16:58 -08:00
Andrew Noyes
f29d6c3f67
Move implementation of ArenaBlock members to Arena.cpp
2020-02-28 12:33:57 -08:00
Evan Tschannen
6054c05963
Merge branch 'release-6.2'
...
# Conflicts:
# CMakeLists.txt
# documentation/sphinx/source/release-notes.rst
# fdbserver/fdbserver.actor.cpp
# versions.target
2020-02-28 12:11:05 -08:00
A.J. Beamon
d1e1fea42d
Our binaries that act like clients (fdbcli, backup and DR binaries) were reporting an unknown client version. Clients did not react if the list of supported versions changed.
2020-02-28 09:35:21 -08:00
Xin Dong
13e72f7b3b
Merge pull request #2605 from dongxinEric/fix/1977/report-inability-to-flush-trace-log
...
Report inability to flush trace logs.
2020-02-27 12:36:55 -08:00
Xin Dong
16575ae94d
Address review comments
2020-02-27 11:54:15 -08:00
Xin Dong
4ac7b36e44
Added back the mutex holder that was removed accidentally
2020-02-27 10:19:17 -08:00
Evan Tschannen
707fc1ddea
only capture the policy to match prior code
2020-02-26 19:04:49 -08:00
Evan Tschannen
c3299b8ebe
if tls cannot be initialized, throw an error from createDatabase
2020-02-26 18:53:06 -08:00
Evan Tschannen
bf5a95e6df
Merge commit 'dc39bdfbbf94a7f470386f439df08c044d08d90c' into feature-tls-environment-vars
...
# Conflicts:
# flow/Net2.actor.cpp
2020-02-26 18:02:56 -08:00
Evan Tschannen
f035bed870
defer initializing TLS to avoid throwing errors from a constructor and so that errors can be logged to the trace file
2020-02-26 17:50:07 -08:00
A.J. Beamon
4bbac9d996
Change a special case return to -1. Update comments to clarify and correct some things.
2020-02-26 16:39:13 -08:00
Evan Tschannen
f85af10a18
fixed a few problems with tls setup
2020-02-26 16:06:45 -08:00
Evan Tschannen
d1598e7c99
set_verify_peers throws an error instead of returning a value
2020-02-26 16:06:16 -08:00
Evan Tschannen
2586bade68
re-added support for configuration TLS options with environment variables
2020-02-26 15:33:48 -08:00
A.J. Beamon
0f5c999d4b
Better containment of boost errors related to TLS.
2020-02-26 12:26:43 -08:00
Steve Atherton
087c6fa33d
Merge branch 'master' into feature-redwood
2020-02-26 12:25:04 -08:00
Xin Dong
74c929d98d
Fix windows build, again
2020-02-26 10:01:08 -08:00
Evan Tschannen
924d335aa7
Merge branch 'release-6.2'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# flow/Knobs.cpp
# flow/Knobs.h
2020-02-25 18:25:19 -08:00
Xin Dong
7b51ab6b63
Rebased with master
2020-02-25 15:43:33 -08:00
Xin Dong
f20619c9fb
Resolve review comments. Changed how issues got cleared
2020-02-25 15:39:51 -08:00
Xin Dong
3f24ae93f2
Remove the unused variable
2020-02-25 15:39:38 -08:00
Xin Dong
090c89e90a
Addressed review comments. Fix the bug where issues on a worker may be wrongly cleared by subsequent GetDBinfo request.
2020-02-25 15:39:38 -08:00
Xin Dong
aaa63331b6
Fix windows build
2020-02-25 15:39:09 -08:00
Xin Dong
288e95c7e1
Reallocate the issues set after each get. Changed an issues name to be accurate
2020-02-25 15:39:09 -08:00
Xin Dong
1c346fcfb0
Added the new issues into Status Schema. Remove the issue reporting in lastError since:
...
- If the issue string contains the error number, status schema needs to be super verbose to include all possible issue strings
- If the issue string does not contain the error number, the generic issue string can be pretty useless.
Thus now specific issues are being reported before calling lastError
2020-02-25 15:38:14 -08:00
Xin Dong
39c92c9cce
Update flow/FileTraceLogWriter.cpp
...
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2020-02-25 15:38:14 -08:00
Xin Dong
f4f860bfa8
Changed issue reporting to be thread safe. Also changed the liveness ping to be thread safe.
2020-02-25 15:38:14 -08:00
Xin Dong
a6580dc15f
Added the ability to ping a trace log writer thread and the monitoring in worker.actor.cpp. The current solution is simple a loose check. We can change this to be accurate check by using 'pthread_kill(writer_thread, 0)'
2020-02-25 15:37:53 -08:00
Xin Dong
0b0414fb94
Addressded review comments. Change the issue reporting from 'ITraceLogWriter' to be a more generic way.
2020-02-25 15:37:53 -08:00
Xin Dong
034dfe5e42
Now the inability to flush trace logs will be reported to both 'stderr' and also the status json object.
...
- Since the first flush failure, if the accumulated consecutive failure count exceeds the value defined in knobs, it will trigger the current worker process to report this issue via the 'GetServerDBInfo' interface of the cluster controler
- A successful flush will reset the accumulated counter.
Notice that the current solution does not take the time into consideration. The assumption is that flush failures tend to only happen in a clustered manner. The intermittent, but short, periods of flush failures are not considered as a problem since the memory pressure built by them should be negligible.
2020-02-25 15:37:32 -08:00
A.J. Beamon
0f7656e52e
Document roughness. Remove an unexplained factor of 2 and handle window edges better. Subtract 1 from roughness to correspond better to variance.
2020-02-25 08:45:51 -08:00
A.J. Beamon
1c6aef76b5
When one of the sqlite reader or writer thread pools fail, fail the other with the same error.
2020-02-24 12:39:04 -08:00
Alvin Moore
9585cd10f1
Removed duplicate CMake link request
2020-02-24 00:19:43 -08:00
Alvin Moore
0f64505d0b
Merge branch 'release-6.2' of github.com:apple/foundationdb
...
Needed to pull in changes to build docker
2020-02-23 23:27:53 -08:00
Steve Atherton
712aa27896
Merge branch 'release-6.2' of github.com:apple/foundationdb into feature-redwood
2020-02-23 00:30:27 -08:00
Evan Tschannen
65fbe0d0bc
revert AcceptSocket priority change because of bad performance results
2020-02-21 19:22:14 -08:00
Evan Tschannen
96258b9809
Merge branch 'release-6.2'
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbcli/fdbcli.actor.cpp
# fdbclient/ManagementAPI.actor.cpp
# fdbrpc/FlowTransport.actor.cpp
# fdbserver/ClusterController.actor.cpp
# fdbserver/DataDistribution.actor.cpp
# fdbserver/DataDistribution.actor.h
# fdbserver/DataDistributionQueue.actor.cpp
# fdbserver/KeyValueStoreMemory.actor.cpp
# fdbserver/MasterProxyServer.actor.cpp
# fdbserver/QuietDatabase.actor.cpp
# fdbserver/SkipList.cpp
# fdbserver/StorageMetrics.actor.h
# fdbserver/TLogServer.actor.cpp
# fdbserver/fdbserver.actor.cpp
# fdbserver/storageserver.actor.cpp
# fdbserver/workloads/KVStoreTest.actor.cpp
# flow/CMakeLists.txt
# flow/Knobs.cpp
# flow/Knobs.h
# flow/genericactors.actor.cpp
# flow/serialize.h
2020-02-21 19:09:16 -08:00
Steve Atherton
f1ec780b31
Merge branch 'release-6.2' of github.com:apple/foundationdb into feature-redwood
2020-02-21 17:43:11 -08:00
A.J. Beamon
4c696d5bf2
Merge branch 'release-6.2' into dd-better-rebalance-logging
...
# Conflicts:
# fdbserver/DataDistributionQueue.actor.cpp
2020-02-21 17:41:00 -08:00
A.J. Beamon
dfa5f76c01
Remove unused parameter. Don't put check for g_network presence in ASSERT_WE_THINK.
2020-02-21 16:28:03 -08:00
A.J. Beamon
2431d4d788
Always compute the time for a trace event when it is being logged rather than when it is being created. Usually these are the same, but if they aren't, doing the opposite can lead to out of order trace events.
2020-02-21 13:57:04 -08:00
A.J. Beamon
6810a03283
Add more logging to valley filler and mountain chopper
2020-02-21 10:55:14 -08:00
Alvin Moore
90b4050eca
Added required include for stringstream
2020-02-21 09:59:11 -08:00
Alvin Moore
d02d84a577
Added required include for std:set which is for some reason only missing within Windows build
2020-02-21 09:36:24 -08:00
Alvin Moore
9042cab7bc
Changed ordering of link libraries
2020-02-21 08:56:52 -08:00
Evan Tschannen
dc3826e2fd
fix: tls throttling would re-insert the failure into the map
2020-02-20 18:17:39 -08:00
Evan Tschannen
f04e311a1e
Merge commit 'b46d6e25e24993ab5a5f04091fd3235050b7cd09' into feature-boost-ssl
...
# Conflicts:
# fdbserver/SimulatedCluster.actor.cpp
# flow/Net2.actor.cpp
2020-02-20 17:36:38 -08:00
Alex Miller
927cff3317
Report errors on TLS misconfigurations ... or at least try to.
2020-02-20 16:57:29 -08:00
Evan Tschannen
d7c841a28a
Merge pull request #2589 from etschannen/feature-proxy-delay
...
Improve version pipelining on the proxy
2020-02-20 15:23:30 -08:00
Evan Tschannen
8129f74a10
Merge pull request #2698 from etschannen/feature-recruit-delay
...
The CC waits until no new workers register before starting a bad recruitment
2020-02-20 14:42:37 -08:00
Evan Tschannen
7d54acf4ca
removed an unnecessary yield
2020-02-20 14:41:49 -08:00
A.J. Beamon
5586e6f6d8
Merge pull request #2697 from etschannen/feature-correctness-fixes
...
A variety of correctness fixes
2020-02-20 13:32:18 -08:00
Evan Tschannen
08c318d28a
re-added the connect lock in the fdbcli so that the timeout is not spent before a connection has been initiated (because of the handshake lock)
2020-02-20 10:43:34 -08:00
Evan Tschannen
69b5a1fbe3
more priority improvements
2020-02-20 10:11:43 -08:00
Evan Tschannen
fd8a58b035
re-added support for the TLS_DISABLED flag
2020-02-19 18:37:47 -08:00
Evan Tschannen
761da5a059
code cleanup
2020-02-19 17:59:45 -08:00
Evan Tschannen
fbd45963d8
The cluster controller waits until no new workers register for 1.0 before starting a bad recruitment
2020-02-19 16:48:30 -08:00
Evan Tschannen
9b3254d5f4
A corrupted processId file should be deleted in simulation, as that is the manual operation that would fix the problem in the real world
2020-02-19 15:21:42 -08:00
Alex Miller
fe78524bbc
Merge pull request #2678 from sears/networktest_perf
...
Add some tuning knobs to networktestclient; also, measure latency directly
2020-02-19 14:38:09 -08:00
Russell Sears
956a3efa80
Pull request comments
2020-02-19 10:55:05 -08:00
Alex Miller
88d36af9c7
Fix --tls_password and add better error logging
...
This refactors all tls settings into a TLSParams object so that we can
set the password before loading any certificates.
It turns out that the FDBLibTLS code did really nice things with error
logging, but I just didn't understand openssl enough before to realize
what pieces I should be copying.
2020-02-19 00:57:05 -08:00
Meng Xu
31a6ec34b7
Merge branch 'master' into mengxu/fast-restore-agent-PR
2020-02-18 16:17:59 -08:00
Alex Miller
9d88356468
Merge pull request #2686 from mpilman/features/avoid-unnecessary-template-instanciations
...
Removed dead code
2020-02-17 14:46:39 -08:00
Alex Miller
9144c3e8ca
Merge pull request #2087 from atn34/issue-1226
...
Allow member actors access to private variables
2020-02-17 14:39:31 -08:00
mpilman
aac94a766b
Removed dead code
2020-02-15 21:56:48 -08:00
A.J. Beamon
649fc6ba94
Merge pull request #2329 from davisp/trace-clock-source-network-option
...
Add network option for the trace clock source
2020-02-15 10:43:00 -08:00
Paul J. Davis
32e285a761
Add network option for the trace clock source
...
This option allows clients to select the clock source for trace events
similar to the `--traceclock` command line parameter for `fdbserver`.
Using the `realtime` clock sources makes loading event data into
OpenTracing systems like Jaeger more useful.
2020-02-15 11:30:43 -06:00
Markus Pilman
ccf590e193
Merge branch 'master' of github.com:apple/foundationdb into features/boost70
2020-02-14 22:05:51 -08:00
mpilman
579444419a
remove call to `get_io_service`
2020-02-14 21:22:14 -08:00
mpilman
3a1e878a9b
Upgrade to boost 1.72
2020-02-14 18:10:13 -08:00
Evan Tschannen
693e469003
Changed the handshake lock to a BoundedFlowLock, which will enforce that old handshakes complete before starting to initiate new handshakes
2020-02-14 16:49:52 -08:00
Evan Tschannen
321dded7dd
rely on preverified to verify the certificate
2020-02-14 16:45:04 -08:00
Alex Miller
94e7f790d8
Merge pull request #2667 from atn34/atn34/remove-flatbuffers-knob
...
Remove USE_OBJECT_SERIALIZER knob
2020-02-14 15:44:38 -08:00
Alex Miller
723a70b357
Call X509_verify_cert once and implement time checking by hand
2020-02-13 21:31:36 -08:00
Alex Miller
d716c50000
Find OpenSSL or LibreSSL in CMake
2020-02-13 21:31:36 -08:00
Alex Miller
8298fb3cb5
Remove spammy traceevent from testing
2020-02-13 21:31:36 -08:00
Russell Sears
7724c644e5
Add some tuning knobs to networktestclient; also, measure latency directly.
2020-02-13 13:11:54 -08:00
Steve Atherton
0c7c815396
Merge branch 'release-6.2' of github.com:apple/foundationdb into feature-redwood
...
# Conflicts:
# tests/CMakeLists.txt
2020-02-12 16:12:57 -08:00
Andrew Noyes
1248d2b8b4
Remove USE_OBJECT_SERIALIZER knob
2020-02-12 10:41:52 -08:00
Steve Atherton
93e3e36d52
Changed RedwoodRecordRef::compare() to include value and updated VersionedBTree to adapt to this change. This fixes an (uncommitted) bug where DeltaTree inserts of a record matching a deleted record except for value would simply unhide the deleted record. For DeltaTree delete/insert sequences to work correctly compare() must only return 0 when the records are fully equivalent.
2020-02-12 01:18:35 -08:00
Meng Xu
e76b6d824a
FastRestore:Assign priority to actors to prioritize vb work
...
When we pipeline multiple version batches, we should prevent a later
version batch from blocking the earlier version batch by consuming
CPU resources.
To achive the above, we should assign higher priority to actors
in later phases in a version batch.
Because restore master will not invoke an actor at a later phase unless
the actors at the earlier phases have been finished. This priority assignment
will not cause dead lock.
2020-02-10 20:29:23 -08:00
Evan Tschannen
dcbce3593e
fixed TLS in simulation
2020-02-10 14:00:21 -08:00
mpilman
5a9d420cb7
Merge remote-tracking branch 'upstream/release-6.2' into release-merges/20200210
2020-02-10 10:02:05 -08:00
A.J. Beamon
ff44bd2b33
Merge pull request #2639 from atn34/atn34/include-port-in-address-default
...
Enable include_port_in_address by default for api version 700
2020-02-10 09:50:59 -08:00
Markus Pilman
e71fe44ee3
Merge branch 'master' into features/icc
2020-02-08 21:33:02 -08:00
A.J. Beamon
abb75f7eb7
Add logging to indicate the time spent at each priority that exceeds some minimum busyness threshold
2020-02-07 14:34:24 -08:00
A.J. Beamon
6010d835fb
Reorganize the interaction between slow task checking and check_yield
2020-02-07 10:35:09 -08:00
Alex Miller
2a2bf945ef
Also remove FDBLibTLS from CMake
2020-02-06 21:55:13 -08:00
Alex Miller
e390dbd36c
Add a non-FDBLibTLS verify peers framework to new TLS impl
2020-02-06 21:06:52 -08:00
Evan Tschannen
38d8d0d675
fixed simulation
2020-02-06 19:29:31 -08:00
Evan Tschannen
69de430057
separate handshaking from connection to improve pipelining
2020-02-06 16:45:54 -08:00
A.J. Beamon
df2b0452b4
Step 3 of fixing storage server range reads: change return type of readRange from VectorRef<KeyValueRef> to RangeResultRef.
2020-02-06 13:19:24 -08:00
Evan Tschannen
53d0867a17
limit the number of connections a process can attempt to establish in parallel
2020-02-04 18:15:10 -08:00
Evan Tschannen
c9738ab133
do not destroy an ssl connection until async_handshake has returned
2020-02-04 17:54:03 -08:00
Andrew Noyes
fcefb4bf6d
Merge branch 'master' into issue-1226
2020-02-04 17:46:36 -08:00
Evan Tschannen
84853dd1fd
switched SSL implementation to use boost ssl
2020-02-04 14:56:40 -08:00
Evan Tschannen
8449badb3e
Merge pull request #1868 from dongxinEric/fix/1827/error_instead_of_timeout
...
Send error back before put the GRV request with PRIORITY_BATCH into t…
2020-02-04 14:32:47 -08:00
mpilman
35d4aef8ee
Reverted `crc32`
2020-02-04 10:53:42 -08:00
mpilman
52ca752dd3
Merge remote-tracking branch 'origin/features/icc' into features/icc
2020-02-04 10:29:49 -08:00
mpilman
d09e07f1f5
Merge remote-tracking branch 'upstream/master' into features/icc
2020-02-04 10:26:18 -08:00
Evan Tschannen
4524831456
Merge pull request #2518 from vishesh/task/failmon-remove-server
...
FailureMonitoring: Server processes no longer need to talk to ClusterController
2020-02-03 17:22:50 -08:00
Andrew Noyes
2ce887012c
Respect api version for include_port_in_address
2020-02-03 15:25:30 -08:00
Xin Dong
7016f7903b
Fixed another build error. Do not use timeReplyIgnoreError since we don not want the logging inside that function and thus that's unnecessary anymore. Change to use ready() which basically ignores the error.
2020-01-31 15:48:29 -08:00
Alex Miller
ee6490c9d1
Merge pull request #2314 from mengranwo/memory-engine
...
New Radix-Tree based Memory Storage Engine
2020-01-30 16:20:13 -08:00
Xin Dong
7216961e46
Do not time the error.
2020-01-30 14:13:56 -08:00
Xin Dong
e21426d12a
Send error back to the GRV requests with batch priority when the cluster is saturated, instead of blindly enqueue the requests and let the client timeout.
2020-01-30 14:13:56 -08:00
A.J. Beamon
cdeb0ee35b
Merge branch 'master' into slow-task-and-priority-tracking-improvements
2020-01-30 09:54:50 -08:00
A.J. Beamon
809586ec31
When logging boost errors, include message in addition to the error code
2020-01-30 08:55:41 -08:00
A.J. Beamon
182dac7cd5
Convert the slow task profiler into a run loop profiler that also logs when the run loop is 100% busy for a knob-configurable duration.
2020-01-28 12:09:37 -08:00
Evan Tschannen
9a620f3e6c
fixed bug in timer()
2020-01-27 17:43:59 -08:00
Evan Tschannen
231d7830a0
more accurate calculation on the amount of time that proxy should wait before getting a version from the master
2020-01-26 19:47:12 -08:00
Alex Miller
2bc5b2cf8a
Merge pull request #2585 from Ma27/fix-glibc230-build
...
Fix build with glibc 2.30
2020-01-23 20:21:32 -08:00
Evan Tschannen
76e192d490
Merge pull request #2538 from alexmiller-apple/hashlittle2-to-crc32c
...
Convert more hashlittle{,2} uses to crc32c_append
2020-01-23 17:54:38 -08:00
Evan Tschannen
6c0b934dda
Merge pull request #2242 from alexmiller-apple/fix-10min-stall-again
...
Fix the 10min multi-region recovery stall again
2020-01-23 17:53:02 -08:00
Maximilian Bosch
e133cb974b
Fix build with glibc 2.30
...
The `gettid()` function is part of glibc 2.30[1]. I decided to keep the
`gettid` implementation here under a different name to remain compatible
to older glibc versions.
[1] https://sourceware.org/ml/libc-alpha/2019-08/msg00029.html
2020-01-23 09:28:18 +01:00
Jingyu Zhou
17002740bb
Add epoch and backup workers to DBCoreState
...
This enables backup workers to know the end version of the epoch. Additionally,
the master recovery only needs to deal with crashed backup workers by
recruiting new workers to backup the unfinished version range.
2020-01-22 19:38:45 -08:00
Jingyu Zhou
a4d6ebe79e
Recruit backup worker in newEpoch
2020-01-22 19:37:48 -08:00
Evan Tschannen
78adbea834
Merge branch 'release-6.2'
...
# Conflicts:
# CMakeLists.txt
# documentation/sphinx/source/release-notes.rst
# flow/Knobs.h
# versions.target
2020-01-21 21:38:19 -08:00
Evan Tschannen
afd3ec13ff
added knobs
2020-01-21 18:58:34 -08:00
Alex Miller
1cb311fcb8
Add an ASSERT_WE_THINK that peek cursors don't get timed_out()
...
This should prevent us from regressing and having multi-region
recoveries hang for 10min again.
2020-01-21 17:07:37 -08:00
Evan Tschannen
7a4b459f07
wait for a tls handshake to complete before returning a connection
...
wait for multiple tls errors before throttling
2020-01-21 16:45:15 -08:00
Vishesh Yadav
daef5f011a
Merge remote-tracking branch 'apple/master' into task/failmon-remove-server
2020-01-21 13:20:15 -08:00
Evan Tschannen
3f9d9d8b84
Merge branch 'release-6.2'
...
# Conflicts:
# CMakeLists.txt
# cmake/FlowCommands.cmake
# documentation/sphinx/source/release-notes.rst
# fdbclient/StorageServerInterface.h
# fdbserver/DataDistributionTracker.actor.cpp
# fdbserver/MasterProxyServer.actor.cpp
# fdbserver/fdbserver.actor.cpp
# flow/Knobs.h
# flow/Platform.cpp
# versions.target
2020-01-16 18:37:47 -08:00
A.J. Beamon
f3d2a1990e
Revert unintended change
2020-01-15 15:53:48 -08:00
A.J. Beamon
173802f27e
Add some tracking for time spent in the run loop to the priority tracker. Add more frequent measurements for priority changes. Fix slow tasks not being measured in some cases.
2020-01-15 15:51:15 -08:00
mengranwo
115ff8bf65
change getKey() interface, pass in uint8_t * only
2020-01-15 13:49:45 -08:00
mengranwo
6836e370a7
revert changes inside KVStoreTest, ready for code review
2020-01-15 13:49:45 -08:00
mengranwo
f597aa7e18
WIP : deployable/stable version since Nov 3. Start rebase to master branch
2020-01-15 13:49:45 -08:00
Evan Tschannen
e65760eb46
Merge pull request #2536 from etschannen/feature-commit-latency
...
Improved commit latency in large clusters
2020-01-13 19:12:02 -08:00
Alex Miller
31fbf84ac5
Make FastAlloc use crc32c instead of hashlittle2
2020-01-13 18:23:12 -08:00
Alex Miller
da73164eda
Move crc32c from fdbrpc to flow
...
So that we can use it from a piece of flow code without breaking module
boundaries.
Also rename generated-constants to crc32c-generated-constants so that
it's more apparent that they're related files.
2020-01-13 18:19:30 -08:00
Evan Tschannen
0e916fdbed
throttle client TLS errors longer than server errors so that when both happen simultaneously the server throttling will be disabled when the client makes its next attempt
2020-01-12 22:12:18 -08:00
Evan Tschannen
1f7eb1f738
throttle outgoing tls connections before establishing a network connection
...
store serverTLSConnectionThrottler map inside of g_network, so that it works properly with simulation
2020-01-12 16:44:30 -08:00
Evan Tschannen
ef5dfb87dc
Merge pull request #2529 from bnamasivayam/tls-throtlling
...
Establishing TLS connection through the handshake process is expensiv…
2020-01-12 14:56:21 -08:00
Balachandar Namasivayam
741aa523e6
Establishing TLS connection through the handshake process is expensive and the fdbserver process can get easily saturated with doing repeated TLS handshakes with only a few hundreds of clients have bad certificate. Hence throttle the number of handshakes done on the server per client ip if it has a bad certificate.
2020-01-10 16:19:41 -08:00
Evan Tschannen
2e20c12200
Merge pull request #2475 from ajbeamon/priority-busy-fixes
...
Fix PriorityBusy calculation and add PriorityMaxBusy
2020-01-10 12:47:17 -08:00
Evan Tschannen
176a1b6319
Merge pull request #2515 from ajbeamon/remove-timer-in-slowtask-profiler
...
Fix slow task profiler crash
2020-01-10 12:41:57 -08:00
Evan Tschannen
a5f544818c
Merge pull request #2420 from ajbeamon/trace-clock-source-fix
...
Revert change to make g_trace_clock thread_local, ...
2020-01-10 12:36:38 -08:00
Alvin Moore
7628d04fb9
Merge branch 'release-6.2' of github.com:apple/foundationdb into release_6.2_merge
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
2020-01-09 07:21:16 -08:00
Vishesh Yadav
598b2eaeb0
fdbrpc: Add warning when peer is unavailable for long time
2020-01-08 13:55:13 -08:00
Evan Tschannen
83ad9caf54
implemented a load balancing algorithm which evens out the number of requests processes by each proxy
2020-01-08 01:59:01 -08:00
A.J. Beamon
de5a591b15
Attempt a minor pointless change to fix the build
2020-01-06 15:17:13 -08:00
A.J. Beamon
6cf38790d6
Reorganize declaration of variable and add release note.
2020-01-06 12:27:56 -08:00
A.J. Beamon
4a52864023
Remove call of timer() from the slow task profiling signal handler, as it can lead to crashes if called at the wrong time.
2020-01-06 12:19:45 -08:00
Evan Tschannen
16b5af067c
changed trace event name
2020-01-03 16:03:29 -08:00
Evan Tschannen
deb032745a
fix: do not set logged until then end of the function
2020-01-03 12:45:23 -08:00
Evan Tschannen
1867d30017
added asserts to protect against future actions on a trace event that has been logged
2020-01-03 12:31:06 -08:00
Evan Tschannen
7152469cc3
log the base trace event before the endpoint messages
2020-01-03 12:15:38 -08:00
Evan Tschannen
6e473c3a83
Merge branch 'release-6.2' into feature-addpeer-fix
2020-01-02 17:37:23 -08:00
Evan Tschannen
032797ca5c
Merge pull request #2430 from etschannen/release-6.2
...
Reduce recovery times caused by saturating the cluster controller
2020-01-02 17:35:59 -08:00
A.J. Beamon
3dd3ac3cfd
Merge branch 'release-6.2' into trace-clock-source-fix
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
2020-01-02 15:14:12 -08:00
A.J. Beamon
ca01593067
Cap busyness to 1.0 at logging time to cover all cases where it could be measured above.
2020-01-02 15:10:42 -08:00
Evan Tschannen
9e137d3b49
fix: addPeerReference only marks a connection as healthy if it is the first peerReference
...
added additional logging to long LoadBalance calls, and when the failure monitor state changes for an address
2019-12-19 18:26:29 -08:00
A.J. Beamon
3b28c7f103
Throw the correct error in deleteFile
2019-12-19 14:13:09 -08:00
A.J. Beamon
414be7a0e4
Merge pull request #2479 from AlvinMooreSr/release_6.2_merge
...
Release 6.2 merge
2019-12-19 09:36:05 -08:00
Evan Tschannen
d8c3c2fda4
Improved prioritization of commit path on the proxies
2019-12-18 16:56:35 -08:00
Vishesh Yadav
902029bbec
Merge pull request #2448 from mpilman/issues/2446
...
Changed failure monitor ping delay to 1 second
2019-12-18 13:26:34 -08:00
Alvin Moore
21390c493a
Merge branch 'release-6.2' of github.com:apple/foundationdb into release_6.2_merge
...
Resolved merge by keeping new test file from master branch: SampleNoSimAttrition.txt adding new constraint from Release branch about existing test file: SimpleExternalTest.txt
# Conflicts:
# tests/CMakeLists.txt
2019-12-18 09:05:08 -08:00
A.J. Beamon
a093021855
Fix priority time calculation. Track max priority busy rather than seconds squared.
2019-12-17 09:14:54 -08:00
Alvin Moore
5080b8293a
Added Windows required library for function GetProcessMemoryInfo
2019-12-16 08:09:10 -08:00
Alvin Moore
3bf971ba8b
Merge branch 'release-6.2' of github.com:apple/foundationdb into release_6.2_merge
...
# Conflicts:
# documentation/sphinx/source/release-notes.rst
# fdbserver/storageserver.actor.cpp
2019-12-12 07:13:12 -08:00
mpilman
7a62d3b526
Changed failure monitor ping delay to 1 second
2019-12-11 11:23:24 -08:00
Evan Tschannen
3c30215662
Merge branch 'release-6.2' of github.com:apple/foundationdb into release-6.2
2019-12-09 13:18:07 -08:00
A.J. Beamon
20eacdb434
Add missing include
2019-12-06 15:18:17 -08:00
A.J. Beamon
9866d1ce27
Revert change to make g_trace_clock thread_local, instead checking we are on the correct thread when getting the time.
2019-12-06 10:15:49 -08:00
Andrew Noyes
78b202f3a4
Apply A.J.'s suggestion to randomInt as well
2019-12-05 11:01:41 -08:00
Andrew Noyes
cf5cdc4e93
Update flow/DeterministicRandom.cpp
...
Include equality now that we've adjusted the value by 1.
Co-Authored-By: A.J. Beamon <ajbeamon@users.noreply.github.com>
2019-12-05 11:01:03 -08:00
Andrew Noyes
b09f0b334b
Take A.J.'s suggestion, which fixes A.J.s counterexample
2019-12-05 10:27:35 -08:00
Andrew Noyes
96e71bb109
Fix old Make build
2019-12-04 17:06:49 -08:00
Andrew Noyes
89a093e035
Accept UBSAN's suggestion
...
/home/anoyes/workspace/foundationdb/flow/DeterministicRandom.cpp:72:29: runtime error: negation of -9223372036854775808 cannot be represented in type 'long int'; cast to an unsigned type to negate this value to itself
2019-12-04 16:45:45 -08:00
Andrew Noyes
4f943be21d
Move DeterministicRandom impl to its own translation unit
...
This will allow me to recompile faster after making changes, and should
(slightly) speed up overall compilation.
I manually verified that the unseed matched for one test before and
after this change, so I probably didn't screw up the refactor
2019-12-04 16:45:32 -08:00
Evan Tschannen
5a6bc2aa71
increase the priority of cluster controller recruitment to prefer recruitment over sending serverDBInfo
2019-12-04 16:28:41 -08:00
Evan Tschannen
5f1ef53f62
increase the priority at which the cluster controller registers workers to avoid having a saturated cluster controller recruit a master without all available workers
2019-12-04 16:17:41 -08:00
Andrew Noyes
46d10dc7dc
Fix "null passed as argument declared not null"
...
Fix several such reports from ubsan
E.g.
/Users/anoyes/workspace/foundationdb/flow/Arena.h:794:16: runtime error: null pointer passed as argument 1, which is declared to never be null
2019-12-03 14:46:53 -08:00
Andrew Noyes
4022ca1381
Add USE_UBSAN cmake option
2019-12-02 16:20:06 -08:00
Andrew Noyes
7f263a2614
Require callers to alignment requirements for aligned_alloc
2019-12-02 15:30:54 -08:00
Evan Tschannen
07331ab5fd
Merge pull request #2362 from etschannen/master
...
Merge 6.2 into master
2019-12-02 15:04:27 -08:00
Andrew Noyes
ff8758b1fd
Request alignment that's at least sizeof(void*)
...
According to https://en.cppreference.com/w/c/memory/aligned_alloc#Notes ,
aligned_alloc may return nullptr if it doesn't like the requested
alignment. Let's also detect if nullptr is returned.
2019-12-02 12:51:33 -08:00
Meng Xu
a2c84d932d
Merge pull request #2396 from atn34/atn34/fix-invalid-vptr
...
Fix invalid vptr
2019-11-27 21:16:49 -08:00
Meng Xu
7eaf76bacf
Merge pull request #2389 from atn34/atn34/default-init-flatbuffers
...
Default initialize absent flatbuffers members
2019-11-27 21:16:27 -08:00
Andrew Noyes
8fc74e3182
Fix UBSAN error
...
Since QuorumCallback<T> is a non trivial type, we need to construct it
before we interact with it
This change fixes the following UBSAN message
/Users/anoyes/workspace/foundationdb/flow/genericactors.actor.h:930:18: runtime error: member access within address 0x0001243f63d0 which does not point to an object of type 'Callback<Standalone<StringRef> >'
0x0001243f63d0: note: object has invalid vptr
2019-11-27 13:16:48 -08:00
Andrew Noyes
3a9fd29d3c
Initialize memory for Optional and ErrorOr
...
This does _not_ fix any potential uses of uninitialized memory. Without
this change, gcc issues false-positive -Wuninitialized warnings
I'm hoping this does not have a noticeable impact on performance
2019-11-26 11:31:30 -08:00
Andrew Noyes
c4e01301b0
Fix a potential UB instance
...
Writing a value which is not 0 or 1 to the underlying memory of a bool
is undefined behavior. Conformant flatbuffers implementations must
accept bytes that are not 0 or 1 as booleans [1]. (Conformant
implementations are only allowed to write the byte 0 or 1 as a boolean
[1])
So this protects us from undefined behavior if we ever read a
flatbuffers message written by an almost-conformant implementation.
[1]: https://github.com/dvidelabs/flatcc/blob/master/doc/binary-format.md#boolean
2019-11-26 11:18:17 -08:00
Andrew Noyes
17ab2f8e00
Default initialize absent flatbuffers members
2019-11-26 10:58:29 -08:00
Evan Tschannen
3c769fcf60
Merge branch 'release-6.2'
...
# Conflicts:
# CMakeLists.txt
# documentation/sphinx/source/release-notes.rst
# fdbserver/ClusterController.actor.cpp
# fdbserver/MasterProxyServer.actor.cpp
# versions.target
2019-11-22 15:39:19 -08:00
Evan Tschannen
ebcb2f79ed
Merge branch 'master' of github.com:apple/foundationdb
2019-11-22 15:34:49 -08:00
Meng Xu
cb77c7dd47
Merge pull request #2355 from atn34/more-no-discard
...
Refactor Notified
2019-11-22 12:09:10 -08:00
Evan Tschannen
27cb299d84
simulation can sometimes randomly hang or throw connection_failed, instead of always doing one or the other
2019-11-21 16:24:18 -08:00
Evan Tschannen
2727b91c46
simulation tests network connections failing due to errors instead of just hanging
2019-11-21 12:33:07 -08:00
Evan Tschannen
dbfa3dc217
Merge pull request #2200 from negoyal/storage-cache-subfeature1
...
Storage cache subfeature1
2019-11-20 13:59:06 -08:00
A.J. Beamon
b5a450b4c6
Fix capitalization error
2019-11-15 12:41:08 -08:00
A.J. Beamon
ed8d3f163c
Rename hgVersion to sourceVersion.
2019-11-15 12:26:51 -08:00
Evan Tschannen
8d3ef89540
Merge branch 'release-6.2'
...
# Conflicts:
# CMakeLists.txt
# documentation/sphinx/source/release-notes.rst
# fdbclient/MutationList.h
# fdbserver/MasterProxyServer.actor.cpp
# versions.target
2019-11-14 15:49:56 -08:00
Andrew Noyes
b4aa72303f
Add [[nodiscard]] for whenAtLeast, and make Notified generic
2019-11-13 13:30:34 -08:00
negoyal
a4a0bf18f9
Merging with Master.
2019-11-12 13:01:29 -08:00
Evan Tschannen
1e5677b55a
increase the priority of reboot and recruitment requests
2019-11-11 15:17:11 -08:00
Meng Xu
e7210fe842
Trace:Resolve review comments and add SevVerbose level
2019-11-05 09:42:29 -08:00
Meng Xu
c4d1e6e1a9
Trace:Severity:Include SevNoInfo to mute trace
...
Define SevFRMutationInfo to trace mutations in restore.
2019-11-04 16:18:40 -08:00
Jingyu Zhou
00b3c8f48a
Revert "Clean up some memory after network thread exits"
2019-11-01 11:05:31 -07:00
Jingyu Zhou
6c28da9093
Clean up some memory after network thread exits
2019-10-29 13:59:55 -07:00
Andrew Noyes
b7b5d2ead3
Remove several nonsensical const uses
...
These seem to be all the ones that clang's -Wignored-qualifiers
complains about
2019-10-26 14:30:34 -07:00
Andrew Noyes
e4acd2e318
Disable TLS temporarily for OPEN_FOR_IDE build
2019-10-25 10:42:22 -07:00
Evan Tschannen
3325980c03
Merge branch 'release-6.2'
...
# Conflicts:
# CMakeLists.txt
# documentation/sphinx/source/release-notes.rst
# fdbserver/DataDistribution.actor.cpp
# fdbserver/OldTLogServer_6_0.actor.cpp
# fdbserver/TLogServer.actor.cpp
# fdbserver/WorkerInterface.actor.h
# fdbserver/worker.actor.cpp
# versions.target
2019-10-24 17:38:15 -07:00
mpilman
325a8e4213
remove confusing USE_ODIRECT knob
2019-10-24 11:44:03 -07:00
mpilman
f23392ec5a
Don't use O_DIRECT in EIO by default
2019-10-24 11:39:55 -07:00
mpilman
7ad0e20e48
Added knob to disable O_DIRECT
2019-10-24 11:20:14 -07:00
mpilman
f41f19b5f6
Introduced knob to set eio parallelism
2019-10-24 11:20:14 -07:00
Stephen Atherton
0e51a248b4
Merge branch 'release-6.2' of github.com:apple/foundationdb into feature-redwood
2019-10-23 10:12:54 -07:00
Evan Tschannen
688940b685
merge 6.2 into master
2019-10-21 11:43:46 -07:00
mpilman
a79757a788
Fix compiler errors on Catalina
...
Fixes #2263
2019-10-21 11:15:37 -07:00
Evan Tschannen
42b7acf7b7
Merge pull request #2202 from etschannen/feature-share-mutations
...
Backup and DR would not share mutations if started on different versions of FDB
2019-10-16 20:28:39 -07:00
Evan Tschannen
35ef9b32de
fix: if establishing a TLS connection took longer than 10ms, we could spend all our CPU establishing new connections instead of pinging to maintain existing connections, leading to an infinite loop
2019-10-16 17:26:01 -07:00
Andrew Noyes
b149aee260
Include hgVersion.h in FLOW_SRCS
...
This way if we rebuild after reconfiguring, the binaries will pick up
the new hgVersion.h
2019-10-16 02:13:20 -07:00
A.J. Beamon
3ba8fd95b5
Add script to parse output from enabling ALLOC_INSTRUMENTATION_STDOUT
2019-10-08 15:50:47 -07:00
Jingyu Zhou
396b10caca
Add memory profiling for FastAlloc when gperftool is used
...
FastAlloc is the major memory use case in FDB, yet we can't profiling its usage.
This commit replaces FastAlloc memory allocation with malloc so that we may
track its memory usage when gperftool is used.
2019-10-07 19:27:06 -07:00
Evan Tschannen
1b946d588f
Merge pull request #2208 from alexmiller-apple/faster-txstag-recovery
...
Recover Txs Faster [0/?]: Combine spill-by-value and spill-by-reference into one file/SharedTLog
2019-10-07 11:15:56 -07:00
Alex Miller
d38a96ab73
Make LogData aware of the spill type it was created to perform.
...
The spilling type is now pulled out of the request, and then stored on
LogData for later access, and persisted in the tlog metadata per tlog
generation.
It turns out that serializing types as Unversioned is a bit wonky.
2019-10-03 01:45:10 -07:00
Meng Xu
d0147e5e5d
Merge branch 'release-6.2' into mengxu/merge-release620-to-master-v3
...
Resolved Conflicts:
documentation/sphinx/source/release-notes.rst
fdbserver/DataDistribution.actor.cpp
versions.target
2019-10-02 13:22:56 -07:00