Memory profiling a FoundationDB layer implemented in Go shows high
memory pressure and increased GC times when performing highly-concurrent
multi-key transactions on the database. Further digging displays that
the source of the memory pressure happens when packing the keys for the
transaction into byte slices: the most salient issue is that memory
during the packing process is allocated based on the number of elements
to pack and not on the total size of the resulting byte slice.
This commit attempts to reduce the amount of memory allocated when
calling `Tuple.Pack` for most (all?) usage patterns, both in number of
allocations and in total allocated size.
The following optimizations have been implemented:
- Remove `bytes.Buffer` usage in `encodeTuple`: the `Buffer` struct is
quite expensive for the key sizes we're looking to generate, both
allocation and performance-wise. A `packer` struct has been implemented
that builds the keys "naively" by using `append` on a slice. Slice
growth in Go is also amortized just like in `bytes.Buffer`.
- Do not use `bytes.Replace` in `encodeBytes`: this function is
particularly expensive because it always allocates a copy of the byte
slice, even when it doesn't contain nil bytes. Instead, the replacement
step has been implemented manually in `packer.putbytesNil`, where it can
perform the replacement optimally into the output byte slice without
allocating memory. By having this local function we also allow the
compiler to not duplicate any input `string`s when casting them to
`[]byte`; previously, a copy of every string to pack was always being
allocated because the compiler couldn't prove that `bytes.Replace`
wouldn't modify the slice.
- Use stack space in `encode[Float|Double|Int]`: all the numerical
packing functions were allocating huge amounts of memory because of the
usage of temporary `bytes.Buffer` objects and `binary.Write` calls. The
sizes for all the packed data are always known (either 4 or 8 bytes
depending on type), so the big endian packing can be performed directly
on the stack with `binary.BigEndian.PutUint[32|64]`, which doesn't
require the `interface{}` conversion for the `binary.Write` API and in
x64 compiles to a `mov + bswap` instruction pair.
As a result of these optimizations, the "average" case of key packing
can now create a key with a single allocation. More complex key packing
operations, even those that contain strings/byte slices with nil bytes,
now allocate memory in a constant way (i.e. amortized based on the
amount of growth of the output buffer and not the number of Tuple
elements to pack).
Additionally, the reduction of memory allocations and the better usage
of the `binary` APIs produce a very significant reduction in runtime for
key packing: between 2x and 6x faster for all packing operations.
Before/after benchmarks are as follows:
benchmark old ns/op new ns/op delta
BenchmarkTuplePacking/Simple-4 310 76.4 -75.35%
BenchmarkTuplePacking/Namespaces-4 495 137 -72.32%
BenchmarkTuplePacking/ManyStrings-4 960 255 -73.44%
BenchmarkTuplePacking/ManyStringsNil-4 1090 392 -64.04%
BenchmarkTuplePacking/ManyBytes-4 1409 399 -71.68%
BenchmarkTuplePacking/ManyBytesNil-4 1364 533 -60.92%
BenchmarkTuplePacking/LargeBytes-4 319 107 -66.46%
BenchmarkTuplePacking/LargeBytesNil-4 638 306 -52.04%
BenchmarkTuplePacking/Integers-4 2764 455 -83.54%
BenchmarkTuplePacking/Floats-4 3478 482 -86.14%
BenchmarkTuplePacking/Doubles-4 3654 575 -84.26%
BenchmarkTuplePacking/UUIDs-4 366 211 -42.35%
benchmark old allocs new allocs delta
BenchmarkTuplePacking/Simple-4 6 1 -83.33%
BenchmarkTuplePacking/Namespaces-4 11 1 -90.91%
BenchmarkTuplePacking/ManyStrings-4 18 2 -88.89%
BenchmarkTuplePacking/ManyStringsNil-4 18 2 -88.89%
BenchmarkTuplePacking/ManyBytes-4 23 3 -86.96%
BenchmarkTuplePacking/ManyBytesNil-4 22 2 -90.91%
BenchmarkTuplePacking/LargeBytes-4 3 2 -33.33%
BenchmarkTuplePacking/LargeBytesNil-4 3 2 -33.33%
BenchmarkTuplePacking/Integers-4 63 3 -95.24%
BenchmarkTuplePacking/Floats-4 62 2 -96.77%
BenchmarkTuplePacking/Doubles-4 63 3 -95.24%
BenchmarkTuplePacking/UUIDs-4 2 2 +0.00%
benchmark old bytes new bytes delta
BenchmarkTuplePacking/Simple-4 272 64 -76.47%
BenchmarkTuplePacking/Namespaces-4 208 64 -69.23%
BenchmarkTuplePacking/ManyStrings-4 512 192 -62.50%
BenchmarkTuplePacking/ManyStringsNil-4 512 192 -62.50%
BenchmarkTuplePacking/ManyBytes-4 864 448 -48.15%
BenchmarkTuplePacking/ManyBytesNil-4 336 192 -42.86%
BenchmarkTuplePacking/LargeBytes-4 400 192 -52.00%
BenchmarkTuplePacking/LargeBytesNil-4 400 192 -52.00%
BenchmarkTuplePacking/Integers-4 3104 448 -85.57%
BenchmarkTuplePacking/Floats-4 2656 192 -92.77%
BenchmarkTuplePacking/Doubles-4 3104 448 -85.57%
BenchmarkTuplePacking/UUIDs-4 256 192 -25.00%
Although the Go bindings to FoundationDB are thoroughly tested as part
of the `bindingtester` operation, this commit implements a more-or-less
complete test case using golden files for the serialized output of
`Tuple.Pack` operations. This will make implementing optimizations and
refactoring the packing operation much simpler.
The same test cases used to verify correctness are also used as a
benchmark suite to measure the amount of memory allocated in the
different operations.
The available_classes function is using Subspace.unpack to
obtain the tuple, not fdb.tuple so update the description to
reflect this.
The limited seat tutorial section was missing the code and
text describing the update of the drop function.
Drop function description was moved up adjacent to the other
functions to provide better document flow making this section
redundant, hence its removal.
* The initial plaintext file converts nicely - this just converts operations to small headers and uses < and > to make sure text in <> shows up
* Use inline code in a few places where it makes sense