This patch keeps a batch of Java's DirectBuffers, which can be shared with JNI C
world. This means:
1. No need for JNI wrapper to make several JNI calls, to allocate and convert
Java objectd to bytes.
2. We already made a PR #3582 to reduce 3 JNI calls for each getRange() i.e. to
fetch summary and then results. As mentioned in that PR, this patch also
makes similar decision to make `getDirectRange()` call synchronous and
instead schedule it asynchronously in Java executor.
3. No need for JNI to dynamically allocate buffers to store KVs.
4. Use one big DirectBuffer to make request and get reponse. `DirectBuffers` give
direct access to the memory, and are much fast than the regular non-direct
buffers we use.
5. We keep a pool of reasonably big DirectBuffers, which are borrowed and
returned by `getRange()` requests.
The downside to this are:
1. We have to manually and "carefully" serialize and deserialize the
request/response both in C and Java world. It is no longer high-level Java
objects.
2. Because `DirectBuffers` are expensive, we can only keep a few of them, which
number of outstanding `getRange()` requests are limited by that.
3. Each request, currently uses 2 buffers, one for current chunk and one for
outstanding request.
4. The performance bump seems to be excellent for bigger key-values. We didn't
observe significant difference for smaller KV sizes (I can't say its better
or worse, as from quick glance it didn't look statistically significant to me).
Performance is currently measured using `PerformanceTester.java`, which measures
throughput for several operations. The results are:
```
1. Using Key = 16bytes, Value = 100bytes
=========================================
Without this PR=>
Count Avg Min Q1 Median Q3 Max
------------------------------------------------------- ------- ------ ----- ------ -------- ------ ------
get_range throughput (local client) [keys/s] 30 349363 73590 316218 342523 406445 540731
get_single_key_range throughput (local client) [keys/s] 30 7685 6455 6981 7744 8129 9773
** With this PR ==>
Count Avg Min Q1 Median Q3 Max
------------------------------------------------------- ------- ------ ----- ------ -------- ------ ------
get_range throughput (local client) [keys/s] 30 383404 70181 338810 396950 437335 502886
get_single_key_range throughput (local client) [keys/s] 30 7029 5515 6635 7090 7353 8219
=======================================
2. Using Key = 256bytes, Value = 512bytes
========================================
** Without this PR ==>
Count Avg Min Q1 Median Q3 Max
------------------------------------------------------- ------- ------ ------ ------ -------- ------ ------
get_range throughput (local client) [keys/s] 90 132787 102036 122650 130204 138269 202790
get_single_key_range throughput (local client) [keys/s] 90 5833 4894 5396 5690 6061 8986
** With this PR ==>
Count Avg Min Q1 Median Q3 Max
------------------------------------------------------- ------- ------ ------ ------ -------- ------ ------
get_range throughput (local client) [keys/s] 90 359302 196676 310931 344029 407232 494259
get_single_key_range throughput (local client) [keys/s] 90 7227 5573 6771 7177 7477 10108
====================================================================================================================
=======================================
3. Using Key = 128bytes, Value = 512bytes
========================================
** Without this PR ==>
Count Avg Min Q1 Median Q3 Max
------------------------------------------------------- ------- ------ ------ ------ -------- ------ ------
get_range throughput (local client) [keys/s] 30 235661 148963 213670 229090 256825 317050
get_single_key_range throughput (local client) [keys/s] 30 10441 6302 10586 10873 10960 11065
====================================================================================================================
** With this PR ==>
Count Avg Min Q1 Median Q3 Max
------------------------------------------------------- ------- ------ ------ ------ -------- ------ ------
get_range throughput (local client) [keys/s] 30 350612 185698 320868 348998 406750 459101
get_single_key_range throughput (local client) [keys/s] 30 10338 6570 10664 10847 10901 11040
====================================================================================================================
```
NOTE: These tests were run on a shared VM. Benchmark in each group was run
serially, and the groups themselves run at different times. Therefore there
might be some skew based on load, but the difference is compelling enough to
show that there is performance benefit for larger KV.
RangeQuery makes getSummary() and getResult() JNI calls, which are redundant in
nature. This patch combines them into single call.
This reduce 3 JNI to 2 JNI calls. Next logical step is to remove the 2nd JNI
call, i.e. getResults() after getRange() which is slighly more convoluted
because C API doesn't allow primitives to compose new Futures.