Shared and exclusive locks are of different types in STL so we can't
easily return one or the other from poolLock() as per the write
argument. Just convert all poolLock() calls sites to name their
lock type locally.
On strings that are not \0-terminated (which are a big reason for the
existence of this function), the while-loop would try to compare the
first character beyond the specified buffer for '\0' before realizing
we're already beyond the end when checking n. Should be mostly harmless
in practise as the check for n would still terminate it, but not right.
In particular this trips up address sanitizer with the bdb backend where
some of the returned strings are not \0-terminated.
Test for string length first, and move the decrementing side-effect into
the loop for better readability.
The shared string pool is in a very central role in several operations,
it's kinda embarrasing that we haven't had any thread protection
on it. Not that anybody has asked either, prior to coming up as part
of #226 (to enable threaded package creation).
Test-suite and couple of smoke-tests with the #226 pass, but only
lightly tested. Then again, it's relatively straightforward. As a general
rule, locks are taken on all exported interfaces on entry and released
on exit, internal callers never lock anything. In rpm usage at least,
performance hit seems negligible.
- This simplifies things a bit as we dont need to worry about the
id storage and the starting location of the next string in advance.
- Also make it clearer the string is copied into the current chunk,
to which pool->offs only points to. Make pool->offs const to
enforce the strings are never written through it.
- Assign newly alloc'ed chunks to pool->chunks, pool->offs just
contains pointers into the chunks. This doesn't change actual
behavior at all, just (IMO) clarifies the code a bit.
- As pointed out by Michael Schroeder in
http://lists.rpm.org/pipermail/rpm-maint/2013-September/003605.html,
the dummy entries used for optimizing rpmstrPoolStrlen() are
problematic in number of ways:
- Walking the id's in a pool is unreliable, and rehashing can cause
bogus empty strings to be added to a pool where they otherwise
do not exist
- rpmstrPoolNumStr() is not accurate when more than one chunk is in use
- Unfortunately this means giving up the rpmstrPoolStrlen() optimization,
for now at least.
- Use multiple chunks that get allocated as old ones get filled up
instead of reallocating, store direct pointers to the strings in
the id array.
- This prevents nasty surprises when previously retrieved pointer
to a pool string goes invalid underneath us due to somebody
adding a new string, and restores former rpm API behavior:
string pointers retrieved from eg rpmds and rpmfi are valid for
the entire lifetime of those objects.
- Allow looking up and inserting partial key strings, this is useful
in various cases where previously a local copy was needed for
\0-terminating the key in the caller.
- Take advantage of rstrlenhash() in rpmstrPoolId(), previously the
length was only interesting when adding so we wasted a strlen()
on every call when the string was already in the pool.
- String hashing needs to walk the entire string anyhow, might as well
take advantage of this and have it return the string length to avoid
having to separately call strlen() in the cases where this matters.
- Move the implementation into rpmstrpool.c for inlining possibilities,
rstrhash() is now just a wrapper to rstrlenhash(). The generic
hash implementation could not take advantage of this anyway really.
- Before this, the slen argument was only good for avoiding an extra
strlen() but being able to handle shove and lookup partial strings
without local copy+modify in callers is handy, this is one of
the prerequisites for that.
- The pool itself can address its contents by id alone, storing
pointers to the strings only hurts as reallocation moving the
data blob requires rehashing the whole thing needlessly.
- We now store just the key id in the hash buckets, and lookup the
actual string for comparison from the pool. This avoids the
need to rehash on realloc and saves memory too, and this is one of
the biggest reasons for wanting a separate hash implementation for
the string pool. Incidentally, this is how libsolv does it too.
- Individual bucket allocation becomes rather wasteful now: a bucket
stores a single integer, and a single pointer to the next bucket,
a pointer which can be twice the size of the key data it holds.
Further tuning and cleaning up after the marriage of these two
datatypes left after the honeymoon is over...
- The only data associated with a pool key is a single id, we dont need
an array for that
- Change poolHash get-entry return the id directly instead of pointer array
- The string pool is more specialized a data structure to be efficiently
handled with the generic hash table implementation in rpmhash.[CH]
and really requires quite a different approach.
- For starters, import a private copy generated roughly with:
gcc -E -DHASHTYPE=poolHash \
-DHTKEYTYPE="const char *" -DHTDATATYPE=rpmsid rpmhash.C
...and clean it up a bit: eliminate unused functions (except for
stats which we'll want to keep for debug purposes), make remaining
functions static and overall tidy up from the mess 'gcc -E' created.
Lots of redundant fluff here still, to be cleaned up gradually...
- This doesn't change anything at all, but opens up the playground
for tuning the pool hash implementation in ways the generic version
could not (at least sanely) be.
- The previous size hint would actually cause us to shrink the hash
bucket allocation, requiring the hash to resize itself immediately
afterwards. As if the rehashes weren't expensive enough already...
- As a special case, two strings (ids) from the same pool can be tested for
equality in constant time (integer comparison). If the pools differ,
a regular string comparison is needed.
- realloc() might not need to actually move the data, and when it
doesn't we dont need to do the very expensive rehash either.
Unsurprisingly makes things a whole lot faster.
- Pool id -> string always works with a frozen pool, but in some cases
we'll need to go the other way, allow caller to specify whether
string -> id lookups should be possible on frozen pool.
- On glibc, realloc() to smaller size doesn't move the data but on
other platforms (including valgrind) it can and does move, which
would require a full rehash. For now, just leave all the data
alone unless we're also freeing the hash, the memory savings
isn't much for a global pool (which is where this matters)
- String pool offset resize was off by one, oops
- String pool data-area resize requires rehashing all the strings,
as the key pointers change. Ouch. Should be avoidable by extending
rpmhash to allow passing the pool itself around in comparisons as "self"
and using offsets as keys, but for now working counts more than speed.
- The unfreeze-sizehint calculation could be negative. Turn the initial
size into constant and use that as a minimum, otherwise rehashing
uses (more or less arbitrary) heuristics to come up with some number.
Lots of fine-tuning ahead...
- The pool stores "arbitrary" number of strings in a space-efficient
manner, with near constant (hashed) string -> id lookup/store and
constant time id -> string and id -> string length lookups.
- Credits for the idea go to the Suse developers working on libsolv,
the basic concept is directly lifted from there but details
differ due to using rpm's own hash table implementation etc.
Another minor difference is using size_t for offsets to permit over
4GB total data size on 64bit systems, the total number of id's in
the pool is limited to uint32 max however (like in libsolv).
- Any (re)implementation bugs by yours truly, this is almost certainly
going to need further tuning and tweaking, API and otherwise.