Commit Graph

44 Commits

Author SHA1 Message Date
Panu Matilainen a162cf10df Take advantage of C++ native mutex facilities for string pool
Shared and exclusive locks are of different types in STL so we can't
easily return one or the other from poolLock() as per the write
argument. Just convert all poolLock() calls sites to name their
lock type locally.
2024-09-24 10:08:38 +02:00
Panu Matilainen 62840a3cdf Add casts that C++ requires but C doesn't across librpmio
In other words, a whole lot of "yes, really".
2024-04-09 11:00:00 +03:00
Panu Matilainen 747b7119ae Fix possible read beyond buffer in rstrnlenhash()
On strings that are not \0-terminated (which are a big reason for the
existence of this function), the while-loop would try to compare the
first character beyond the specified buffer for '\0' before realizing
we're already beyond the end when checking n. Should be mostly harmless
in practise as the check for n would still terminate it, but not right.
In particular this trips up address sanitizer with the bdb backend where
some of the returned strings are not \0-terminated.

Test for string length first, and move the decrementing side-effect into
the loop for better readability.
2020-09-04 13:05:38 +03:00
Panu Matilainen 7ffc4d17ff Implement thread protection locking on the string pool
The shared string pool is in a very central role in several operations,
it's kinda embarrasing that we haven't had any thread protection
on it. Not that anybody has asked either, prior to coming up as part
of #226 (to enable threaded package creation).

Test-suite and couple of smoke-tests with the #226 pass, but only
lightly tested. Then again, it's relatively straightforward. As a general
rule, locks are taken on all exported interfaces on entry and released
on exit, internal callers never lock anything. In rpm usage at least,
performance hit seems negligible.
2018-09-28 15:32:55 +03:00
Panu Matilainen 613842841f Un-oneline rpmstrPoolNumStr() for next step 2018-09-28 14:24:33 +03:00
Panu Matilainen 8e90ea931c Use helper variables for pool streq comparison
No functional changes, but will be necessary for things to come.
2018-09-28 13:49:36 +03:00
Panu Matilainen 0a6cfc17a1 Add + use an internal helper for id -> string retrieval
Ensure non-NULL pool on the outer call, internal callers don't need that.
No functional changes, just refactoring for things to come
2018-09-28 13:47:58 +03:00
Panu Matilainen 4d67755941 Check for NULL pool on the outer callers
No functional change, just minor refactor for things to come
2018-09-28 13:02:39 +03:00
Panu Matilainen f2d5e7ecd7 Clarify a couple of comments 2013-12-02 13:13:00 +02:00
Panu Matilainen 25406a133f Track chunk usage in the pool struct directly
- This simplifies things a bit as we dont need to worry about the
  id storage and the starting location of the next string in advance.
- Also make it clearer the string is copied into the current chunk,
  to which pool->offs only points to. Make pool->offs const to
  enforce the strings are never written through it.
2013-12-02 12:47:23 +02:00
Panu Matilainen 938b86b8bd Clarify pool chunk allocation
- Assign newly alloc'ed chunks to pool->chunks, pool->offs just
  contains pointers into the chunks. This doesn't change actual
  behavior at all, just (IMO) clarifies the code a bit.
2013-12-02 12:29:21 +02:00
Panu Matilainen c24930219a Fix a harmless off-by-one in rpmstrPoolPu()
- ssize already has the trailing \0 accounted for
2013-12-02 10:54:18 +02:00
Panu Matilainen cfe99e08ad Drop the end-of-chunk dummy entries from string pool
- As pointed out by Michael Schroeder in
  http://lists.rpm.org/pipermail/rpm-maint/2013-September/003605.html,
  the dummy entries used for optimizing rpmstrPoolStrlen() are
  problematic in number of ways:
  - Walking the id's in a pool is unreliable, and rehashing can cause
    bogus empty strings to be added to a pool where they otherwise
    do not exist
  - rpmstrPoolNumStr() is not accurate when more than one chunk is in use
- Unfortunately this means giving up the rpmstrPoolStrlen() optimization,
  for now at least.
2013-12-02 10:45:33 +02:00
Michael Schroeder 41a01d2563 Fix off-by-one in rpmstrPoolRehash()
- pool->offs_size is the last used id, thus it should be "<=" instead of "<"

Signed-off-by: Panu Matilainen <pmatilai@redhat.com>
2013-11-29 10:42:36 +02:00
Ville Skyttä 8002b3f985 Spelling fixes.
Signed-off-by: Panu Matilainen <pmatilai@redhat.com>
2013-02-19 21:35:40 +02:00
Panu Matilainen e3ed69591f Missing include in string pool
- When compiled without selinux support, stdlib.h doesn't get
  included here. Wtf?
2012-10-11 15:14:48 +03:00
Florian Festi bdb966b4df Make string pool strings static in memory
- Use multiple chunks that get allocated as old ones get filled up
  instead of reallocating, store direct pointers to the strings in
  the id array.
- This prevents nasty surprises when previously retrieved pointer
  to a pool string goes invalid underneath us due to somebody
  adding a new string, and restores former rpm API behavior:
  string pointers retrieved from eg rpmds and rpmfi are valid for
  the entire lifetime of those objects.
2012-09-28 10:37:05 +03:00
Panu Matilainen 1bbf25b78f Add function to get number of unique strings in the pool 2012-09-26 08:34:40 +03:00
Florian Festi 971a2887f8 Change poolHash to use internal collision resolution 2012-09-19 13:31:13 +02:00
Panu Matilainen 3619df6ebb Aargh, stupid thinko in rpmstrPoolStrlen() last id special case
- At the largest id, the end boundary is data, not offset size... doh
2012-09-19 10:49:16 +03:00
Panu Matilainen 4c75ab28b8 Make pool string->id operations properly length-aware
- Allow looking up and inserting partial key strings, this is useful
  in various cases where previously a local copy was needed for
  \0-terminating the key in the caller.
- Take advantage of rstrlenhash() in rpmstrPoolId(), previously the
  length was only interesting when adding so we wasted a strlen()
  on every call when the string was already in the pool.
2012-09-18 06:11:37 +03:00
Panu Matilainen 0927ab855e Add length aware variant(s) of string hashing
- Being able to hash partial strings is needed for allowing string pool
  to operate on partial strings...
2012-09-18 04:47:01 +03:00
Panu Matilainen 76a699701c Enhanced string hash to permit calculating string length on the same call
- String hashing needs to walk the entire string anyhow, might as well
  take advantage of this and have it return the string length to avoid
  having to separately call strlen() in the cases where this matters.
- Move the implementation into rpmstrpool.c for inlining possibilities,
  rstrhash() is now just a wrapper to rstrlenhash(). The generic
  hash implementation could not take advantage of this anyway really.
2012-09-18 04:40:20 +03:00
Panu Matilainen bef4be688d Dont assume \0 terminated strings in rpmstrPoolPut()
- Before this, the slen argument was only good for avoiding an extra
  strlen() but being able to handle shove and lookup partial strings
  without local copy+modify in callers is handy, this is one of
  the prerequisites for that.
2012-09-18 04:15:56 +03:00
Panu Matilainen 1abd80f9c2 Use pool id's for hash table key, lookup strings from pool as needed
- The pool itself can address its contents by id alone, storing
  pointers to the strings only hurts as reallocation moving the
  data blob requires rehashing the whole thing needlessly.
- We now store just the key id in the hash buckets, and lookup the
  actual string for comparison from the pool. This avoids the
  need to rehash on realloc and saves memory too, and this is one of
  the biggest reasons for wanting a separate hash implementation for
  the string pool. Incidentally, this is how libsolv does it too.
- Individual bucket allocation becomes rather wasteful now: a bucket
  stores a single integer, and a single pointer to the next bucket,
  a pointer  which can be twice the size of the key data it holds.
  Further tuning and cleaning up after the marriage of these two
  datatypes left after the honeymoon is over...
2012-09-17 15:52:59 +03:00
Panu Matilainen 7cb0a71a11 Move the string pool struct definition earlier so we can reference it... 2012-09-17 15:32:57 +03:00
Panu Matilainen 77392704f3 Inline poolHashfindEntry() into GetEntry(), nothing else needs it 2012-09-17 15:18:01 +03:00
Panu Matilainen 533106ccfe Eliminate key comparison and hash function vectors from poolHash
- As the pool is hardwired to single hash type, these dont make
  any sense here and the extra indirection will only hurt performance.
2012-09-17 15:14:08 +03:00
Panu Matilainen 38fe7e3b47 More poolHash multiple data-value cleanups
- The only data associated with a pool key is a single id, we dont need
  an array for that
- Change poolHash get-entry return the id directly instead of pointer array
2012-09-17 14:48:21 +03:00
Panu Matilainen 46b664b11b Eliminate redundant data counting from poolHash
- There's a strict 1:1 relation between keys and data in the string
  pool, this keeping count of data is pointless.
2012-09-17 14:43:43 +03:00
Panu Matilainen 95794632be Eliminate unnecessary key and data free-functionality from poolHash 2012-09-17 14:30:55 +03:00
Panu Matilainen d9d9fecaef Pull a private hash-implementation copy to string pool
- The string pool is more specialized a data structure to be efficiently
  handled with the generic hash table implementation in rpmhash.[CH]
  and really requires quite a different approach.
- For starters, import a private copy generated roughly with:
      gcc -E -DHASHTYPE=poolHash \
             -DHTKEYTYPE="const char *" -DHTDATATYPE=rpmsid rpmhash.C
  ...and clean it up a bit: eliminate unused functions (except for
  stats which we'll want to keep for debug purposes), make remaining
  functions static and overall tidy up from the mess 'gcc -E' created.
  Lots of redundant fluff here still, to be cleaned up gradually...
- This doesn't change anything at all, but opens up the playground
  for tuning the pool hash implementation in ways the generic version
  could not (at least sanely) be.
2012-09-17 14:27:01 +03:00
Panu Matilainen 72d0735b90 Rename string pool hash type to poolHash
- No changes other than a rename for next steps...
2012-09-17 13:33:42 +03:00
Panu Matilainen 241fc3c143 Lift string pool rehash into a separate helper function
- This way we have exactly one place for controlling hash (re)creation
  size strategies etc.
2012-09-15 13:01:53 +03:00
Panu Matilainen 95329e10be Use a saner pool hash resize hint
- The previous size hint would actually cause us to shrink the hash
  bucket allocation, requiring the hash to resize itself immediately
  afterwards. As if the rehashes weren't expensive enough already...
2012-09-15 12:49:15 +03:00
Panu Matilainen 1e2c2fece2 Add a string equality check function to string pool API
- As a special case, two strings (ids) from the same pool can be tested for
  equality in constant time (integer comparison). If the pools differ,
  a regular string comparison is needed.
2012-09-13 09:01:30 +03:00
Panu Matilainen 2ea2a0961f Only rehash the pool on insert if the data area actually moved
- realloc() might not need to actually move the data, and when it
  doesn't we dont need to do the very expensive rehash either.
  Unsurprisingly makes things a whole lot faster.
2012-09-12 19:29:28 +03:00
Panu Matilainen 0654685493 Allow keeping hash table around on pool freeze, adjust callers
- Pool id -> string always works with a frozen pool, but in some cases
  we'll need to go the other way, allow caller to specify whether
  string -> id lookups should be possible on frozen pool.
- On glibc, realloc() to smaller size doesn't move the data but on
  other platforms (including valgrind) it can and does move, which
  would require a full rehash. For now, just leave all the data
  alone unless we're also freeing the hash, the memory savings
  isn't much for a global pool (which is where this matters)
2012-09-12 19:17:20 +03:00
Panu Matilainen 3226c2073a String pool id 0 equals NULL
- Pool id 0 is special case for "not found". Return an actual NULL
  instead of an empty string.
2012-09-12 13:33:22 +03:00
Panu Matilainen bed3880ef1 Avoid doing anything if pool is already frozen 2012-09-12 13:30:50 +03:00
Panu Matilainen 51f1cff50d Fix segfault on rpmstrPoolId() on frozen pool
- String -> id lookups need the hash table in place even if we're not
  adding. We could do a linear search in such a case but...
2012-09-11 10:22:18 +03:00
Panu Matilainen 00deac224c Make rpmstrPoolUnfreeze() safe to call on unfrozen pool 2012-09-11 09:01:49 +03:00
Panu Matilainen 09373ec03a And now, on to the embarrassing string-pool reimplementation bugs, take I
- String pool offset resize was off by one, oops
- String pool data-area resize requires rehashing all the strings,
  as the key pointers change. Ouch. Should be avoidable by extending
  rpmhash to allow passing the pool itself around in comparisons as "self"
  and using offsets as keys, but for now working counts more than speed.
- The unfreeze-sizehint calculation could be negative. Turn the initial
  size into constant and use that as a minimum, otherwise rehashing
  uses (more or less arbitrary) heuristics to come up with some number.
  Lots of fine-tuning ahead...
2012-09-09 13:04:55 +03:00
Panu Matilainen 9e47043b2d First cut of a libsolv-style string <-> id pool API
- The pool stores "arbitrary" number of strings in a space-efficient
  manner, with near constant (hashed) string -> id lookup/store and
  constant time id -> string and id -> string length lookups.
- Credits for the idea go to the Suse developers working on libsolv,
  the basic concept is directly lifted from there but details
  differ due to using rpm's own hash table implementation etc.
  Another minor difference is using size_t for offsets to permit over
  4GB total data size on 64bit systems, the total number of id's in
  the pool is limited to uint32 max however (like in libsolv).
- Any (re)implementation bugs by yours truly, this is almost certainly
  going to need further tuning and tweaking, API and otherwise.
2012-09-07 13:34:27 +03:00