Summary:
Reuse the old -use-dex-index experiment flag for this.
To avoid breaking the tests, make Dex deduplicate symbols, addressing an old FIXME.
Reviewers: hokein
Subscribers: ilya-biryukov, ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D53288
llvm-svn: 344594
Summary:
The bug being fixed: when a posting list doesn't exist in the index, it
was previously just dropped from the query rather than being treated as
empty. Now that we have the FALSE iterator, we can use it instead.
The query tree logic previously had a bunch of special cases to detect whether
subtrees are empty. Now we just naively build the whole tree, and rely
on the query optimizations to drop the trivial parts.
Finally, there was a bug in trigram generation: the empty query would
generate a single trigram "$$$" instead of no trigrams.
This had no effect (there was no posting list, so the other bug
cancelled it out). But we now have to fix this bug too.
Reviewers: ilya-biryukov
Subscribers: ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52796
llvm-svn: 343802
Summary:
The FALSE iterator will be used in a followup patch to fix a logic bug in Dex
(currently, tokens that don't have posting lists in the index are simply dropped
from the query, changing semantics).
It can usually be optimized away, so added the following opmitizations:
- simplify booleans inside AND/OR
- replace effectively-empty AND/OR with booleans
- flatten nested AND/ORs
While working on this, found a bug in the AND iterator: its constructor sync()
assumes that ReachedEnd is set if applicable, but the constructor never sets it.
This crashes if a non-first iterator is nonempty.
Reviewers: ilya-biryukov
Subscribers: ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52789
llvm-svn: 343801
Summary:
Currently queries like "ab" can match identifiers like a_yellow_bee.
The value of allowing this for exactly one segment but no more seems dubious.
It costs ~3% of overall ram (~9% of posting list ram) and some quality.
Reviewers: ilya-biryukov, ioeric
Subscribers: MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52885
llvm-svn: 343777
Summary:
1) Instead of x$$ for a short-query trigram, just use x
2) Make rules more coherent: prefixes of length 1-2, and first char + next head
3) Fix Dex::fuzzyFind to mark results as incomplete, because
short-trigram rules only yield a subset of results.
Reviewers: ioeric
Subscribers: ilya-biryukov, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52808
llvm-svn: 343775
Summary:
The FALSE iterator will be used in a followup patch to fix a logic bug in Dex
(currently, tokens that don't have posting lists in the index are simply dropped
from the query, changing semantics).
It can usually be optimized away, so added the following opmitizations:
- simplify booleans inside AND/OR
- replace effectively-empty AND/OR with booleans
- flatten nested AND/ORs
While working on this, found a bug in the AND iterator: its constructor sync()
assumes that ReachedEnd is set if applicable, but the constructor never sets it.
This crashes if a non-first iterator is nonempty.
Reviewers: ilya-biryukov
Subscribers: ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52789
llvm-svn: 343774
Declaring a field with the same name as a type causes GCC to error out:
Dex.h:104:10: error: declaration of 'clang::clangd::dex::Corpus clang::clangd::dex::Dex::Corpus' [-fpermissive]
Corpus Corpus;
^
Iterator.h:127:7: error: changes meaning of 'Corpus' from 'class clang::clangd::dex::Corpus' [-fpermissive]
class Corpus {
llvm-svn: 343610
Summary:
- Corpus avoids having to pass size to the true iterator, and (soon) any
iterator that might optimize down to true.
- Shorten names of factory functions now they're scoped to the Corpus.
intersect() and unionOf() rather than createAnd() or createOr() as this
seems to read better to me, and fits with other short names. Opinion wanted!
- DEFAULT_BOOST_SCORE --> 1. This is a multiplier, don't obfuscate identity.
- Simplify variadic templates in Iterator.h
Reviewers: ioeric
Subscribers: ilya-biryukov, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52711
llvm-svn: 343589
Summary:
This makes it suitable for logging (which immediately found a bug, to
be fixed in the next patch...)
Reviewers: ioeric
Subscribers: ilya-biryukov, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52715
llvm-svn: 343580
Summary:
When no scope qualifier is specified, allow completing index symbols
from any scope and insert proper automatically. This is still experimental and
hidden behind a flag.
Things missing:
- Scope proximity based scoring.
- FuzzyFind supports weighted scopes.
Reviewers: sammccall
Reviewed By: sammccall
Subscribers: kbobyrev, ilya-biryukov, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52364
llvm-svn: 343248
* With the current implementation, `sizeof(std::vector<Chunk>)` is added
twice to the `Dex` memory estimate which is incorrect
* `Dex` logs memory usage estimation before `BackingDataSize` is set and
hence the log report excludes size of the external `SymbolSlab` which is
coupled with `Dex` instance
Reviewed By: ioeric
Differential Revision: https://reviews.llvm.org/D52503
llvm-svn: 343117
Because `PostingList` objects are compressed, it is now impossible to
see elements other than the current one and the documentation doesn't
match implementation anymore.
Reviewed By: ioeric
Differential Revision: https://reviews.llvm.org/D52545
llvm-svn: 343116
Summary:
Interface is in one file, implementation in two as they have little in common.
A couple of ad-hoc YAML functions left exposed:
- symbol -> YAML I expect to keep for tools like dexp
- YAML -> symbol is used for the MR-style indexer, I think we can eliminate
this (merge-on-the-fly, else use a different serialization)
Reviewers: kbobyrev
Subscribers: mgorny, ilya-biryukov, ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52453
llvm-svn: 342999
For consistency, functional-style code pieces are replaced with their
simple counterparts to improve readability.
Also, file headers are fixed to comply with LLVM Coding Standards.
`static` member of anonymous namespace is not marked `static` anymore,
because it is redundant.
Reviewed By: sammccall
Differential Revision: https://reviews.llvm.org/D52466
llvm-svn: 342974
This patch implements Variable-length Byte compression of `PostingList`s
to sacrifice some performance for lower memory consumption.
`PostingList` compression and decompression was extensively tested using
fuzzer for multiple hours and runnning significant number of realistic
`FuzzyFindRequests`. AddressSanitizer and UndefinedBehaviorSanitizer
were used to ensure the correct behaviour.
Performance evaluation was conducted with recent LLVM symbol index (292k
symbols) and the collection of user-recorded queries (7751
`FuzzyFindRequest` JSON dumps):
| Metrics | Before| After | Change (%)
| ----- | ----- | ----- | -----
| Memory consumption (posting lists only), MB | 54.4 | 23.5 | -60%
| Time to process queries, sec | 7.70 | 9.4 | +25%
Reviewers: sammccall, ioeric
Reviewed By: sammccall
Differential Revision: https://reviews.llvm.org/D52300
llvm-svn: 342965
`Dex` should utilize `FuzzyFindRequest.RestrictForCodeCompletion` flags
and omit symbols not meant for code completion when asked for it.
The measurements below were conducted with setting
`FuzzyFindRequest.RestrictForCodeCompletion` to `true` (so that it's
more realistic). Sadly, the average latency goes down, I suspect that is
mostly because of the empty queries where the number of posting lists is
critical.
| Metrics | Before | After | Relative difference
| ----- | ----- | ----- | -----
| Cumulative query latency (7000 `FuzzyFindRequest`s over LLVM static index) | 6182735043 ns | 7202442053 ns | +16%
| Whole Index size | 81.24 MB | 81.79 MB | +0.6%
Out of 292252 symbols collected from LLVM codebase 136926 appear to be
restricted for code completion.
Reviewers: ioeric
Differential Revision: https://reviews.llvm.org/D52357
llvm-svn: 342866
Summary:
We can use cl::ResetCommandLineParser() to support different types of
command-lines, as long as we're careful about option lifetimes.
(I tried using subcommands, but the error messages were bad)
I found a mostly-reasonable pattern to isolate the fiddly parts.
Added -scope and -limit flags to the `find` command to demonstrate.
(Note that scope support seems to be broken in dex?)
Fixed symbol lookup to parse symbol IDs.
Caveats:
- with command help (e.g. `find -help`), you also get some spam
about required arguments. This is a bug in llvm::cl, which prints
these to errs() rather than the designated stream.
Reviewers: kbobyrev
Subscribers: ilya-biryukov, ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51989
llvm-svn: 342456
This patch abstracts `PostingList` interface and reuses existing
implementation. It will be used later to test different `PostingList`
representations.
No functionality change is introduced, this patch is mostly refactoring
so that the following patches could focus on functionality while not
being too hard to review.
Reviewed By: sammccall, ioeric
Differential Revision: https://reviews.llvm.org/D51982
llvm-svn: 342155
As discussed during D51860 review, it is better to use `llvm::Optional`
here as it has clear semantics which reflect intended behavior.
Reviewed By: sammccall
Differential Revision: https://reviews.llvm.org/D52028
llvm-svn: 342138
Currently, `SymbolIndex::estimateMemoryUsage()` returns the "overhead"
estimate, i.e. the estimate of the Index data structure excluding
backing data (such as Symbol Slab and Reference Slab). This patch
propagates information about paired data size where necessary.
Reviewed By: ioeric, sammccall
Differential Revision: https://reviews.llvm.org/D51539
llvm-svn: 341800
If the current element is already beyond advanceTo()'s DocID, just
return instead of doing binary search. This simple optimization saves up
to 6-7% performance,
Reviewed By: ilya-biryukov
Differential Revision: https://reviews.llvm.org/D51802
llvm-svn: 341781
This patch sets URI schemes of Dex to SymbolCollector's default schemes
in case callers tried to pass empty list of schemes. This was the case
for initialization in Clangd main and was a reason of incorrect
behavior.
Also, it fixes a bug with missed `continue;` after spotting invalid URI
scheme conversion.
llvm-svn: 341552
This patch introduces `PathURI` Search Token kind and utilizes it to
uprank symbols which are defined in files with small distance to the
directory where the fuzzy find request is coming from (e.g. files user
is editing).
Reviewed By: ioeric
Reviewers: ioeric, sammccall
Differential Revision: https://reviews.llvm.org/D51481
llvm-svn: 341542
Summary:
A few things that I noticed while merging the SwapIndex patch:
- SymbolOccurrences and particularly SymbolOccurrenceSlab are unwieldy names,
and these names appear *a lot*. Ref, RefSlab, etc seem clear enough
and read/format much better.
- The asymmetry between SymbolSlab and RefSlab (build() vs freeze()) is
confusing and irritating, and doesn't even save much code.
Avoiding RefSlab::Builder was my idea, but it was a bad one; add it.
- DenseMap<SymbolID, ArrayRef<Ref>> seems like a reasonable compromise for
constructing MemIndex - and means many less wasted allocations than the
current DenseMap<SymbolID, vector<Ref*>> for FileIndex, and none for
slabs.
- RefSlab::find() is not actually used for anything, so we can throw
away the DenseMap and keep the representation much more compact.
- A few naming/consistency fixes: e.g. Slabs,Refs -> Symbols,Refs.
Reviewers: ioeric
Subscribers: ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51605
llvm-svn: 341368
Summary:
This is now handled by a wrapper class SwapIndex, so MemIndex/DexIndex can be
immutable and focus on their job.
Old and busted:
I have a MemIndex, which holds a shared_ptr<vector<Symbol*>>, which keeps the
symbol slab alive. I update by calling build(shared_ptr<vector<Symbol*>>).
New hotness: I have a SwapIndex, which holds a unique_ptr<SymbolIndex>, which
holds a MemIndex, which holds a shared_ptr<void>, which keeps backing
data alive.
I update by building a new MemIndex and calling SwapIndex::reset().
Reviewers: kbobyrev, ioeric
Subscribers: ilya-biryukov, ioeric, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51422
llvm-svn: 341318
* Use consistent assertion messages in iterators implementations
* Silence a bunch of clang-tidy warnings: use `emplace_back` instead of
`push_back` where possible, make sure arguments have the same name in
header and implementation file, use for loop over ranges where possible
Reviewed by: ioeric
Differential Revision: https://reviews.llvm.org/D51528
llvm-svn: 341190
This patch introduces iterator cost concept to improve the performance
of Dex query iterators (mainly, AND iterator). Benchmarks show that the
queries become ~10% faster.
Before
```
-------------------------------------------------------
Benchmark Time CPU Iteration
-------------------------------------------------------
DexAdHocQueries 5883074 ns 5883018 ns 117
DexRealQ 959904457 ns 959898507 ns 1
```
After
```
-------------------------------------------------------
Benchmark Time CPU Iteration
-------------------------------------------------------
DexAdHocQueries 5238403 ns 5238361 ns 130
DexRealQ 873275207 ns 873269453 ns 1
```
Reviewed by: sammccall
Differential Revision: https://reviews.llvm.org/D51310
llvm-svn: 341057
Stop using `$$$` (empty) trigram and generating a posting list with all
items. Since TRUE iterator is already implemented and correctly inserted
when there are no real trigram posting lists, this is a valid
transformation.
Benchmarks show that this simple change allows ~30% speedup on dataset
of real completion queries.
Before
```
-------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------
DexAdHocQueries 5640321 ns 5640265 ns 120
DexRealQ 939835603 ns 939830296 ns 1
```
After
```
-------------------------------------------------------
Benchmark Time CPU Iterations
-------------------------------------------------------
DexAdHocQueries 3452014 ns 3451987 ns 203
DexRealQ 667455912 ns 667455750 ns 1
```
Reviewed by: ilya-biryukov
Differential Revision: https://reviews.llvm.org/D51287
llvm-svn: 340729
This patch introduces LIMIT iterator, which is very important for
improving the quality of search query. LIMIT iterators can be applied on
top of BOOST iterators to prevent populating query request with a huge
number of low-quality symbols.
Reviewed by: sammccall
Differential Revision: https://reviews.llvm.org/D51029
llvm-svn: 340605
This patch prints information about built index size estimation to
verbose logs. This is useful for optimizing memory usage of DexIndex and
comparisons with MemIndex.
Reviewed by: sammccall
Differential Revision: https://reviews.llvm.org/D51154
llvm-svn: 340601