[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
//===-- Serialization.cpp - Binary serialization of index data ------------===//
|
|
|
|
//
|
|
|
|
// The LLVM Compiler Infrastructure
|
|
|
|
//
|
|
|
|
// This file is distributed under the University of Illinois Open Source
|
|
|
|
// License. See LICENSE.TXT for details.
|
|
|
|
//
|
|
|
|
//===----------------------------------------------------------------------===//
|
2018-09-26 23:06:23 +08:00
|
|
|
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
#include "Serialization.h"
|
2018-09-07 02:52:26 +08:00
|
|
|
#include "Index.h"
|
2018-09-26 23:06:23 +08:00
|
|
|
#include "Logger.h"
|
2018-09-07 17:40:36 +08:00
|
|
|
#include "RIFF.h"
|
[clangd] Merge binary + YAML serialization behind a (mostly) common interface.
Summary:
Interface is in one file, implementation in two as they have little in common.
A couple of ad-hoc YAML functions left exposed:
- symbol -> YAML I expect to keep for tools like dexp
- YAML -> symbol is used for the MR-style indexer, I think we can eliminate
this (merge-on-the-fly, else use a different serialization)
Reviewers: kbobyrev
Subscribers: mgorny, ilya-biryukov, ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52453
llvm-svn: 342999
2018-09-26 02:06:43 +08:00
|
|
|
#include "Trace.h"
|
|
|
|
#include "dex/Dex.h"
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
#include "llvm/Support/Compression.h"
|
|
|
|
#include "llvm/Support/Endian.h"
|
|
|
|
#include "llvm/Support/Error.h"
|
|
|
|
|
|
|
|
using namespace llvm;
|
|
|
|
namespace clang {
|
|
|
|
namespace clangd {
|
|
|
|
namespace {
|
|
|
|
Error makeError(const Twine &Msg) {
|
|
|
|
return make_error<StringError>(Msg, inconvertibleErrorCode());
|
|
|
|
}
|
|
|
|
|
|
|
|
// IO PRIMITIVES
|
|
|
|
// We use little-endian 32 bit ints, sometimes with variable-length encoding.
|
2018-09-24 22:51:15 +08:00
|
|
|
//
|
|
|
|
// Variable-length int encoding (varint) uses the bottom 7 bits of each byte
|
|
|
|
// to encode the number, and the top bit to indicate whether more bytes follow.
|
|
|
|
// e.g. 9a 2f means [0x1a and keep reading, 0x2f and stop].
|
|
|
|
// This represents 0x1a | 0x2f<<7 = 6042.
|
|
|
|
// A 32-bit integer takes 1-5 bytes to encode; small numbers are more compact.
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
|
2018-09-24 22:51:15 +08:00
|
|
|
// Reads binary data from a StringRef, and keeps track of position.
|
|
|
|
class Reader {
|
|
|
|
const char *Begin, *End;
|
2018-09-25 00:52:48 +08:00
|
|
|
bool Err = false;
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
|
2018-09-24 22:51:15 +08:00
|
|
|
public:
|
|
|
|
Reader(StringRef Data) : Begin(Data.begin()), End(Data.end()) {}
|
|
|
|
// The "error" bit is set by reading past EOF or reading invalid data.
|
|
|
|
// When in an error state, reads may return zero values: callers should check.
|
|
|
|
bool err() const { return Err; }
|
|
|
|
// Did we read all the data, or encounter an error?
|
|
|
|
bool eof() const { return Begin == End || Err; }
|
|
|
|
// All the data we didn't read yet.
|
|
|
|
StringRef rest() const { return StringRef(Begin, End - Begin); }
|
|
|
|
|
|
|
|
uint8_t consume8() {
|
|
|
|
if (LLVM_UNLIKELY(Begin == End)) {
|
|
|
|
Err = true;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
return *Begin++;
|
|
|
|
}
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
|
2018-09-24 22:51:15 +08:00
|
|
|
uint32_t consume32() {
|
|
|
|
if (LLVM_UNLIKELY(Begin + 4 > End)) {
|
|
|
|
Err = true;
|
|
|
|
return 0;
|
|
|
|
}
|
|
|
|
auto Ret = support::endian::read32le(Begin);
|
|
|
|
Begin += 4;
|
|
|
|
return Ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
StringRef consume(int N) {
|
|
|
|
if (LLVM_UNLIKELY(Begin + N > End)) {
|
|
|
|
Err = true;
|
|
|
|
return StringRef();
|
|
|
|
}
|
|
|
|
StringRef Ret(Begin, N);
|
|
|
|
Begin += N;
|
|
|
|
return Ret;
|
|
|
|
}
|
|
|
|
|
|
|
|
uint32_t consumeVar() {
|
|
|
|
constexpr static uint8_t More = 1 << 7;
|
|
|
|
uint8_t B = consume8();
|
|
|
|
if (LLVM_LIKELY(!(B & More)))
|
|
|
|
return B;
|
|
|
|
uint32_t Val = B & ~More;
|
|
|
|
for (int Shift = 7; B & More && Shift < 32; Shift += 7) {
|
|
|
|
B = consume8();
|
|
|
|
Val |= (B & ~More) << Shift;
|
|
|
|
}
|
|
|
|
return Val;
|
|
|
|
}
|
|
|
|
|
|
|
|
StringRef consumeString(ArrayRef<StringRef> Strings) {
|
|
|
|
auto StringIndex = consumeVar();
|
|
|
|
if (LLVM_UNLIKELY(StringIndex >= Strings.size())) {
|
|
|
|
Err = true;
|
|
|
|
return StringRef();
|
|
|
|
}
|
|
|
|
return Strings[StringIndex];
|
|
|
|
}
|
|
|
|
|
|
|
|
SymbolID consumeID() {
|
|
|
|
StringRef Raw = consume(SymbolID::RawSize); // short if truncated.
|
|
|
|
return LLVM_UNLIKELY(err()) ? SymbolID() : SymbolID::fromRaw(Raw);
|
|
|
|
}
|
|
|
|
};
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
|
|
|
|
void write32(uint32_t I, raw_ostream &OS) {
|
|
|
|
char buf[4];
|
|
|
|
support::endian::write32le(buf, I);
|
|
|
|
OS.write(buf, sizeof(buf));
|
|
|
|
}
|
|
|
|
|
|
|
|
void writeVar(uint32_t I, raw_ostream &OS) {
|
|
|
|
constexpr static uint8_t More = 1 << 7;
|
|
|
|
if (LLVM_LIKELY(I < 1 << 7)) {
|
|
|
|
OS.write(I);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
for (;;) {
|
|
|
|
OS.write(I | More);
|
|
|
|
I >>= 7;
|
|
|
|
if (I < 1 << 7) {
|
|
|
|
OS.write(I);
|
|
|
|
return;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
// STRING TABLE ENCODING
|
|
|
|
// Index data has many string fields, and many strings are identical.
|
|
|
|
// We store each string once, and refer to them by index.
|
|
|
|
//
|
|
|
|
// The string table's format is:
|
2018-09-05 21:17:47 +08:00
|
|
|
// - UncompressedSize : uint32 (or 0 for no compression)
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
// - CompressedData : byte[CompressedSize]
|
|
|
|
//
|
|
|
|
// CompressedData is a zlib-compressed byte[UncompressedSize].
|
|
|
|
// It contains a sequence of null-terminated strings, e.g. "foo\0bar\0".
|
|
|
|
// These are sorted to improve compression.
|
|
|
|
|
|
|
|
// Maps each string to a canonical representation.
|
|
|
|
// Strings remain owned externally (e.g. by SymbolSlab).
|
|
|
|
class StringTableOut {
|
|
|
|
DenseSet<StringRef> Unique;
|
|
|
|
std::vector<StringRef> Sorted;
|
|
|
|
// Since strings are interned, look up can be by pointer.
|
|
|
|
DenseMap<std::pair<const char *, size_t>, unsigned> Index;
|
|
|
|
|
|
|
|
public:
|
2018-09-05 21:17:47 +08:00
|
|
|
StringTableOut() {
|
|
|
|
// Ensure there's at least one string in the table.
|
|
|
|
// Table size zero is reserved to indicate no compression.
|
|
|
|
Unique.insert("");
|
|
|
|
}
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
// Add a string to the table. Overwrites S if an identical string exists.
|
|
|
|
void intern(StringRef &S) { S = *Unique.insert(S).first; };
|
|
|
|
// Finalize the table and write it to OS. No more strings may be added.
|
|
|
|
void finalize(raw_ostream &OS) {
|
|
|
|
Sorted = {Unique.begin(), Unique.end()};
|
2018-10-07 22:49:41 +08:00
|
|
|
llvm::sort(Sorted);
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
for (unsigned I = 0; I < Sorted.size(); ++I)
|
|
|
|
Index.try_emplace({Sorted[I].data(), Sorted[I].size()}, I);
|
|
|
|
|
|
|
|
std::string RawTable;
|
|
|
|
for (StringRef S : Sorted) {
|
|
|
|
RawTable.append(S);
|
|
|
|
RawTable.push_back(0);
|
|
|
|
}
|
2018-09-05 21:17:47 +08:00
|
|
|
if (zlib::isAvailable()) {
|
|
|
|
SmallString<1> Compressed;
|
|
|
|
cantFail(zlib::compress(RawTable, Compressed));
|
|
|
|
write32(RawTable.size(), OS);
|
|
|
|
OS << Compressed;
|
|
|
|
} else {
|
|
|
|
write32(0, OS); // No compression.
|
|
|
|
OS << RawTable;
|
|
|
|
}
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
}
|
|
|
|
// Get the ID of an string, which must be interned. Table must be finalized.
|
|
|
|
unsigned index(StringRef S) const {
|
|
|
|
assert(!Sorted.empty() && "table not finalized");
|
|
|
|
assert(Index.count({S.data(), S.size()}) && "string not interned");
|
|
|
|
return Index.find({S.data(), S.size()})->second;
|
|
|
|
}
|
|
|
|
};
|
|
|
|
|
|
|
|
struct StringTableIn {
|
|
|
|
BumpPtrAllocator Arena;
|
|
|
|
std::vector<StringRef> Strings;
|
|
|
|
};
|
|
|
|
|
|
|
|
Expected<StringTableIn> readStringTable(StringRef Data) {
|
2018-09-24 22:51:15 +08:00
|
|
|
Reader R(Data);
|
|
|
|
size_t UncompressedSize = R.consume32();
|
|
|
|
if (R.err())
|
|
|
|
return makeError("Truncated string table");
|
2018-09-05 21:17:47 +08:00
|
|
|
|
|
|
|
StringRef Uncompressed;
|
|
|
|
SmallString<1> UncompressedStorage;
|
|
|
|
if (UncompressedSize == 0) // No compression
|
2018-09-24 22:51:15 +08:00
|
|
|
Uncompressed = R.rest();
|
2018-09-05 21:17:47 +08:00
|
|
|
else {
|
2018-09-24 22:51:15 +08:00
|
|
|
if (Error E = llvm::zlib::uncompress(R.rest(), UncompressedStorage,
|
|
|
|
UncompressedSize))
|
2018-09-05 21:17:47 +08:00
|
|
|
return std::move(E);
|
|
|
|
Uncompressed = UncompressedStorage;
|
|
|
|
}
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
|
|
|
|
StringTableIn Table;
|
|
|
|
StringSaver Saver(Table.Arena);
|
2018-09-24 22:51:15 +08:00
|
|
|
R = Reader(Uncompressed);
|
|
|
|
for (Reader R(Uncompressed); !R.eof();) {
|
|
|
|
auto Len = R.rest().find(0);
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
if (Len == StringRef::npos)
|
|
|
|
return makeError("Bad string table: not null terminated");
|
2018-09-24 22:51:15 +08:00
|
|
|
Table.Strings.push_back(Saver.save(R.consume(Len)));
|
|
|
|
R.consume8();
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
}
|
2018-09-24 22:51:15 +08:00
|
|
|
if (R.err())
|
|
|
|
return makeError("Truncated string table");
|
2018-09-05 15:52:49 +08:00
|
|
|
return std::move(Table);
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
// SYMBOL ENCODING
|
|
|
|
// Each field of clangd::Symbol is encoded in turn (see implementation).
|
|
|
|
// - StringRef fields encode as varint (index into the string table)
|
|
|
|
// - enums encode as the underlying type
|
|
|
|
// - most numbers encode as varint
|
|
|
|
|
2018-09-24 22:51:15 +08:00
|
|
|
void writeLocation(const SymbolLocation &Loc, const StringTableOut &Strings,
|
|
|
|
raw_ostream &OS) {
|
|
|
|
writeVar(Strings.index(Loc.FileURI), OS);
|
|
|
|
for (const auto &Endpoint : {Loc.Start, Loc.End}) {
|
[clangd] Encode Line/Column as a 32-bits integer.
Summary:
This would buy us more memory. Using a 32-bits integer is enough for
most human-readable source code (up to 4M lines and 4K columns).
Previsouly, we used 8 bytes for a position, now 4 bytes, it would save
us 8 bytes for each Ref and each Symbol instance.
For LLVM-project binary index file, we save ~13% memory.
| Before | After |
| 412MB | 355MB |
Reviewers: sammccall
Subscribers: ilya-biryukov, ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D53363
llvm-svn: 344735
2018-10-18 18:43:50 +08:00
|
|
|
writeVar(Endpoint.line(), OS);
|
|
|
|
writeVar(Endpoint.column(), OS);
|
2018-09-24 22:51:15 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
SymbolLocation readLocation(Reader &Data, ArrayRef<StringRef> Strings) {
|
|
|
|
SymbolLocation Loc;
|
2018-11-14 19:55:45 +08:00
|
|
|
Loc.FileURI = Data.consumeString(Strings).data();
|
2018-09-24 22:51:15 +08:00
|
|
|
for (auto *Endpoint : {&Loc.Start, &Loc.End}) {
|
[clangd] Encode Line/Column as a 32-bits integer.
Summary:
This would buy us more memory. Using a 32-bits integer is enough for
most human-readable source code (up to 4M lines and 4K columns).
Previsouly, we used 8 bytes for a position, now 4 bytes, it would save
us 8 bytes for each Ref and each Symbol instance.
For LLVM-project binary index file, we save ~13% memory.
| Before | After |
| 412MB | 355MB |
Reviewers: sammccall
Subscribers: ilya-biryukov, ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D53363
llvm-svn: 344735
2018-10-18 18:43:50 +08:00
|
|
|
Endpoint->setLine(Data.consumeVar());
|
|
|
|
Endpoint->setColumn(Data.consumeVar());
|
2018-09-24 22:51:15 +08:00
|
|
|
}
|
|
|
|
return Loc;
|
|
|
|
}
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
|
|
|
|
void writeSymbol(const Symbol &Sym, const StringTableOut &Strings,
|
|
|
|
raw_ostream &OS) {
|
|
|
|
OS << Sym.ID.raw(); // TODO: once we start writing xrefs and posting lists,
|
|
|
|
// symbol IDs should probably be in a string table.
|
|
|
|
OS.write(static_cast<uint8_t>(Sym.SymInfo.Kind));
|
|
|
|
OS.write(static_cast<uint8_t>(Sym.SymInfo.Lang));
|
|
|
|
writeVar(Strings.index(Sym.Name), OS);
|
|
|
|
writeVar(Strings.index(Sym.Scope), OS);
|
2018-09-24 22:51:15 +08:00
|
|
|
writeLocation(Sym.Definition, Strings, OS);
|
|
|
|
writeLocation(Sym.CanonicalDeclaration, Strings, OS);
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
writeVar(Sym.References, OS);
|
2018-09-07 02:52:26 +08:00
|
|
|
OS.write(static_cast<uint8_t>(Sym.Flags));
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
OS.write(static_cast<uint8_t>(Sym.Origin));
|
|
|
|
writeVar(Strings.index(Sym.Signature), OS);
|
|
|
|
writeVar(Strings.index(Sym.CompletionSnippetSuffix), OS);
|
|
|
|
writeVar(Strings.index(Sym.Documentation), OS);
|
|
|
|
writeVar(Strings.index(Sym.ReturnType), OS);
|
|
|
|
|
|
|
|
auto WriteInclude = [&](const Symbol::IncludeHeaderWithReferences &Include) {
|
|
|
|
writeVar(Strings.index(Include.IncludeHeader), OS);
|
|
|
|
writeVar(Include.References, OS);
|
|
|
|
};
|
2018-09-24 22:51:15 +08:00
|
|
|
writeVar(Sym.IncludeHeaders.size(), OS);
|
|
|
|
for (const auto &Include : Sym.IncludeHeaders)
|
|
|
|
WriteInclude(Include);
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
}
|
|
|
|
|
2018-09-24 22:51:15 +08:00
|
|
|
Symbol readSymbol(Reader &Data, ArrayRef<StringRef> Strings) {
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
Symbol Sym;
|
2018-09-24 22:51:15 +08:00
|
|
|
Sym.ID = Data.consumeID();
|
|
|
|
Sym.SymInfo.Kind = static_cast<index::SymbolKind>(Data.consume8());
|
|
|
|
Sym.SymInfo.Lang = static_cast<index::SymbolLanguage>(Data.consume8());
|
|
|
|
Sym.Name = Data.consumeString(Strings);
|
|
|
|
Sym.Scope = Data.consumeString(Strings);
|
|
|
|
Sym.Definition = readLocation(Data, Strings);
|
|
|
|
Sym.CanonicalDeclaration = readLocation(Data, Strings);
|
|
|
|
Sym.References = Data.consumeVar();
|
|
|
|
Sym.Flags = static_cast<Symbol::SymbolFlag>(Data.consumeVar());
|
|
|
|
Sym.Origin = static_cast<SymbolOrigin>(Data.consumeVar());
|
|
|
|
Sym.Signature = Data.consumeString(Strings);
|
|
|
|
Sym.CompletionSnippetSuffix = Data.consumeString(Strings);
|
|
|
|
Sym.Documentation = Data.consumeString(Strings);
|
|
|
|
Sym.ReturnType = Data.consumeString(Strings);
|
|
|
|
Sym.IncludeHeaders.resize(Data.consumeVar());
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
for (auto &I : Sym.IncludeHeaders) {
|
2018-09-24 22:51:15 +08:00
|
|
|
I.IncludeHeader = Data.consumeString(Strings);
|
|
|
|
I.References = Data.consumeVar();
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
}
|
2018-09-24 22:51:15 +08:00
|
|
|
return Sym;
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
}
|
|
|
|
|
2018-10-04 22:09:55 +08:00
|
|
|
// REFS ENCODING
|
|
|
|
// A refs section has data grouped by Symbol. Each symbol has:
|
2018-11-16 18:58:40 +08:00
|
|
|
// - SymbolID: 8 bytes
|
2018-10-04 22:09:55 +08:00
|
|
|
// - NumRefs: varint
|
|
|
|
// - Ref[NumRefs]
|
|
|
|
// Fields of Ref are encoded in turn, see implementation.
|
|
|
|
|
|
|
|
void writeRefs(const SymbolID &ID, ArrayRef<Ref> Refs,
|
|
|
|
const StringTableOut &Strings, raw_ostream &OS) {
|
|
|
|
OS << ID.raw();
|
|
|
|
writeVar(Refs.size(), OS);
|
|
|
|
for (const auto &Ref : Refs) {
|
|
|
|
OS.write(static_cast<unsigned char>(Ref.Kind));
|
|
|
|
writeLocation(Ref.Location, Strings, OS);
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
|
|
|
std::pair<SymbolID, std::vector<Ref>> readRefs(Reader &Data,
|
|
|
|
ArrayRef<StringRef> Strings) {
|
|
|
|
std::pair<SymbolID, std::vector<Ref>> Result;
|
|
|
|
Result.first = Data.consumeID();
|
|
|
|
Result.second.resize(Data.consumeVar());
|
|
|
|
for (auto &Ref : Result.second) {
|
|
|
|
Ref.Kind = static_cast<RefKind>(Data.consume8());
|
|
|
|
Ref.Location = readLocation(Data, Strings);
|
|
|
|
}
|
|
|
|
return Result;
|
|
|
|
}
|
|
|
|
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
// FILE ENCODING
|
|
|
|
// A file is a RIFF chunk with type 'CdIx'.
|
|
|
|
// It contains the sections:
|
|
|
|
// - meta: version number
|
2018-11-20 02:06:36 +08:00
|
|
|
// - srcs: checksum of the source file
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
// - stri: string table
|
|
|
|
// - symb: symbols
|
2018-10-04 22:09:55 +08:00
|
|
|
// - refs: references to symbols
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
|
|
|
|
// The current versioning scheme is simple - non-current versions are rejected.
|
2018-09-05 21:17:47 +08:00
|
|
|
// If you make a breaking change, bump this version number to invalidate stored
|
|
|
|
// data. Later we may want to support some backward compatibility.
|
2018-11-16 18:58:40 +08:00
|
|
|
constexpr static uint32_t Version = 7;
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
|
[clangd] Merge binary + YAML serialization behind a (mostly) common interface.
Summary:
Interface is in one file, implementation in two as they have little in common.
A couple of ad-hoc YAML functions left exposed:
- symbol -> YAML I expect to keep for tools like dexp
- YAML -> symbol is used for the MR-style indexer, I think we can eliminate
this (merge-on-the-fly, else use a different serialization)
Reviewers: kbobyrev
Subscribers: mgorny, ilya-biryukov, ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52453
llvm-svn: 342999
2018-09-26 02:06:43 +08:00
|
|
|
Expected<IndexFileIn> readRIFF(StringRef Data) {
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
auto RIFF = riff::readFile(Data);
|
|
|
|
if (!RIFF)
|
|
|
|
return RIFF.takeError();
|
|
|
|
if (RIFF->Type != riff::fourCC("CdIx"))
|
|
|
|
return makeError("wrong RIFF type");
|
|
|
|
StringMap<StringRef> Chunks;
|
|
|
|
for (const auto &Chunk : RIFF->Chunks)
|
|
|
|
Chunks.try_emplace(StringRef(Chunk.ID.data(), Chunk.ID.size()), Chunk.Data);
|
|
|
|
|
|
|
|
for (StringRef RequiredChunk : {"meta", "stri"})
|
|
|
|
if (!Chunks.count(RequiredChunk))
|
|
|
|
return makeError("missing required chunk " + RequiredChunk);
|
|
|
|
|
2018-09-24 22:51:15 +08:00
|
|
|
Reader Meta(Chunks.lookup("meta"));
|
|
|
|
if (Meta.consume32() != Version)
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
return makeError("wrong version");
|
|
|
|
|
|
|
|
auto Strings = readStringTable(Chunks.lookup("stri"));
|
|
|
|
if (!Strings)
|
|
|
|
return Strings.takeError();
|
|
|
|
|
|
|
|
IndexFileIn Result;
|
2018-11-20 02:06:36 +08:00
|
|
|
if (Chunks.count("srcs")) {
|
|
|
|
Reader Hash(Chunks.lookup("srcs"));
|
2018-11-20 02:06:29 +08:00
|
|
|
Result.Digest.emplace();
|
2018-11-20 02:06:33 +08:00
|
|
|
llvm::StringRef Digest = Hash.consume(Result.Digest->size());
|
2018-11-20 02:06:29 +08:00
|
|
|
std::copy(Digest.bytes_begin(), Digest.bytes_end(), Result.Digest->begin());
|
|
|
|
}
|
|
|
|
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
if (Chunks.count("symb")) {
|
2018-09-24 22:51:15 +08:00
|
|
|
Reader SymbolReader(Chunks.lookup("symb"));
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
SymbolSlab::Builder Symbols;
|
2018-09-24 22:51:15 +08:00
|
|
|
while (!SymbolReader.eof())
|
|
|
|
Symbols.insert(readSymbol(SymbolReader, Strings->Strings));
|
|
|
|
if (SymbolReader.err())
|
|
|
|
return makeError("malformed or truncated symbol");
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
Result.Symbols = std::move(Symbols).build();
|
|
|
|
}
|
2018-10-04 22:09:55 +08:00
|
|
|
if (Chunks.count("refs")) {
|
|
|
|
Reader RefsReader(Chunks.lookup("refs"));
|
|
|
|
RefSlab::Builder Refs;
|
|
|
|
while (!RefsReader.eof()) {
|
|
|
|
auto RefsBundle = readRefs(RefsReader, Strings->Strings);
|
|
|
|
for (const auto &Ref : RefsBundle.second) // FIXME: bulk insert?
|
|
|
|
Refs.insert(RefsBundle.first, Ref);
|
|
|
|
}
|
|
|
|
if (RefsReader.err())
|
|
|
|
return makeError("malformed or truncated refs");
|
|
|
|
Result.Refs = std::move(Refs).build();
|
|
|
|
}
|
2018-09-05 15:52:49 +08:00
|
|
|
return std::move(Result);
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
}
|
|
|
|
|
[clangd] Merge binary + YAML serialization behind a (mostly) common interface.
Summary:
Interface is in one file, implementation in two as they have little in common.
A couple of ad-hoc YAML functions left exposed:
- symbol -> YAML I expect to keep for tools like dexp
- YAML -> symbol is used for the MR-style indexer, I think we can eliminate
this (merge-on-the-fly, else use a different serialization)
Reviewers: kbobyrev
Subscribers: mgorny, ilya-biryukov, ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52453
llvm-svn: 342999
2018-09-26 02:06:43 +08:00
|
|
|
void writeRIFF(const IndexFileOut &Data, raw_ostream &OS) {
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
assert(Data.Symbols && "An index file without symbols makes no sense!");
|
|
|
|
riff::File RIFF;
|
|
|
|
RIFF.Type = riff::fourCC("CdIx");
|
|
|
|
|
|
|
|
SmallString<4> Meta;
|
|
|
|
{
|
|
|
|
raw_svector_ostream MetaOS(Meta);
|
|
|
|
write32(Version, MetaOS);
|
|
|
|
}
|
|
|
|
RIFF.Chunks.push_back({riff::fourCC("meta"), Meta});
|
|
|
|
|
2018-11-20 02:06:29 +08:00
|
|
|
if (Data.Digest) {
|
|
|
|
llvm::StringRef Hash(reinterpret_cast<const char *>(Data.Digest->data()),
|
|
|
|
Data.Digest->size());
|
2018-11-20 02:06:36 +08:00
|
|
|
RIFF.Chunks.push_back({riff::fourCC("srcs"), Hash});
|
2018-11-20 02:06:29 +08:00
|
|
|
}
|
|
|
|
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
StringTableOut Strings;
|
|
|
|
std::vector<Symbol> Symbols;
|
|
|
|
for (const auto &Sym : *Data.Symbols) {
|
|
|
|
Symbols.emplace_back(Sym);
|
|
|
|
visitStrings(Symbols.back(), [&](StringRef &S) { Strings.intern(S); });
|
|
|
|
}
|
2018-10-04 22:09:55 +08:00
|
|
|
std::vector<std::pair<SymbolID, std::vector<Ref>>> Refs;
|
|
|
|
if (Data.Refs) {
|
|
|
|
for (const auto &Sym : *Data.Refs) {
|
|
|
|
Refs.emplace_back(Sym);
|
2018-11-14 19:55:45 +08:00
|
|
|
for (auto &Ref : Refs.back().second) {
|
|
|
|
StringRef File = Ref.Location.FileURI;
|
|
|
|
Strings.intern(File);
|
|
|
|
Ref.Location.FileURI = File.data();
|
|
|
|
}
|
2018-10-04 22:09:55 +08:00
|
|
|
}
|
|
|
|
}
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
|
|
|
|
std::string StringSection;
|
|
|
|
{
|
|
|
|
raw_string_ostream StringOS(StringSection);
|
|
|
|
Strings.finalize(StringOS);
|
|
|
|
}
|
|
|
|
RIFF.Chunks.push_back({riff::fourCC("stri"), StringSection});
|
|
|
|
|
|
|
|
std::string SymbolSection;
|
|
|
|
{
|
|
|
|
raw_string_ostream SymbolOS(SymbolSection);
|
|
|
|
for (const auto &Sym : Symbols)
|
|
|
|
writeSymbol(Sym, Strings, SymbolOS);
|
|
|
|
}
|
|
|
|
RIFF.Chunks.push_back({riff::fourCC("symb"), SymbolSection});
|
|
|
|
|
2018-10-04 22:09:55 +08:00
|
|
|
std::string RefsSection;
|
|
|
|
if (Data.Refs) {
|
|
|
|
{
|
|
|
|
raw_string_ostream RefsOS(RefsSection);
|
|
|
|
for (const auto &Sym : Refs)
|
|
|
|
writeRefs(Sym.first, Sym.second, Strings, RefsOS);
|
|
|
|
}
|
|
|
|
RIFF.Chunks.push_back({riff::fourCC("refs"), RefsSection});
|
|
|
|
}
|
|
|
|
|
[clangd] Merge binary + YAML serialization behind a (mostly) common interface.
Summary:
Interface is in one file, implementation in two as they have little in common.
A couple of ad-hoc YAML functions left exposed:
- symbol -> YAML I expect to keep for tools like dexp
- YAML -> symbol is used for the MR-style indexer, I think we can eliminate
this (merge-on-the-fly, else use a different serialization)
Reviewers: kbobyrev
Subscribers: mgorny, ilya-biryukov, ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52453
llvm-svn: 342999
2018-09-26 02:06:43 +08:00
|
|
|
OS << RIFF;
|
|
|
|
}
|
|
|
|
|
|
|
|
} // namespace
|
|
|
|
|
|
|
|
// Defined in YAMLSerialization.cpp.
|
|
|
|
void writeYAML(const IndexFileOut &, raw_ostream &);
|
|
|
|
Expected<IndexFileIn> readYAML(StringRef);
|
|
|
|
|
2018-10-20 23:30:37 +08:00
|
|
|
raw_ostream &operator<<(raw_ostream &OS, const IndexFileOut &O) {
|
[clangd] Merge binary + YAML serialization behind a (mostly) common interface.
Summary:
Interface is in one file, implementation in two as they have little in common.
A couple of ad-hoc YAML functions left exposed:
- symbol -> YAML I expect to keep for tools like dexp
- YAML -> symbol is used for the MR-style indexer, I think we can eliminate
this (merge-on-the-fly, else use a different serialization)
Reviewers: kbobyrev
Subscribers: mgorny, ilya-biryukov, ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52453
llvm-svn: 342999
2018-09-26 02:06:43 +08:00
|
|
|
switch (O.Format) {
|
|
|
|
case IndexFileFormat::RIFF:
|
2018-09-26 03:53:33 +08:00
|
|
|
writeRIFF(O, OS);
|
[clangd] Merge binary + YAML serialization behind a (mostly) common interface.
Summary:
Interface is in one file, implementation in two as they have little in common.
A couple of ad-hoc YAML functions left exposed:
- symbol -> YAML I expect to keep for tools like dexp
- YAML -> symbol is used for the MR-style indexer, I think we can eliminate
this (merge-on-the-fly, else use a different serialization)
Reviewers: kbobyrev
Subscribers: mgorny, ilya-biryukov, ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52453
llvm-svn: 342999
2018-09-26 02:06:43 +08:00
|
|
|
break;
|
|
|
|
case IndexFileFormat::YAML:
|
2018-09-26 03:53:33 +08:00
|
|
|
writeYAML(O, OS);
|
[clangd] Merge binary + YAML serialization behind a (mostly) common interface.
Summary:
Interface is in one file, implementation in two as they have little in common.
A couple of ad-hoc YAML functions left exposed:
- symbol -> YAML I expect to keep for tools like dexp
- YAML -> symbol is used for the MR-style indexer, I think we can eliminate
this (merge-on-the-fly, else use a different serialization)
Reviewers: kbobyrev
Subscribers: mgorny, ilya-biryukov, ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52453
llvm-svn: 342999
2018-09-26 02:06:43 +08:00
|
|
|
break;
|
|
|
|
}
|
|
|
|
return OS;
|
|
|
|
}
|
|
|
|
|
|
|
|
Expected<IndexFileIn> readIndexFile(StringRef Data) {
|
|
|
|
if (Data.startswith("RIFF")) {
|
|
|
|
return readRIFF(Data);
|
|
|
|
} else if (auto YAMLContents = readYAML(Data)) {
|
|
|
|
return std::move(*YAMLContents);
|
|
|
|
} else {
|
|
|
|
return makeError("Not a RIFF file and failed to parse as YAML: " +
|
2018-10-20 23:30:37 +08:00
|
|
|
toString(YAMLContents.takeError()));
|
[clangd] Merge binary + YAML serialization behind a (mostly) common interface.
Summary:
Interface is in one file, implementation in two as they have little in common.
A couple of ad-hoc YAML functions left exposed:
- symbol -> YAML I expect to keep for tools like dexp
- YAML -> symbol is used for the MR-style indexer, I think we can eliminate
this (merge-on-the-fly, else use a different serialization)
Reviewers: kbobyrev
Subscribers: mgorny, ilya-biryukov, ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52453
llvm-svn: 342999
2018-09-26 02:06:43 +08:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
[clangd] Cleanup: stop passing around list of supported URI schemes.
Summary:
Instead of passing around a list of supported URI schemes in clangd, we
expose an interface to convert a path to URI using any compatible scheme
that has been registered. It favors customized schemes and falls
back to "file" when no other scheme works.
Changes in this patch are:
- URI::create(AbsPath, URISchemes) -> URI::create(AbsPath). The new API finds a
compatible scheme from the registry.
- Remove URISchemes option everywhere (ClangdServer, SymbolCollecter, FileIndex etc).
- Unit tests will use "unittest" by default.
- Move "test" scheme from ClangdLSPServer to ClangdMain.cpp, and only
register the test scheme when lit-test or enable-lit-scheme is set.
(The new flag is added to make lit protocol.test work; I wonder if there
is alternative here.)
Reviewers: sammccall
Reviewed By: sammccall
Subscribers: ilya-biryukov, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D54800
llvm-svn: 347467
2018-11-22 23:02:05 +08:00
|
|
|
std::unique_ptr<SymbolIndex> loadIndex(StringRef SymbolFilename, bool UseDex) {
|
[clangd] Merge binary + YAML serialization behind a (mostly) common interface.
Summary:
Interface is in one file, implementation in two as they have little in common.
A couple of ad-hoc YAML functions left exposed:
- symbol -> YAML I expect to keep for tools like dexp
- YAML -> symbol is used for the MR-style indexer, I think we can eliminate
this (merge-on-the-fly, else use a different serialization)
Reviewers: kbobyrev
Subscribers: mgorny, ilya-biryukov, ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52453
llvm-svn: 342999
2018-09-26 02:06:43 +08:00
|
|
|
trace::Span OverallTracer("LoadIndex");
|
|
|
|
auto Buffer = MemoryBuffer::getFile(SymbolFilename);
|
|
|
|
if (!Buffer) {
|
2018-10-20 23:30:37 +08:00
|
|
|
errs() << "Can't open " << SymbolFilename << "\n";
|
[clangd] Merge binary + YAML serialization behind a (mostly) common interface.
Summary:
Interface is in one file, implementation in two as they have little in common.
A couple of ad-hoc YAML functions left exposed:
- symbol -> YAML I expect to keep for tools like dexp
- YAML -> symbol is used for the MR-style indexer, I think we can eliminate
this (merge-on-the-fly, else use a different serialization)
Reviewers: kbobyrev
Subscribers: mgorny, ilya-biryukov, ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52453
llvm-svn: 342999
2018-09-26 02:06:43 +08:00
|
|
|
return nullptr;
|
|
|
|
}
|
|
|
|
|
|
|
|
SymbolSlab Symbols;
|
|
|
|
RefSlab Refs;
|
|
|
|
{
|
|
|
|
trace::Span Tracer("ParseIndex");
|
|
|
|
if (auto I = readIndexFile(Buffer->get()->getBuffer())) {
|
|
|
|
if (I->Symbols)
|
|
|
|
Symbols = std::move(*I->Symbols);
|
2018-10-04 22:09:55 +08:00
|
|
|
if (I->Refs)
|
|
|
|
Refs = std::move(*I->Refs);
|
[clangd] Merge binary + YAML serialization behind a (mostly) common interface.
Summary:
Interface is in one file, implementation in two as they have little in common.
A couple of ad-hoc YAML functions left exposed:
- symbol -> YAML I expect to keep for tools like dexp
- YAML -> symbol is used for the MR-style indexer, I think we can eliminate
this (merge-on-the-fly, else use a different serialization)
Reviewers: kbobyrev
Subscribers: mgorny, ilya-biryukov, ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52453
llvm-svn: 342999
2018-09-26 02:06:43 +08:00
|
|
|
} else {
|
2018-10-20 23:30:37 +08:00
|
|
|
errs() << "Bad Index: " << toString(I.takeError()) << "\n";
|
[clangd] Merge binary + YAML serialization behind a (mostly) common interface.
Summary:
Interface is in one file, implementation in two as they have little in common.
A couple of ad-hoc YAML functions left exposed:
- symbol -> YAML I expect to keep for tools like dexp
- YAML -> symbol is used for the MR-style indexer, I think we can eliminate
this (merge-on-the-fly, else use a different serialization)
Reviewers: kbobyrev
Subscribers: mgorny, ilya-biryukov, ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52453
llvm-svn: 342999
2018-09-26 02:06:43 +08:00
|
|
|
return nullptr;
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2018-10-18 23:33:20 +08:00
|
|
|
size_t NumSym = Symbols.size();
|
|
|
|
size_t NumRefs = Refs.numRefs();
|
|
|
|
|
[clangd] Merge binary + YAML serialization behind a (mostly) common interface.
Summary:
Interface is in one file, implementation in two as they have little in common.
A couple of ad-hoc YAML functions left exposed:
- symbol -> YAML I expect to keep for tools like dexp
- YAML -> symbol is used for the MR-style indexer, I think we can eliminate
this (merge-on-the-fly, else use a different serialization)
Reviewers: kbobyrev
Subscribers: mgorny, ilya-biryukov, ioeric, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D52453
llvm-svn: 342999
2018-09-26 02:06:43 +08:00
|
|
|
trace::Span Tracer("BuildIndex");
|
[clangd] Cleanup: stop passing around list of supported URI schemes.
Summary:
Instead of passing around a list of supported URI schemes in clangd, we
expose an interface to convert a path to URI using any compatible scheme
that has been registered. It favors customized schemes and falls
back to "file" when no other scheme works.
Changes in this patch are:
- URI::create(AbsPath, URISchemes) -> URI::create(AbsPath). The new API finds a
compatible scheme from the registry.
- Remove URISchemes option everywhere (ClangdServer, SymbolCollecter, FileIndex etc).
- Unit tests will use "unittest" by default.
- Move "test" scheme from ClangdLSPServer to ClangdMain.cpp, and only
register the test scheme when lit-test or enable-lit-scheme is set.
(The new flag is added to make lit protocol.test work; I wonder if there
is alternative here.)
Reviewers: sammccall
Reviewed By: sammccall
Subscribers: ilya-biryukov, MaskRay, jkorous, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D54800
llvm-svn: 347467
2018-11-22 23:02:05 +08:00
|
|
|
auto Index = UseDex ? dex::Dex::build(std::move(Symbols), std::move(Refs))
|
|
|
|
: MemIndex::build(std::move(Symbols), std::move(Refs));
|
2018-10-17 16:48:04 +08:00
|
|
|
vlog("Loaded {0} from {1} with estimated memory usage {2} bytes\n"
|
2018-10-18 23:33:20 +08:00
|
|
|
" - number of symbols: {3}\n"
|
2018-10-17 16:48:04 +08:00
|
|
|
" - number of refs: {4}\n",
|
2018-09-26 23:06:23 +08:00
|
|
|
UseDex ? "Dex" : "MemIndex", SymbolFilename,
|
2018-10-18 23:33:20 +08:00
|
|
|
Index->estimateMemoryUsage(), NumSym, NumRefs);
|
2018-09-26 23:06:23 +08:00
|
|
|
return Index;
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
}
|
|
|
|
|
|
|
|
} // namespace clangd
|
|
|
|
} // namespace clang
|