[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
//===--- RIFF.h - Binary container file format -------------------*- C++-*-===//
|
|
|
|
//
|
2019-01-19 16:50:56 +08:00
|
|
|
// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
|
|
|
|
// See https://llvm.org/LICENSE.txt for license information.
|
|
|
|
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
//
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
//
|
|
|
|
// Tools for reading and writing data in RIFF containers.
|
|
|
|
//
|
|
|
|
// A chunk consists of:
|
|
|
|
// - ID : char[4]
|
|
|
|
// - Length : uint32
|
|
|
|
// - Data : byte[Length]
|
|
|
|
// - Padding : byte[Length % 2]
|
|
|
|
// The semantics of a chunk's Data are determined by its ID.
|
|
|
|
// The format makes it easy to skip over uninteresting or unknown chunks.
|
|
|
|
//
|
|
|
|
// A RIFF file is a single chunk with ID "RIFF". Its Data is:
|
|
|
|
// - Type : char[4]
|
|
|
|
// - Chunks : chunk[]
|
|
|
|
//
|
|
|
|
// This means that a RIFF file consists of:
|
|
|
|
// - "RIFF" : char[4]
|
|
|
|
// - File length - 8 : uint32
|
|
|
|
// - File type : char[4]
|
|
|
|
// - Chunks : chunk[]
|
|
|
|
//
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
#ifndef LLVM_CLANG_TOOLS_EXTRA_CLANGD_RIFF_H
|
|
|
|
#define LLVM_CLANG_TOOLS_EXTRA_CLANGD_RIFF_H
|
|
|
|
#include "llvm/ADT/StringRef.h"
|
|
|
|
#include "llvm/Support/Error.h"
|
|
|
|
#include "llvm/Support/ScopedPrinter.h"
|
|
|
|
#include <array>
|
|
|
|
|
|
|
|
namespace clang {
|
|
|
|
namespace clangd {
|
|
|
|
namespace riff {
|
|
|
|
|
|
|
|
// A FourCC identifies a chunk in a file, or the type of file itself.
|
|
|
|
using FourCC = std::array<char, 4>;
|
|
|
|
// Get a FourCC from a string literal, e.g. fourCC("RIFF").
|
|
|
|
inline constexpr FourCC fourCC(const char (&Literal)[5]) {
|
|
|
|
return FourCC{{Literal[0], Literal[1], Literal[2], Literal[3]}};
|
|
|
|
}
|
2020-07-08 23:30:24 +08:00
|
|
|
inline constexpr llvm::StringRef fourCCStr(const FourCC &Data) {
|
|
|
|
return llvm::StringRef(&Data[0], Data.size());
|
|
|
|
}
|
[clangd] Define a compact binary serialization fomat for symbol slab/index.
Summary:
This is intended to replace the current YAML format for general use.
It's ~10x more compact than YAML, and ~40% more compact than gzipped YAML:
llvmidx.riff = 20M, llvmidx.yaml = 272M, llvmidx.yaml.gz = 32M
It's also simpler/faster to read and write.
The format is a RIFF container (chunks of (type, size, data)) with:
- a compressed string table
- simple binary encoding of symbols (with varints for compactness)
It can be extended to include occurrences, Dex posting lists, etc.
There's no rich backwards-compatibility scheme, but a version number is included
so we can detect incompatible files and do ad-hoc back-compat.
Alternatives considered:
- compressed YAML or JSON: bulky and slow to load
- llvm bitstream: confusing model and libraries are hard to use. My attempt
produced slightly larger files, and the code was longer and slower.
- protobuf or similar: would be really nice (esp for back-compat) but the
dependency is a big hassle
- ad-hoc binary format without a container: it seems clear we're going
to add posting lists and occurrences here, and that they will benefit
from sharing a string table. The container makes it easy to debug
these pieces in isolation, and make them optional.
Reviewers: ioeric
Subscribers: mgorny, ilya-biryukov, MaskRay, jkorous, mgrang, arphaman, kadircet, cfe-commits
Differential Revision: https://reviews.llvm.org/D51585
llvm-svn: 341375
2018-09-05 00:16:50 +08:00
|
|
|
// A chunk is a section in a RIFF container.
|
|
|
|
struct Chunk {
|
|
|
|
FourCC ID;
|
|
|
|
llvm::StringRef Data;
|
|
|
|
};
|
|
|
|
inline bool operator==(const Chunk &L, const Chunk &R) {
|
|
|
|
return std::tie(L.ID, L.Data) == std::tie(R.ID, R.Data);
|
|
|
|
}
|
|
|
|
// A File is a RIFF container, which is a typed chunk sequence.
|
|
|
|
struct File {
|
|
|
|
FourCC Type;
|
|
|
|
std::vector<Chunk> Chunks;
|
|
|
|
};
|
|
|
|
inline bool operator==(const File &L, const File &R) {
|
|
|
|
return std::tie(L.Type, L.Chunks) == std::tie(R.Type, R.Chunks);
|
|
|
|
}
|
|
|
|
|
|
|
|
// Reads a single chunk from the start of Stream.
|
|
|
|
// Stream is updated to exclude the consumed chunk.
|
|
|
|
llvm::Expected<Chunk> readChunk(llvm::StringRef &Stream);
|
|
|
|
|
|
|
|
// Serialize a single chunk to OS.
|
|
|
|
llvm::raw_ostream &operator<<(llvm::raw_ostream &OS, const Chunk &);
|
|
|
|
|
|
|
|
// Parses a RIFF file consisting of a single RIFF chunk.
|
|
|
|
llvm::Expected<File> readFile(llvm::StringRef Stream);
|
|
|
|
|
|
|
|
// Serialize a RIFF file (i.e. a single RIFF chunk) to OS.
|
|
|
|
llvm::raw_ostream &operator<<(llvm::raw_ostream &OS, const File &);
|
|
|
|
|
|
|
|
} // namespace riff
|
|
|
|
} // namespace clangd
|
|
|
|
} // namespace clang
|
|
|
|
#endif
|