After this, NUMERIC_CONSTANT and strings should parse only one way.
There are 8 types of literals, and 24 valid (literal, TokenKind) pairs.
This means adding 8 new named guards (or 24, if we want to assert the token).
It seems fairly clear to me at this point that the guard names are unneccesary
indirection: the guards are in fact coupled to the rule signature.
(Also add the zero guard I forgot in the previous patch.)
Differential Revision: https://reviews.llvm.org/D130066
- Extend the GLR parser to allow conditional reduction based on the
guard functions;
- Implement two simple guards (contextual-override/final) for cxx.bnf;
- layering: clangPseudoCXX depends on clangPseudo (as the guard function need
to access the TokenStream);
Differential Revision: https://reviews.llvm.org/D127448
- define a common data structure Language which is a compiled result of the
bnf grammar. It is defined in Language.h;
- creates a clangPseudoCLI lib which defines a grammar commandline flag and
expose a function to get the Language. It supports --grammar=cxx,
--grammmar=/path/to/file.bnf;
- use the clangPseudoCLI in clang-pseudo, fuzzer, and benchmark tools (
simplify the code and use the prebuilt cxx grammar);
Split out from https://reviews.llvm.org/D127448.
Differential Revision: https://reviews.llvm.org/D128679
Most GSS nodes have short effective lifetimes, keeping them around until the
end of the parse is wasteful. Mark and sweep them every 20 tokens.
When parsing clangd/AST.cpp, this reduces the GSS memory from 1MB to 20kB.
We pay ~5% performance for this according to the glrParse benchmark.
(Parsing more tokens between GCs doesn't seem to improve this further).
Compared to the refcounting approach in https://reviews.llvm.org/D126337, this
is simpler (at least the complexity is better isolated) and has >2x less
overhead. It doesn't provide death handlers (for error-handling) but we have
an alternative solution in mind.
Differential Revision: https://reviews.llvm.org/D126723
There are more states than symbols.
This means first partioning the action list by state leaves us with a smaller
range to binary search over. This improves find() a lot and glrParse() by 7%.
The tradeoff is storing more smaller ranges increases the size of the offsets
array, overall grammar memory is +1% (337->340KB).
Before:
glrParse 188795975 ns 188778003 ns 77 bytes_per_second=1.98068M/s
After:
glrParse 175936203 ns 175916873 ns 81 bytes_per_second=2.12548M/s
Differential Revision: https://reviews.llvm.org/D127006
Error-tolerant bracket matching enables our error-tolerant parsing strategies.
The implementation here is *not* yet error tolerant: this patch sets up the APIs
and plumbing, and describes the planned approach.
Differential Revision: https://reviews.llvm.org/D125911
With this patch, we're able to parse smaller chunks of C++ code (statement,
declaration), rather than translation-unit.
The start symbol is listed in the grammar in a form of `_ :=
statement`, each start symbol has a dedicated state (`_ := • statement`).
We create and track all these separate states in the LRTable. When we
start parsing, we lookup the corresponding state to start the parser.
LR pasing table changes with this patch:
- number of states: 1467 -> 1471
- number of actions: 82891 -> 83578
- size of the table (bytes): 334248 -> 336996
Differential Revision: https://reviews.llvm.org/D125006
This includes only the taken branch of conditional sections.
The API allows for producing a stream for a particular PP branch, which
will be used later for the secondary GLR parses of not-taken branches.
Differential Revision: https://reviews.llvm.org/D123243
It turns out clang::expandUCNs only works on tokens that contain valid UCNs
and no other random escapes, and clang only uses it on raw_identifiers.
Currently we can hit an assertion by creating tokens with stray non-valid-UCN
backslashes in them.
Fortunately, expanding UCNs in raw_identifiers is actually all we need.
Most tokens (keywords, punctuation) can't have them. UCNs in literals can be
treated as escape sequences like \n even this isn't the standard's
interpretation. This more or less matches how clang works.
(See https://isocpp.org/files/papers/P2194R0.pdf which points out that the
standard's description of how UCNs work is misaligned with real implementations)
Differential Revision: https://reviews.llvm.org/D125049
This patch implements a standard GLR parsing algorithm, the
core piece of the pseudoparser.
- it parses preprocessed C++ code, currently it supports correct code
only and parse them as a translation-unit;
- it produces a forest which stores all possible trees in an efficient
manner (only a single node being build for per (SymbolID, Token Range));
no disambiguation yet;
Reland with a fix for g++'s -fpermissive error on previous declaration `GSS& GSS;`.
Differential Revision: https://reviews.llvm.org/D121150
This patch implements a standard GLR parsing algorithm, the
core piece of the pseudoparser.
- it parses preprocessed C++ code, currently it supports correct code
only and parse them as a translation-unit;
- it produces a forest which stores all possible trees in an efficient
manner (only a single node being build for per (SymbolID, Token Range));
no disambiguation yet;
Differential Revision: https://reviews.llvm.org/D121150
In files where different preprocessing paths are possible, our goal is to
choose a preprocessed token sequence which we can parse that pins down as much
of the grammatical structure as possible.
This forms the "primary parse", and the not-taken branches get parsed later,
and are constrained to be compatible with the primary parse.
Concretely:
int x =
#ifdef // TAKEN
2 + 2 + 2 // determined during primary parse to be an expression
#else
2 // constrained to be an expression during a secondary parse
#endif
;
Differential Revision: https://reviews.llvm.org/D121165
This reverts commit 049f4e4eab.
The problem was a stray dependency in CLANG_TEST_DEPS which caused cmake
to fail if clang-pseudo wasn't built. This is now removed.
This should make clearer that:
- it's not part of clang proper
- there's no expectation to update it along with clang (beyond green tests)
- clang should not depend on it
This is intended to be expose a library, so unlike other tools has a split
between include/ and lib/.
The main renames are:
clang/lib/Tooling/Syntax/Pseudo/* => clang-tools-extra/pseudo/lib/*
clang/include/clang/Tooling/Syntax/Pseudo/* => clang-tools-extra/pseudo/include/clang-pseudo/*
clang/tools/clang/pseudo/* => clang-tools-extra/pseudo/tool/*
clang/test/Syntax/* => clang-tools-extra/pseudo/test/*
clang/unittests/Tooling/Syntax/Pseudo/* => clang-tools-extra/pseudo/unittests/*
#include "clang/Tooling/Syntax/Pseudo/*" => #include "clang-pseudo/*"
namespace clang::syntax::pseudo => namespace clang::pseudo
check-clang => check-clang-pseudo
clangToolingSyntaxPseudo => clangPseudo
The clang-pseudo and ClangPseudoTests binaries are not renamed.
See discussion around:
https://discourse.llvm.org/t/rfc-a-c-pseudo-parser-for-tooling/59217/50
Differential Revision: https://reviews.llvm.org/D121233