llvm-project

Commit Graph

Author	SHA1	Message	Date
Sam McCall	d9d554a3f4	[pseudo] Add ambiguity & unparseability metrics to -print-statistics These can be used to quantify parsing improvements from a change. Differential Revision: https://reviews.llvm.org/D130199	2022-07-22 10:35:06 +02:00
Haojian Wu	18cee95919	[pseudo] Tweak the cli option messages, NFC.	2022-07-22 08:53:24 +02:00
Sam McCall	3132e9cd7c	[pseudo] Key guards by RuleID, add guards to literals (and 0). After this, NUMERIC_CONSTANT and strings should parse only one way. There are 8 types of literals, and 24 valid (literal, TokenKind) pairs. This means adding 8 new named guards (or 24, if we want to assert the token). It seems fairly clear to me at this point that the guard names are unneccesary indirection: the guards are in fact coupled to the rule signature. (Also add the zero guard I forgot in the previous patch.) Differential Revision: https://reviews.llvm.org/D130066	2022-07-21 22:42:31 +02:00
Sam McCall	c91ce94144	[pseudo] Add `clang-pseudo -html-forest=<output.html>`, an HTML forest browser It generates a standalone HTML file with all needed JS/CSS embedded. This allows navigating the tree both with a tree widget and in the code, inspecting nodes, and selecting ambiguous alternatives. Demo: https://htmlpreview.github.io/?https://gist.githubusercontent.com/sam-mccall/03882f7499d293196594e8a50599a503/raw/ASTSignals.cpp.html Differential Revision: https://reviews.llvm.org/D130004	2022-07-19 22:32:11 +02:00
Haojian Wu	098488e09a	[pseduo] More precise on printing the error message, NFC	2022-07-18 13:23:18 +02:00
Haojian Wu	9ab67cc8bf	[pseudo] Implement guard extension. - Extend the GLR parser to allow conditional reduction based on the guard functions; - Implement two simple guards (contextual-override/final) for cxx.bnf; - layering: clangPseudoCXX depends on clangPseudo (as the guard function need to access the TokenStream); Differential Revision: https://reviews.llvm.org/D127448	2022-07-05 15:55:15 +02:00
Haojian Wu	5f0a054f89	[pseudo] Remove duplicated code in ClangPseudo.cpp The code was added accidently during the rebase when landing `fe66aebd`.	2022-07-04 11:32:56 +02:00
Haojian Wu	fe66aebd75	[pseudo] Define a clangPseudoCLI library. - define a common data structure Language which is a compiled result of the bnf grammar. It is defined in Language.h; - creates a clangPseudoCLI lib which defines a grammar commandline flag and expose a function to get the Language. It supports --grammar=cxx, --grammmar=/path/to/file.bnf; - use the clangPseudoCLI in clang-pseudo, fuzzer, and benchmark tools ( simplify the code and use the prebuilt cxx grammar); Split out from https://reviews.llvm.org/D127448. Differential Revision: https://reviews.llvm.org/D128679	2022-07-01 08:31:34 +02:00
Sam McCall	9b6bb12b85	[pseudo] Add ForestNode descendants iterator, print ambiguous/opaque node stats. Differential Revision: https://reviews.llvm.org/D128930	2022-06-30 21:20:55 +02:00
Sam McCall	3f028c02ba	[pseudo] Grammar::parseBNF returns Grammar not unique_ptr. NFC	2022-06-28 16:34:21 +02:00
Haojian Wu	c70aeaad2b	[pseudo] Move grammar-related headers to a separate dir, NFC. We did that for .cpp, but forgot the headers. Differential Revision: https://reviews.llvm.org/D127388	2022-06-09 14:58:05 +02:00
Sam McCall	94b2ca18c1	[pseudo] GC GSS nodes, reuse them with a freelist Most GSS nodes have short effective lifetimes, keeping them around until the end of the parse is wasteful. Mark and sweep them every 20 tokens. When parsing clangd/AST.cpp, this reduces the GSS memory from 1MB to 20kB. We pay ~5% performance for this according to the glrParse benchmark. (Parsing more tokens between GCs doesn't seem to improve this further). Compared to the refcounting approach in https://reviews.llvm.org/D126337, this is simpler (at least the complexity is better isolated) and has >2x less overhead. It doesn't provide death handlers (for error-handling) but we have an alternative solution in mind. Differential Revision: https://reviews.llvm.org/D126723	2022-06-08 23:39:59 +02:00
Sam McCall	93bcff8aa8	[pseudo] Invert rows/columns of LRTable storage for speedup. NFC There are more states than symbols. This means first partioning the action list by state leaves us with a smaller range to binary search over. This improves find() a lot and glrParse() by 7%. The tradeoff is storing more smaller ranges increases the size of the offsets array, overall grammar memory is +1% (337->340KB). Before: glrParse 188795975 ns 188778003 ns 77 bytes_per_second=1.98068M/s After: glrParse 175936203 ns 175916873 ns 81 bytes_per_second=2.12548M/s Differential Revision: https://reviews.llvm.org/D127006	2022-06-08 23:35:14 +02:00
Haojian Wu	f1df6515e3	[pseudo] Add missing dependency, fix shared library build.	2022-05-25 12:38:23 +02:00
Sam McCall	0360b9f159	[pseudo] (trivial) bracket-matching Error-tolerant bracket matching enables our error-tolerant parsing strategies. The implementation here is not yet error tolerant: this patch sets up the APIs and plumbing, and describes the planned approach. Differential Revision: https://reviews.llvm.org/D125911	2022-05-24 15:13:36 +02:00
Haojian Wu	1a65c491be	[pseudo] Support parsing variant target symbols. With this patch, we're able to parse smaller chunks of C++ code (statement, declaration), rather than translation-unit. The start symbol is listed in the grammar in a form of `_ := statement`, each start symbol has a dedicated state (`_ := • statement`). We create and track all these separate states in the LRTable. When we start parsing, we lookup the corresponding state to start the parser. LR pasing table changes with this patch: - number of states: 1467 -> 1471 - number of actions: 82891 -> 83578 - size of the table (bytes): 334248 -> 336996 Differential Revision: https://reviews.llvm.org/D125006	2022-05-16 10:38:16 +02:00
Sam McCall	7dc3c6190e	[pseudo] Strip directives from a token stream This includes only the taken branch of conditional sections. The API allows for producing a stream for a particular PP branch, which will be used later for the secondary GLR parses of not-taken branches. Differential Revision: https://reviews.llvm.org/D123243	2022-05-06 12:15:08 +02:00
Sam McCall	232cc446ff	[pseudo] Only expand UCNs for raw_identifiers It turns out clang::expandUCNs only works on tokens that contain valid UCNs and no other random escapes, and clang only uses it on raw_identifiers. Currently we can hit an assertion by creating tokens with stray non-valid-UCN backslashes in them. Fortunately, expanding UCNs in raw_identifiers is actually all we need. Most tokens (keywords, punctuation) can't have them. UCNs in literals can be treated as escape sequences like \n even this isn't the standard's interpretation. This more or less matches how clang works. (See https://isocpp.org/files/papers/P2194R0.pdf which points out that the standard's description of how UCNs work is misaligned with real implementations) Differential Revision: https://reviews.llvm.org/D125049	2022-05-06 08:53:31 +02:00
Haojian Wu	c4546091ed	[pseudo] Use a real language option in the parser. Differential Revision: https://reviews.llvm.org/D124831	2022-05-03 22:24:56 +02:00
Haojian Wu	9f38da258e	[pseudo] Implement the GLR parsing algorithm. This patch implements a standard GLR parsing algorithm, the core piece of the pseudoparser. - it parses preprocessed C++ code, currently it supports correct code only and parse them as a translation-unit; - it produces a forest which stores all possible trees in an efficient manner (only a single node being build for per (SymbolID, Token Range)); no disambiguation yet; Reland with a fix for g++'s -fpermissive error on previous declaration `GSS& GSS;`. Differential Revision: https://reviews.llvm.org/D121150	2022-05-03 20:25:23 +02:00
Haojian Wu	860eabb395	Revert "[pseudo] Implement the GLR parsing algorithm." This breaks some buildbots (on the declaration GSS& GSS), will fix it later. This reverts commit `eac22d0754`.	2022-05-03 15:54:10 +02:00
Sam McCall	eac22d0754	[pseudo] Implement the GLR parsing algorithm. This patch implements a standard GLR parsing algorithm, the core piece of the pseudoparser. - it parses preprocessed C++ code, currently it supports correct code only and parse them as a translation-unit; - it produces a forest which stores all possible trees in an efficient manner (only a single node being build for per (SymbolID, Token Range)); no disambiguation yet; Differential Revision: https://reviews.llvm.org/D121150	2022-05-03 15:42:07 +02:00
Sam McCall	c03d6257c5	[pseudo] Rename DirectiveMap -> DirectiveTree. NFC Addressing comment from previous review https://reviews.llvm.org/D121165?id=413636#inline-1160757	2022-04-06 21:36:57 +02:00
Sam McCall	af89e4792d	[pseudo] Add crude heuristics to choose taken preprocessor branches. In files where different preprocessing paths are possible, our goal is to choose a preprocessed token sequence which we can parse that pins down as much of the grammatical structure as possible. This forms the "primary parse", and the not-taken branches get parsed later, and are constrained to be compatible with the primary parse. Concretely: int x = #ifdef // TAKEN 2 + 2 + 2 // determined during primary parse to be an expression #else 2 // constrained to be an expression during a secondary parse #endif ; Differential Revision: https://reviews.llvm.org/D121165	2022-04-06 17:22:35 +02:00
Haojian Wu	30de15e100	[pseudo] Tweak some docs, NFC Consitently use the "nonterminal", "pseudoparser" terms.	2022-03-17 13:58:42 +01:00
Sam McCall	89cd86bbc5	Reapply [pseudo] Move pseudoparser from clang to clang-tools-extra" This reverts commit `049f4e4eab`. The problem was a stray dependency in CLANG_TEST_DEPS which caused cmake to fail if clang-pseudo wasn't built. This is now removed.	2022-03-16 01:10:55 +01:00
Sam McCall	049f4e4eab	Revert "[pseudo] Move pseudoparser from clang to clang-tools-extra" This reverts commit `b97856c4cf`. Breaks a bunch of bots: https://lab.llvm.org/buildbot/#/builders/193/builds/8513	2022-03-16 01:06:24 +01:00
Sam McCall	b97856c4cf	[pseudo] Move pseudoparser from clang to clang-tools-extra This should make clearer that: - it's not part of clang proper - there's no expectation to update it along with clang (beyond green tests) - clang should not depend on it This is intended to be expose a library, so unlike other tools has a split between include/ and lib/. The main renames are: clang/lib/Tooling/Syntax/Pseudo/* => clang-tools-extra/pseudo/lib/* clang/include/clang/Tooling/Syntax/Pseudo/* => clang-tools-extra/pseudo/include/clang-pseudo/* clang/tools/clang/pseudo/* => clang-tools-extra/pseudo/tool/* clang/test/Syntax/* => clang-tools-extra/pseudo/test/* clang/unittests/Tooling/Syntax/Pseudo/* => clang-tools-extra/pseudo/unittests/* #include "clang/Tooling/Syntax/Pseudo/" => #include "clang-pseudo/" namespace clang::syntax::pseudo => namespace clang::pseudo check-clang => check-clang-pseudo clangToolingSyntaxPseudo => clangPseudo The clang-pseudo and ClangPseudoTests binaries are not renamed. See discussion around: https://discourse.llvm.org/t/rfc-a-c-pseudo-parser-for-tooling/59217/50 Differential Revision: https://reviews.llvm.org/D121233	2022-03-16 00:14:11 +01:00

28 Commits