llvm-project

Commit Graph

Author	SHA1	Message	Date
Haojian Wu	f7dc91ad56	[pseudo] Eliminate a false parse of structured binding declaration. Using the guard to implement part of the rule https://eel.is/c++draft/dcl.pre#6. ``` void foo() { // can be parsed as // - structured-binding declaration (a false parse) // - assignment expression array[index] = value; } ``` Differential Revision: https://reviews.llvm.org/D132260	2022-08-23 15:25:52 +02:00
Sam McCall	bd5cc6575b	[pseudo] Start rules are `_ := start-symbol EOF`, improve recovery. Previously we were calling glrRecover() ad-hoc at the end of input. Two main problems with this: - glrRecover() on two separate code paths is inelegant - We may have to recover several times in succession (e.g. to exit from nested scopes), so we need a loop at end-of-file Having an actual shift action for an EOF terminal allows us to handle both concerns in the main shift/recover/reduce loop. This revealed a recovery design bug where recovery could enter a loop by repeatedly choosing the same parent to identically recover from. Addressed this by allowing each node to be used as a recovery base once. Differential Revision: https://reviews.llvm.org/D130550	2022-08-19 16:49:37 +02:00
Haojian Wu	6a9f79e102	[pseudo] Eliminate the type-name identifier ambiguities in the grammar. See https://reviews.llvm.org/D130626 for motivation. Identifier in the grammar has different categories (type-name, template-name, namespace-name), they requires semantic information to resolve. This patch is to eliminate the "local" ambiguities in type-name, and namespace-name, which gives us a performance boost of the parser: - eliminate all different type rules (class-name, enum-name, typedef-name), and fold them into a unified type-name, this removes the #1 type-name ambiguity, and gives us a big performance boost; - remove the namespace-alis rules, as they're hard and uninteresting; Note that we could eliminate more and gain more performance (like fold template-name, type-name, namespace together), but at current stage, we'd like keep all existing categories of the identifier (as they might assist in correlated disambiguation & keep the representation of important concepts uniform). \| file \|ambiguous nodes \| forest size \| glrParse performance \| \|SemaCodeComplete.cpp\| 11k -> 5.7K \| 10.4MB -> 7.9MB \| 7.1MB/s -> 9.98MB/s \| \| AST.cpp \| 1.3k -> 0.73K \| 0.99MB -> 0.77MB \| 6.7MB/s -> 8.4MB/s \| Differential Revision: https://reviews.llvm.org/D130747	2022-08-17 14:30:53 +02:00
Haojian Wu	1828c75d5f	[pseudo] Apply the function-declarator to member functions. A followup patch of `d489b3807f`, but for member functions, this will eliminate a false parse of member declaration. Differential Revision: https://reviews.llvm.org/D131720	2022-08-12 13:49:01 +02:00
Haojian Wu	a1a1a78ac8	[pseudo] Eliminate an ambiguity for the empty member declaration. We happened to introduce a `member-declaration := ;` rule when inlining the `member-declaration := decl-specifier-seq_opt member-declarator-list_opt ;`. And with the `member-declaration := empty-declaration` rule, we had two parses of `;`. This patch is to restrict the grammar to eliminate the `member-declaration := ;` rule. Differential Revision: https://reviews.llvm.org/D131724	2022-08-12 13:46:26 +02:00
Haojian Wu	6f6c40a875	[pseudo] Eliminate the false `::` nested-name-specifier ambiguity The solution is to favor the longest possible nest-name-specifier, and drop other alternatives by using the guard, per per C++ [basic.lookup.qual.general]. Motivated cases: ``` Foo::Foo() {}; // the constructor can be parsed as: // - Foo ::Foo(); // where the first Foo is return-type, and ::Foo is the function declarator // + Foo::Foo(); // where Foo::Foo is the function declarator ``` ``` void test() { // a very slow parsing case when there are many qualifers! X::Y::Z; // The statement can be parsed as: // - X ::Y::Z; // ::Y::Z is the declarator // - X::Y ::Z; // ::Z is the declarator // + X::Y::Z; // a declaration without declarator (X::Y::Z is decl-specifier-seq) // + X::Y::Z; // a qualifed-id expression } ``` Differential Revision: https://reviews.llvm.org/D130511	2022-07-28 11:01:15 +02:00
Sam McCall	b2b993a6ae	[pseudo] Eliminate multiple-specified-types ambiguities using guards Motivating case: `foo bar;` is not a declaration of nothing with `foo` and `bar` both types. This is a common and critical ambiguity, clangd/AST.cpp has 20% fewer ambiguous nodes (1674->1332) after this change. Differential Revision: https://reviews.llvm.org/D130337	2022-07-25 12:57:07 +02:00
Sam McCall	d9d554a3f4	[pseudo] Add ambiguity & unparseability metrics to -print-statistics These can be used to quantify parsing improvements from a change. Differential Revision: https://reviews.llvm.org/D130199	2022-07-22 10:35:06 +02:00
Haojian Wu	2a88fb2ecb	[pseudo] Eliminate the dangling-else syntax ambiguity. - the grammar ambiguity is eliminated by a guard; - modify the guard function signatures, now all parameters are folded in to a single object, avoid a long parameter list (as we will add more parameters in the near future); Reviewed By: sammccall Differential Revision: https://reviews.llvm.org/D130160	2022-07-22 09:13:09 +02:00
Sam McCall	3132e9cd7c	[pseudo] Key guards by RuleID, add guards to literals (and 0). After this, NUMERIC_CONSTANT and strings should parse only one way. There are 8 types of literals, and 24 valid (literal, TokenKind) pairs. This means adding 8 new named guards (or 24, if we want to assert the token). It seems fairly clear to me at this point that the guard names are unneccesary indirection: the guards are in fact coupled to the rule signature. (Also add the zero guard I forgot in the previous patch.) Differential Revision: https://reviews.llvm.org/D130066	2022-07-21 22:42:31 +02:00
Sam McCall	c91ce94144	[pseudo] Add `clang-pseudo -html-forest=<output.html>`, an HTML forest browser It generates a standalone HTML file with all needed JS/CSS embedded. This allows navigating the tree both with a tree widget and in the code, inspecting nodes, and selecting ambiguous alternatives. Demo: https://htmlpreview.github.io/?https://gist.githubusercontent.com/sam-mccall/03882f7499d293196594e8a50599a503/raw/ASTSignals.cpp.html Differential Revision: https://reviews.llvm.org/D130004	2022-07-19 22:32:11 +02:00
Haojian Wu	d489b3807f	[pseudo] Implement a guard to determine function declarator. This eliminates some simple-declaration/function-definition false parses. - implement a function to determine whether a declarator ForestNode is a function declarator; - extend the standard declarator to two guarded function-declarator and non-function-declarator nonterminals; Differential Revision: https://reviews.llvm.org/D129222	2022-07-19 09:44:45 +02:00
Haojian Wu	b94ea8b3eb	[pseudo] Add bracket recovery for function parameters.	2022-07-18 10:23:15 +02:00
Sam McCall	7d8e2742d9	[pseudo] Define recovery strategy as grammar extension. Differential Revision: https://reviews.llvm.org/D129158	2022-07-06 15:03:38 +02:00
Sam McCall	3121167488	[pseudo] Add error-recovery framework & brace-based recovery The idea is: - a parse failure is detected when all heads die when trying to shift the next token - we can recover by choosing a nonterminal we're partway through parsing, and determining where it ends through nonlocal means (e.g. matching brackets) - we can find candidates by walking up the stack from the (ex-)heads - the token range is defined using heuristics attached to grammar rules - the unparsed region is represented in the forest by an Opaque node This patch has the core GLR functionality. It does not allow recovery heuristics to be attached as extensions to the grammar, but rather infers a brace-based heuristic. Expected followups: - make recovery heuristics grammar extensions (depends on D127448) - add recovery to our grammar for bracketed constructs and sequence nodes - change the structure of our augmented `_ := start` rules to eliminate some special-cases in glrParse. - (if I can work out how): avoid some spurious recovery cases described in comments (Previously mistakenly committed as `a0f4c10ae2`) Differential Revision: https://reviews.llvm.org/D128486	2022-07-05 20:49:41 +02:00
Haojian Wu	9ab67cc8bf	[pseudo] Implement guard extension. - Extend the GLR parser to allow conditional reduction based on the guard functions; - Implement two simple guards (contextual-override/final) for cxx.bnf; - layering: clangPseudoCXX depends on clangPseudo (as the guard function need to access the TokenStream); Differential Revision: https://reviews.llvm.org/D127448	2022-07-05 15:55:15 +02:00
Haojian Wu	70c0d92930	[pseudo] Use the prebuilt cxx grammar for the lit tests, NFC. Differential Revision: https://reviews.llvm.org/D129074	2022-07-05 15:17:18 +02:00
Sam McCall	9b6bb12b85	[pseudo] Add ForestNode descendants iterator, print ambiguous/opaque node stats. Differential Revision: https://reviews.llvm.org/D128930	2022-06-30 21:20:55 +02:00
Sam McCall	bc5e7ced1c	[pseudo] Fix bugs/inconsistencies in forest dump. - when printing a shared node for the second time, don't print its children (This keeps output proportional to the size of the structure) - when printing a shared node for the second time, print its type only, not rule (for consistency with above: don't dump details of nodes twice) - don't abbreviate shared nodes, to ensure we can prune the tree there Differential Revision: https://reviews.llvm.org/D128805	2022-06-29 22:56:26 +02:00
Haojian Wu	1ba7f5218c	[pseudo] Update the cxx.bnf path in comments to reflect the new location, NFC	2022-06-29 15:10:39 +02:00
Sam McCall	743971faaf	Revert "[pseudo] Add error-recovery framework & brace-based recovery" This reverts commit `a0f4c10ae2`. This commit hadn't been reviewed yet, and was unintentionally included on another branch.	2022-06-28 21:11:09 +02:00
Sam McCall	a0f4c10ae2	[pseudo] Add error-recovery framework & brace-based recovery The idea is: - a parse failure is detected when all heads die when trying to shift the next token - we can recover by choosing a nonterminal we're partway through parsing, and determining where it ends through nonlocal means (e.g. matching brackets) - we can find candidates by walking up the stack from the (ex-)heads - the token range is defined using heuristics attached to grammar rules - the unparsed region is represented in the forest by an Opaque node This patch has the core GLR functionality. It does not allow recovery heuristics to be attached as extensions to the grammar, but rather infers a brace-based heuristic. Expected followups: - make recovery heuristics grammar extensions (depends on D127448) - add recover to our grammar for bracketed constructs and sequence nodes - change the structure of our augmented `_ := start` rules to eliminate some special-cases in glrParse. - (if I can work out how): avoid some spurious recovery cases described in comments - grammar changes to eliminate the hard distinction between init-list and designated-init-list shown in the recovery-init-list.cpp testcase Differential Revision: https://reviews.llvm.org/D128486	2022-06-28 21:08:43 +02:00
Sam McCall	241557fb06	[pseudo] Move cxx grammar into the cxx/ directory. NFC	2022-06-28 16:02:10 +02:00
Sam McCall	aacefc817d	[pseudo] Simplify/loosen the grammar around lambda captures. Treat captures as a uniform list, rather than default-captures being special snowflakes that may only appear at the start. This accepts a larger set of (incorrect) code, and simplifies error-handling by making this fit into the usual homogeneous-list pattern. Differential Revision: https://reviews.llvm.org/D128708	2022-06-28 15:56:12 +02:00
Sam McCall	8cf28585a4	[pseudo] Allow mixed designated/undesignated init lists. This isn't allowed by the standard grammar but is allowed in C, and clang/GCC permit it as an extension. It avoids the need to determine which type of list we have in error-recovery. While here, also support array index designators `{ [4]=1 }` which are also legal in C, and common extensions in C++. Differential Revision: https://reviews.llvm.org/D128687	2022-06-28 15:45:41 +02:00
Sam McCall	85eaecbe8e	[pseudo] Check follow-sets instead of tying reduce actions to lookahead tokens. Previously, the action table stores a reduce action for each lookahead token it should allow. These tokens are the followSet(action.rule.target). In practice, the follow sets are large, so we spend a bunch of time binary searching around all these essentially-duplicates to check whether our lookahead token is there. However the number of reduces for a given state is very small, so we're much better off linear scanning over them and performing a fast check for each. D128318 was an attempt at this, storing a bitmap for each reduce. However it's even more compact just to use the follow sets directly, as there are fewer nonterminals than (state, rule) pairs. It's also faster. This specialized approach means unbundling Reduce from other actions in LRTable, so it's no longer useful to support it in Action. I suspect Action will soon go away, as we store each kind of action separately. This improves glrParse speed by 42% (3.30 -> 4.69 MB/s). It also reduces LR table size by 59% (343 -> 142kB). Differential Revision: https://reviews.llvm.org/D128472	2022-06-28 00:36:16 +02:00
Sam McCall	6b187fdf3b	[pseudo] Add xfail tests for a simple-declaration/function-definition ambiguity I expect to eliminate this ambiguity at the grammar level by use of guards, because it interferes with brace-based error recvoery. Differential Revision: https://reviews.llvm.org/D127400	2022-06-23 15:52:22 +02:00
Sam McCall	18f0b7092d	[pseudo] Don't clang-format test inputs. NFC	2022-06-09 14:18:30 +02:00
Haojian Wu	7a05942dd0	[pseudo] Remove the explicit Accept actions. As pointed out in the previous review section, having a dedicated accept action doesn't seem to be necessary. This patch implements the the same behavior without accept acction, which will save some code complexity. Reviewed By: sammccall Differential Revision: https://reviews.llvm.org/D125677	2022-06-09 11:19:07 +02:00
Haojian Wu	28eeea1e27	[pseudo]Pull out the operator< test, NFC Fix the review comment in https://reviews.llvm.org/D125479.	2022-06-07 11:00:08 +02:00
Haojian Wu	cf88150c48	[pseudo] Fix the incorrect parameters-and-qualifiers rule. The parenthese body should be parameter-declaration-clause, rather than parameter-declaration-list. Reviewed By: sammccall Differential Revision: https://reviews.llvm.org/D125479	2022-06-07 10:47:07 +02:00
Haojian Wu	ecd7ff53b5	[pseudo] Fix the type-parameter rule. The IDENTIFIER should be optional. Differential Revision: https://reviews.llvm.org/D126998	2022-06-07 10:36:45 +02:00
Haojian Wu	90dab0473e	[pseudo] Handle the language predefined identifier __func__ The clang lexer lexes it as a dedicated token kind (rather than identifier), we extend the grammar to handle it. Differential Revision: https://reviews.llvm.org/D126996	2022-06-07 10:34:37 +02:00
Haojian Wu	58b33bc8c4	[pseudo] Fix noptr-abstract-declarator rule. The const-expression in the [] can be empty. Differential Revision: https://reviews.llvm.org/D126992	2022-06-07 10:22:23 +02:00
Haojian Wu	0a6a17a4f9	[pseudo] Fix the member-specification grammar rule. The grammar rule is not right, doesn't match the standard one. Differential Revision: https://reviews.llvm.org/D126991	2022-06-07 10:18:18 +02:00
Haojian Wu	1a65c491be	[pseudo] Support parsing variant target symbols. With this patch, we're able to parse smaller chunks of C++ code (statement, declaration), rather than translation-unit. The start symbol is listed in the grammar in a form of `_ := statement`, each start symbol has a dedicated state (`_ := • statement`). We create and track all these separate states in the LRTable. When we start parsing, we lookup the corresponding state to start the parser. LR pasing table changes with this patch: - number of states: 1467 -> 1471 - number of actions: 82891 -> 83578 - size of the table (bytes): 334248 -> 336996 Differential Revision: https://reviews.llvm.org/D125006	2022-05-16 10:38:16 +02:00
Sam McCall	7dc3c6190e	[pseudo] Strip directives from a token stream This includes only the taken branch of conditional sections. The API allows for producing a stream for a particular PP branch, which will be used later for the secondary GLR parses of not-taken branches. Differential Revision: https://reviews.llvm.org/D123243	2022-05-06 12:15:08 +02:00
Sam McCall	1616bd9ef4	[pseudo] Add fuzzer for the pseudoparser. As confirmation, running this locally found 2 crashes: - trivial: crashes on file with no tokens - lexer: hits an assertion failure on bytes: 0x5c,0xa,0x5c,0x1,0x65,0x5c,0xa Differential Revision: https://reviews.llvm.org/D125037	2022-05-06 09:22:28 +02:00
Sam McCall	232cc446ff	[pseudo] Only expand UCNs for raw_identifiers It turns out clang::expandUCNs only works on tokens that contain valid UCNs and no other random escapes, and clang only uses it on raw_identifiers. Currently we can hit an assertion by creating tokens with stray non-valid-UCN backslashes in them. Fortunately, expanding UCNs in raw_identifiers is actually all we need. Most tokens (keywords, punctuation) can't have them. UCNs in literals can be treated as escape sequences like \n even this isn't the standard's interpretation. This more or less matches how clang works. (See https://isocpp.org/files/papers/P2194R0.pdf which points out that the standard's description of how UCNs work is misaligned with real implementations) Differential Revision: https://reviews.llvm.org/D125049	2022-05-06 08:53:31 +02:00
Haojian Wu	c4546091ed	[pseudo] Use a real language option in the parser. Differential Revision: https://reviews.llvm.org/D124831	2022-05-03 22:24:56 +02:00
Haojian Wu	9f38da258e	[pseudo] Implement the GLR parsing algorithm. This patch implements a standard GLR parsing algorithm, the core piece of the pseudoparser. - it parses preprocessed C++ code, currently it supports correct code only and parse them as a translation-unit; - it produces a forest which stores all possible trees in an efficient manner (only a single node being build for per (SymbolID, Token Range)); no disambiguation yet; Reland with a fix for g++'s -fpermissive error on previous declaration `GSS& GSS;`. Differential Revision: https://reviews.llvm.org/D121150	2022-05-03 20:25:23 +02:00
Haojian Wu	860eabb395	Revert "[pseudo] Implement the GLR parsing algorithm." This breaks some buildbots (on the declaration GSS& GSS), will fix it later. This reverts commit `eac22d0754`.	2022-05-03 15:54:10 +02:00
Sam McCall	eac22d0754	[pseudo] Implement the GLR parsing algorithm. This patch implements a standard GLR parsing algorithm, the core piece of the pseudoparser. - it parses preprocessed C++ code, currently it supports correct code only and parse them as a translation-unit; - it produces a forest which stores all possible trees in an efficient manner (only a single node being build for per (SymbolID, Token Range)); no disambiguation yet; Differential Revision: https://reviews.llvm.org/D121150	2022-05-03 15:42:07 +02:00
Sam McCall	5749a261c5	[pseudo] Include missing `count` in test deps. We don't use this for testing, but one of the lit python modules requires it :-\ After this, check-clang-pseudo passes with a clean build tree.	2022-04-07 00:15:18 +02:00
Sam McCall	c03d6257c5	[pseudo] Rename DirectiveMap -> DirectiveTree. NFC Addressing comment from previous review https://reviews.llvm.org/D121165?id=413636#inline-1160757	2022-04-06 21:36:57 +02:00
Sam McCall	af89e4792d	[pseudo] Add crude heuristics to choose taken preprocessor branches. In files where different preprocessing paths are possible, our goal is to choose a preprocessed token sequence which we can parse that pins down as much of the grammatical structure as possible. This forms the "primary parse", and the not-taken branches get parsed later, and are constrained to be compatible with the primary parse. Concretely: int x = #ifdef // TAKEN 2 + 2 + 2 // determined during primary parse to be an expression #else 2 // constrained to be an expression during a secondary parse #endif ; Differential Revision: https://reviews.llvm.org/D121165	2022-04-06 17:22:35 +02:00
Sam McCall	57ee624d79	[cmake] Provide CURRENT_TOOLS_DIR centrally, replacing CLANG_TOOLS_DIR CLANG_TOOLS_DIR holds the the current bin/ directory, maybe with a %(build_mode) placeholder. It is used to add the just-built binaries to $PATH for lit tests. In most cases it equals LLVM_TOOLS_DIR, which is used for the same purpose. But for a standalone build of clang, CLANG_TOOLS_DIR points at the build tree and LLVM_TOOLS_DIR points at the provided LLVM binaries. Currently CLANG_TOOLS_DIR is set in clang/test/, clang-tools-extra/test/, and other things always built with clang. This is a few cryptic lines of CMake in each place. Meanwhile LLVM_TOOLS_DIR is provided by configure_site_lit_cfg(). This patch moves CLANG_TOOLS_DIR to configure_site_lit_cfg() and renames it: - there's nothing clang-specific about the value - it will also replace LLD_TOOLS_DIR, LLDB_TOOLS_DIR etc (not in this patch) It also defines CURRENT_LIBS_DIR. While I removed the last usage of CLANG_LIBS_DIR in `e4cab4e24d`, there are LLD_LIBS_DIR usages etc that may be live, and I'd like to mechanically update them in a followup patch. Differential Revision: https://reviews.llvm.org/D121763	2022-03-25 20:22:01 +01:00
Haojian Wu	f383b88d82	[pseudo] Sort nonterminals based on their reduction order. Reductions need to be performed in a careful order in GLR parser, to make sure we gather all alternatives before creating an ambigous forest node. This patch encodes the nonterminal order into the rule id, so that we can efficiently to determinal ordering of reductions in GLR parser. This patch also abstracts to a TestGrammar, which is shared among tests. This is a part of the GLR parser, https://reviews.llvm.org/D121368, https://reviews.llvm.org/D121150 Differential Revision: https://reviews.llvm.org/D122303	2022-03-24 14:30:12 +01:00
Sam McCall	1f92f44ec9	[pseudo] fix typo'd test assertions	2022-03-21 14:05:21 +01:00
Sam McCall	89cd86bbc5	Reapply [pseudo] Move pseudoparser from clang to clang-tools-extra" This reverts commit `049f4e4eab`. The problem was a stray dependency in CLANG_TEST_DEPS which caused cmake to fail if clang-pseudo wasn't built. This is now removed.	2022-03-16 01:10:55 +01:00

1 2

52 Commits