Mostly mechanics here. Interesting decisions:
- apply disambiguation in-place instead of copying the forest
debatable, but even the final tree size is significant
- split decide/apply into different functions - this allows the hard part
(decide) to be tested non-destructively and combined with HTML forest easily
- add non-const accessors to forest to enable apply
- unit tests but no lit tests: my plan is to test actual C++ disambiguation
heuristics with lit, generic disambiguation mechanics without the C++ grammar
Differential Revision: https://reviews.llvm.org/D132487
Using the guard to implement part of the rule https://eel.is/c++draft/dcl.pre#6.
```
void foo() {
// can be parsed as
// - structured-binding declaration (a false parse)
// - assignment expression
array[index] = value;
}
```
Differential Revision: https://reviews.llvm.org/D132260
The bug was that if we recover from the token 0, we will make the
Heads empty (Line646), which results no recovery being applied.
Reviewed By: sammccall
Differential Revision: https://reviews.llvm.org/D132388
Previously we were calling glrRecover() ad-hoc at the end of input.
Two main problems with this:
- glrRecover() on two separate code paths is inelegant
- We may have to recover several times in succession (e.g. to exit from
nested scopes), so we need a loop at end-of-file
Having an actual shift action for an EOF terminal allows us to handle
both concerns in the main shift/recover/reduce loop.
This revealed a recovery design bug where recovery could enter a loop by
repeatedly choosing the same parent to identically recover from.
Addressed this by allowing each node to be used as a recovery base once.
Differential Revision: https://reviews.llvm.org/D130550
Our GLR uses lookahead: only perform reductions that might be consumed by the
shift immediately following. However when shift fails and so reduce is followed
by recovery instead, this restriction is incorrect and leads to missing heads.
In turn this means certain recovery strategies can't be made to work. e.g.
```
ns := NAMESPACE { namespace-body } [recover=Skip]
ns-body := namespace_opt
```
When `namespace { namespace {` is parsed, we can recover the inner `ns` (using
the `Skip` strategy to ignore the missing `}`). However this `namespace` will
not be reduced to a `namespace-body` as EOF is not in the follow-set, and so we
are unable to recover the outer `ns`.
This patch fixes this by tracking which heads were produced by constrained
reduce, and discarding and rebuilding them before performing recovery.
This is a prerequisite for the `Skip` strategy mentioned above, though there are
some other limitations we need to address too.
Reviewed By: hokein
Differential Revision: https://reviews.llvm.org/D130523
See https://reviews.llvm.org/D130626 for motivation.
Identifier in the grammar has different categories (type-name, template-name,
namespace-name), they requires semantic information to resolve. This patch is
to eliminate the "local" ambiguities in type-name, and namespace-name, which
gives us a performance boost of the parser:
- eliminate all different type rules (class-name, enum-name, typedef-name), and
fold them into a unified type-name, this removes the #1 type-name ambiguity, and
gives us a big performance boost;
- remove the namespace-alis rules, as they're hard and uninteresting;
Note that we could eliminate more and gain more performance (like fold template-name,
type-name, namespace together), but at current stage, we'd like keep all existing
categories of the identifier (as they might assist in correlated disambiguation &
keep the representation of important concepts uniform).
| file |ambiguous nodes | forest size | glrParse performance |
|SemaCodeComplete.cpp| 11k -> 5.7K | 10.4MB -> 7.9MB | 7.1MB/s -> 9.98MB/s |
| AST.cpp | 1.3k -> 0.73K | 0.99MB -> 0.77MB | 6.7MB/s -> 8.4MB/s |
Differential Revision: https://reviews.llvm.org/D130747
A followup patch of d489b3807f, but for
member functions, this will eliminate a false parse of member
declaration.
Differential Revision: https://reviews.llvm.org/D131720
We happened to introduce a `member-declaration := ;` rule
when inlining the `member-declaration := decl-specifier-seq_opt
member-declarator-list_opt ;`.
And with the `member-declaration := empty-declaration` rule, we had two parses of `;`.
This patch is to restrict the grammar to eliminate the
`member-declaration := ;` rule.
Differential Revision: https://reviews.llvm.org/D131724
Now that we've updated to C++17 MSVC gives very verbose warnings about not creating classes that inherit from std::iterator - use llvm::iterator_facade_base instead
Fixes#57005
I went over the output of the following mess of a command:
`(ulimit -m 2000000; ulimit -v 2000000; git ls-files -z | parallel --xargs -0 cat | aspell list --mode=none --ignore-case | grep -E '^[A-Za-z][a-z]*$' | sort | uniq -c | sort -n | grep -vE '.{25}' | aspell pipe -W3 | grep : | cut -d' ' -f2 | less)`
and proceeded to spend a few days looking at it to find probable typos
and fixed a few hundred of them in all of the llvm project (note, the
ones I found are not anywhere near all of them, but it seems like a
good start).
Reviewed By: kadircet
Differential Revision: https://reviews.llvm.org/D130826
The solution is to favor the longest possible nest-name-specifier, and
drop other alternatives by using the guard, per per C++ [basic.lookup.qual.general].
Motivated cases:
```
Foo::Foo() {};
// the constructor can be parsed as:
// - Foo ::Foo(); // where the first Foo is return-type, and ::Foo is the function declarator
// + Foo::Foo(); // where Foo::Foo is the function declarator
```
```
void test() {
// a very slow parsing case when there are many qualifers!
X::Y::Z;
// The statement can be parsed as:
// - X ::Y::Z; // ::Y::Z is the declarator
// - X::Y ::Z; // ::Z is the declarator
// + X::Y::Z; // a declaration without declarator (X::Y::Z is decl-specifier-seq)
// + X::Y::Z; // a qualifed-id expression
}
```
Differential Revision: https://reviews.llvm.org/D130511
- Place rules under rule::lhs::rhs__rhs__rhs
- Change mangling of keywords to ALL_CAPS (needed to turn keywords that appear
alone on RHS into valid identifiers)
- Make enums implicitly convertible to underlying type (though still scoped,
using alias tricks)
In principle this lets us exhaustively write a switch over all rules of a NT:
switch ((rule::declarator)N->rule()) {
case rule::declarator::noptr_declarator:
...
}
In practice we don't do this anywhere yet as we're often switching over multiple
nonterminal kinds at once.
Differential Revision: https://reviews.llvm.org/D130414
This allows incomplete code such as `namespace foo {` to be modeled as a
normal sequence with the missing } represented by an empty opaque node.
Differential Revision: https://reviews.llvm.org/D130551
Motivating case: `foo bar;` is not a declaration of nothing with `foo` and `bar`
both types.
This is a common and critical ambiguity, clangd/AST.cpp has 20% fewer
ambiguous nodes (1674->1332) after this change.
Differential Revision: https://reviews.llvm.org/D130337
llvm::sort is beneficial even when we use the iterator-based overload,
since it can optionally shuffle the elements (to detect
non-determinism). However llvm::sort is not usable everywhere, for
example, in compiler-rt.
Reviewed By: nhaehnle
Differential Revision: https://reviews.llvm.org/D130406
- the grammar ambiguity is eliminated by a guard;
- modify the guard function signatures, now all parameters are folded in
to a single object, avoid a long parameter list (as we will add more
parameters in the near future);
Reviewed By: sammccall
Differential Revision: https://reviews.llvm.org/D130160
After this, NUMERIC_CONSTANT and strings should parse only one way.
There are 8 types of literals, and 24 valid (literal, TokenKind) pairs.
This means adding 8 new named guards (or 24, if we want to assert the token).
It seems fairly clear to me at this point that the guard names are unneccesary
indirection: the guards are in fact coupled to the rule signature.
(Also add the zero guard I forgot in the previous patch.)
Differential Revision: https://reviews.llvm.org/D130066
This eliminates some simple-declaration/function-definition false
parses.
- implement a function to determine whether a declarator ForestNode is a
function declarator;
- extend the standard declarator to two guarded function-declarator and
non-function-declarator nonterminals;
Differential Revision: https://reviews.llvm.org/D129222
This first version only uses bracket matching. We plan to extend this to
use DirectiveTree as well.
Also includes changes to Token to allow retrieving corresponding token
in token stream of original source file.
Differential Revision: https://reviews.llvm.org/D129648
`stripComments(cook(...))` is a common pattern being written.
Without this patch, this has a use-after-free issue (cook returns a temporary
TokenStream object which has its own payload, but the payload is not
shared with the one returned by stripComments).
Reviewed By: sammccall
Differential Revision: https://reviews.llvm.org/D125311
The idea is:
- a parse failure is detected when all heads die when trying to shift the next token
- we can recover by choosing a nonterminal we're partway through parsing, and
determining where it ends through nonlocal means (e.g. matching brackets)
- we can find candidates by walking up the stack from the (ex-)heads
- the token range is defined using heuristics attached to grammar rules
- the unparsed region is represented in the forest by an Opaque node
This patch has the core GLR functionality.
It does not allow recovery heuristics to be attached as extensions to
the grammar, but rather infers a brace-based heuristic.
Expected followups:
- make recovery heuristics grammar extensions (depends on D127448)
- add recovery to our grammar for bracketed constructs and sequence nodes
- change the structure of our augmented `_ := start` rules to eliminate some
special-cases in glrParse.
- (if I can work out how): avoid some spurious recovery cases described in comments
(Previously mistakenly committed as a0f4c10ae2)
Differential Revision: https://reviews.llvm.org/D128486
- Extend the GLR parser to allow conditional reduction based on the
guard functions;
- Implement two simple guards (contextual-override/final) for cxx.bnf;
- layering: clangPseudoCXX depends on clangPseudo (as the guard function need
to access the TokenStream);
Differential Revision: https://reviews.llvm.org/D127448