Commit Graph

125 Commits

Author SHA1 Message Date
Sam McCall 3132e9cd7c [pseudo] Key guards by RuleID, add guards to literals (and 0).
After this, NUMERIC_CONSTANT and strings should parse only one way.

There are 8 types of literals, and 24 valid (literal, TokenKind) pairs.
This means adding 8 new named guards (or 24, if we want to assert the token).

It seems fairly clear to me at this point that the guard names are unneccesary
indirection: the guards are in fact coupled to the rule signature.

(Also add the zero guard I forgot in the previous patch.)

Differential Revision: https://reviews.llvm.org/D130066
2022-07-21 22:42:31 +02:00
Haojian Wu 65c8e24622 [pseudo] Fix an invalid assertion on recoveryBrackets.
The `Begin` is not the index of the left bracket, `Begin-1` is,
otherwise the assertion will be triggered on case `Foo().call();`.
2022-07-21 14:02:11 +02:00
Haojian Wu 2955192df8 [pseudo] Make sure we rebuild pseudo_gen tool. 2022-07-21 10:09:21 +02:00
Sam McCall c91ce94144 [pseudo] Add `clang-pseudo -html-forest=<output.html>`, an HTML forest browser
It generates a standalone HTML file with all needed JS/CSS embedded.
This allows navigating the tree both with a tree widget and in the code,
inspecting nodes, and selecting ambiguous alternatives.

Demo: https://htmlpreview.github.io/?https://gist.githubusercontent.com/sam-mccall/03882f7499d293196594e8a50599a503/raw/ASTSignals.cpp.html

Differential Revision: https://reviews.llvm.org/D130004
2022-07-19 22:32:11 +02:00
Haojian Wu d489b3807f [pseudo] Implement a guard to determine function declarator.
This eliminates some simple-declaration/function-definition false
parses.

- implement a function to determine whether a declarator ForestNode is a
  function declarator;
- extend the standard declarator to two guarded function-declarator and
  non-function-declarator nonterminals;

Differential Revision: https://reviews.llvm.org/D129222
2022-07-19 09:44:45 +02:00
Sam McCall fa0c7639e9 [pseudo] Add guards for module contextual keywords 2022-07-18 22:38:41 +02:00
Haojian Wu 098488e09a [pseduo] More precise on printing the error message, NFC 2022-07-18 13:23:18 +02:00
Utkarsh Saxena 70914aa631 Use pseudo parser for folding ranges
This first version only uses bracket matching. We plan to extend this to
use DirectiveTree as well.

Also includes changes to Token to allow retrieving corresponding token
in token stream of original source file.

Differential Revision: https://reviews.llvm.org/D129648
2022-07-18 11:35:34 +02:00
Haojian Wu b94ea8b3eb [pseudo] Add bracket recovery for function parameters. 2022-07-18 10:23:15 +02:00
Haojian Wu 76910d4a56 [pseudo] Share the underly payload when stripping comments for a token stream
`stripComments(cook(...))` is a common pattern being written.
Without this patch, this has a use-after-free issue (cook returns a temporary
TokenStream object which has its own payload, but the payload is not
shared with the one returned by stripComments).

Reviewed By: sammccall

Differential Revision: https://reviews.llvm.org/D125311
2022-07-15 15:20:48 +02:00
Haojian Wu 2315358906 [pseudo] Generate an enum type for identifying grammar rules.
The Rule enum type enables us to identify a grammar rule within C++'s
type system.

Differential Revision: https://reviews.llvm.org/D129359
2022-07-15 15:09:31 +02:00
Kazu Hirata 53daa177f8 [clang, clang-tools-extra] Use has_value instead of hasValue (NFC) 2022-07-12 22:47:41 -07:00
Haojian Wu cd3aa338c7 [pseudo] NFC, fix the header guard for Language.h 2022-07-07 14:42:26 +02:00
Sam McCall 7d8e2742d9 [pseudo] Define recovery strategy as grammar extension.
Differential Revision: https://reviews.llvm.org/D129158
2022-07-06 15:03:38 +02:00
Sam McCall 3121167488 [pseudo] Add error-recovery framework & brace-based recovery
The idea is:

- a parse failure is detected when all heads die when trying to shift the next token
- we can recover by choosing a nonterminal we're partway through parsing, and
  determining where it ends through nonlocal means (e.g. matching brackets)
- we can find candidates by walking up the stack from the (ex-)heads
- the token range is defined using heuristics attached to grammar rules
- the unparsed region is represented in the forest by an Opaque node

This patch has the core GLR functionality.
It does not allow recovery heuristics to be attached as extensions to
the grammar, but rather infers a brace-based heuristic.

Expected followups:

- make recovery heuristics grammar extensions (depends on D127448)
- add recovery to our grammar for bracketed constructs and sequence nodes
- change the structure of our augmented `_ := start` rules to eliminate some
  special-cases in glrParse.
- (if I can work out how): avoid some spurious recovery cases described in comments

(Previously mistakenly committed as a0f4c10ae2)

Differential Revision: https://reviews.llvm.org/D128486
2022-07-05 20:49:41 +02:00
Haojian Wu 9ab67cc8bf [pseudo] Implement guard extension.
- Extend the GLR parser to allow conditional reduction based on the
  guard functions;
- Implement two simple guards (contextual-override/final) for cxx.bnf;
- layering: clangPseudoCXX depends on clangPseudo (as the guard function need
  to access the TokenStream);

Differential Revision: https://reviews.llvm.org/D127448
2022-07-05 15:55:15 +02:00
Haojian Wu d263447311 [pseudo] Fix the build for the benchmark tool. 2022-07-05 15:42:41 +02:00
Haojian Wu 70c0d92930 [pseudo] Use the prebuilt cxx grammar for the lit tests, NFC.
Differential Revision: https://reviews.llvm.org/D129074
2022-07-05 15:17:18 +02:00
Sam McCall 9fbf1107cc [pseudo] Eliminate LRTable::Action. NFC
The last remaining uses are in tests/test builders.
Replace with a builder struct.

Differential Revision: https://reviews.llvm.org/D129093
2022-07-05 14:35:41 +02:00
Sam McCall b37dafd5dc [pseudo] Store shift and goto actions in a compact structure with faster lookup.
The actions table is very compact but the binary search to find the
correct action is relatively expensive.
A hashtable is faster but pretty large (64 bits per value, plus empty
slots, and lookup is constant time but not trivial due to collisions).

The structure in this patch uses 1.25 bits per entry (whether present or absent)
plus the size of the values, and lookup is trivial.

The Shift table is 119KB = 27KB values + 92KB keys.
The Goto table is 86KB = 30KB values + 57KB keys.
(Goto has a smaller keyspace as #nonterminals < #terminals, and more entries).

This patch improves glrParse speed by 28%: 4.69 => 5.99 MB/s
Overall the table grows by 60%: 142 => 228KB.

By comparison, DenseMap<unsigned, StateID> is "only" 16% faster (5.43 MB/s),
and results in a 285% larger table (547 KB) vs the baseline.

Differential Revision: https://reviews.llvm.org/D128485
2022-07-04 19:40:04 +02:00
Haojian Wu 5f0a054f89 [pseudo] Remove duplicated code in ClangPseudo.cpp
The code was added accidently during the rebase when landing fe66aebd.
2022-07-04 11:32:56 +02:00
Haojian Wu bbcd8e5271 [pseudo] NFC, polish the fix of c998273499 2022-07-01 21:25:46 +02:00
Haojian Wu c998273499 [pseudo] Fix an out-of-bound issue in getReduceRules. 2022-07-01 20:16:06 +02:00
Sam McCall a322c104cb [pseudo] temporary fix for missing generated header after fe66aebd75
Better fix to be added by Haojian later!
2022-07-01 16:45:22 +02:00
Haojian Wu fe66aebd75 [pseudo] Define a clangPseudoCLI library.
- define a common data structure Language which is a compiled result of the
  bnf grammar. It is defined in Language.h;
- creates a clangPseudoCLI lib which defines a grammar commandline flag and
  expose a function to get the Language. It supports --grammar=cxx,
  --grammmar=/path/to/file.bnf;
- use the clangPseudoCLI in clang-pseudo, fuzzer, and benchmark tools (
  simplify the code and use the prebuilt cxx grammar);

Split out from https://reviews.llvm.org/D127448.

Differential Revision: https://reviews.llvm.org/D128679
2022-07-01 08:31:34 +02:00
Sam McCall 9b6bb12b85 [pseudo] Add ForestNode descendants iterator, print ambiguous/opaque node stats.
Differential Revision: https://reviews.llvm.org/D128930
2022-06-30 21:20:55 +02:00
Sam McCall 8b04c331b5 [pseudo] Forest dump ascii art isn't broken by large indices 2022-06-30 16:53:51 +02:00
Sam McCall bc5e7ced1c [pseudo] Fix bugs/inconsistencies in forest dump.
- when printing a shared node for the second time, don't print its children
  (This keeps output proportional to the size of the structure)
- when printing a shared node for the second time, print its type only, not rule
  (for consistency with above: don't dump details of nodes twice)
- don't abbreviate shared nodes, to ensure we can prune the tree there

Differential Revision: https://reviews.llvm.org/D128805
2022-06-29 22:56:26 +02:00
Haojian Wu 1ba7f5218c [pseudo] Update the cxx.bnf path in comments to reflect the new
location, NFC
2022-06-29 15:10:39 +02:00
Sam McCall 743971faaf Revert "[pseudo] Add error-recovery framework & brace-based recovery"
This reverts commit a0f4c10ae2.
This commit hadn't been reviewed yet, and was unintentionally included
on another branch.
2022-06-28 21:11:09 +02:00
Sam McCall d25361c3af [pseudo] Move ellipsis into initializer-list-item. NFC
This makes the list formation a bit simpler.
2022-06-28 21:08:43 +02:00
Sam McCall a0f4c10ae2 [pseudo] Add error-recovery framework & brace-based recovery
The idea is:
 - a parse failure is detected when all heads die when trying to shift
   the next token
 - we can recover by choosing a nonterminal we're partway through parsing,
   and determining where it ends through nonlocal means (e.g. matching brackets)
 - we can find candidates by walking up the stack from the (ex-)heads
 - the token range is defined using heuristics attached to grammar rules
 - the unparsed region is represented in the forest by an Opaque node

This patch has the core GLR functionality.
It does not allow recovery heuristics to be attached as extensions to
the grammar, but rather infers a brace-based heuristic.

Expected followups:
 - make recovery heuristics grammar extensions (depends on D127448)
 - add recover to our grammar for bracketed constructs and sequence nodes
 - change the structure of our augmented `_ := start` rules to eliminate
   some special-cases in glrParse.
 - (if I can work out how): avoid some spurious recovery cases described
   in comments
 - grammar changes to eliminate the hard distinction between init-list
   and designated-init-list shown in the recovery-init-list.cpp testcase

Differential Revision: https://reviews.llvm.org/D128486
2022-06-28 21:08:43 +02:00
Sam McCall 3f028c02ba [pseudo] Grammar::parseBNF returns Grammar not unique_ptr. NFC 2022-06-28 16:34:21 +02:00
Sam McCall 241557fb06 [pseudo] Move cxx grammar into the cxx/ directory. NFC 2022-06-28 16:02:10 +02:00
Sam McCall aacefc817d [pseudo] Simplify/loosen the grammar around lambda captures.
Treat captures as a uniform list, rather than default-captures being special
snowflakes that may only appear at the start.

This accepts a larger set of (incorrect) code, and simplifies error-handling
by making this fit into the usual homogeneous-list pattern.

Differential Revision: https://reviews.llvm.org/D128708
2022-06-28 15:56:12 +02:00
Sam McCall 8cf28585a4 [pseudo] Allow mixed designated/undesignated init lists.
This isn't allowed by the standard grammar but is allowed in C, and clang/GCC
permit it as an extension.
It avoids the need to determine which type of list we have in error-recovery.

While here, also support array index designators `{ [4]=1 }` which are
also legal in C, and common extensions in C++.

Differential Revision: https://reviews.llvm.org/D128687
2022-06-28 15:45:41 +02:00
Sam McCall 85eaecbe8e [pseudo] Check follow-sets instead of tying reduce actions to lookahead tokens.
Previously, the action table stores a reduce action for each lookahead
token it should allow. These tokens are the followSet(action.rule.target).

In practice, the follow sets are large, so we spend a bunch of time binary
searching around all these essentially-duplicates to check whether our lookahead
token is there.
However the number of reduces for a given state is very small, so we're
much better off linear scanning over them and performing a fast check for each.

D128318 was an attempt at this, storing a bitmap for each reduce.
However it's even more compact just to use the follow sets directly, as
there are fewer nonterminals than (state, rule) pairs. It's also faster.

This specialized approach means unbundling Reduce from other actions in
LRTable, so it's no longer useful to support it in Action. I suspect
Action will soon go away, as we store each kind of action separately.

This improves glrParse speed by 42% (3.30 -> 4.69 MB/s).
It also reduces LR table size by 59% (343 -> 142kB).

Differential Revision: https://reviews.llvm.org/D128472
2022-06-28 00:36:16 +02:00
Kazu Hirata 94460f5136 Don't use Optional::hasValue (NFC)
This patch replaces x.hasValue() with x where x is contextually
convertible to bool.
2022-06-26 19:54:41 -07:00
Kazu Hirata 3b7c3a654c Revert "Don't use Optional::hasValue (NFC)"
This reverts commit aa8feeefd3.
2022-06-25 11:56:50 -07:00
Kazu Hirata aa8feeefd3 Don't use Optional::hasValue (NFC) 2022-06-25 11:55:57 -07:00
Sam McCall 768216cac0 [pseudo] Handle no-reductions-available on the fastpath. NFC
This is a ~2% speedup.
2022-06-23 20:34:11 +02:00
Sam McCall 466eae6aa3 [pseudo] Store last node popped in the queue, not its parent(s). NFC
We have to walk up to the last node to find the start token, but no need
to go even one node further.

This is one node fewer to store, but more importantly if the last node
happens to have multiple parents we avoid storing the sequence multiple times.

This saves ~5% on glrParse.
Based on a comment by hokein@ on https://reviews.llvm.org/D128307
2022-06-23 20:10:20 +02:00
Sam McCall 7aff663b2a [pseudo] Store reduction sequences by pointer in heaps, instead of by value.
Copying sequences around as the heap resized is significantly expensive.

This speeds up glrParse by ~35% (2.4 => 3.25 MB/s)

Differential Revision: https://reviews.llvm.org/D128307
2022-06-23 19:41:11 +02:00
Sam McCall 3e610f2cdc [pseudo] Turn glrReduce into a class, reuse storage across calls.
This is a ~5% speedup, we no longer have to allocate the priority queues and
other collections for each reduction step where we use them.

It's also IMO easier to understand the structure of a class with methods vs a
function with nested lambdas.

Differential Revision: https://reviews.llvm.org/D128301
2022-06-23 19:27:47 +02:00
Sam McCall f9710d1908 [pseudo] Add a fast-path to GLR reduce when both pop and push are trivial
In general we split a reduce into pop/push, so concurrently-available reductions
can run in the correct order. The data structures for this are expensive.

When only one reduction is possible at a time, we need not do this: we can pop
and immediately push instead.
Strictly this is correct whenever we yield one concurrent PushSpec.

This patch recognizes a trivial but common subset of these cases:
 - there must be no pending pushes and only one head available to pop
 - the head must have only one reduction rule
 - the reduction path must be a straight line (no multiple parents)

On my machine this speeds up by 2.12 -> 2.30 MB/s = 8%

Differential Revision: https://reviews.llvm.org/D128299
2022-06-23 18:21:59 +02:00
Sam McCall b70ee9d984 Reland "[pseudo] Track heads as GSS nodes, rather than as "pending actions"."
This reverts commit 2c80b53198.

Fixes LRTable::buildForTest to create states that are referenced but
have no actions.
2022-06-23 18:21:44 +02:00
Sam McCall 2c80b53198 Revert "[pseudo] Track heads as GSS nodes, rather than as "pending actions"."
This reverts commit e3ec054dfd.

Tests fail in asserts mode: https://lab.llvm.org/buildbot/#/builders/109/builds/41217
2022-06-23 18:16:38 +02:00
Sam McCall e3ec054dfd [pseudo] Track heads as GSS nodes, rather than as "pending actions".
IMO this model is simpler to understand (borrowed from the LR0 patch D127357).
It also makes error recovery easier to implement, as we have a simple list of
head nodes lying around to recover from when needed.
(It's not quite as nice as LR0 in this respect though).

It's slightly slower (2.24 -> 2.12 MB/S on my machine = 5%) but nothing close
to as bad as LR0.
However
 - I think we'd have to eat a litle performance loss otherwise to implement
   error recovery.
 - this frees up some complexity budget for optimizations like fastpath push/pop
   (this + fastpath is already faster than head)
 - I haven't changed the data structure here and it's now pretty dumb, we can
   make it faster

Differential Revision: https://reviews.llvm.org/D128297
2022-06-23 17:26:42 +02:00
Sam McCall 6b187fdf3b [pseudo] Add xfail tests for a simple-declaration/function-definition ambiguity
I expect to eliminate this ambiguity at the grammar level by use of guards,
because it interferes with brace-based error recvoery.

Differential Revision: https://reviews.llvm.org/D127400
2022-06-23 15:52:22 +02:00
Kazu Hirata 5413bf1bac Don't use Optional::hasValue (NFC) 2022-06-20 11:33:56 -07:00