Commit Graph

23 Commits

Author SHA1 Message Date
Haojian Wu 2315358906 [pseudo] Generate an enum type for identifying grammar rules.
The Rule enum type enables us to identify a grammar rule within C++'s
type system.

Differential Revision: https://reviews.llvm.org/D129359
2022-07-15 15:09:31 +02:00
Sam McCall 7d8e2742d9 [pseudo] Define recovery strategy as grammar extension.
Differential Revision: https://reviews.llvm.org/D129158
2022-07-06 15:03:38 +02:00
Sam McCall 3121167488 [pseudo] Add error-recovery framework & brace-based recovery
The idea is:

- a parse failure is detected when all heads die when trying to shift the next token
- we can recover by choosing a nonterminal we're partway through parsing, and
  determining where it ends through nonlocal means (e.g. matching brackets)
- we can find candidates by walking up the stack from the (ex-)heads
- the token range is defined using heuristics attached to grammar rules
- the unparsed region is represented in the forest by an Opaque node

This patch has the core GLR functionality.
It does not allow recovery heuristics to be attached as extensions to
the grammar, but rather infers a brace-based heuristic.

Expected followups:

- make recovery heuristics grammar extensions (depends on D127448)
- add recovery to our grammar for bracketed constructs and sequence nodes
- change the structure of our augmented `_ := start` rules to eliminate some
  special-cases in glrParse.
- (if I can work out how): avoid some spurious recovery cases described in comments

(Previously mistakenly committed as a0f4c10ae2)

Differential Revision: https://reviews.llvm.org/D128486
2022-07-05 20:49:41 +02:00
Sam McCall 9fbf1107cc [pseudo] Eliminate LRTable::Action. NFC
The last remaining uses are in tests/test builders.
Replace with a builder struct.

Differential Revision: https://reviews.llvm.org/D129093
2022-07-05 14:35:41 +02:00
Sam McCall b37dafd5dc [pseudo] Store shift and goto actions in a compact structure with faster lookup.
The actions table is very compact but the binary search to find the
correct action is relatively expensive.
A hashtable is faster but pretty large (64 bits per value, plus empty
slots, and lookup is constant time but not trivial due to collisions).

The structure in this patch uses 1.25 bits per entry (whether present or absent)
plus the size of the values, and lookup is trivial.

The Shift table is 119KB = 27KB values + 92KB keys.
The Goto table is 86KB = 30KB values + 57KB keys.
(Goto has a smaller keyspace as #nonterminals < #terminals, and more entries).

This patch improves glrParse speed by 28%: 4.69 => 5.99 MB/s
Overall the table grows by 60%: 142 => 228KB.

By comparison, DenseMap<unsigned, StateID> is "only" 16% faster (5.43 MB/s),
and results in a 285% larger table (547 KB) vs the baseline.

Differential Revision: https://reviews.llvm.org/D128485
2022-07-04 19:40:04 +02:00
Sam McCall 743971faaf Revert "[pseudo] Add error-recovery framework & brace-based recovery"
This reverts commit a0f4c10ae2.
This commit hadn't been reviewed yet, and was unintentionally included
on another branch.
2022-06-28 21:11:09 +02:00
Sam McCall a0f4c10ae2 [pseudo] Add error-recovery framework & brace-based recovery
The idea is:
 - a parse failure is detected when all heads die when trying to shift
   the next token
 - we can recover by choosing a nonterminal we're partway through parsing,
   and determining where it ends through nonlocal means (e.g. matching brackets)
 - we can find candidates by walking up the stack from the (ex-)heads
 - the token range is defined using heuristics attached to grammar rules
 - the unparsed region is represented in the forest by an Opaque node

This patch has the core GLR functionality.
It does not allow recovery heuristics to be attached as extensions to
the grammar, but rather infers a brace-based heuristic.

Expected followups:
 - make recovery heuristics grammar extensions (depends on D127448)
 - add recover to our grammar for bracketed constructs and sequence nodes
 - change the structure of our augmented `_ := start` rules to eliminate
   some special-cases in glrParse.
 - (if I can work out how): avoid some spurious recovery cases described
   in comments
 - grammar changes to eliminate the hard distinction between init-list
   and designated-init-list shown in the recovery-init-list.cpp testcase

Differential Revision: https://reviews.llvm.org/D128486
2022-06-28 21:08:43 +02:00
Sam McCall 3f028c02ba [pseudo] Grammar::parseBNF returns Grammar not unique_ptr. NFC 2022-06-28 16:34:21 +02:00
Sam McCall 85eaecbe8e [pseudo] Check follow-sets instead of tying reduce actions to lookahead tokens.
Previously, the action table stores a reduce action for each lookahead
token it should allow. These tokens are the followSet(action.rule.target).

In practice, the follow sets are large, so we spend a bunch of time binary
searching around all these essentially-duplicates to check whether our lookahead
token is there.
However the number of reduces for a given state is very small, so we're
much better off linear scanning over them and performing a fast check for each.

D128318 was an attempt at this, storing a bitmap for each reduce.
However it's even more compact just to use the follow sets directly, as
there are fewer nonterminals than (state, rule) pairs. It's also faster.

This specialized approach means unbundling Reduce from other actions in
LRTable, so it's no longer useful to support it in Action. I suspect
Action will soon go away, as we store each kind of action separately.

This improves glrParse speed by 42% (3.30 -> 4.69 MB/s).
It also reduces LR table size by 59% (343 -> 142kB).

Differential Revision: https://reviews.llvm.org/D128472
2022-06-28 00:36:16 +02:00
Sam McCall b70ee9d984 Reland "[pseudo] Track heads as GSS nodes, rather than as "pending actions"."
This reverts commit 2c80b53198.

Fixes LRTable::buildForTest to create states that are referenced but
have no actions.
2022-06-23 18:21:44 +02:00
Sam McCall 2c80b53198 Revert "[pseudo] Track heads as GSS nodes, rather than as "pending actions"."
This reverts commit e3ec054dfd.

Tests fail in asserts mode: https://lab.llvm.org/buildbot/#/builders/109/builds/41217
2022-06-23 18:16:38 +02:00
Sam McCall e3ec054dfd [pseudo] Track heads as GSS nodes, rather than as "pending actions".
IMO this model is simpler to understand (borrowed from the LR0 patch D127357).
It also makes error recovery easier to implement, as we have a simple list of
head nodes lying around to recover from when needed.
(It's not quite as nice as LR0 in this respect though).

It's slightly slower (2.24 -> 2.12 MB/S on my machine = 5%) but nothing close
to as bad as LR0.
However
 - I think we'd have to eat a litle performance loss otherwise to implement
   error recovery.
 - this frees up some complexity budget for optimizations like fastpath push/pop
   (this + fastpath is already faster than head)
 - I haven't changed the data structure here and it's now pretty dumb, we can
   make it faster

Differential Revision: https://reviews.llvm.org/D128297
2022-06-23 17:26:42 +02:00
Haojian Wu c70aeaad2b [pseudo] Move grammar-related headers to a separate dir, NFC.
We did that for .cpp, but forgot the headers.

Differential Revision: https://reviews.llvm.org/D127388
2022-06-09 14:58:05 +02:00
Haojian Wu 9ce232fba9 [pseudo] Fix the missing-field-initializers warning from f1ac00c9b0, NFC 2022-06-09 14:10:36 +02:00
Haojian Wu f1ac00c9b0 [pseudo] Add grammar annotations support.
Add annotation handling ([key=value]) in the BNF grammar parser, which
will be used in the conditional reduction, and error recovery.

Reviewed By: sammccall

Differential Revision: https://reviews.llvm.org/D126536
2022-06-09 12:06:22 +02:00
Haojian Wu 7a05942dd0 [pseudo] Remove the explicit Accept actions.
As pointed out in the previous review section, having a dedicated accept
action doesn't seem to be necessary. This patch implements the the same behavior
without accept acction, which will save some code complexity.

Reviewed By: sammccall

Differential Revision: https://reviews.llvm.org/D125677
2022-06-09 11:19:07 +02:00
Haojian Wu 075449da80 [pseudo] Fix a sign-compare warning in debug build, NFC. 2022-06-09 11:18:03 +02:00
Sam McCall bbc58c5e9b [pseudo] Restore accidentally removed debug print 2022-06-08 23:39:34 +02:00
Sam McCall 93bcff8aa8 [pseudo] Invert rows/columns of LRTable storage for speedup. NFC
There are more states than symbols.
This means first partioning the action list by state leaves us with a smaller
range to binary search over. This improves find() a lot and glrParse() by 7%.
The tradeoff is storing more smaller ranges increases the size of the offsets
array, overall grammar memory is +1% (337->340KB).

Before:
glrParse    188795975 ns    188778003 ns           77 bytes_per_second=1.98068M/s
After:
glrParse    175936203 ns    175916873 ns           81 bytes_per_second=2.12548M/s

Differential Revision: https://reviews.llvm.org/D127006
2022-06-08 23:35:14 +02:00
Fangrui Song 47ec8b5574 [pseudo] Fix leaks after D126731
Array Operator new Cookies help lsan find allocations, while std::array
can't.
2022-06-03 18:43:16 -07:00
Sam McCall dc63ad8878 [pseudo] Eliminate dependencies from clang-pseudo-gen. NFC
ClangBasic dependency eliminated by replacing our usage of
tok::getPunctuatorSpelling etc with direct use of the *.def file.

Implicit dependencies on clang-tablegen-targets removed as we manage to avoid
any transitive tablegen deps.

After these changes, `ninja clean; ninja pseudo-gen` runs 169 actions only
(basically Support and Demangle).

Differential Revision: https://reviews.llvm.org/D126731
2022-06-03 20:42:38 +02:00
Haojian Wu a5ddd4a238 [pseudo] Remove an unnecessary nullable check diagnostic in the bnf
grammar, NFC.

This diagnostic has been handled in eliminateOptional.
2022-05-30 09:04:47 +02:00
Haojian Wu cd2292ef82 [pseudo] A basic implementation of compiling cxx grammar at build time.
The main idea is to compile the cxx grammar at build time, and construct
the core pieces (Grammar, LRTable) of the pseudoparse based on the compiled
data sources.

This is a tiny implementation, which is good for start:

- defines how the public API should look like;
- integrates the cxx grammar compilation workflow with the cmake system.
- onlynonterminal symbols of the C++ grammar are compiled, anything
  else are still doing the real compilation work at runtime, we can opt-in more
  bits in the future;
- splits the monolithic clangPsuedo library for better layering;

Reviewed By: sammccall

Differential Revision: https://reviews.llvm.org/D125667
2022-05-25 11:26:06 +02:00