llvm-project/clang-tools-extra/pseudo
Haojian Wu 6a9f79e102 [pseudo] Eliminate the type-name identifier ambiguities in the grammar.
See https://reviews.llvm.org/D130626 for motivation.

Identifier in the grammar has different categories (type-name, template-name,
namespace-name), they requires semantic information to resolve. This patch is
to eliminate the "local" ambiguities in type-name, and namespace-name, which
gives us a performance boost of the parser:

  - eliminate all different type rules (class-name, enum-name, typedef-name), and
    fold them into a unified type-name, this removes the #1 type-name ambiguity, and
    gives us a big performance boost;
  - remove the namespace-alis rules, as they're hard and uninteresting;

Note that we could eliminate more and gain more performance (like fold template-name,
type-name, namespace together), but at current stage, we'd like keep all existing
categories of the identifier (as they might assist in correlated disambiguation &
keep the representation of important concepts uniform).

| file               |ambiguous nodes |  forest size     | glrParse performance |
|SemaCodeComplete.cpp|  11k -> 5.7K   | 10.4MB -> 7.9MB  | 7.1MB/s -> 9.98MB/s  |
|       AST.cpp      |  1.3k -> 0.73K | 0.99MB -> 0.77MB | 6.7MB/s -> 8.4MB/s   |

Differential Revision: https://reviews.llvm.org/D130747
2022-08-17 14:30:53 +02:00
..
benchmarks [pseudo] Implement guard extension. 2022-07-05 15:55:15 +02:00
fuzzer [pseudo] Implement guard extension. 2022-07-05 15:55:15 +02:00
gen [clang-tools-extra] Fixed a number of typos 2022-08-01 15:32:25 +02:00
include Remove unused forward declarations (NFC) 2022-08-13 12:55:47 -07:00
lib [pseudo] Eliminate the type-name identifier ambiguities in the grammar. 2022-08-17 14:30:53 +02:00
test [pseudo] Eliminate the type-name identifier ambiguities in the grammar. 2022-08-17 14:30:53 +02:00
tool [pseudo] Add ambiguity & unparseability metrics to -print-statistics 2022-07-22 10:35:06 +02:00
unittests [pseudo] Use C++17 variant to simplify the DirectiveTree::Chunk class, NFC. 2022-08-11 14:27:38 +02:00
CMakeLists.txt
DesignNotes.md
README.md

README.md

clang pseudoparser

This directory implements an approximate heuristic parser for C++, based on the clang lexer, the C++ grammar, and the GLR parsing algorithm.

It parses a file in isolation, without reading its included headers. The result is a strict syntactic tree whose structure follows the C++ grammar. There is no semantic analysis, apart from guesses to disambiguate the parse. Disambiguation can optionally be guided by an AST or a symbol index.

For now, the best reference on intended scope is the design proposal, with further discussion on the RFC.

Dependencies between pseudoparser and clang

Dependencies are limited because they don't make sense, but also to avoid placing a burden on clang mantainers.

The pseudoparser reuses the clang lexer (clangLex and clangBasic libraries) but not the higher-level libraries (Parse, Sema, AST, Frontend...).

When the pseudoparser should be used together with an AST (e.g. to guide disambiguation), this is a separate "bridge" library that depends on both.

Clang does not depend on the pseudoparser at all. If this seems useful in future it should be discussed by RFC.

Parity between pseudoparser and clang

The pseudoparser aims to understand real-world code, and particularly the languages and extensions supported by Clang.

However we don't try to keep these in lockstep: there's no expectation that Clang parser changes are accompanied by pseudoparser changes or vice versa.