forked from OSchip/llvm-project
167 lines
8.2 KiB
Markdown
167 lines
8.2 KiB
Markdown
<!--===- docs/ParserCombinators.md
|
|
|
|
Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
|
|
See https://llvm.org/LICENSE.txt for license information.
|
|
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
|
|
|
-->
|
|
|
|
## Concept
|
|
The Fortran language recognizer here can be classified as an LL recursive
|
|
descent parser. It is composed from a *parser combinator* library that
|
|
defines a few fundamental parsers and a few ways to compose them into more
|
|
powerful parsers.
|
|
|
|
For our purposes here, a *parser* is any object that attempts to recognize
|
|
an instance of some syntax from an input stream. It may succeed or fail.
|
|
On success, it may return some semantic value to its caller.
|
|
|
|
In C++ terms, a parser is any instance of a class that
|
|
1. has a `constexpr` default constructor,
|
|
1. defines a type named `resultType`, and
|
|
1. provides a function (`const` member or `static`) that accepts a reference to a
|
|
`ParseState` as its argument and returns a `std::optional<resultType>` as a
|
|
result, with the presence or absence of a value in the `std::optional<>`
|
|
signifying success or failure, respectively.
|
|
```
|
|
std::optional<resultType> Parse(ParseState &) const;
|
|
```
|
|
The `resultType` of a parser is typically the class type of some particular
|
|
node type in the parse tree.
|
|
|
|
`ParseState` is a class that encapsulates a position in the source stream,
|
|
collects messages, and holds a few state flags that determine tokenization
|
|
(e.g., are we in a character literal?). Instances of `ParseState` are
|
|
independent and complete -- they are cheap to duplicate whenever necessary to
|
|
implement backtracking.
|
|
|
|
The `constexpr` default constructor of a parser is important. The functions
|
|
(below) that operate on instances of parsers are themselves all `constexpr`.
|
|
This use of compile-time expressions allows the entirety of a recursive
|
|
descent parser for a language to be constructed at compilation time through
|
|
the use of templates.
|
|
|
|
### Fundamental Predefined Parsers
|
|
These objects and functions are (or return) the fundamental parsers:
|
|
|
|
* `ok` is a trivial parser that always succeeds without advancing.
|
|
* `pure(x)` returns a trivial parser that always succeeds without advancing,
|
|
returning some value `x`.
|
|
* `pure<T>()` is `pure(T{})` but does not require that T be copy-constructible.
|
|
* `fail<T>(msg)` denotes a trivial parser that always fails, emitting the
|
|
given message as a side effect. The template parameter is the type of
|
|
the value that the parser never returns.
|
|
* `nextCh` consumes the next character and returns its location,
|
|
and fails at EOF.
|
|
* `"xyz"_ch` succeeds if the next character consumed matches any of those
|
|
in the string and returns its location. Be advised that the source
|
|
will have been normalized to lower case (miniscule) letters outside
|
|
character and Hollerith literals and edit descriptors before parsing.
|
|
|
|
### Combinators
|
|
These functions and operators combine existing parsers to generate new parsers.
|
|
They are `constexpr`, so they should be viewed as type-safe macros.
|
|
|
|
* `!p` succeeds if p fails, and fails if p succeeds.
|
|
* `p >> q` fails if p does, otherwise running q and returning its value when
|
|
it succeeds.
|
|
* `p / q` fails if p does, otherwise running q and returning p's value
|
|
if q succeeds.
|
|
* `p || q` succeeds if p does, otherwise running q. The two parsers must
|
|
have the same type, and the value returned by the first succeeding parser
|
|
is the value of the combination.
|
|
* `first(p1, p2, ...)` returns the value of the first parser that succeeds.
|
|
All of the parsers in the list must return the same type.
|
|
It is essentially the same as `p1 || p2 || ...` but has a slightly
|
|
faster implementation and may be easier to format in your code.
|
|
* `lookAhead(p)` succeeds if p does, but doesn't modify any state.
|
|
* `attempt(p)` succeeds if p does, safely preserving state on failure.
|
|
* `many(p)` recognizes a greedy sequence of zero or more nonempty successes
|
|
of p, and returns `std::list<>` of their values. It always succeeds.
|
|
* `some(p)` recognized a greedy sequence of one or more successes of p.
|
|
It fails if p immediately fails.
|
|
* `skipMany(p)` is the same as `many(p)`, but it discards the results.
|
|
* `maybe(p)` tries to match p, returning an `std::optional<T>` value.
|
|
It always succeeds.
|
|
* `defaulted(p)` matches p, and when p fails it returns a
|
|
default-constructed instance of p's resultType. It always succeeds.
|
|
* `nonemptySeparated(p, q)` repeatedly matches "p q p q p q ... p",
|
|
returning a `std::list<>` of only the values of the p's. It fails if
|
|
p immediately fails.
|
|
* `extension(p)` parses p if strict standard compliance is disabled,
|
|
or with a warning if nonstandard usage warnings are enabled.
|
|
* `deprecated(p)` parses p if strict standard compliance is disabled,
|
|
with a warning if deprecated usage warnings are enabled.
|
|
* `inContext(msg, p)` runs p within an error message context; any
|
|
message that `p` generates will be tagged with `msg` as its
|
|
context. Contexts may nest.
|
|
* `withMessage(msg, p)` succeeds if `p` does, and if it does not,
|
|
it discards the messages from `p` and fails with the specified message.
|
|
* `recovery(p, q)` is equivalent to `p || q`, except that error messages
|
|
generated from the first parser are retained, and a flag is set in
|
|
the ParseState to remember that error recovery was necessary.
|
|
* `localRecovery(msg, p, q)` is equivalent to
|
|
`recovery(withMessage(msg, p), q >> pure<A>())` where `A` is the
|
|
result type of 'p'.
|
|
It is useful for targeted error recovery situations within statements.
|
|
|
|
Note that
|
|
```
|
|
a >> b >> c / d / e
|
|
```
|
|
matches a sequence of five parsers, but returns only the result that was
|
|
obtained by matching `c`.
|
|
|
|
### Applicatives
|
|
The following *applicative* combinators combine parsers and modify or
|
|
collect the values that they return.
|
|
|
|
* `construct<T>(p1, p2, ...)` matches zero or more parsers in succession,
|
|
collecting their results and then passing them with move semantics to a
|
|
constructor for the type T if they all succeed.
|
|
If there is a single parser as the argument and it returns no usable
|
|
value but only success or failure (_e.g.,_ `"IF"_tok`), the default
|
|
nullary constructor of the type `T` is called.
|
|
* `sourced(p)` matches p, and fills in its `source` data member with the
|
|
locations of the cooked character stream that it consumed
|
|
* `applyFunction(f, p1, p2, ...)` matches one or more parsers in succession,
|
|
collecting their results and passing them as rvalue reference arguments to
|
|
some function, returning its result.
|
|
* `applyLambda([](&&x){}, p1, p2, ...)` is the same thing, but for lambdas
|
|
and other function objects.
|
|
* `applyMem(mf, p1, p2, ...)` is the same thing, but invokes a member
|
|
function of the result of the first parser for updates in place.
|
|
|
|
### Token Parsers
|
|
Last, we have these basic parsers on which the actual grammar of the Fortran
|
|
is built. All of the following parsers consume characters acquired from
|
|
`nextCh`.
|
|
|
|
* `space` always succeeds after consuming any spaces
|
|
* `spaceCheck` always succeeds after consuming any spaces, and can emit
|
|
a warning if there was no space in free form code before a character
|
|
that could continue a name or keyword
|
|
* `digit` matches one cooked decimal digit (0-9)
|
|
* `letter` matches one cooked letter (A-Z)
|
|
* `"..."_tok` match the content of the string, skipping spaces before and
|
|
after. Internal spaces are optional matches. The `_tok` suffix is
|
|
optional when the parser appears before the combinator `>>` or after
|
|
the combinator `/`.
|
|
* `"..."_sptok` is a string match in which the spaces are required in
|
|
free form source.
|
|
* `"..."_id` is a string match for a complete identifier (not a prefix of
|
|
a longer identifier or keyword).
|
|
* `parenthesized(p)` is shorthand for `"(" >> p / ")"`.
|
|
* `bracketed(p)` is shorthand for `"[" >> p / "]"`.
|
|
* `nonEmptyList(p)` matches a comma-separated list of one or more
|
|
instances of p.
|
|
* `nonEmptyList(errorMessage, p)` is equivalent to
|
|
`withMessage(errorMessage, nonemptyList(p))`, which allows one to supply
|
|
a meaningful error message in the event of an empty list.
|
|
* `optionalList(p)` is the same thing, but can be empty, and always succeeds.
|
|
|
|
### Debugging Parser
|
|
Last, a string literal `"..."_debug` denotes a parser that emits the string to
|
|
`llvm::errs` and succeeds. It is useful for tracing while debugging a parser but should
|
|
obviously not be committed for production code.
|