forked from OSchip/llvm-project
219 lines
10 KiB
Markdown
219 lines
10 KiB
Markdown
<!--===- docs/Parsing.md
|
|
|
|
Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
|
|
See https://llvm.org/LICENSE.txt for license information.
|
|
SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
|
|
|
-->
|
|
|
|
# The F18 Parser
|
|
|
|
```eval_rst
|
|
.. contents::
|
|
:local:
|
|
```
|
|
|
|
This program source code implements a parser for the Fortran programming
|
|
language.
|
|
|
|
The draft ISO standard for Fortran 2018 dated July 2017 was used as the
|
|
primary definition of the language. The parser also accepts many features
|
|
from previous versions of the standard that are no longer part of the Fortran
|
|
2018 language.
|
|
|
|
It also accepts many features that have never been part of any version
|
|
of the standard Fortran language but have been supported by previous
|
|
implementations and are known or suspected to remain in use. As a
|
|
general principle, we want to recognize and implement any such feature
|
|
so long as it does not conflict with requirements of the current standard
|
|
for Fortran.
|
|
|
|
The parser is implemented in standard ISO C++ and requires the 2017
|
|
edition of the language and library. The parser constitutes a reentrant
|
|
library with no mutable or constructed static data. Best modern C++
|
|
programming practices are observed to ensure that the ownership of
|
|
dynamic memory is clear, that value rather than object semantics are
|
|
defined for the data structures, that most functions are free from
|
|
invisible side effects, and that the strictest available type checking
|
|
is enforced by the C++ compiler when the Fortran parser is built.
|
|
Class inheritance is rare and dynamic polymorphism is avoided in favor
|
|
of modern discriminated unions. To the furthest reasonable extent, the
|
|
parser has been implemented in a declarative fashion that corresponds
|
|
closely to the text of the Fortran language standard.
|
|
|
|
The several major modules of the Fortran parser are composed into a
|
|
top-level Parsing class, by means of which one may drive the parsing of a
|
|
source file and receive its parse tree and error messages. The interfaces
|
|
of the Parsing class correspond to the two major passes of the parser,
|
|
which are described below.
|
|
|
|
## Prescanning and Preprocessing
|
|
|
|
The first pass is performed by an instance of the Prescanner class,
|
|
with help from an instance of Preprocessor.
|
|
|
|
The prescanner generates the "cooked character stream", implemented
|
|
by a CookedSource class instance, in which:
|
|
* line ends have been normalized to single ASCII LF characters (UNIX newlines)
|
|
* all `INCLUDE` files have been expanded
|
|
* all continued Fortran source lines have been unified
|
|
* all comments and insignificant spaces have been removed
|
|
* fixed form right margins have been clipped
|
|
* extra blank card columns have been inserted into character literals
|
|
and Hollerith constants
|
|
* preprocessing directives have been implemented
|
|
* preprocessing macro invocations have been expanded
|
|
* legacy `D` lines in fixed form source have been omitted or included
|
|
* except for the payload in character literals, Hollerith constants,
|
|
and character and Hollerith edit descriptors, all letters have been
|
|
normalized to lower case
|
|
* all original non-ASCII characters in Hollerith constants have been
|
|
decoded and re-encoded into UTF-8
|
|
|
|
Lines in the cooked character stream can be of arbitrary length.
|
|
|
|
The purpose of the cooked character stream is to enable the implementation
|
|
of a parser whose sole concern is the recognition of the Fortran language
|
|
from productions that closely correspond to the grammar that is presented
|
|
in the Fortran standard, without having to deal with the complexity of
|
|
all of the source-level concerns in the preceding list.
|
|
|
|
The implementation of the preprocessor interacts with the prescanner by
|
|
means of _token sequences_. These are partitionings of input lines into
|
|
contiguous virtual blocks of characters, and are the only place in this
|
|
Fortran compiler in which we have reified a tokenization of the program
|
|
source; the parser proper does not have a tokenizer. The prescanner
|
|
builds these token sequences out of source lines and supplies them
|
|
to the preprocessor, which interprets directives and expands macro
|
|
invocations. The token sequences returned by the preprocessor are then
|
|
marshaled to constitute the cooked character stream that is the output of
|
|
the prescanner.
|
|
|
|
The preprocessor and prescanner can both instantiate new temporary
|
|
instances of the Prescanner class to locate, open, and process any
|
|
include files.
|
|
|
|
The tight interaction and mutual design of the prescanner and preprocessor
|
|
enable a principled implementation of preprocessing for the Fortran
|
|
language that implements a reasonable facsimile of the C language
|
|
preprocessor that is fully aware of Fortran's source forms, line
|
|
continuation mechanisms, case insensitivity, token syntax, &c.
|
|
|
|
The preprocessor always runs. There's no good reason for it not to.
|
|
|
|
The content of the cooked character stream is available and useful
|
|
for debugging, being as it is a simple value forwarded from the first major
|
|
pass of the compiler to the second.
|
|
|
|
## Source Provenance
|
|
|
|
The prescanner constructs a chronicle of every file that is read by the
|
|
parser, viz. the original source file and all others that it directly
|
|
or indirectly includes. One copy of the content of each of these files
|
|
is mapped or read into the address space of the parser. Memory mapping
|
|
is used initially, but files with DOS line breaks or a missing terminal
|
|
newline are immediately normalized in a buffer when necessary.
|
|
|
|
The virtual input stream, which marshals every appearance of every file
|
|
and every expansion of every macro invocation, is not materialized as
|
|
an actual stream of bytes. There is, however, a mapping from each byte
|
|
position in this virtual input stream back to whence it came (maintained
|
|
by an instance of the AllSources class). Offsets into this virtual input
|
|
stream constitute values of the Provenance class. Provenance values,
|
|
and contiguous ranges thereof, are used to describe and delimit source
|
|
positions for messaging.
|
|
|
|
Further, every byte in the cooked character stream supplied by the
|
|
prescanner to the parser can be inexpensively mapped to its provenance.
|
|
Simple `const char *` pointers to characters in the cooked character
|
|
stream, or to contiguous ranges thereof, are used as source position
|
|
indicators within the parser and in the parse tree.
|
|
|
|
## Messages
|
|
|
|
Message texts, and snprintf-like formatting strings for constructing
|
|
messages, are instantiated in the various components of the parser with
|
|
C++ user defined character literals tagged with `_err_en_US` and `_en_US`
|
|
(signifying fatality and language, with the default being the dialect of
|
|
English used in the United States) so that they may be easily identified
|
|
for localization. As described above, messages are associated with
|
|
source code positions by means of provenance values.
|
|
|
|
## The Parse Tree
|
|
|
|
Each of the ca. 450 numbered requirement productions in the standard
|
|
Fortran language grammar, as well as the productions implied by legacy
|
|
extensions and preserved obsolescent features, maps to a distinct class
|
|
in the parse tree so as to maximize the efficacy of static type checking
|
|
by the C++ compiler.
|
|
|
|
A transcription of the Fortran grammar appears with production requirement
|
|
numbers in the commentary before these class definitions, so that one
|
|
may easily refer to the standard (or to the parse tree definitions while
|
|
reading that document).
|
|
|
|
Three paradigms collectively implement most of the parse tree classes:
|
|
* *wrappers*, in which a single data member `v` has been encapsulated
|
|
in a new type
|
|
* *tuples* (or product types), in which several values of arbitrary
|
|
types have been encapsulated in a single data member `t` whose type
|
|
is an instance of `std::tuple<>`
|
|
* *discriminated unions* (or sum types), in which one value whose type is
|
|
a dynamic selection from a set of distinct types is saved in a data
|
|
member `u` whose type is an instance of `std::variant<>`
|
|
|
|
The use of these patterns is a design convenience, and exceptions to them
|
|
are not uncommon wherever it made better sense to write custom definitions.
|
|
|
|
Parse tree entities should be viewed as values, not objects; their
|
|
addresses should not be abused for purposes of identification. They are
|
|
assembled with C++ move semantics during parse tree construction.
|
|
Their default and copy constructors are deliberately deleted in their
|
|
declarations.
|
|
|
|
The std::list<> data type is used in the parse tree to reliably store pointers
|
|
to other relevant entries in the tree. Since the tree lists are moved and
|
|
spliced at certain points std::list<> provides the necessary guarantee of the
|
|
stability of pointers into these lists.
|
|
|
|
There is a general purpose library by means of which parse trees may
|
|
be traversed.
|
|
|
|
## Parsing
|
|
|
|
This compiler attempts to recognize the entire cooked character stream
|
|
(see above) as a Fortran program. It records the reductions made during
|
|
a successful recognition as a parse tree value. The recognized grammar
|
|
is that of a whole source file, not just of its possible statements,
|
|
so the parser has no global state that tracks the subprogram hierarchy
|
|
or the structure of their nested block constructs. The parser performs
|
|
no semantic analysis along the way, deferring all of that work to the
|
|
next pass of the compiler.
|
|
|
|
The resulting parse tree therefore necessarily contains ambiguous parses
|
|
that cannot be resolved without recourse to a symbol table. Most notably,
|
|
leading assignments to array elements can be misrecognized as statement
|
|
function definitions, and array element references can be misrecognized
|
|
as function calls. The semantic analysis phase of the compiler performs
|
|
local rewrites of the parse tree once it can be disambiguated by symbols
|
|
and types.
|
|
|
|
Formally speaking, this parser is based on recursive descent with
|
|
localized backtracking (specifically, it will not backtrack into a
|
|
successful reduction to try its other alternatives). It is not generated
|
|
as a table or code from a specification of the Fortran grammar; rather, it
|
|
_is_ the grammar, as declaratively respecified in C++ constant expressions
|
|
using a small collection of basic token recognition objects and a library
|
|
of "parser combinator" template functions that compose them to form more
|
|
complicated recognizers and their correspondences to the construction
|
|
of parse tree values.
|
|
|
|
## Unparsing
|
|
|
|
Parse trees can be converted back into free form Fortran source code.
|
|
This formatter is not really a classical "pretty printer", but is
|
|
more of a data structure dump whose output is suitable for compilation
|
|
by another compiler. It is also used for testing the parser, since a
|
|
reparse of an unparsed parse tree should be an identity function apart from
|
|
source provenance.
|