llvm-project/flang/docs/Parsing.md

10 KiB

The F18 Parser

This program source code implements a parser for the Fortran programming language.

The draft ISO standard for Fortran 2018 dated July 2017 was used as the primary definition of the language. The parser also accepts many features from previous versions of the standard that are no longer part of the Fortran 2018 language.

It also accepts many features that have never been part of any version of the standard Fortran language but have been supported by previous implementations and are known or suspected to remain in use. As a general principle, we want to recognize and implement any such feature so long as it does not conflict with requirements of the current standard for Fortran.

The parser is implemented in standard ISO C++ and requires the 2017 edition of the language and library. The parser constitutes a reentrant library with no mutable or constructed static data. Best modern C++ programming practices are observed to ensure that the ownership of dynamic memory is clear, that value rather than object semantics are defined for the data structures, that most functions are free from invisible side effects, and that the strictest available type checking is enforced by the C++ compiler when the Fortran parser is built. Class inheritance is rare and dynamic polymorphism is avoided in favor of modern discriminated unions. To the furthest reasonable extent, the parser has been implemented in a declarative fashion that corresponds closely to the text of the Fortran language standard.

The several major modules of the Fortran parser are composed into a top-level Parsing class, by means of which one may drive the parsing of a source file and receive its parse tree and error messages. The interfaces of the Parsing class correspond to the two major passes of the parser, which are described below.

Prescanning and Preprocessing

The first pass is performed by an instance of the Prescanner class, with help from an instance of Preprocessor.

The prescanner generates the "cooked character stream", implemented by a CookedSource class instance, in which:

  • line ends have been normalized to single ASCII LF characters (UNIX newlines)
  • all INCLUDE files have been expanded
  • all continued Fortran source lines have been unified
  • all comments and insignificant spaces have been removed
  • fixed form right margins have been clipped
  • extra blank card columns have been inserted into character literals and Hollerith constants
  • preprocessing directives have been implemented
  • preprocessing macro invocations have been expanded
  • legacy D lines in fixed form source have been omitted or included
  • except for the payload in character literals, Hollerith constants, and character and Hollerith edit descriptors, all letters have been normalized to lower case
  • all original non-ASCII characters in Hollerith constants have been decoded and re-encoded into UTF-8

Lines in the cooked character stream can be of arbitrary length.

The purpose of the cooked character stream is to enable the implementation of a parser whose sole concern is the recognition of the Fortran language from productions that closely correspond to the grammar that is presented in the Fortran standard, without having to deal with the complexity of all of the source-level concerns in the preceding list.

The implementation of the preprocessor interacts with the prescanner by means of token sequences. These are partitionings of input lines into contiguous virtual blocks of characters, and are the only place in this Fortran compiler in which we have reified a tokenization of the program source; the parser proper does not have a tokenizer. The prescanner builds these token sequences out of source lines and supplies them to the preprocessor, which interprets directives and expands macro invocations. The token sequences returned by the preprocessor are then marshaled to constitute the cooked character stream that is the output of the prescanner.

The preprocessor and prescanner can both instantiate new temporary instances of the Prescanner class to locate, open, and process any include files.

The tight interaction and mutual design of the prescanner and preprocessor enable a principled implementation of preprocessing for the Fortran language that implements a reasonable facsimile of the C language preprocessor that is fully aware of Fortran's source forms, line continuation mechanisms, case insensitivity, token syntax, &c.

The preprocessor always runs. There's no good reason for it not to.

The content of the cooked character stream is available and useful for debugging, being as it is a simple value forwarded from the first major pass of the compiler to the second.

Source Provenance

The prescanner constructs a chronicle of every file that is read by the parser, viz. the original source file and all others that it directly or indirectly includes. One copy of the content of each of these files is mapped or read into the address space of the parser. Memory mapping is used initially, but files with DOS line breaks or a missing terminal newline are immediately normalized in a buffer when necessary.

The virtual input stream, which marshals every appearance of every file and every expansion of every macro invocation, is not materialized as an actual stream of bytes. There is, however, a mapping from each byte position in this virtual input stream back to whence it came (maintained by an instance of the AllSources class). Offsets into this virtual input stream constitute values of the Provenance class. Provenance values, and contiguous ranges thereof, are used to describe and delimit source positions for messaging.

Further, every byte in the cooked character stream supplied by the prescanner to the parser can be inexpensively mapped to its provenance. Simple const char * pointers to characters in the cooked character stream, or to contiguous ranges thereof, are used as source position indicators within the parser and in the parse tree.

Messages

Message texts, and snprintf-like formatting strings for constructing messages, are instantiated in the various components of the parser with C++ user defined character literals tagged with _err_en_US and _en_US (signifying fatality and language, with the default being the dialect of English used in the United States) so that they may be easily identified for localization. As described above, messages are associated with source code positions by means of provenance values.

The Parse Tree

Each of the ca. 450 numbered requirement productions in the standard Fortran language grammar, as well as the productions implied by legacy extensions and preserved obsolescent features, maps to a distinct class in the parse tree so as to maximize the efficacy of static type checking by the C++ compiler.

A transcription of the Fortran grammar appears with production requirement numbers in the commentary before these class definitions, so that one may easily refer to the standard (or to the parse tree definitions while reading that document).

Three paradigms collectively implement most of the parse tree classes:

  • wrappers, in which a single data member v has been encapsulated in a new type
  • tuples (or product types), in which several values of arbitrary types have been encapsulated in a single data member t whose type is an instance of std::tuple<>
  • discriminated unions (or sum types), in which one value whose type is a dynamic selection from a set of distinct types is saved in a data member u whose type is an instance of std::variant<>

The use of these patterns is a design convenience, and exceptions to them are not uncommon wherever it made better sense to write custom definitions.

Parse tree entities should be viewed as values, not objects; their addresses should not be abused for purposes of identification. They are assembled with C++ move semantics during parse tree construction. Their default and copy constructors are deliberately deleted in their declarations.

The std::list<> data type is used in the parse tree to reliably store pointers to other relevant entries in the tree. Since the tree lists are moved and spliced at certain points std::list<> provides the necessary guarantee of the stability of pointers into these lists.

There is a general purpose library by means of which parse trees may be traversed.

Parsing

This compiler attempts to recognize the entire cooked character stream (see above) as a Fortran program. It records the reductions made during a successful recognition as a parse tree value. The recognized grammar is that of a whole source file, not just of its possible statements, so the parser has no global state that tracks the subprogram hierarchy or the structure of their nested block constructs. The parser performs no semantic analysis along the way, deferring all of that work to the next pass of the compiler.

The resulting parse tree therefore necessarily contains ambiguous parses that cannot be resolved without recourse to a symbol table. Most notably, leading assignments to array elements can be misrecognized as statement function definitions, and array element references can be misrecognized as function calls. The semantic analysis phase of the compiler performs local rewrites of the parse tree once it can be disambiguated by symbols and types.

Formally speaking, this parser is based on recursive descent with localized backtracking (specifically, it will not backtrack into a successful reduction to try its other alternatives). It is not generated as a table or code from a specification of the Fortran grammar; rather, it is the grammar, as declaratively respecified in C++ constant expressions using a small collection of basic token recognition objects and a library of "parser combinator" template functions that compose them to form more complicated recognizers and their correspondences to the construction of parse tree values.

Unparsing

Parse trees can be converted back into free form Fortran source code. This formatter is not really a classical "pretty printer", but is more of a data structure dump whose output is suitable for compilation by another compiler. It is also used for testing the parser, since a reparse of an unparsed parse tree should be an identity function apart from source provenance.