llvm-project/clang/docs/ClangTransformerTutorial.rst

401 lines
18 KiB
ReStructuredText

==========================
Clang Transformer Tutorial
==========================
A tutorial on how to write a source-to-source translation tool using Clang Transformer.
.. contents::
:local:
What is Clang Transformer?
--------------------------
Clang Transformer is a framework for writing C++ diagnostics and program
transformations. It is built on the clang toolchain and the LibTooling library,
but aims to hide much of the complexity of clang's native, low-level libraries.
The core abstraction of Transformer is the *rewrite rule*, which specifies how
to change a given program pattern into a new form. Here are some examples of
tasks you can achieve with Transformer:
* warn against using the name ``MkX`` for a declared function,
* change ``MkX`` to ``MakeX``, where ``MkX`` is the name of a declared function,
* change ``s.size()`` to ``Size(s)``, where ``s`` is a ``string``,
* collapse ``e.child().m()`` to ``e.m()``, for any expression ``e`` and method named
``m``.
All of the examples have a common form: they identify a pattern that is the
target of the transformation, they specify an *edit* to the code identified by
the pattern, and their pattern and edit refer to common variables, like ``s``,
``e``, and ``m``, that range over code fragments. Our first and second examples also
specify constraints on the pattern that aren't apparent from the syntax alone,
like "``s`` is a ``string``." Even the first example ("warn ...") shares this form,
even though it doesn't change any of the code -- it's "edit" is simply a no-op.
Transformer helps users succinctly specify rules of this sort and easily execute
them locally over a collection of files, apply them to selected portions of
a codebase, or even bundle them as a clang-tidy check for ongoing application.
Who is Clang Transformer for?
-----------------------------
Clang Transformer is for developers who want to write clang-tidy checks or write
tools to modify a large number of C++ files in (roughly) the same way. What
qualifies as "large" really depends on the nature of the change and your
patience for repetitive editing. In our experience, automated solutions become
worthwhile somewhere between 100 and 500 files.
Getting Started
---------------
Patterns in Transformer are expressed with :doc:`clang's AST matchers <LibASTMatchers>`.
Matchers are a language of combinators for describing portions of a clang
Abstract Syntax Tree (AST). Since clang's AST includes complete type information
(within the limits of single `Translation Unit (TU)`_,
these patterns can even encode rich constraints on the type properties of AST
nodes.
.. _`Translation Unit (TU)`: https://en.wikipedia.org/wiki/Translation_unit_\(programming\)
We assume a familiarity with the clang AST and the corresponding AST matchers
for the purpose of this tutorial. Users who are unfamiliar with either are
encouraged to start with the recommended references in `Related Reading`_.
Example: style-checking names
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Assume you have a style-guide rule which forbids functions from being named
"MkX" and you want to write a check that catches any violations of this rule. We
can express this a Transformer rewrite rule:
.. code-block:: c++
makeRule(functionDecl(hasName("MkX").bind("fun"),
noopEdit(node("fun")),
cat("The name ``MkX`` is not allowed for functions; please rename"));
``makeRule`` is our go-to function for generating rewrite rules. It takes three
arguments: the pattern, the edit, and (optionally) an explanatory note. In our
example, the pattern (``functionDecl(...)``) identifies the declaration of the
function ``MkX``. Since we're just diagnosing the problem, but not suggesting a
fix, our edit is an no-op. But, it contains an *anchor* for the diagnostic
message: ``node("fun")`` says to associate the message with the source range of
the AST node bound to "fun"; in this case, the ill-named function declaration.
Finally, we use ``cat`` to build a message that explains the change. Regarding the
name ``cat`` -- we'll discuss it in more detail below, but suffice it to say that
it can also take multiple arguments and concatenate their results.
Note that the result of ``makeRule`` is a value of type
``clang::transformer::RewriteRule``, but most users don't need to care about the
details of this type.
Example: renaming a function
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Now, let's extend this example to a *transformation*; specifically, the second
example above:
.. code-block:: c++
makeRule(declRefExpr(to(functionDecl(hasName("MkX")))),
changeTo(cat("MakeX")),
cat("MkX has been renamed MakeX"));
In this example, the pattern (``declRefExpr(...)``) identifies any *reference* to
the function ``MkX``, rather than the declaration itself, as in our previous
example. Our edit (``changeTo(...)``) says to *change* the code matched by the
pattern *to* the text "MakeX". Finally, we use ``cat`` again to build a message
that explains the change.
Here are some example changes that this rule would make:
+--------------------------+----------------------------+
| Original | Result |
+==========================+============================+
| ``X x = MkX(3);`` | ``X x = MakeX(3);`` |
+--------------------------+----------------------------+
| ``CallFactory(MkX, 3);`` | ``CallFactory(MakeX, 3);`` |
+--------------------------+----------------------------+
| ``auto f = MkX;`` | ``auto f = MakeX;`` |
+--------------------------+----------------------------+
Example: method to function
^^^^^^^^^^^^^^^^^^^^^^^^^^^
Next, let's write a rule to replace a method call with a (free) function call,
applied to the original method call's target object. Specifically, "change
``s.size()`` to ``Size(s)``, where ``s`` is a ``string``." We start with a simpler
change that ignores the type of ``s``. That is, it will modify *any* method call
where the method is named "size":
.. code-block:: c++
llvm::StringRef s = "str";
makeRule(
cxxMemberCallExpr(
on(expr().bind(s)),
callee(cxxMethodDecl(hasName("size")))),
changeTo(cat("Size(", node(s), ")")),
cat("Method ``size`` is deprecated in favor of free function ``Size``"));
We express the pattern with the given AST matcher, which binds the method call's
target to ``s`` [#f1]_. For the edit, we again use ``changeTo``, but this
time we construct the term from multiple parts, which we compose with ``cat``. The
second part of our term is ``node(s)``, which selects the source code
corresponding to the AST node ``s`` that was bound when a match was found in the
AST for our rule's pattern. ``node(s)`` constructs a ``RangeSelector``, which, when
used in ``cat``, indicates that the selected source should be inserted in the
output at that point.
Now, we probably don't want to rewrite *all* invocations of "size" methods, just
those on ``std::string``\ s. We can achieve this change simply by refining our
matcher. The rest of the rule remains unchanged:
.. code-block:: c++
llvm::StringRef s = "str";
makeRule(
cxxMemberCallExpr(
on(expr(hasType(namedDecl(hasName("std::string"))))
.bind(s)),
callee(cxxMethodDecl(hasName("size")))),
changeTo(cat("Size(", node(s), ")")),
cat("Method ``size`` is deprecated in favor of free function ``Size``"));
Example: rewriting method calls
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
In this example, we delete an "intermediary" method call in a string of
invocations. This scenario can arise, for example, if you want to collapse a
substructure into its parent.
.. code-block:: c++
llvm::StringRef e = "expr", m = "member";
auto child_call = cxxMemberCallExpr(on(expr().bind(e)),
callee(cxxMethodDecl(hasName("child"))));
makeRule(cxxMemberCallExpr(on(child_call), callee(memberExpr().bind(m)),
changeTo(cat(e, ".", member(m), "()"))),
cat("``child`` accessor is being removed; call ",
member(m), " directly on parent"));
This rule isn't quite what we want: it will rewrite ``my_object.child().foo()`` to
``my_object.foo()``, but it will also rewrite ``my_ptr->child().foo()`` to
``my_ptr.foo()``, which is not what we intend. We could fix this by restricting
the pattern with ``not(isArrow())`` in the definition of ``child_call``. Yet, we
*want* to rewrite calls through pointers.
To capture this idiom, we provide the ``access`` combinator to intelligently
construct a field/method access. In our example, the member access is expressed
as:
.. code-block:: c++
access(e, cat(member(m)))
The first argument specifies the object being accessed and the second, a
description of the field/method name. In this case, we specify that the method
name should be copied from the source -- specifically, the source range of ``m``'s
member. To construct the method call, we would use this expression in ``cat``:
.. code-block:: c++
cat(access(e, cat(member(m))), "()")
Reference: ranges, stencils, edits, rules
-----------------------------------------
The above examples demonstrate just the basics of rewrite rules. Every element
we touched on has more available constructors: range selectors, stencils, edits
and rules. In this section, we'll briefly review each in turn, with references
to the source headers for up-to-date information. First, though, we clarify what
rewrite rules are actually rewriting.
Rewriting ASTs to... Text?
^^^^^^^^^^^^^^^^^^^^^^^^^^
The astute reader may have noticed that we've been somewhat vague in our
explanation of what the rewrite rules are actually rewriting. We've referred to
"code", but code can be represented both as raw source text and as an abstract
syntax tree. So, which one is it?
Ideally, we'd be rewriting the input AST to a new AST, but clang's AST is not
terribly amenable to this kind of transformation. So, we compromise: we express
our patterns and the names that they bind in terms of the AST, but our changes
in terms of source code text. We've designed Transformer's language to bridge
the gap between the two representations, in an attempt to minimize the user's
need to reason about source code locations and other, low-level syntactic
details.
Range Selectors
^^^^^^^^^^^^^^^
Transformer provides a small API for describing source ranges: the
``RangeSelector`` combinators. These ranges are most commonly used to specify the
source code affected by an edit and to extract source code in constructing new
text.
Roughly, there are two kinds of range combinators: ones that select a source
range based on the AST, and others that combine existing ranges into new ranges.
For example, ``node`` selects the range of source spanned by a particular AST
node, as we've seen, while ``after`` selects the (empty) range located immediately
after its argument range. So, ``after(node("id"))`` is the empty range immediately
following the AST node bound to ``id``.
For the full collection of ``RangeSelector``\ s, see the header,
`clang/Tooling/Transformer/RangeSelector.h <https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Tooling/Transformer/RangeSelector.h>`_
Stencils
^^^^^^^^
Transformer offers a large and growing collection of combinators for
constructing output. Above, we demonstrated ``cat``, the core function for
constructing stencils. It takes a series of arguments, of three possible kinds:
#. Raw text, to be copied directly to the output.
#. Selector: specified with a ``RangeSelector``, indicates a range of source text
to copy to the output.
#. Builder: an operation that constructs a code snippet from its arguments. For
example, the ``access`` function we saw above.
Data of these different types are all represented (generically) by a ``Stencil``.
``cat`` takes text and ``RangeSelector``\ s directly as arguments, rather than
requiring that they be constructed with a builder; other builders are
constructed explicitly.
In general, ``Stencil``\ s produce text from a match result. So, they are not
limited to generating source code, but can also be used to generate diagnostic
messages that reference (named) elements of the matched code, like we saw in the
example of rewriting method calls.
Further details of the ``Stencil`` type are documented in the header file
`clang/Tooling/Transformer/Stencil.h <https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Tooling/Transformer/Stencil.h>`_.
Edits
^^^^^
Transformer supports additional forms of edits. First, in a ``changeTo``, we can
specify the particular portion of code to be replaced, using the same
``RangeSelector`` we saw earlier. For example, we could change the function name
in a function declaration with:
.. code-block:: c++
makeRule(functionDecl(hasName("bad")).bind(f),
changeTo(name(f), cat("good")),
cat("bad is now good"));
We also provide simpler editing primitives for insertion and deletion:
``insertBefore``, ``insertAfter`` and ``remove``. These can all be found in the header
file
`clang/Tooling/Transformer/RewriteRule.h <https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Tooling/Transformer/RewriteRule.h>`_.
We are not limited one edit per match found. Some situations require making
multiple edits for each match. For example, suppose we wanted to swap two
arguments of a function call.
For this, we provide an overload of ``makeRule`` that takes a list of edits,
rather than just a single one. Our example might look like:
.. code-block:: c++
makeRule(callExpr(...),
{changeTo(node(arg0), cat(node(arg2))),
changeTo(node(arg2), cat(node(arg0)))},
cat("swap the first and third arguments of the call"));
``EditGenerator``\ s (Advanced)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
The particular edits we've seen so far are all instances of the ``ASTEdit`` class,
or a list of such. But, not all edits can be expressed as ``ASTEdit``\ s. So, we
also support a very general signature for edit generators:
.. code-block:: c++
using EditGenerator = MatchConsumer<llvm::SmallVector<Edit, 1>>;
That is, an ``EditGenerator`` is function that maps a ``MatchResult`` to a set
of edits, or fails. This signature supports a very general form of computation
over match results. Transformer provides a number of functions for working with
``EditGenerator``\ s, most notably
`flatten <https://github.com/llvm/llvm-project/blob/1fabe6e51917bcd7a1242294069c682fe6dffa45/clang/include/clang/Tooling/Transformer/RewriteRule.h#L165-L167>`_
``EditGenerator``\ s, like list flattening. For the full list, see the header file
`clang/Tooling/Transformer/RewriteRule.h <https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Tooling/Transformer/RewriteRule.h>`_.
Rules
^^^^^
We can also compose multiple *rules*, rather than just edits within a rule,
using ``applyFirst``: it composes a list of rules as an ordered choice, where
Transformer applies the first rule whose pattern matches, ignoring others in the
list that follow. If the matchers are independent then order doesn't matter. In
that case, ``applyFirst`` is simply joining the set of rules into one.
The benefit of ``applyFirst`` is that, for some problems, it allows the user to
more concisely formulate later rules in the list, since their patterns need not
explicitly exclude the earlier patterns of the list. For example, consider a set
of rules that rewrite compound statements, where one rule handles the case of an
empty compound statement and the other handles non-empty compound statements.
With ``applyFirst``, these rules can be expressed compactly as:
.. code-block:: c++
applyFirst({
makeRule(compoundStmt(statementCountIs(0)).bind("empty"), ...),
makeRule(compoundStmt().bind("non-empty"),...)
})
The second rule does not need to explicitly specify that the compound statement
is non-empty -- it follows from the rules position in ``applyFirst``. For more
complicated examples, this can lead to substantially more readable code.
Sometimes, a modification to the code might require the inclusion of a
particular header file. To this end, users can modify rules to specify include
directives with ``addInclude``.
For additional documentation on these functions, see the header file
`clang/Tooling/Transformer/RewriteRule.h <https://github.com/llvm/llvm-project/blob/main/clang/include/clang/Tooling/Transformer/RewriteRule.h>`_.
Using a RewriteRule as a clang-tidy check
-----------------------------------------
Transformer supports executing a rewrite rule as a
`clang-tidy <https://clang.llvm.org/extra/clang-tidy/>`_ check, with the class
``clang::tidy::utils::TransformerClangTidyCheck``. It is designed to require
minimal code in the definition. For example, given a rule
``MyCheckAsRewriteRule``, one can define a tidy check as follows:
.. code-block:: c++
class MyCheck : public TransformerClangTidyCheck {
public:
MyCheck(StringRef Name, ClangTidyContext *Context)
: TransformerClangTidyCheck(MyCheckAsRewriteRule, Name, Context) {}
};
``TransformerClangTidyCheck`` implements the virtual ``registerMatchers`` and
``check`` methods based on your rule specification, so you don't need to implement
them yourself. If the rule needs to be configured based on the language options
and/or the clang-tidy configuration, it can be expressed as a function taking
these as parameters and (optionally) returning a ``RewriteRule``. This would be
useful, for example, for our method-renaming rule, which is parameterized by the
original name and the target. For details, see
`clang-tools-extra/clang-tidy/utils/TransformerClangTidyCheck.h <https://github.com/llvm/llvm-project/blob/main/clang-tools-extra/clang-tidy/utils/TransformerClangTidyCheck.h>`_
Related Reading
---------------
A good place to start understanding the clang AST and its matchers is with the
introductions on clang's site:
* :doc:`Introduction to the Clang AST <IntroductionToTheClangAST>`
* :doc:`Matching the Clang AST <LibASTMatchers>`
* `AST Matcher Reference <https://clang.llvm.org/docs/LibASTMatchersReference.html>`_
.. rubric:: Footnotes
.. [#f1] Technically, it binds it to the string "str", to which our
variable ``s`` is bound. But, the choice of that id string is
irrelevant, so elide the difference.