forked from OSchip/llvm-project
1161 lines
40 KiB
ReStructuredText
1161 lines
40 KiB
ReStructuredText
========================================
|
|
Kaleidoscope: Code generation to LLVM IR
|
|
========================================
|
|
|
|
.. contents::
|
|
:local:
|
|
|
|
Chapter 3 Introduction
|
|
======================
|
|
|
|
Welcome to Chapter 3 of the "`Implementing a language with
|
|
LLVM <index.html>`_" tutorial. This chapter shows you how to transform
|
|
the `Abstract Syntax Tree <LangImpl2.html>`_, built in Chapter 2, into
|
|
LLVM IR. This will teach you a little bit about how LLVM does things, as
|
|
well as demonstrate how easy it is to use. It's much more work to build
|
|
a lexer and parser than it is to generate LLVM IR code. :)
|
|
|
|
**Please note**: the code in this chapter and later require LLVM 2.2 or
|
|
later. LLVM 2.1 and before will not work with it. Also note that you
|
|
need to use a version of this tutorial that matches your LLVM release:
|
|
If you are using an official LLVM release, use the version of the
|
|
documentation included with your release or on the `llvm.org releases
|
|
page <http://llvm.org/releases/>`_.
|
|
|
|
Code Generation Setup
|
|
=====================
|
|
|
|
In order to generate LLVM IR, we want some simple setup to get started.
|
|
First we define virtual code generation (codegen) methods in each AST
|
|
class:
|
|
|
|
.. code-block:: c++
|
|
|
|
/// ExprAST - Base class for all expression nodes.
|
|
class ExprAST {
|
|
public:
|
|
virtual ~ExprAST() {}
|
|
virtual Value *Codegen() = 0;
|
|
};
|
|
|
|
/// NumberExprAST - Expression class for numeric literals like "1.0".
|
|
class NumberExprAST : public ExprAST {
|
|
double Val;
|
|
public:
|
|
NumberExprAST(double val) : Val(val) {}
|
|
virtual Value *Codegen();
|
|
};
|
|
...
|
|
|
|
The Codegen() method says to emit IR for that AST node along with all
|
|
the things it depends on, and they all return an LLVM Value object.
|
|
"Value" is the class used to represent a "`Static Single Assignment
|
|
(SSA) <http://en.wikipedia.org/wiki/Static_single_assignment_form>`_
|
|
register" or "SSA value" in LLVM. The most distinct aspect of SSA values
|
|
is that their value is computed as the related instruction executes, and
|
|
it does not get a new value until (and if) the instruction re-executes.
|
|
In other words, there is no way to "change" an SSA value. For more
|
|
information, please read up on `Static Single
|
|
Assignment <http://en.wikipedia.org/wiki/Static_single_assignment_form>`_
|
|
- the concepts are really quite natural once you grok them.
|
|
|
|
Note that instead of adding virtual methods to the ExprAST class
|
|
hierarchy, it could also make sense to use a `visitor
|
|
pattern <http://en.wikipedia.org/wiki/Visitor_pattern>`_ or some other
|
|
way to model this. Again, this tutorial won't dwell on good software
|
|
engineering practices: for our purposes, adding a virtual method is
|
|
simplest.
|
|
|
|
The second thing we want is an "Error" method like we used for the
|
|
parser, which will be used to report errors found during code generation
|
|
(for example, use of an undeclared parameter):
|
|
|
|
.. code-block:: c++
|
|
|
|
Value *ErrorV(const char *Str) { Error(Str); return 0; }
|
|
|
|
static Module *TheModule;
|
|
static IRBuilder<> Builder(getGlobalContext());
|
|
static std::map<std::string, Value*> NamedValues;
|
|
|
|
The static variables will be used during code generation. ``TheModule``
|
|
is the LLVM construct that contains all of the functions and global
|
|
variables in a chunk of code. In many ways, it is the top-level
|
|
structure that the LLVM IR uses to contain code.
|
|
|
|
The ``Builder`` object is a helper object that makes it easy to generate
|
|
LLVM instructions. Instances of the
|
|
```IRBuilder`` <http://llvm.org/doxygen/IRBuilder_8h-source.html>`_
|
|
class template keep track of the current place to insert instructions
|
|
and has methods to create new instructions.
|
|
|
|
The ``NamedValues`` map keeps track of which values are defined in the
|
|
current scope and what their LLVM representation is. (In other words, it
|
|
is a symbol table for the code). In this form of Kaleidoscope, the only
|
|
things that can be referenced are function parameters. As such, function
|
|
parameters will be in this map when generating code for their function
|
|
body.
|
|
|
|
With these basics in place, we can start talking about how to generate
|
|
code for each expression. Note that this assumes that the ``Builder``
|
|
has been set up to generate code *into* something. For now, we'll assume
|
|
that this has already been done, and we'll just use it to emit code.
|
|
|
|
Expression Code Generation
|
|
==========================
|
|
|
|
Generating LLVM code for expression nodes is very straightforward: less
|
|
than 45 lines of commented code for all four of our expression nodes.
|
|
First we'll do numeric literals:
|
|
|
|
.. code-block:: c++
|
|
|
|
Value *NumberExprAST::Codegen() {
|
|
return ConstantFP::get(getGlobalContext(), APFloat(Val));
|
|
}
|
|
|
|
In the LLVM IR, numeric constants are represented with the
|
|
``ConstantFP`` class, which holds the numeric value in an ``APFloat``
|
|
internally (``APFloat`` has the capability of holding floating point
|
|
constants of Arbitrary Precision). This code basically just creates
|
|
and returns a ``ConstantFP``. Note that in the LLVM IR that constants
|
|
are all uniqued together and shared. For this reason, the API uses the
|
|
"foo::get(...)" idiom instead of "new foo(..)" or "foo::Create(..)".
|
|
|
|
.. code-block:: c++
|
|
|
|
Value *VariableExprAST::Codegen() {
|
|
// Look this variable up in the function.
|
|
Value *V = NamedValues[Name];
|
|
return V ? V : ErrorV("Unknown variable name");
|
|
}
|
|
|
|
References to variables are also quite simple using LLVM. In the simple
|
|
version of Kaleidoscope, we assume that the variable has already been
|
|
emitted somewhere and its value is available. In practice, the only
|
|
values that can be in the ``NamedValues`` map are function arguments.
|
|
This code simply checks to see that the specified name is in the map (if
|
|
not, an unknown variable is being referenced) and returns the value for
|
|
it. In future chapters, we'll add support for `loop induction
|
|
variables <LangImpl5.html#for>`_ in the symbol table, and for `local
|
|
variables <LangImpl7.html#localvars>`_.
|
|
|
|
.. code-block:: c++
|
|
|
|
Value *BinaryExprAST::Codegen() {
|
|
Value *L = LHS->Codegen();
|
|
Value *R = RHS->Codegen();
|
|
if (L == 0 || R == 0) return 0;
|
|
|
|
switch (Op) {
|
|
case '+': return Builder.CreateFAdd(L, R, "addtmp");
|
|
case '-': return Builder.CreateFSub(L, R, "subtmp");
|
|
case '*': return Builder.CreateFMul(L, R, "multmp");
|
|
case '<':
|
|
L = Builder.CreateFCmpULT(L, R, "cmptmp");
|
|
// Convert bool 0/1 to double 0.0 or 1.0
|
|
return Builder.CreateUIToFP(L, Type::getDoubleTy(getGlobalContext()),
|
|
"booltmp");
|
|
default: return ErrorV("invalid binary operator");
|
|
}
|
|
}
|
|
|
|
Binary operators start to get more interesting. The basic idea here is
|
|
that we recursively emit code for the left-hand side of the expression,
|
|
then the right-hand side, then we compute the result of the binary
|
|
expression. In this code, we do a simple switch on the opcode to create
|
|
the right LLVM instruction.
|
|
|
|
In the example above, the LLVM builder class is starting to show its
|
|
value. IRBuilder knows where to insert the newly created instruction,
|
|
all you have to do is specify what instruction to create (e.g. with
|
|
``CreateFAdd``), which operands to use (``L`` and ``R`` here) and
|
|
optionally provide a name for the generated instruction.
|
|
|
|
One nice thing about LLVM is that the name is just a hint. For instance,
|
|
if the code above emits multiple "addtmp" variables, LLVM will
|
|
automatically provide each one with an increasing, unique numeric
|
|
suffix. Local value names for instructions are purely optional, but it
|
|
makes it much easier to read the IR dumps.
|
|
|
|
`LLVM instructions <../LangRef.html#instref>`_ are constrained by strict
|
|
rules: for example, the Left and Right operators of an `add
|
|
instruction <../LangRef.html#i_add>`_ must have the same type, and the
|
|
result type of the add must match the operand types. Because all values
|
|
in Kaleidoscope are doubles, this makes for very simple code for add,
|
|
sub and mul.
|
|
|
|
On the other hand, LLVM specifies that the `fcmp
|
|
instruction <../LangRef.html#i_fcmp>`_ always returns an 'i1' value (a
|
|
one bit integer). The problem with this is that Kaleidoscope wants the
|
|
value to be a 0.0 or 1.0 value. In order to get these semantics, we
|
|
combine the fcmp instruction with a `uitofp
|
|
instruction <../LangRef.html#i_uitofp>`_. This instruction converts its
|
|
input integer into a floating point value by treating the input as an
|
|
unsigned value. In contrast, if we used the `sitofp
|
|
instruction <../LangRef.html#i_sitofp>`_, the Kaleidoscope '<' operator
|
|
would return 0.0 and -1.0, depending on the input value.
|
|
|
|
.. code-block:: c++
|
|
|
|
Value *CallExprAST::Codegen() {
|
|
// Look up the name in the global module table.
|
|
Function *CalleeF = TheModule->getFunction(Callee);
|
|
if (CalleeF == 0)
|
|
return ErrorV("Unknown function referenced");
|
|
|
|
// If argument mismatch error.
|
|
if (CalleeF->arg_size() != Args.size())
|
|
return ErrorV("Incorrect # arguments passed");
|
|
|
|
std::vector<Value*> ArgsV;
|
|
for (unsigned i = 0, e = Args.size(); i != e; ++i) {
|
|
ArgsV.push_back(Args[i]->Codegen());
|
|
if (ArgsV.back() == 0) return 0;
|
|
}
|
|
|
|
return Builder.CreateCall(CalleeF, ArgsV, "calltmp");
|
|
}
|
|
|
|
Code generation for function calls is quite straightforward with LLVM.
|
|
The code above initially does a function name lookup in the LLVM
|
|
Module's symbol table. Recall that the LLVM Module is the container that
|
|
holds all of the functions we are JIT'ing. By giving each function the
|
|
same name as what the user specifies, we can use the LLVM symbol table
|
|
to resolve function names for us.
|
|
|
|
Once we have the function to call, we recursively codegen each argument
|
|
that is to be passed in, and create an LLVM `call
|
|
instruction <../LangRef.html#i_call>`_. Note that LLVM uses the native C
|
|
calling conventions by default, allowing these calls to also call into
|
|
standard library functions like "sin" and "cos", with no additional
|
|
effort.
|
|
|
|
This wraps up our handling of the four basic expressions that we have so
|
|
far in Kaleidoscope. Feel free to go in and add some more. For example,
|
|
by browsing the `LLVM language reference <../LangRef.html>`_ you'll find
|
|
several other interesting instructions that are really easy to plug into
|
|
our basic framework.
|
|
|
|
Function Code Generation
|
|
========================
|
|
|
|
Code generation for prototypes and functions must handle a number of
|
|
details, which make their code less beautiful than expression code
|
|
generation, but allows us to illustrate some important points. First,
|
|
lets talk about code generation for prototypes: they are used both for
|
|
function bodies and external function declarations. The code starts
|
|
with:
|
|
|
|
.. code-block:: c++
|
|
|
|
Function *PrototypeAST::Codegen() {
|
|
// Make the function type: double(double,double) etc.
|
|
std::vector<Type*> Doubles(Args.size(),
|
|
Type::getDoubleTy(getGlobalContext()));
|
|
FunctionType *FT = FunctionType::get(Type::getDoubleTy(getGlobalContext()),
|
|
Doubles, false);
|
|
|
|
Function *F = Function::Create(FT, Function::ExternalLinkage, Name, TheModule);
|
|
|
|
This code packs a lot of power into a few lines. Note first that this
|
|
function returns a "Function\*" instead of a "Value\*". Because a
|
|
"prototype" really talks about the external interface for a function
|
|
(not the value computed by an expression), it makes sense for it to
|
|
return the LLVM Function it corresponds to when codegen'd.
|
|
|
|
The call to ``FunctionType::get`` creates the ``FunctionType`` that
|
|
should be used for a given Prototype. Since all function arguments in
|
|
Kaleidoscope are of type double, the first line creates a vector of "N"
|
|
LLVM double types. It then uses the ``Functiontype::get`` method to
|
|
create a function type that takes "N" doubles as arguments, returns one
|
|
double as a result, and that is not vararg (the false parameter
|
|
indicates this). Note that Types in LLVM are uniqued just like Constants
|
|
are, so you don't "new" a type, you "get" it.
|
|
|
|
The final line above actually creates the function that the prototype
|
|
will correspond to. This indicates the type, linkage and name to use, as
|
|
well as which module to insert into. "`external
|
|
linkage <../LangRef.html#linkage>`_" means that the function may be
|
|
defined outside the current module and/or that it is callable by
|
|
functions outside the module. The Name passed in is the name the user
|
|
specified: since "``TheModule``" is specified, this name is registered
|
|
in "``TheModule``"s symbol table, which is used by the function call
|
|
code above.
|
|
|
|
.. code-block:: c++
|
|
|
|
// If F conflicted, there was already something named 'Name'. If it has a
|
|
// body, don't allow redefinition or reextern.
|
|
if (F->getName() != Name) {
|
|
// Delete the one we just made and get the existing one.
|
|
F->eraseFromParent();
|
|
F = TheModule->getFunction(Name);
|
|
|
|
The Module symbol table works just like the Function symbol table when
|
|
it comes to name conflicts: if a new function is created with a name
|
|
that was previously added to the symbol table, the new function will get
|
|
implicitly renamed when added to the Module. The code above exploits
|
|
this fact to determine if there was a previous definition of this
|
|
function.
|
|
|
|
In Kaleidoscope, I choose to allow redefinitions of functions in two
|
|
cases: first, we want to allow 'extern'ing a function more than once, as
|
|
long as the prototypes for the externs match (since all arguments have
|
|
the same type, we just have to check that the number of arguments
|
|
match). Second, we want to allow 'extern'ing a function and then
|
|
defining a body for it. This is useful when defining mutually recursive
|
|
functions.
|
|
|
|
In order to implement this, the code above first checks to see if there
|
|
is a collision on the name of the function. If so, it deletes the
|
|
function we just created (by calling ``eraseFromParent``) and then
|
|
calling ``getFunction`` to get the existing function with the specified
|
|
name. Note that many APIs in LLVM have "erase" forms and "remove" forms.
|
|
The "remove" form unlinks the object from its parent (e.g. a Function
|
|
from a Module) and returns it. The "erase" form unlinks the object and
|
|
then deletes it.
|
|
|
|
.. code-block:: c++
|
|
|
|
// If F already has a body, reject this.
|
|
if (!F->empty()) {
|
|
ErrorF("redefinition of function");
|
|
return 0;
|
|
}
|
|
|
|
// If F took a different number of args, reject.
|
|
if (F->arg_size() != Args.size()) {
|
|
ErrorF("redefinition of function with different # args");
|
|
return 0;
|
|
}
|
|
}
|
|
|
|
In order to verify the logic above, we first check to see if the
|
|
pre-existing function is "empty". In this case, empty means that it has
|
|
no basic blocks in it, which means it has no body. If it has no body, it
|
|
is a forward declaration. Since we don't allow anything after a full
|
|
definition of the function, the code rejects this case. If the previous
|
|
reference to a function was an 'extern', we simply verify that the
|
|
number of arguments for that definition and this one match up. If not,
|
|
we emit an error.
|
|
|
|
.. code-block:: c++
|
|
|
|
// Set names for all arguments.
|
|
unsigned Idx = 0;
|
|
for (Function::arg_iterator AI = F->arg_begin(); Idx != Args.size();
|
|
++AI, ++Idx) {
|
|
AI->setName(Args[Idx]);
|
|
|
|
// Add arguments to variable symbol table.
|
|
NamedValues[Args[Idx]] = AI;
|
|
}
|
|
return F;
|
|
}
|
|
|
|
The last bit of code for prototypes loops over all of the arguments in
|
|
the function, setting the name of the LLVM Argument objects to match,
|
|
and registering the arguments in the ``NamedValues`` map for future use
|
|
by the ``VariableExprAST`` AST node. Once this is set up, it returns the
|
|
Function object to the caller. Note that we don't check for conflicting
|
|
argument names here (e.g. "extern foo(a b a)"). Doing so would be very
|
|
straight-forward with the mechanics we have already used above.
|
|
|
|
.. code-block:: c++
|
|
|
|
Function *FunctionAST::Codegen() {
|
|
NamedValues.clear();
|
|
|
|
Function *TheFunction = Proto->Codegen();
|
|
if (TheFunction == 0)
|
|
return 0;
|
|
|
|
Code generation for function definitions starts out simply enough: we
|
|
just codegen the prototype (Proto) and verify that it is ok. We then
|
|
clear out the ``NamedValues`` map to make sure that there isn't anything
|
|
in it from the last function we compiled. Code generation of the
|
|
prototype ensures that there is an LLVM Function object that is ready to
|
|
go for us.
|
|
|
|
.. code-block:: c++
|
|
|
|
// Create a new basic block to start insertion into.
|
|
BasicBlock *BB = BasicBlock::Create(getGlobalContext(), "entry", TheFunction);
|
|
Builder.SetInsertPoint(BB);
|
|
|
|
if (Value *RetVal = Body->Codegen()) {
|
|
|
|
Now we get to the point where the ``Builder`` is set up. The first line
|
|
creates a new `basic block <http://en.wikipedia.org/wiki/Basic_block>`_
|
|
(named "entry"), which is inserted into ``TheFunction``. The second line
|
|
then tells the builder that new instructions should be inserted into the
|
|
end of the new basic block. Basic blocks in LLVM are an important part
|
|
of functions that define the `Control Flow
|
|
Graph <http://en.wikipedia.org/wiki/Control_flow_graph>`_. Since we
|
|
don't have any control flow, our functions will only contain one block
|
|
at this point. We'll fix this in `Chapter 5 <LangImpl5.html>`_ :).
|
|
|
|
.. code-block:: c++
|
|
|
|
if (Value *RetVal = Body->Codegen()) {
|
|
// Finish off the function.
|
|
Builder.CreateRet(RetVal);
|
|
|
|
// Validate the generated code, checking for consistency.
|
|
verifyFunction(*TheFunction);
|
|
|
|
return TheFunction;
|
|
}
|
|
|
|
Once the insertion point is set up, we call the ``CodeGen()`` method for
|
|
the root expression of the function. If no error happens, this emits
|
|
code to compute the expression into the entry block and returns the
|
|
value that was computed. Assuming no error, we then create an LLVM `ret
|
|
instruction <../LangRef.html#i_ret>`_, which completes the function.
|
|
Once the function is built, we call ``verifyFunction``, which is
|
|
provided by LLVM. This function does a variety of consistency checks on
|
|
the generated code, to determine if our compiler is doing everything
|
|
right. Using this is important: it can catch a lot of bugs. Once the
|
|
function is finished and validated, we return it.
|
|
|
|
.. code-block:: c++
|
|
|
|
// Error reading body, remove function.
|
|
TheFunction->eraseFromParent();
|
|
return 0;
|
|
}
|
|
|
|
The only piece left here is handling of the error case. For simplicity,
|
|
we handle this by merely deleting the function we produced with the
|
|
``eraseFromParent`` method. This allows the user to redefine a function
|
|
that they incorrectly typed in before: if we didn't delete it, it would
|
|
live in the symbol table, with a body, preventing future redefinition.
|
|
|
|
This code does have a bug, though. Since the ``PrototypeAST::Codegen``
|
|
can return a previously defined forward declaration, our code can
|
|
actually delete a forward declaration. There are a number of ways to fix
|
|
this bug, see what you can come up with! Here is a testcase:
|
|
|
|
::
|
|
|
|
extern foo(a b); # ok, defines foo.
|
|
def foo(a b) c; # error, 'c' is invalid.
|
|
def bar() foo(1, 2); # error, unknown function "foo"
|
|
|
|
Driver Changes and Closing Thoughts
|
|
===================================
|
|
|
|
For now, code generation to LLVM doesn't really get us much, except that
|
|
we can look at the pretty IR calls. The sample code inserts calls to
|
|
Codegen into the "``HandleDefinition``", "``HandleExtern``" etc
|
|
functions, and then dumps out the LLVM IR. This gives a nice way to look
|
|
at the LLVM IR for simple functions. For example:
|
|
|
|
::
|
|
|
|
ready> 4+5;
|
|
Read top-level expression:
|
|
define double @0() {
|
|
entry:
|
|
ret double 9.000000e+00
|
|
}
|
|
|
|
Note how the parser turns the top-level expression into anonymous
|
|
functions for us. This will be handy when we add `JIT
|
|
support <LangImpl4.html#jit>`_ in the next chapter. Also note that the
|
|
code is very literally transcribed, no optimizations are being performed
|
|
except simple constant folding done by IRBuilder. We will `add
|
|
optimizations <LangImpl4.html#trivialconstfold>`_ explicitly in the next
|
|
chapter.
|
|
|
|
::
|
|
|
|
ready> def foo(a b) a*a + 2*a*b + b*b;
|
|
Read function definition:
|
|
define double @foo(double %a, double %b) {
|
|
entry:
|
|
%multmp = fmul double %a, %a
|
|
%multmp1 = fmul double 2.000000e+00, %a
|
|
%multmp2 = fmul double %multmp1, %b
|
|
%addtmp = fadd double %multmp, %multmp2
|
|
%multmp3 = fmul double %b, %b
|
|
%addtmp4 = fadd double %addtmp, %multmp3
|
|
ret double %addtmp4
|
|
}
|
|
|
|
This shows some simple arithmetic. Notice the striking similarity to the
|
|
LLVM builder calls that we use to create the instructions.
|
|
|
|
::
|
|
|
|
ready> def bar(a) foo(a, 4.0) + bar(31337);
|
|
Read function definition:
|
|
define double @bar(double %a) {
|
|
entry:
|
|
%calltmp = call double @foo(double %a, double 4.000000e+00)
|
|
%calltmp1 = call double @bar(double 3.133700e+04)
|
|
%addtmp = fadd double %calltmp, %calltmp1
|
|
ret double %addtmp
|
|
}
|
|
|
|
This shows some function calls. Note that this function will take a long
|
|
time to execute if you call it. In the future we'll add conditional
|
|
control flow to actually make recursion useful :).
|
|
|
|
::
|
|
|
|
ready> extern cos(x);
|
|
Read extern:
|
|
declare double @cos(double)
|
|
|
|
ready> cos(1.234);
|
|
Read top-level expression:
|
|
define double @1() {
|
|
entry:
|
|
%calltmp = call double @cos(double 1.234000e+00)
|
|
ret double %calltmp
|
|
}
|
|
|
|
This shows an extern for the libm "cos" function, and a call to it.
|
|
|
|
.. TODO:: Abandon Pygments' horrible `llvm` lexer. It just totally gives up
|
|
on highlighting this due to the first line.
|
|
|
|
::
|
|
|
|
ready> ^D
|
|
; ModuleID = 'my cool jit'
|
|
|
|
define double @0() {
|
|
entry:
|
|
%addtmp = fadd double 4.000000e+00, 5.000000e+00
|
|
ret double %addtmp
|
|
}
|
|
|
|
define double @foo(double %a, double %b) {
|
|
entry:
|
|
%multmp = fmul double %a, %a
|
|
%multmp1 = fmul double 2.000000e+00, %a
|
|
%multmp2 = fmul double %multmp1, %b
|
|
%addtmp = fadd double %multmp, %multmp2
|
|
%multmp3 = fmul double %b, %b
|
|
%addtmp4 = fadd double %addtmp, %multmp3
|
|
ret double %addtmp4
|
|
}
|
|
|
|
define double @bar(double %a) {
|
|
entry:
|
|
%calltmp = call double @foo(double %a, double 4.000000e+00)
|
|
%calltmp1 = call double @bar(double 3.133700e+04)
|
|
%addtmp = fadd double %calltmp, %calltmp1
|
|
ret double %addtmp
|
|
}
|
|
|
|
declare double @cos(double)
|
|
|
|
define double @1() {
|
|
entry:
|
|
%calltmp = call double @cos(double 1.234000e+00)
|
|
ret double %calltmp
|
|
}
|
|
|
|
When you quit the current demo, it dumps out the IR for the entire
|
|
module generated. Here you can see the big picture with all the
|
|
functions referencing each other.
|
|
|
|
This wraps up the third chapter of the Kaleidoscope tutorial. Up next,
|
|
we'll describe how to `add JIT codegen and optimizer
|
|
support <LangImpl4.html>`_ to this so we can actually start running
|
|
code!
|
|
|
|
Full Code Listing
|
|
=================
|
|
|
|
Here is the complete code listing for our running example, enhanced with
|
|
the LLVM code generator. Because this uses the LLVM libraries, we need
|
|
to link them in. To do this, we use the
|
|
`llvm-config <http://llvm.org/cmds/llvm-config.html>`_ tool to inform
|
|
our makefile/command line about which options to use:
|
|
|
|
.. code-block:: bash
|
|
|
|
# Compile
|
|
clang++ -g -O3 toy.cpp `llvm-config --cppflags --ldflags --libs core` -o toy
|
|
# Run
|
|
./toy
|
|
|
|
Here is the code:
|
|
|
|
.. code-block:: c++
|
|
|
|
// To build this:
|
|
// See example below.
|
|
|
|
#include "llvm/DerivedTypes.h"
|
|
#include "llvm/IRBuilder.h"
|
|
#include "llvm/LLVMContext.h"
|
|
#include "llvm/Module.h"
|
|
#include "llvm/Analysis/Verifier.h"
|
|
#include <cstdio>
|
|
#include <string>
|
|
#include <map>
|
|
#include <vector>
|
|
using namespace llvm;
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
// Lexer
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
// The lexer returns tokens [0-255] if it is an unknown character, otherwise one
|
|
// of these for known things.
|
|
enum Token {
|
|
tok_eof = -1,
|
|
|
|
// commands
|
|
tok_def = -2, tok_extern = -3,
|
|
|
|
// primary
|
|
tok_identifier = -4, tok_number = -5
|
|
};
|
|
|
|
static std::string IdentifierStr; // Filled in if tok_identifier
|
|
static double NumVal; // Filled in if tok_number
|
|
|
|
/// gettok - Return the next token from standard input.
|
|
static int gettok() {
|
|
static int LastChar = ' ';
|
|
|
|
// Skip any whitespace.
|
|
while (isspace(LastChar))
|
|
LastChar = getchar();
|
|
|
|
if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
|
|
IdentifierStr = LastChar;
|
|
while (isalnum((LastChar = getchar())))
|
|
IdentifierStr += LastChar;
|
|
|
|
if (IdentifierStr == "def") return tok_def;
|
|
if (IdentifierStr == "extern") return tok_extern;
|
|
return tok_identifier;
|
|
}
|
|
|
|
if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
|
|
std::string NumStr;
|
|
do {
|
|
NumStr += LastChar;
|
|
LastChar = getchar();
|
|
} while (isdigit(LastChar) || LastChar == '.');
|
|
|
|
NumVal = strtod(NumStr.c_str(), 0);
|
|
return tok_number;
|
|
}
|
|
|
|
if (LastChar == '#') {
|
|
// Comment until end of line.
|
|
do LastChar = getchar();
|
|
while (LastChar != EOF && LastChar != '\n' && LastChar != '\r');
|
|
|
|
if (LastChar != EOF)
|
|
return gettok();
|
|
}
|
|
|
|
// Check for end of file. Don't eat the EOF.
|
|
if (LastChar == EOF)
|
|
return tok_eof;
|
|
|
|
// Otherwise, just return the character as its ascii value.
|
|
int ThisChar = LastChar;
|
|
LastChar = getchar();
|
|
return ThisChar;
|
|
}
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
// Abstract Syntax Tree (aka Parse Tree)
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
/// ExprAST - Base class for all expression nodes.
|
|
class ExprAST {
|
|
public:
|
|
virtual ~ExprAST() {}
|
|
virtual Value *Codegen() = 0;
|
|
};
|
|
|
|
/// NumberExprAST - Expression class for numeric literals like "1.0".
|
|
class NumberExprAST : public ExprAST {
|
|
double Val;
|
|
public:
|
|
NumberExprAST(double val) : Val(val) {}
|
|
virtual Value *Codegen();
|
|
};
|
|
|
|
/// VariableExprAST - Expression class for referencing a variable, like "a".
|
|
class VariableExprAST : public ExprAST {
|
|
std::string Name;
|
|
public:
|
|
VariableExprAST(const std::string &name) : Name(name) {}
|
|
virtual Value *Codegen();
|
|
};
|
|
|
|
/// BinaryExprAST - Expression class for a binary operator.
|
|
class BinaryExprAST : public ExprAST {
|
|
char Op;
|
|
ExprAST *LHS, *RHS;
|
|
public:
|
|
BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
|
|
: Op(op), LHS(lhs), RHS(rhs) {}
|
|
virtual Value *Codegen();
|
|
};
|
|
|
|
/// CallExprAST - Expression class for function calls.
|
|
class CallExprAST : public ExprAST {
|
|
std::string Callee;
|
|
std::vector<ExprAST*> Args;
|
|
public:
|
|
CallExprAST(const std::string &callee, std::vector<ExprAST*> &args)
|
|
: Callee(callee), Args(args) {}
|
|
virtual Value *Codegen();
|
|
};
|
|
|
|
/// PrototypeAST - This class represents the "prototype" for a function,
|
|
/// which captures its name, and its argument names (thus implicitly the number
|
|
/// of arguments the function takes).
|
|
class PrototypeAST {
|
|
std::string Name;
|
|
std::vector<std::string> Args;
|
|
public:
|
|
PrototypeAST(const std::string &name, const std::vector<std::string> &args)
|
|
: Name(name), Args(args) {}
|
|
|
|
Function *Codegen();
|
|
};
|
|
|
|
/// FunctionAST - This class represents a function definition itself.
|
|
class FunctionAST {
|
|
PrototypeAST *Proto;
|
|
ExprAST *Body;
|
|
public:
|
|
FunctionAST(PrototypeAST *proto, ExprAST *body)
|
|
: Proto(proto), Body(body) {}
|
|
|
|
Function *Codegen();
|
|
};
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
// Parser
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
/// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
|
|
/// token the parser is looking at. getNextToken reads another token from the
|
|
/// lexer and updates CurTok with its results.
|
|
static int CurTok;
|
|
static int getNextToken() {
|
|
return CurTok = gettok();
|
|
}
|
|
|
|
/// BinopPrecedence - This holds the precedence for each binary operator that is
|
|
/// defined.
|
|
static std::map<char, int> BinopPrecedence;
|
|
|
|
/// GetTokPrecedence - Get the precedence of the pending binary operator token.
|
|
static int GetTokPrecedence() {
|
|
if (!isascii(CurTok))
|
|
return -1;
|
|
|
|
// Make sure it's a declared binop.
|
|
int TokPrec = BinopPrecedence[CurTok];
|
|
if (TokPrec <= 0) return -1;
|
|
return TokPrec;
|
|
}
|
|
|
|
/// Error* - These are little helper functions for error handling.
|
|
ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
|
|
PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
|
|
FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
|
|
|
|
static ExprAST *ParseExpression();
|
|
|
|
/// identifierexpr
|
|
/// ::= identifier
|
|
/// ::= identifier '(' expression* ')'
|
|
static ExprAST *ParseIdentifierExpr() {
|
|
std::string IdName = IdentifierStr;
|
|
|
|
getNextToken(); // eat identifier.
|
|
|
|
if (CurTok != '(') // Simple variable ref.
|
|
return new VariableExprAST(IdName);
|
|
|
|
// Call.
|
|
getNextToken(); // eat (
|
|
std::vector<ExprAST*> Args;
|
|
if (CurTok != ')') {
|
|
while (1) {
|
|
ExprAST *Arg = ParseExpression();
|
|
if (!Arg) return 0;
|
|
Args.push_back(Arg);
|
|
|
|
if (CurTok == ')') break;
|
|
|
|
if (CurTok != ',')
|
|
return Error("Expected ')' or ',' in argument list");
|
|
getNextToken();
|
|
}
|
|
}
|
|
|
|
// Eat the ')'.
|
|
getNextToken();
|
|
|
|
return new CallExprAST(IdName, Args);
|
|
}
|
|
|
|
/// numberexpr ::= number
|
|
static ExprAST *ParseNumberExpr() {
|
|
ExprAST *Result = new NumberExprAST(NumVal);
|
|
getNextToken(); // consume the number
|
|
return Result;
|
|
}
|
|
|
|
/// parenexpr ::= '(' expression ')'
|
|
static ExprAST *ParseParenExpr() {
|
|
getNextToken(); // eat (.
|
|
ExprAST *V = ParseExpression();
|
|
if (!V) return 0;
|
|
|
|
if (CurTok != ')')
|
|
return Error("expected ')'");
|
|
getNextToken(); // eat ).
|
|
return V;
|
|
}
|
|
|
|
/// primary
|
|
/// ::= identifierexpr
|
|
/// ::= numberexpr
|
|
/// ::= parenexpr
|
|
static ExprAST *ParsePrimary() {
|
|
switch (CurTok) {
|
|
default: return Error("unknown token when expecting an expression");
|
|
case tok_identifier: return ParseIdentifierExpr();
|
|
case tok_number: return ParseNumberExpr();
|
|
case '(': return ParseParenExpr();
|
|
}
|
|
}
|
|
|
|
/// binoprhs
|
|
/// ::= ('+' primary)*
|
|
static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
|
|
// If this is a binop, find its precedence.
|
|
while (1) {
|
|
int TokPrec = GetTokPrecedence();
|
|
|
|
// If this is a binop that binds at least as tightly as the current binop,
|
|
// consume it, otherwise we are done.
|
|
if (TokPrec < ExprPrec)
|
|
return LHS;
|
|
|
|
// Okay, we know this is a binop.
|
|
int BinOp = CurTok;
|
|
getNextToken(); // eat binop
|
|
|
|
// Parse the primary expression after the binary operator.
|
|
ExprAST *RHS = ParsePrimary();
|
|
if (!RHS) return 0;
|
|
|
|
// If BinOp binds less tightly with RHS than the operator after RHS, let
|
|
// the pending operator take RHS as its LHS.
|
|
int NextPrec = GetTokPrecedence();
|
|
if (TokPrec < NextPrec) {
|
|
RHS = ParseBinOpRHS(TokPrec+1, RHS);
|
|
if (RHS == 0) return 0;
|
|
}
|
|
|
|
// Merge LHS/RHS.
|
|
LHS = new BinaryExprAST(BinOp, LHS, RHS);
|
|
}
|
|
}
|
|
|
|
/// expression
|
|
/// ::= primary binoprhs
|
|
///
|
|
static ExprAST *ParseExpression() {
|
|
ExprAST *LHS = ParsePrimary();
|
|
if (!LHS) return 0;
|
|
|
|
return ParseBinOpRHS(0, LHS);
|
|
}
|
|
|
|
/// prototype
|
|
/// ::= id '(' id* ')'
|
|
static PrototypeAST *ParsePrototype() {
|
|
if (CurTok != tok_identifier)
|
|
return ErrorP("Expected function name in prototype");
|
|
|
|
std::string FnName = IdentifierStr;
|
|
getNextToken();
|
|
|
|
if (CurTok != '(')
|
|
return ErrorP("Expected '(' in prototype");
|
|
|
|
std::vector<std::string> ArgNames;
|
|
while (getNextToken() == tok_identifier)
|
|
ArgNames.push_back(IdentifierStr);
|
|
if (CurTok != ')')
|
|
return ErrorP("Expected ')' in prototype");
|
|
|
|
// success.
|
|
getNextToken(); // eat ')'.
|
|
|
|
return new PrototypeAST(FnName, ArgNames);
|
|
}
|
|
|
|
/// definition ::= 'def' prototype expression
|
|
static FunctionAST *ParseDefinition() {
|
|
getNextToken(); // eat def.
|
|
PrototypeAST *Proto = ParsePrototype();
|
|
if (Proto == 0) return 0;
|
|
|
|
if (ExprAST *E = ParseExpression())
|
|
return new FunctionAST(Proto, E);
|
|
return 0;
|
|
}
|
|
|
|
/// toplevelexpr ::= expression
|
|
static FunctionAST *ParseTopLevelExpr() {
|
|
if (ExprAST *E = ParseExpression()) {
|
|
// Make an anonymous proto.
|
|
PrototypeAST *Proto = new PrototypeAST("", std::vector<std::string>());
|
|
return new FunctionAST(Proto, E);
|
|
}
|
|
return 0;
|
|
}
|
|
|
|
/// external ::= 'extern' prototype
|
|
static PrototypeAST *ParseExtern() {
|
|
getNextToken(); // eat extern.
|
|
return ParsePrototype();
|
|
}
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
// Code Generation
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
static Module *TheModule;
|
|
static IRBuilder<> Builder(getGlobalContext());
|
|
static std::map<std::string, Value*> NamedValues;
|
|
|
|
Value *ErrorV(const char *Str) { Error(Str); return 0; }
|
|
|
|
Value *NumberExprAST::Codegen() {
|
|
return ConstantFP::get(getGlobalContext(), APFloat(Val));
|
|
}
|
|
|
|
Value *VariableExprAST::Codegen() {
|
|
// Look this variable up in the function.
|
|
Value *V = NamedValues[Name];
|
|
return V ? V : ErrorV("Unknown variable name");
|
|
}
|
|
|
|
Value *BinaryExprAST::Codegen() {
|
|
Value *L = LHS->Codegen();
|
|
Value *R = RHS->Codegen();
|
|
if (L == 0 || R == 0) return 0;
|
|
|
|
switch (Op) {
|
|
case '+': return Builder.CreateFAdd(L, R, "addtmp");
|
|
case '-': return Builder.CreateFSub(L, R, "subtmp");
|
|
case '*': return Builder.CreateFMul(L, R, "multmp");
|
|
case '<':
|
|
L = Builder.CreateFCmpULT(L, R, "cmptmp");
|
|
// Convert bool 0/1 to double 0.0 or 1.0
|
|
return Builder.CreateUIToFP(L, Type::getDoubleTy(getGlobalContext()),
|
|
"booltmp");
|
|
default: return ErrorV("invalid binary operator");
|
|
}
|
|
}
|
|
|
|
Value *CallExprAST::Codegen() {
|
|
// Look up the name in the global module table.
|
|
Function *CalleeF = TheModule->getFunction(Callee);
|
|
if (CalleeF == 0)
|
|
return ErrorV("Unknown function referenced");
|
|
|
|
// If argument mismatch error.
|
|
if (CalleeF->arg_size() != Args.size())
|
|
return ErrorV("Incorrect # arguments passed");
|
|
|
|
std::vector<Value*> ArgsV;
|
|
for (unsigned i = 0, e = Args.size(); i != e; ++i) {
|
|
ArgsV.push_back(Args[i]->Codegen());
|
|
if (ArgsV.back() == 0) return 0;
|
|
}
|
|
|
|
return Builder.CreateCall(CalleeF, ArgsV, "calltmp");
|
|
}
|
|
|
|
Function *PrototypeAST::Codegen() {
|
|
// Make the function type: double(double,double) etc.
|
|
std::vector<Type*> Doubles(Args.size(),
|
|
Type::getDoubleTy(getGlobalContext()));
|
|
FunctionType *FT = FunctionType::get(Type::getDoubleTy(getGlobalContext()),
|
|
Doubles, false);
|
|
|
|
Function *F = Function::Create(FT, Function::ExternalLinkage, Name, TheModule);
|
|
|
|
// If F conflicted, there was already something named 'Name'. If it has a
|
|
// body, don't allow redefinition or reextern.
|
|
if (F->getName() != Name) {
|
|
// Delete the one we just made and get the existing one.
|
|
F->eraseFromParent();
|
|
F = TheModule->getFunction(Name);
|
|
|
|
// If F already has a body, reject this.
|
|
if (!F->empty()) {
|
|
ErrorF("redefinition of function");
|
|
return 0;
|
|
}
|
|
|
|
// If F took a different number of args, reject.
|
|
if (F->arg_size() != Args.size()) {
|
|
ErrorF("redefinition of function with different # args");
|
|
return 0;
|
|
}
|
|
}
|
|
|
|
// Set names for all arguments.
|
|
unsigned Idx = 0;
|
|
for (Function::arg_iterator AI = F->arg_begin(); Idx != Args.size();
|
|
++AI, ++Idx) {
|
|
AI->setName(Args[Idx]);
|
|
|
|
// Add arguments to variable symbol table.
|
|
NamedValues[Args[Idx]] = AI;
|
|
}
|
|
|
|
return F;
|
|
}
|
|
|
|
Function *FunctionAST::Codegen() {
|
|
NamedValues.clear();
|
|
|
|
Function *TheFunction = Proto->Codegen();
|
|
if (TheFunction == 0)
|
|
return 0;
|
|
|
|
// Create a new basic block to start insertion into.
|
|
BasicBlock *BB = BasicBlock::Create(getGlobalContext(), "entry", TheFunction);
|
|
Builder.SetInsertPoint(BB);
|
|
|
|
if (Value *RetVal = Body->Codegen()) {
|
|
// Finish off the function.
|
|
Builder.CreateRet(RetVal);
|
|
|
|
// Validate the generated code, checking for consistency.
|
|
verifyFunction(*TheFunction);
|
|
|
|
return TheFunction;
|
|
}
|
|
|
|
// Error reading body, remove function.
|
|
TheFunction->eraseFromParent();
|
|
return 0;
|
|
}
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
// Top-Level parsing and JIT Driver
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
static void HandleDefinition() {
|
|
if (FunctionAST *F = ParseDefinition()) {
|
|
if (Function *LF = F->Codegen()) {
|
|
fprintf(stderr, "Read function definition:");
|
|
LF->dump();
|
|
}
|
|
} else {
|
|
// Skip token for error recovery.
|
|
getNextToken();
|
|
}
|
|
}
|
|
|
|
static void HandleExtern() {
|
|
if (PrototypeAST *P = ParseExtern()) {
|
|
if (Function *F = P->Codegen()) {
|
|
fprintf(stderr, "Read extern: ");
|
|
F->dump();
|
|
}
|
|
} else {
|
|
// Skip token for error recovery.
|
|
getNextToken();
|
|
}
|
|
}
|
|
|
|
static void HandleTopLevelExpression() {
|
|
// Evaluate a top-level expression into an anonymous function.
|
|
if (FunctionAST *F = ParseTopLevelExpr()) {
|
|
if (Function *LF = F->Codegen()) {
|
|
fprintf(stderr, "Read top-level expression:");
|
|
LF->dump();
|
|
}
|
|
} else {
|
|
// Skip token for error recovery.
|
|
getNextToken();
|
|
}
|
|
}
|
|
|
|
/// top ::= definition | external | expression | ';'
|
|
static void MainLoop() {
|
|
while (1) {
|
|
fprintf(stderr, "ready> ");
|
|
switch (CurTok) {
|
|
case tok_eof: return;
|
|
case ';': getNextToken(); break; // ignore top-level semicolons.
|
|
case tok_def: HandleDefinition(); break;
|
|
case tok_extern: HandleExtern(); break;
|
|
default: HandleTopLevelExpression(); break;
|
|
}
|
|
}
|
|
}
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
// "Library" functions that can be "extern'd" from user code.
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
/// putchard - putchar that takes a double and returns 0.
|
|
extern "C"
|
|
double putchard(double X) {
|
|
putchar((char)X);
|
|
return 0;
|
|
}
|
|
|
|
//===----------------------------------------------------------------------===//
|
|
// Main driver code.
|
|
//===----------------------------------------------------------------------===//
|
|
|
|
int main() {
|
|
LLVMContext &Context = getGlobalContext();
|
|
|
|
// Install standard binary operators.
|
|
// 1 is lowest precedence.
|
|
BinopPrecedence['<'] = 10;
|
|
BinopPrecedence['+'] = 20;
|
|
BinopPrecedence['-'] = 20;
|
|
BinopPrecedence['*'] = 40; // highest.
|
|
|
|
// Prime the first token.
|
|
fprintf(stderr, "ready> ");
|
|
getNextToken();
|
|
|
|
// Make the module, which holds all the code.
|
|
TheModule = new Module("my cool jit", Context);
|
|
|
|
// Run the main "interpreter loop" now.
|
|
MainLoop();
|
|
|
|
// Print out all of the generated code.
|
|
TheModule->dump();
|
|
|
|
return 0;
|
|
}
|
|
|
|
`Next: Adding JIT and Optimizer Support <LangImpl4.html>`_
|
|
|