From 31715d3a427b9dab1b97258338dda897142e826e Mon Sep 17 00:00:00 2001 From: Brian Gaeke Date: Mon, 24 Nov 2003 17:03:38 +0000 Subject: [PATCH] Apply doc patch from PR136. llvm-svn: 10198 --- llvm/docs/Stacker.html | 406 +++++++++++++++++++++++++++++++++++------ 1 file changed, 347 insertions(+), 59 deletions(-) diff --git a/llvm/docs/Stacker.html b/llvm/docs/Stacker.html index 81ad60e8fd60..eabccdf6cf10 100644 --- a/llvm/docs/Stacker.html +++ b/llvm/docs/Stacker.html @@ -6,9 +6,21 @@
Stacker: An Example Of Using LLVM
  1. Abstract
  2. Introduction
  3. +
  4. Lessons I Learned About LLVM +
    1. Everything's a Value!
    2. +
    3. Terminate Those Blocks!
    4. +
    5. Concrete Blocks
    6. +
    7. push_back Is Your Friend
    8. +
    9. The Wily GetElementPtrInst
    10. +
    11. Getting Linkage Types Right
    12. +
    13. Constants Are Easier Than That!
    14. +
  5. The Stacker Lexicon
    1. The Stack @@ -18,12 +30,24 @@
    2. Built-Ins
  6. -
  7. The Directory Structure +
  8. Prime: A Complete Example
  9. +
  10. Internal Code Details +
    1. The Directory Structure
    2. +
    3. The Lexer
    4. +
    5. The Parser
    6. +
    7. The Compiler
    8. +
    9. The Runtime
    10. +
    11. Compiler Driver
    12. +
    13. Test Programs
    14. +

Written by Reid Spencer

@@ -80,31 +104,266 @@ written Stacker definitions have that characteristic.

Exercise for the reader: how could you make this a one line program?

Lessons Learned About LLVM
Lessons I Learned About LLVM

Stacker was written for two purposes: (a) to get the author over the learning curve and (b) to provide a simple example of how to write a compiler using LLVM. During the development of Stacker, many lessons about LLVM were learned. Those lessons are described in the following subsections.

+ +
Everything's a Value!

Although I knew that LLVM used a Single Static Assignment (SSA) format, +it wasn't obvious to me how prevalent this idea was in LLVM until I really +started using it. Reading the Programmer's Manual and Language Reference I +noted that most of the important LLVM IR (Intermediate Representation) C++ +classes were derived from the Value class. The full power of that simple +design only became fully understood once I started constructing executable +expressions for Stacker.


This really makes your programming go faster. Think about compiling code +for the following C/C++ expression: (a|b)*((x+1)/(y+1)). You could write a +function using LLVM that does exactly that, this way:


+expression(BasicBlock*bb, Value* a, Value* b, Value* x, Value* y )
+    Instruction* tail = bb->getTerminator();
+    ConstantSInt* one = ConstantSInt::get( Type::IntTy, 1);
+    BinaryOperator* or1 = 
+	new BinaryOperator::create( Instruction::Or, a, b, "", tail );
+    BinaryOperator* add1 = 
+	new BinaryOperator::create( Instruction::Add, x, one, "", tail );
+    BinaryOperator* add2 =
+	new BinaryOperator::create( Instruction::Add, y, one, "", tail );
+    BinaryOperator* div1 = 
+	new BinaryOperator::create( Instruction::Div, add1, add2, "", tail);
+    BinaryOperator* mult1 = 
+	new BinaryOperator::create( Instruction::Mul, or1, div1, "", tail );
+    return mult1;

"Okay, big deal," you say. It is a big deal. Here's why. Note that I didn't +have to tell this function which kinds of Values are being passed in. They could be +instructions, Constants, Global Variables, etc. Furthermore, if you specify Values +that are incorrect for this sequence of operations, LLVM will either notice right +away (at compilation time) or the LLVM Verifier will pick up the inconsistency +when the compiler runs. In no case will you make a type error that gets passed +through to the generated program. This really helps you write a compiler +that always generates correct code!


The second point is that we don't have to worry about branching, registers, +stack variables, saving partial results, etc. The instructions we create +are the values we use. Note that all that was created in the above +code is a Constant value and five operators. Each of the instructions is +the resulting value of that instruction.


The lesson is this: SSA form is very powerful: there is no difference + between a value and the instruction that created it. This is fully +enforced by the LLVM IR. Use it to your best advantage.

+ +
Terminate Those Blocks!

I had to learn about terminating blocks the hard way: using the debugger +to figure out what the LLVM verifier was trying to tell me and begging for +help on the LLVMdev mailing list. I hope you avoid this experience.


Emblazon this rule in your mind:

+ +

Terminating instructions are a semantic requirement of the LLVM IR. There +is no facility for implicitly chaining together blocks placed into a function +in the order they occur. Indeed, in the general case, blocks will not be +added to the function in the order of execution because of the recursive +way compilers are written.


Furthermore, if you don't terminate your blocks, your compiler code will +compile just fine. You won't find out about the problem until you're running +the compiler and the module you just created fails on the LLVM Verifier.

+ +
Concrete Blocks

After a little initial fumbling around, I quickly caught on to how blocks +should be constructed. The use of the standard template library really helps +simply the interface. In general, here's what I learned: +

  1. Create your blocks early. While writing your compiler, you + will encounter several situations where you know apriori that you will + need several blocks. For example, if-then-else, switch, while and for + statements in C/C++ all need multiple blocks for expression in LVVM. + The rule is, create them early.
  2. +
  3. Terminate your blocks early. This just reduces the chances + that you forget to terminate your blocks which is required (go + here for more). +
  4. Use getTerminator() for instruction insertion. I noticed early on + that many of the constructors for the Instruction classes take an optional + insert_before argument. At first, I thought this was a mistake + because clearly the normal mode of inserting instructions would be one at + a time after some other instruction, not before. However, + if you hold on to your terminating instruction (or use the handy dandy + getTerminator() method on a BasicBlock), it can + always be used as the insert_before argument to your instruction + constructors. This causes the instruction to automatically be inserted in + the RightPlace&tm; place, just before the terminating instruction. The + nice thing about this design is that you can pass blocks around and insert + new instructions into them without ever known what instructions came + before. This makes for some very clean compiler design.
  5. +

The foregoing is such an important principal, its worth making an idiom:

+BasicBlock* bb = new BasicBlock();
+bb->getInstList().push_back( new Branch( ... ) );
+new Instruction(..., bb->getTerminator() );

To make this clear, consider the typical if-then-else statement +(see StackerCompiler::handle_if() method). We can set this up +in a single function using LLVM in the following way:

+using namespace llvm;
+MyCompiler::handle_if( BasicBlock* bb, SetCondInst* condition )
+    // Create the blocks to contain code in the structure of if/then/else
+    BasicBlock* then = new BasicBlock(); 
+    BasicBlock* else = new BasicBlock();
+    BasicBlock* exit = new BasicBlock();
+    // Insert the branch instruction for the "if"
+    bb->getInstList().push_back( new BranchInst( then, else, condition ) );
+    // Set up the terminating instructions
+    then->getInstList().push_back( new BranchInst( exit ) );
+    else->getInstList().push_back( new BranchInst( exit ) );
+    // Fill in the then part .. details excised for brevity
+    this->fill_in( then );
+    // Fill in the else part .. details excised for brevity
+    this->fill_in( else );
+    // Return a block to the caller that can be filled in with the code
+    // that follows the if/then/else construct.
+    return exit;

Presumably in the foregoing, the calls to the "fill_in" method would add +the instructions for the "then" and "else" parts. They would use the third part +of the idiom almost exclusively (inserting new instructions before the +terminator). Furthermore, they could even recurse back to handle_if +should they encounter another if/then/else statement and it will all "just work". +


Note how cleanly this all works out. In particular, the push_back methods on +the BasicBlock's instruction list. These are lists of type +Instruction which also happen to be Values. To create +the "if" branch we merely instantiate a BranchInst that takes as +arguments the blocks to branch to and the condition to branch on. The blocks +act like branch labels! This new BranchInst terminates +the BasicBlock provided as an argument. To give the caller a way +to keep inserting after calling handle_if we create an "exit" block +which is returned to the caller. Note that the "exit" block is used as the +terminator for both the "then" and the "else" blocks. This gaurantees that no +matter what else "handle_if" or "fill_in" does, they end up at the "exit" block. +

+ +
push_back Is Your Friend

+One of the first things I noticed is the frequent use of the "push_back" +method on the various lists. This is so common that it is worth mentioning. +The "push_back" inserts a value into an STL list, vector, array, etc. at the +end. The method might have also been named "insert_tail" or "append". +Althought I've used STL quite frequently, my use of push_back wasn't very +high in other programs. In LLVM, you'll use it all the time. +

+ +
The Wily GetElementPtrInst

+It took a little getting used to and several rounds of postings to the LLVM +mail list to wrap my head around this instruction correctly. Even though I had +read the Language Reference and Programmer's Manual a couple times each, I still +missed a few very key points: +

+ +

This means that when you look up an element in the global variable (assuming +its a struct or array), you must deference the pointer first! For many +things, this leads to the idiom: +


+std::vector index_vector;
+index_vector.push_back( ConstantSInt::get( Type::LongTy, 0 );
+// ... push other indices ...
+GetElementPtrInst* gep = new GetElementPtrInst( ptr, index_vector );

For example, suppose we have a global variable whose type is [24 x int]. The +variable itself represents a pointer to that array. To subscript the +array, we need two indices, not just one. The first index (0) dereferences the +pointer. The second index subscripts the array. If you're a "C" programmer, this +will run against your grain because you'll naturally think of the global array +variable and the address of its first element as the same. That tripped me up +for a while until I realized that they really do differ .. by type. +Remember that LLVM is a strongly typed language itself. Absolutely everything +has a type. The "type" of the global variable is [24 x int]*. That is, its +a pointer to an array of 24 ints. When you dereference that global variable with +a single index, you now have a " [24 x int]" type, the pointer is gone. Although +the pointer value of the dereferenced global and the address of the zero'th element +in the array will be the same, they differ in their type. The zero'th element has +type "int" while the pointer value has type "[24 x int]".


Get this one aspect of LLVM right in your head and you'll save yourself +a lot of compiler writing headaches down the road.

Getting Linkage Types Right

To be completed.

Everything's a Value!

To be completed.

The Wily GetElementPtrInst

To be completed.

Constants Are Easier Than That!

To be completed.

Terminate Those Blocks!

To be completed.

new,get,create .. Its All The Same

To be completed.

Utility Functions To The Rescue

To be completed.

push_back Is Your Friend

To be completed.

Block Heads Come First

To be completed.


Linkage types in LLVM can be a little confusing, especially if your compiler +writing mind has affixed very hard concepts to particular words like "weak", +"external", "global", "linkonce", etc. LLVM does not use the precise +definitions of say ELF or GCC even though they share common terms. To be fair, +the concepts are related and similar but not precisely the same. This can lead +you to think you know what a linkage type represents but in fact it is slightly +different. I recommend you read the + Language Reference on this topic very +carefully.


Here are some handy tips that I discovered along the way:

+ +
+ +
Constants Are Easier Than That!

+Constants in LLVM took a little getting used to until I discovered a few utility +functions in the LLVM IR that make things easier. Here's what I learned:

+ +
The Stacker Lexicon
The Stack
@@ -184,7 +443,7 @@ depending on what they do. The groups are as follows:

their operands.
The words are: ABS NEG + - * / MOD */ ++ -- MIN MAX
  • StackThese words manipulate the stack directly by moving its elements around.
  • -
  • Memory>These words allocate, free and manipulate memory +
  • MemoryThese words allocate, free and manipulate memory areas outside the stack.
    The words are: MALLOC FREE GET PUT
  • ControlThese words alter the normal left to right flow of execution.
  • @@ -696,39 +955,19 @@ using the following construction:

    Directory Structure
    Prime: A Complete Example

    The source code, test programs, and sample programs can all be found -under the LLVM "projects" directory. You will need to obtain the LLVM sources -to find it (either via anonymous CVS or a tarball. See the -Getting Started document).


    Under the "projects" directory there is a directory named "stacker". That -directory contains everything, as follows:

    - -
    Prime: A Complete Example

    The following fully documented program highlights many of features of both -the Stacker language and what is possible with LLVM. The program simply -prints out the prime numbers until it reaches +

    The following fully documented program highlights many features of both +the Stacker language and what is possible with LLVM. The program has two modes +of operations. If you provide numeric arguments to the program, it checks to see +if those arguments are prime numbers, prints out the results. Without any +aruments, the program prints out any prime numbers it finds between 1 and one +million (there's a log of them!). The source code comments below tell the +remainder of the story.


    - ################################################################################ # # Brute force prime number generator @@ -964,19 +1203,68 @@ prints out the prime numbers until it reaches ENDIF 0 ( push return code ) ; -]]> -

    - -

    To be completed.

    The Lexer
    The Parser
    The Compiler
    The Stack
    Definitions Are Functions
    Words Are BasicBlocks
    + +

    This section is under construction. +

    In the mean time, you can always read the code! It has comments!

    + + +

    The source code, test programs, and sample programs can all be found +under the LLVM "projects" directory. You will need to obtain the LLVM sources +to find it (either via anonymous CVS or a tarball. See the +Getting Started document).


    Under the "projects" directory there is a directory named "stacker". That +directory contains everything, as follows:

    • lib - contains most of the source code +
      • lib/compiler - contains the compiler library +
      • lib/runtime - contains the runtime library +
    • +
    • test - contains the test programs
    • +
    • tools - contains the Stacker compiler main program, stkrc +
      • lib/stkrc - contains the Stacker compiler main program + +
      • sample - contains the sample programs
      • +
    + +
    The Lexer

    See projects/Stacker/lib/compiler/Lexer.l


    + +
    The Parser

    See projects/Stacker/lib/compiler/StackerParser.y


    + +
    The Compiler

    See projects/Stacker/lib/compiler/StackerCompiler.cpp


    + +
    The Runtime

    See projects/Stacker/lib/runtime/stacker_rt.c


    + +
    Compiler Driver

    See projects/Stacker/tools/stkrc/stkrc.cpp


    + +
    Test Programs

    See projects/Stacker/test/*.st
