内容简介:Inpart 2, we completed the user facing part of the compiler with semantic analysis. Now we're free to focus on the backend, generating LLVM. To do so, we'll use theSome things to keep in mind when writing LLVM generating code:The following is a simplified
Inpart 2, we completed the user facing part of the compiler with semantic analysis. Now we're free to focus on the backend, generating LLVM. To do so, we'll use the llvm-hs-pure library, which embeds LLVM into a Haskell ADT, allowing us to manipulate it without having to set up any FFI. We will then pretty print the generated bytecode with llvm-hs-pretty and call clang
on it to generate machine code for our preferred target.
Some things to keep in mind when writing LLVM generating code:
-
There is very little type safety. The usual haskell guarantees of "if it typechecks it probably works" do not apply here at all.After all, we are writing basically writing assembly, so even if we get our code through the LLVM type checker, there's still a good chance that it won't do what we want it to.This is where having a comprehensive test suite is invaluable.
-
Documentation for LLVM in general is quite scant. The official language reference manual is comprehensive, but can be quite terse and provides little in the way of examples, although this gets better all the time and is not nearly as bad now as it was a few years ago. Documentation for llvm-hs specifically is even scarcer: basically just some tiny examples and the haskell port of thekaleidoscope tutorial, which is by now very outdated. This post will hopefully help matters somewhat but it leaves many parts of the LLVM IR and ecosystem unexplored.
-
Let
clang
help you. You can run it with-emit-llvm
on a C file to see what it generates. This is helpful even when you aren't writing a C-like language as you can see how certain high-level language concepts map onto LLVM.
LLVM Basics
The following is a simplified version of how LLVM works in order to create a basic mental model. There are many omissions.
LLVM programs are split into modules containing toplevel definitions like functions and global variables. For our purposes, we are concerned with emitting a single module containing all of our functions, global variables, and typedefs for our structs. Within functions , code is contained in basic blocks. A basic block is a list of sequential instructions in SSA (Single Static Assignment) form that is terminated in a conditional or unconditional branch to another block or in a return instruction .
LLVM, unlike many forms of assembly, is typed . Its type system resembles C's in many ways, but there are some notable differences. Instead of C's confusing and target dependent integer types short
, long
, long long
, etc., LLVM has arbitrary N bit width integers denoted iN
. Also, void*
is banned in LLVM, so char*
or i8*
is used instead. LLVM also has native vector types for SIMD instructions , but those won't factor into our discussion. There are other exotic features in the type system that help optimization like poison values but we won't discuss them here.
Codegen.hs
With that out of the way, we can start writing code. This module has quite a few imports.
{-# LANGUAGE RecursiveDo #-} {-# LANGUAGE TupleSections #-} {-# LANGUAGE FlexibleContexts #-} -- We need these to write a ConvertibleStrings instance for -- ShortByteString {-# LANGUAGE MultiParamTypeClasses #-} {-# OPTIONS_GHC -fno-warn-orphans #-} module Microc.Codegen ( codegenProgram ) where import qualified LLVM.AST.IntegerPredicate as IP import qualified LLVM.AST.FloatingPointPredicate as FP import LLVM.AST ( Operand ) import qualified LLVM.AST as AST import qualified LLVM.AST.Type as AST import qualified LLVM.AST.Constant as C import LLVM.AST.Name import LLVM.AST.Typed ( typeOf ) import qualified LLVM.IRBuilder.Module as L import qualified LLVM.IRBuilder.Monad as L import qualified LLVM.IRBuilder.Instruction as L import qualified LLVM.IRBuilder.Constant as L import LLVM.Prelude ( ShortByteString ) import qualified Data.Map as M import Control.Monad.State import Data.String ( fromString ) import Microc.Sast import Microc.Ast ( Type(..) , Op(..) , Uop(..) , Bind(..) , Struct(..) ) import Data.String.Conversions import qualified Data.Text as T import Data.Text ( Text ) import Data.Word ( Word32 ) import Data.List ( find )
In llvm-hs
, values that can be passed as arguments to LLVM instructions have type Operand
. This includes basically all variables as well as declared functions. Just as in Semant
, we define an Env
type to hold information about our generated operands and how they correspond to the names of their corresponding source variables. Besides the operands, we need to keep track of all struct declarations and all string literals in the source, as we emit a unique global variable for each unique string literal.
data Env = Env { operands :: M.Map Text Operand , structs :: [ Struct ] , strings :: M.Map Text Operand } deriving (Eq, Show) registerOperand :: MonadState Env m => Text -> Operand -> m () registerOperand name op = modify $ \env -> env { operands = M.insert name op (operands env) }
Utilities
Working with LLVM bindings in other languages usually involves passing mutable builder and module context objects to all instruction-emitting functions in order to ensure that all variables have unique names and to maintain the integrity of the module. This is important to ensure that code remains in SSA form.
Of course, since we're not working in other languages, this approach of passing around mutable objects would be severely un-ergonomic, at best. Fortunately, llvm-hs
provides us with monads that emulate this behavior, ModuleBuilderT
for the module context, and IRBuilderT
for the builder object. We'll establish two type synonyms, LLVM
for generating top level entities and Codegen
for generating basic blocks.
type LLVM = L.ModuleBuilderT (State Env) type Codegen = L.IRBuilderT LLVM
We'll also write some utilities to query struct fields defined in our Env
, to convert from MicroC types to LLVM types, and to calculate the sizes of MicroC types. For structs, we emit packed fields , which is pretty bad for performance, but makes calculating sizes very easy. Note that by this phase of the compiler, we no longer report errors to the user, so if anything goes wrong, we'll just crash.
getFields :: MonadState Env m => Text -> m [Bind] getFields name = do ss <- gets structs case find (\s -> structName s == name) ss of Nothing -> error "Internal error - struct not found" Just (Struct _ binds) -> pure binds charStar :: AST.Type charStar = AST.ptr AST.i8 -- llvm-hs uses ShortByteString for names, but we want -- easy conversion to Text with cs from Data.String.Conversions instance ConvertibleStrings Text ShortByteString where convertString = fromString . T.unpack ltypeOfTyp :: MonadState Env m => Type -> m AST.Type ltypeOfTyp = \case TyVoid -> pure AST.void TyInt -> pure AST.i32 TyChar -> pure AST.i8 TyFloat -> pure AST.double TyBool -> pure AST.i1 -- (void *) is invalid LLVM Pointer TyVoid -> pure $ charStar -- special case to handle recursively defined structures -- TODO: add real cycle checking so that improperly defined -- recursive types case the compiler to hang forever Pointer (TyStruct n) -> pure $ AST.ptr (AST.NamedTypeReference (mkName $ cs ("struct." <> n))) Pointer t -> fmap AST.ptr (ltypeOfTyp t) TyStruct n -> do fields <- getFields n typs <- mapM (ltypeOfTyp . bindType) fields -- Packed structs aren't great for performance but very easy to code for now pure $ AST.StructureType { AST.isPacked = True, AST.elementTypes = typs } sizeof :: MonadState Env m => Type -> m Word32 sizeof = \case TyBool -> pure 1 TyChar -> pure 1 TyInt -> pure 4 TyFloat -> pure 8 TyVoid -> pure 0 Pointer _ -> pure 8 TyStruct n -> do fields <- getFields n sizes <- mapM (sizeof . bindType) fields pure (sum sizes)
Expressions
LVals
Now, we're ready to generate code for expressions. First, LVal
s. When generating an LVal
, we generate an Operand
corresponding to the address of the value. That way, we can use it as an argument to the store instruction . For variables, we simply look up the variable name in the Env
.
codegenLVal :: LValue -> Codegen Operand codegenLVal (SId name) = gets ((M.! name) . operands)
Since we are generating addresses, dereferencing is essentially the inverse of this, so we just generate code for the underlying expression.
codegenLVal (SDeref e ) = codegenSexpr e
For struct access, we get to use the fascinating (read: confusing) getelementptr instruction . The instruction only calculates addresses, it doesn't load memory, so it's a perfect fit for the semantics of codegenLVal
. We generate the address of the left hand side of the access and then have to pass two arguments to gep
, a zero to access the memory pointed to by the address we just calculated, then the offset of the struct field we want to access, which we calculated in semant. Note that getelementptr
handles calculating alignment, so we don't need to do it ourselves.
codegenLVal (SAccess e i) = do e' <- codegenLVal e let zero = L.int32 0 offset = L.int32 (fromIntegral i) L.gep e' [zero, offset]
Literals
Most literals, as usual, are straightforward.
codegenSexpr :: SExpr -> Codegen Operand codegenSexpr (TyInt , SLiteral i ) = pure $ L.int32 (fromIntegral i) codegenSexpr (TyFloat, SFliteral f) = pure $ L.double f codegenSexpr (TyBool , SBoolLit b ) = pure $ L.bit (if b then 1 else 0) codegenSexpr (TyChar , SCharLit c ) = pure $ L.int8 (fromIntegral c)
Strings, however, are not. We look up the string literal in the Env
to see if we've generated a global variable for it before. If so, we just return that. Otherwise, we use globalStringPtr
to generate a pointer to a global string variable. We name each variable "0.str", "1.str" etc., since mkName
crashes with non-ASCII input, which we haven't explicitly forbidden in our string literals. Note that globalStringPtr
returns a Constant
which is distinct from an Operand
, so we need to promote it with AST.ConstantOperand
.
codegenSexpr (Pointer TyChar, SStrLit s ) = do -- Generate a new unique global variable for every string literal we see strs <- gets strings case M.lookup s strs of Nothing -> do let nm = mkName (show (M.size strs) <> ".str") op <- L.globalStringPtr (cs s) nm modify $ \env -> env { strings = M.insert s (AST.ConstantOperand op) strs } pure (AST.ConstantOperand op) Just op -> pure op
Null pointers are generated with inttoptr
.
codegenSexpr (t, SNull) = L.inttoptr (L.int64 0) =<< ltypeOfTyp t
Sizeof
is calculated with the sizeof
function we wrote earlier.
codegenSexpr (TyInt, SSizeof t) = L.int32 . fromIntegral <$> sizeof t
The &
operator finds the address of an LVal
, which is already taken care of by codegenLVal
.
codegenSexpr (_, SAddr e) = codegenLVal e
Binary operators
For assignment, we calculate the address of the left hand side, the value of the right hand side, and then store said value at the address, returning the value.
codegenSexpr (_, SAssign lhs rhs) = do rhs' <- codegenSexpr rhs lhs' <- codegenLVal lhs L.store lhs' 0 rhs' return rhs'
For the Binop
constructor, we begin by generating code for the left and right and sides.
codegenSexpr (t, SBinop op lhs rhs) = do lhs' <- codegenSexpr lhs rhs' <- codegenSexpr rhs case op of
For addition on int
s and float
s, we simply generate the corresponding machine instruction. For pointer addition, getElementPtr
takes care of calculating the offset for each pointer type so we don't have to worry about it.
Add -> case (fst lhs, fst rhs) of (Pointer _, TyInt ) -> L.gep lhs' [rhs'] (TyInt , Pointer _) -> L.gep rhs' [lhs'] (TyInt , TyInt ) -> L.add lhs' rhs' (TyFloat , TyFloat ) -> L.fadd lhs' rhs' _ -> error "Internal error - semant failed"
For pointer subtraction, we do actually have to calculate the pointer width ourselves and divide the difference in addresses by it.
Sub -> let zero = L.int64 0 in case (fst lhs, fst rhs) of (Pointer typ, Pointer typ') -> if typ' /= typ then error "Internal error - semant failed" else do lhs'' <- L.ptrtoint lhs' AST.i64 rhs'' <- L.ptrtoint rhs' AST.i64 diff <- L.sub lhs'' rhs'' width <- L.int64 . fromIntegral <$> sizeof typ L.sdiv diff width
Subtracting int
s from pointers is similar to adding them, except that we negate the int
before passing it to getElementPtr
.
(Pointer _, TyInt) -> do rhs'' <- L.sub zero rhs' L.gep lhs' [rhs'']
For int
s and float
s, we again dispatch to the corresponding machine instruction.
(TyInt , TyInt ) -> L.sub lhs' rhs' (TyFloat, TyFloat) -> L.fsub lhs' rhs' _ -> error "Internal error - semant failed"
Multiplication and division are easy.
Mult -> case t of TyInt -> L.mul lhs' rhs' TyFloat -> L.fmul lhs' rhs' _ -> error "Internal error - semant failed" Div -> case t of TyInt -> L.sdiv lhs' rhs' TyFloat -> L.fdiv lhs' rhs' _ -> error "Internal error - semant failed"
For the exponentiation operator, all remaining cases are raising int
s to int
s. We can take this opportunity to write some non-trivial LLVM and implement exponentiation as repeated multiplication directly in the IR. In haskell, the algorithm would be
-- We can obviously be more terse but this form maps better onto LLVM raise lhs rhs = go 1 rhs where go acc expt = if expt == 0 then acc else let nextAcc = lhs * acc nextExpt = expt - 1 in go nextAcc nextExpt
In order to marry SSA with conditionals, LLVM uses phi nodes. Phi nodes must all appear at the very beginning of a basic block. There cannot be any non-phi instructions preceding them. The phi instruction takes a list of pairs. The first element of each pair is a value and the second element is the label of a basic block which has an outgoing branch to the block with phi nodes.
First, we need to get the label of the enclosing block so that we can start our new block. We then set acc
and expt
to phi nodes, such that if control flow proceeds into the loop_pow
block from the enclosing scope, they are initialized to 1 and rhs
, respectively, and if control flow is from continue
, they are set to nextAcc
and nextExpt
. The if
clause is handled by issuing a condBr
if the expt
has reached 0, at which point we either return the acc
or jump back to loop_pow
.
Note that we use mdo
, courtesy of {-# LANGUAGE RecursiveDo #-}
, instead of do
, as we need to forward-reference the doneBlock
and continueBlock
s in our branch instruction. We can't define our blocks with L.block
and then branch to them because calling L.block
ends the current block and starts a new one. When using other LLVM bindings, one usually has to create all of the blocks and then manually position the builder at the correct location before emitting instructions. However, haskell's laziness allows us to avoid this inelegance and write branching code much more naturally.
Power -> mdo enclosing <- L.currentBlock L.br loop loop <- L.block `L.named` "loop_pow" acc <- L.phi [(L.int32 1, enclosing), (nextAcc, continueBlock)] `L.named` "acc" expt <- L.phi [(rhs', enclosing), (nextExpt, continueBlock)] `L.named` "expt" done <- L.icmp IP.EQ expt (L.int32 0) L.condBr done doneBlock continueBlock continueBlock <- L.block `L.named` "continue" nextAcc <- L.mul acc lhs' `L.named` "next_acc" nextExpt <- L.sub expt (L.int32 1) `L.named` "next_expt" L.br loop doneBlock <- L.block `L.named` "done" pure acc
(It is left as an exercise for the reader to implement amore efficientexponentiation algorithm in LLVM.)
The remaining binary operators all map directly onto their LLVM counterparts.
Equal -> case fst lhs of TyInt -> L.icmp IP.EQ lhs' rhs' TyBool -> L.icmp IP.EQ lhs' rhs' TyChar -> L.icmp IP.EQ lhs' rhs' Pointer _ -> L.icmp IP.EQ lhs' rhs' TyFloat -> L.fcmp FP.OEQ lhs' rhs' _ -> error "Internal error - semant failed" Neq -> case fst lhs of TyInt -> L.icmp IP.NE lhs' rhs' TyBool -> L.icmp IP.NE lhs' rhs' TyChar -> L.icmp IP.NE lhs' rhs' Pointer _ -> L.icmp IP.NE lhs' rhs' TyFloat -> L.fcmp FP.ONE lhs' rhs' _ -> error "Internal error - semant failed" Less -> case fst lhs of TyInt -> L.icmp IP.SLT lhs' rhs' TyBool -> L.icmp IP.SLT lhs' rhs' TyChar -> L.icmp IP.ULT lhs' rhs' TyFloat -> L.fcmp FP.OLT lhs' rhs' _ -> error "Internal error - semant failed" Leq -> case fst lhs of TyInt -> L.icmp IP.SLE lhs' rhs' TyBool -> L.icmp IP.SLE lhs' rhs' TyChar -> L.icmp IP.ULE lhs' rhs' TyFloat -> L.fcmp FP.OLE lhs' rhs' _ -> error "Internal error - semant failed" Greater -> case fst lhs of TyInt -> L.icmp IP.SGT lhs' rhs' TyBool -> L.icmp IP.SGT lhs' rhs' TyChar -> L.icmp IP.UGT lhs' rhs' TyFloat -> L.fcmp FP.OGT lhs' rhs' _ -> error "Internal error - semant failed" Geq -> case fst lhs of TyInt -> L.icmp IP.SGE lhs' rhs' TyBool -> L.icmp IP.SGE lhs' rhs' TyChar -> L.icmp IP.UGE lhs' rhs' TyFloat -> L.fcmp FP.OGE lhs' rhs' _ -> error "Internal error - semant failed" And -> L.and lhs' rhs' Or -> L.or lhs' rhs' BitAnd -> L.and lhs' rhs' BitOr -> L.or lhs' rhs'
Unary operators
There aren't any negation intrinsics in LLVM, but it's easy enough to implement them ourselves.
codegenSexpr (t, SUnop op e) = do e' <- codegenSexpr e case op of Neg -> case t of TyInt -> L.sub (L.int32 0) e' TyFloat -> L.fsub (L.double 0) e' _ -> error "Internal error - semant failed" Not -> case t of TyBool -> L.xor e' (L.bit 1) _ -> error "Internal error - semant failed"
Function calls
For function calls, we generate code for each argument, look up the function in our Env
, and then emit the call
instruction. Note that we add an empty list to each argument. LLVM allows us to emit parameter attributes attached to each argument, which we don't really care about.
codegenSexpr (_, SCall fun es) = do es' <- mapM (fmap (, []) . codegenSexpr) es f <- gets ((M.! fun) . operands) L.call f es'
Casts
For casts from type t
to t'
, we simply use the corresponding instruction.
codegenSexpr (_, SCast t' e@(t, _)) = do e' <- codegenSexpr e llvmType <- ltypeOfTyp t' case (t', t) of (Pointer _, Pointer _) -> L.bitcast e' llvmType (Pointer _, TyInt ) -> L.inttoptr e' llvmType (TyInt , Pointer _) -> L.ptrtoint e' llvmType -- Signed Int to Floating Point (TyFloat , TyInt ) -> L.sitofp e' llvmType _ -> error "Internal error - semant failed"
Finally, for SNoexpr
we just generate a 0, and if something got by the semantic checker, we crash.
codegenSexpr (_, SNoexpr) = pure $ L.int32 0 -- Final catchall codegenSexpr sx = error $ "Internal error - semant failed. Invalid sexpr " <> show sx
Statements
Codegen for statements isn't too bad. In the case of naked expressions, returns, and blocks, we simply reuse the work from codegenSexpr
.
codegenStatement :: SStatement -> Codegen () codegenStatement (SExpr e) = void $ codegenSexpr e codegenStatement (SReturn e) = case e of (TyVoid, SNoexpr) -> L.retVoid _ -> L.ret =<< codegenSexpr e codegenStatement (SBlock ss) = mapM_ codegenStatement ss
For conditionals, we follow a similar strategy as we did in implementing integer exponentiation. We generate the condition, branch on it, generate statements for each alternative in the correct block, and then issue an unconditional branch to a merge
block. One subtlety that we have to keep in mind is the possibility of a return
inside of the if statement. LLVM only allows one kind of terminator in a block, so we can use the mkTerminator
helper to check if that is the case and if so, do nothing.
codegenStatement (SIf p cons alt) = mdo bool <- codegenSexpr p L.condBr bool thenBlock elseBlock thenBlock <- L.block `L.named` "then" codegenStatement cons mkTerminator $ L.br mergeBlock elseBlock <- L.block `L.named` "else" codegenStatement alt mkTerminator $ L.br mergeBlock mergeBlock <- L.block `L.named` "merge" return ()
For do while loops, we immediately branch into the while
block, generate the code for the condition and the body, then conditionally branch into either the while
or merge
blocks.
codegenStatement (SDoWhile p body) = mdo L.br whileBlock whileBlock <- L.block `L.named` "while_body" codegenStatement body continue <- codegenSexpr p mkTerminator $ L.condBr continue whileBlock mergeBlock mergeBlock <- L.block `L.named` "merge" return ()
Functions
To generate function code, we use the function
function (who said haskellers were bad at naming?!). We actually need to insert it into the Env
before generating code for it in case it calls itself recursively. Fortunately, we have our trusty mdo
. After generating the body, we have to re-insert all the strings we encountered back into the global Env
so that they can be reused across functions (this is ugly and should be refactored.)
codegenFunc :: SFunction -> LLVM () codegenFunc f = mdo registerOperand (sname f) function (function, strs) <- locally $ do retty <- ltypeOfTyp (styp f) params <- mapM mkParam (sformals f) fun <- L.function name params retty genBody strings' <- gets strings pure (fun, strings') modify $ \e -> e { strings = strs }
The L.function
call merits further discussion. It has type
:: MonadModuleBuilder m => Name -> [(Type, ParameterName)] -> Type -> ([Operand] -> IRBuilderT m () -> m Operand)
which specializes to
:: Name -> [(Type, ParameterName)] -> Type -> ([Operand] -> Codegen ()) -> LLVM Operand
(Now we understand why Codegen
and LLVM
are defined the way they are.)
The name
is easy.
where name = mkName (cs $ sname f)
To make parameters, we just find the corresponding LLVM type and suggest the name as it appears in the source file.
mkParam (Bind t n) = (,) <$> ltypeOfTyp t <*> pure (L.ParameterName (cs n))
To generate the body, we first create an entry
block.
genBody :: [Operand] -> Codegen () genBody ops = do _entry <- L.block `L.named` "entry"
Then, for each of the Operand
s that the function takes, we allocate space on the stack with alloca
, store that Operand
in that memory, and register the memory in our Env
.
forM_ (zip ops (sformals f)) $ \(op, Bind _ n) -> do -- typeOf is defined in LLVM.AST.Typed addr <- L.alloca (typeOf op) Nothing 0 L.store addr 0 op registerOperand n addr
Local variables are treated similarly, except that we can leave them as uninitialized memory.
forM_ (slocals f) $ \(Bind t n) -> do ltype <- ltypeOfTyp t addr <- L.alloca ltype Nothing 0 registerOperand n addr
Finally, we generate the body of the function.
codegenStatement (sbody f)
For built in functions, we can use extern
to indicate to the linker to insert them into the final program.
emitBuiltIn :: (String, [AST.Type], AST.Type) -> LLVM () emitBuiltIn (name, argtys, retty) = do func <- L.extern (mkName name) argtys retty registerOperand (cs name) func -- Printf has varargs so we treat it separately builtIns :: [(String, [AST.Type], AST.Type)] builtIns = [ ("printbig" , [AST.i32] , AST.void) , ("llvm.pow.f64" , [AST.double, AST.double], AST.double) , ("llvm.powi.i32", [AST.double, AST.i32] , AST.double) , ("malloc" , [AST.i32] , AST.ptr AST.i8) , ("free" , [AST.ptr AST.i8] , AST.void) ]
Globals
For global variables, we simply call global
with a dummy 0 initial value and add the variable to our Env
.
codegenGlobal :: Bind -> LLVM () codegenGlobal (Bind t n) = do let name = mkName $ cs n initVal = C.Int 0 0 typ <- ltypeOfTyp t var <- L.global name typ initVal registerOperand n var
For toplevel structs, we register typedef
s at the module level with L.typedef
.
emitTypeDef :: Struct -> LLVM AST.Type emitTypeDef (Struct name _) = do typ <- ltypeOfTyp (TyStruct name) L.typedef (mkName (cs ("struct." <> name))) (Just typ)
Finally, we generate the entire SProgram
by emitting all of the builtin functions, toplevel structs, global variables, and functions sequentially.
codegenProgram :: SProgram -> AST.Module codegenProgram (structs, globals, funcs) = flip evalState (Env { operands = M.empty, structs = structs, strings = M.empty }) $ L.buildModuleT "microc" $ do printf <- L.externVarArgs (mkName "printf") [charStar] AST.i32 registerOperand "printf" printf mapM_ emitBuiltIn builtIns mapM_ emitTypeDef structs mapM_ codegenGlobal globals mapM_ codegenFunc funcs
(Full source for Codegen.hs
here .)
The runtime
There's really very little runtime to speak of, but for demonstration purposes there's a runtime.c
file that we link with all of our executables to provide the builtin functions.
#include <stdio.h> #include <stdlib.h> void printbig(int c) { // elided... }
(Full source for runtime.c
here ).
Linking
Our "linker" is just a thin wrapper around clang
. We'll create a Microc.Toplevel
module to handle the details.
module Microc.Toplevel where import LLVM.AST import LLVM.Pretty import Data.String.Conversions import Data.Text ( Text ) import qualified Data.Text.IO as T import System.IO import System.Directory import System.Process import System.Posix.Temp import Control.Exception ( bracket )
The compile
function, given a Module
, generates an executable at the supplied path. We call ppllvm
from llvm-hs-pretty
to dump the textual representation of the Module
into a file so that we can call clang
on it. We use bracket
to make sure that the build artifacts get cleaned up properly.
compile :: Module -> FilePath -> IO () compile llvmModule outfile = bracket (mkdtemp "build") removePathForcibly $ \buildDir -> withCurrentDirectory buildDir $ do -- create temporary file for "output.ll" (llvm, llvmHandle) <- mkstemps "output" ".ll" let runtime = "../src/runtime.c" -- write the llvmModule to a file T.hPutStrLn llvmHandle (cs $ ppllvm llvmModule) hClose llvmHandle -- link the runtime with the assembly callProcess "clang" ["-Wno-override-module", "-lm", llvm, runtime, "-o", "../" <> outfile]
We also provide a run
function that simply generates an executable, reads its output, then deletes it.
run :: Module -> IO Text run llvmModule = do compile llvmModule "./a.out" result <- cs <$> readProcess "./a.out" [] [] removePathForcibly "./a.out" return result
(Full source for Toplevel.hs
here .)
Now we can finally finish writing runOpts
in Main.hs
.
runOpts :: Options -> IO () runOpts (Options action infile ptype) = do program <- T.readFile infile let parseTree = case ptype of Combinator -> runParser programP infile program Generator -> Right $ parse . alexScanTokens $ T.unpack program case parseTree of Left err -> putStrLn $ errorBundlePretty err Right ast -> case action of Ast -> putDoc $ pretty ast <> "\n" _ -> case checkProgram ast of Left err -> putDoc $ pretty err <> "\n" Right sast -> let llvm = codegenProgram sast in case action of Sast -> pPrint sast LLVM -> T.putStrLn . cs . ppllvm $ llvm Compile outfile -> compile llvm outfile Run -> run llvm >>= T.putStr Ast -> error "unreachable"
(Full source for Main.hs
here .)
Testing
For testing, we write a similar runner that takes a filepath and returns what would be the result of calling mcc <path>
on it.
runFile :: FilePath -> IO Text runFile infile = do program <- T.readFile infile let parseTree = runParser programP (cs infile) program case parseTree of Left e -> return . cs $ errorBundlePretty e Right ast -> case checkProgram ast of Left err -> return . renderStrict $ layoutPretty defaultLayoutOptions (pretty err) Right sast -> run (codegenProgram sast)
We now have enough code to write passing tests for our compiler. They look very similar to the failing tests from part 2.
passing :: IO TestTree passing = do mcFiles <- findByExtension [".mc"] "tests/pass" return $ testGroup "pass" [ goldenVsString (takeBaseName mcFile) outfile (cs <$> runFile mcFile) | mcFile <- mcFiles , let outfile = replaceExtension mcFile ".golden" ] main :: IO () main = defaultMain =<< goldenTests goldenTests :: IO TestTree goldenTests = testGroup "all" <$> sequence [passing, failing, parsing]
(Full source for Testall.hs
here .)
Conclusion and Acknowledgments
The compiler is finished! In just 1500 or so lines of haskell, we've implemented a significant amount of the C programming language! Thanks to everyone who's been reading along. I've had fun writing and revisiting my old code. In particular, I'd like to thank
-
Théophile Choutriwho's been sponsoring the series
-
Moritz Kiefer for maintaining llvm-hs and approving my PR's to the library
-
Stephen EdwardsandRichard Townsend, whose compiler class I took in college; my team's project from that semester, a small, poorly-named functional array language that compiles to LLVM, is on github
以上所述就是小编给大家介绍的《Micro C, Part 3: Generating LLVM》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
Node.js开发指南
郭家寶(BYVoid) / 人民邮电出版社 / 2012-7 / 45.00元
Node.js是一种方兴未艾的新技术,诞生于2009年。经过两年的快速变化,Node.js生态圈已经逐渐走向稳定。Node.js采用了以往类似语言和框架中非常罕见的技术,总结为关键词就是:非阻塞式控制流、异步I/O、单线程消息循环。不少开发者在入门时总要经历一个痛苦的思维转变过程,给学习带来巨大的障碍。 而本书的目的就是帮助读者扫清这些障碍,学会使用Node.js进行Web后端开发,同时掌握事件驱......一起来看看 《Node.js开发指南》 这本书的介绍吧!