summaryrefslogtreecommitdiff
path: root/sys/doc/compiler.ms
diff options
context:
space:
mode:
authoraiju <aiju@phicode.de>2011-07-18 11:01:22 +0200
committeraiju <aiju@phicode.de>2011-07-18 11:01:22 +0200
commit8c4c1f39f4e369d7c590c9d119f1150a2215e56d (patch)
treecd430740860183fc01de1bc1ddb216ceff1f7173 /sys/doc/compiler.ms
parent11bf57fb2ceb999e314cfbe27a4e123bf846d2c8 (diff)
added /sys/doc
Diffstat (limited to 'sys/doc/compiler.ms')
-rw-r--r--sys/doc/compiler.ms1157
1 files changed, 1157 insertions, 0 deletions
diff --git a/sys/doc/compiler.ms b/sys/doc/compiler.ms
new file mode 100644
index 000000000..205de2d71
--- /dev/null
+++ b/sys/doc/compiler.ms
@@ -0,0 +1,1157 @@
+.HTML "Plan 9 C Compilers
+.TL
+Plan 9 C Compilers
+.AU
+Ken Thompson
+ken@plan9.bell-labs.com
+.AB
+.FS
+Originally appeared, in a different form, in
+.I
+Proceedings of the Summer 1990 UKUUG Conference,
+.R
+pp. 41-51,
+London, 1990.
+.FE
+This paper describes the overall structure and function of the Plan 9 C compilers.
+A more detailed implementation document
+for any one of the compilers
+is yet to be written.
+.AE
+.NH
+Introduction
+.LP
+There are many compilers in the series.
+Six of the compilers (MIPS 3000, SPARC, Intel 386, Power PC, DEC Alpha, and Motorola 68020)
+are considered active and are used to compile
+current versions of Plan 9.
+Several others (Motorola 68000, Intel 960, ARM 7500, AMD 29000) have had only limited use, such as
+to program peripherals or experimental devices.
+.NH
+Structure
+.LP
+The compiler is a single program that produces an
+object file.
+Combined in the compiler are the traditional
+roles of preprocessor, lexical analyzer, parser, code generator,
+local optimizer,
+and first half of the assembler.
+The object files are binary forms of assembly
+language,
+similar to what might be passed between
+the first and second passes of an assembler.
+.LP
+Object files and libraries
+are combined by a loader
+program to produce the executable binary.
+The loader combines the roles of second half
+of the assembler, global optimizer, and loader.
+The names of the compliers, loaders, and assemblers
+are as follows:
+.DS
+.ta 1.5i
+.de Ta
+\\$1 \f(CW\\$2\fP \f(CW\\$3\fP \f(CW\\$4\fP
+..
+.Ta SPARC kc kl ka
+.Ta Power\ PC qc ql qa
+.Ta MIPS vc vl va
+.Ta Motorola\ 68000 1c 1l 1a
+.Ta Motorola\ 68020 2c 2l 2a
+.Ta ARM\ 7500 5c 5l 5a
+.Ta Intel\ 960 6c 6l 6a
+.Ta DEC\ Alpha 7c 7l 7a
+.Ta Intel\ 386 8c 8l 8a
+.Ta AMD\ 29000 9c 9l 9a
+.DE
+There is a further breakdown
+in the source of the compilers into
+object-independent and
+object-dependent
+parts.
+All of the object-independent parts
+are combined into source files in the
+directory
+.CW /sys/src/cmd/cc .
+The object-dependent parts are collected
+in a separate directory for each compiler,
+for example
+.CW /sys/src/cmd/vc .
+All of the code,
+both object-independent and
+object-dependent,
+is machine-independent
+and may be cross-compiled and executed on any
+of the architectures.
+.NH
+The Language
+.LP
+The compiler implements ANSI C with some
+restrictions and extensions
+[ANSI90].
+Most of the restrictions are due to
+personal preference, while
+most of the extensions were to help in
+the implementation of Plan 9.
+There are other departures from the standard,
+particularly in the libraries,
+that are beyond the scope of this
+paper.
+.NH 2
+Register, volatile, const
+.LP
+The keyword
+.CW register
+is recognized syntactically
+but is semantically ignored.
+Thus taking the address of a
+.CW register
+variable is not diagnosed.
+The keyword
+.CW volatile
+disables all optimizations, in particular registerization, of the corresponding variable.
+The keyword
+.CW const
+generates warnings (if warnings are enabled by the compiler's
+.CW -w
+option) of non-constant use of the variable,
+but does not affect the generated code.
+.NH 2
+The preprocessor
+.LP
+The C preprocessor is probably the
+biggest departure from the ANSI standard.
+.LP
+The preprocessor built into the Plan 9 compilers does not support
+.CW #if ,
+although it does handle
+.CW #ifdef
+and
+.CW #include .
+If it is necessary to be more standard,
+the source text can first be run through the separate ANSI C
+preprocessor,
+.CW cpp .
+.NH 2
+Unnamed substructures
+.LP
+The most important and most heavily used of the
+extensions is the declaration of an
+unnamed substructure or subunion.
+For example:
+.DS
+.CW
+.ta .1i .6i 1.1i 1.6i
+ typedef
+ struct lock
+ {
+ int locked;
+ } Lock;
+
+ typedef
+ struct node
+ {
+ int type;
+ union
+ {
+ double dval;
+ float fval;
+ long lval;
+ };
+ Lock;
+ } Node;
+
+ Lock* lock;
+ Node* node;
+.DE
+The declaration of
+.CW Node
+has an unnamed substructure of type
+.CW Lock
+and an unnamed subunion.
+One use of this feature allows references to elements of the
+subunit to be accessed as if they were in
+the outer structure.
+Thus
+.CW node->dval
+and
+.CW node->locked
+are legitimate references.
+.LP
+When an outer structure is used
+in a context that is only legal for
+an unnamed substructure,
+the compiler promotes the reference to the
+unnamed substructure.
+This is true for references to structures and
+to references to pointers to structures.
+This happens in assignment statements and
+in argument passing where prototypes have been
+declared.
+Thus, continuing with the example,
+.DS
+.CW
+.ta .1i .6i 1.1i 1.6i
+ lock = node;
+.DE
+would assign a pointer to the unnamed
+.CW Lock
+in
+the
+.CW Node
+to the variable
+.CW lock .
+Another example,
+.DS
+.CW
+.ta .1i .6i 1.1i 1.6i
+ extern void lock(Lock*);
+ func(...)
+ {
+ ...
+ lock(node);
+ ...
+ }
+.DE
+will pass a pointer to the
+.CW Lock
+substructure.
+.LP
+Finally, in places where context is insufficient to identify the unnamed structure,
+the type name (it must be a
+.CW typedef )
+of the unnamed structure can be used as an identifier.
+In our example,
+.CW &node->Lock
+gives the address of the anonymous
+.CW Lock
+structure.
+.NH 2
+Structure displays
+.LP
+A structure cast followed by a list of expressions in braces is
+an expression with the type of the structure and elements assigned from
+the corresponding list.
+Structures are now almost first-class citizens of the language.
+It is common to see code like this:
+.DS
+.CW
+.ta .1i
+ r = (Rectangle){point1, (Point){x,y+2}};
+.DE
+.NH 2
+Initialization indexes
+.LP
+In initializers of arrays,
+one may place a constant expression
+in square brackets before an initializer.
+This causes the next initializer to assign
+the indicated element.
+For example:
+.DS
+.CW
+.ta .1i .6i 1.6i
+ enum errors
+ {
+ Etoobig,
+ Ealarm,
+ Egreg
+ };
+ char* errstrings[] =
+ {
+ [Ealarm] "Alarm call",
+ [Egreg] "Panic: out of mbufs",
+ [Etoobig] "Arg list too long",
+ };
+.DE
+In the same way,
+individual structures members may
+be initialized in any order by preceding the initialization with
+.CW .tagname .
+Both forms allow an optional
+.CW = ,
+to be compatible with a proposed
+extension to ANSI C.
+.NH 2
+External register
+.LP
+The declaration
+.CW extern
+.CW register
+will dedicate a register to
+a variable on a global basis.
+It can be used only under special circumstances.
+External register variables must be identically
+declared in all modules and
+libraries.
+The feature is not intended for efficiency,
+although it can produce efficient code;
+rather it represents a unique storage class that
+would be hard to get any other way.
+On a shared-memory multi-processor,
+an external register is
+one-per-processor and neither one-per-procedure (automatic)
+or one-per-system (external).
+It is used for two variables in the Plan 9 kernel,
+.CW u
+and
+.CW m .
+.CW U
+is a pointer to the structure representing the currently running process
+and
+.CW m
+is a pointer to the per-machine data structure.
+.NH 2
+Long long
+.LP
+The compilers accept
+.CW long
+.CW long
+as a basic type meaning 64-bit integer.
+On all of the machines
+this type is synthesized from 32-bit instructions.
+.NH 2
+Pragma
+.LP
+The compilers accept
+.CW #pragma
+.CW lib
+.I libname
+and pass the
+library name string uninterpreted
+to the loader.
+The loader uses the library name to
+find libraries to load.
+If the name contains
+.CW %O ,
+it is replaced with
+the single character object type of the compiler
+(e.g.,
+.CW v
+for the MIPS).
+If the name contains
+.CW %M ,
+it is replaced with
+the architecture type for the compiler
+(e.g.,
+.CW mips
+for the MIPS).
+If the name starts with
+.CW /
+it is an absolute pathname;
+if it starts with
+.CW .
+then it is searched for in the loader's current directory.
+Otherwise, the name is searched from
+.CW /%M/lib .
+Such
+.CW #pragma
+statements in header files guarantee that the correct
+libraries are always linked with a program without the
+need to specify them explicitly at link time.
+.LP
+They also accept
+.CW #pragma
+.CW hjdicks
+.CW on
+(or
+.CW yes
+or
+.CW 1 )
+to cause subsequently declared data, until
+.CW #pragma
+.CW hjdicks
+.CW off
+(or
+.CW no
+or
+.CW 0 ),
+to be laid out in memory tightly packed in successive bytes, disregarding
+the usual alignment rules.
+Accessing such data can cause faults.
+.LP
+Similarly,
+.CW #pragma
+.CW profile
+.CW off
+(or
+.CW no
+or
+.CW 0 )
+causes subsequently declared functions, until
+.CW #pragma
+.CW profile
+.CW on
+(or
+.CW yes
+or
+.CW 1 ),
+to be marked as unprofiled.
+Such functions will not be profiled when
+profiling is enabled for the rest of the program.
+.LP
+Two
+.CW #pragma
+statements allow type-checking of
+.CW print -like
+functions.
+The first, of the form
+.P1
+#pragma varargck argpos error 2
+.P2
+tells the compiler that the second argument to
+.CW error
+is a
+.CW print
+format string (see the manual page
+.I print (2))
+that specifies how to format
+.CW error 's
+subsequent arguments.
+The second, of the form
+.P1
+#pragma varargck type "s" char*
+.P2
+says that the
+.CW print
+format verb
+.CW s
+processes an argument of
+type
+.CW char* .
+If the compiler's
+.CW -F
+option is enabled, the compiler will use this information
+to report type violations in the arguments to
+.CW print ,
+.CW error ,
+and similar routines.
+.NH
+Object module conventions
+.LP
+The overall conventions of the runtime environment
+are important
+to runtime efficiency.
+In this section,
+several of these conventions are discussed.
+.NH 2
+Register saving
+.LP
+In the Plan 9 compilers,
+the caller of a procedure saves the registers.
+With caller-saves,
+the leaf procedures can use all the
+registers and never save them.
+If you spend a lot of time at the leaves,
+this seems preferable.
+With callee-saves,
+the saving of the registers is done
+in the single point of entry and return.
+If you are interested in space,
+this seems preferable.
+In both,
+there is a degree of uncertainty
+about what registers need to be saved.
+Callee-saved registers make it difficult to
+find variables in registers in debuggers.
+Callee-saved registers also complicate
+the implementation of
+.CW longjmp .
+The convincing argument is
+that with caller-saves,
+the decision to registerize a variable
+can include the cost of saving the register
+across calls.
+For a further discussion of caller- vs. callee-saves,
+see the paper by Davidson and Whalley [Dav91].
+.LP
+In the Plan 9 operating system,
+calls to the kernel look like normal procedure
+calls, which means
+the caller
+has saved the registers and the system
+entry does not have to.
+This makes system calls considerably faster.
+Since this is a potential security hole,
+and can lead to non-determinism,
+the system may eventually save the registers
+on entry,
+or more likely clear the registers on return.
+.NH 2
+Calling convention
+.LP
+Older C compilers maintain a frame pointer, which is at a known constant
+offset from the stack pointer within each function.
+For machines where the stack grows towards zero,
+the argument pointer is at a known constant offset
+from the frame pointer.
+Since the stack grows down in Plan 9,
+the Plan 9 compilers
+keep neither an
+explicit frame pointer nor
+an explicit argument pointer;
+instead they generate addresses relative to the stack pointer.
+.LP
+On some architectures, the first argument to a subroutine is passed in a register.
+.NH 2
+Functions returning structures
+.LP
+Structures longer than one word are awkward to implement
+since they do not fit in registers and must
+be passed around in memory.
+Functions that return structures
+are particularly clumsy.
+The Plan 9 compilers pass the return address of
+a structure as the first argument of a
+function that has a structure return value.
+Thus
+.DS
+.CW
+.ta .1i .6i 1.1i 1.6i
+ x = f(...)
+.DE
+is rewritten as
+.DS
+.CW
+.ta .1i .6i 1.1i 1.6i
+ f(&x, ...)\f1.
+.DE
+This saves a copy and makes the compilation
+much less clumsy.
+A disadvantage is that if you call this
+function without an assignment,
+a dummy location must be invented.
+.LP
+There is also a danger of calling a function
+that returns a structure without declaring
+it as such.
+With ANSI C function prototypes,
+this error need never occur.
+.NH
+Implementation
+.LP
+The compiler is divided internally into
+four machine-independent passes,
+four machine-dependent passes,
+and an output pass.
+The next nine sections describe each pass in order.
+.NH 2
+Parsing
+.LP
+The first pass is a YACC-based parser
+[Joh79].
+Declarations are interpreted immediately,
+building a block structured symbol table.
+Executable statements are put into a parse tree
+and collected,
+without interpretation.
+At the end of each procedure,
+the parse tree for the function is
+examined by the other passes of the compiler.
+.LP
+The input stream of the parser is
+a pushdown list of input activations.
+The preprocessor
+expansions of
+macros
+and
+.CW #include
+are implemented as pushdowns.
+Thus there is no separate
+pass for preprocessing.
+.NH 2
+Typing
+.LP
+The next pass distributes typing information
+to every node of the tree.
+Implicit operations on the tree are added,
+such as type promotions and taking the
+address of arrays and functions.
+.NH 2
+Machine-independent optimization
+.LP
+The next pass performs optimizations
+and transformations of the tree, such as converting
+.CW &*x
+and
+.CW *&x
+into
+.CW x .
+Constant expressions are converted to constants in this pass.
+.NH 2
+Arithmetic rewrites
+.LP
+This is another machine-independent optimization.
+Subtrees of add, subtract, and multiply of integers are
+rewritten for easier compilation.
+The major transformation is factoring:
+.CW 4+8*a+16*b+5
+is transformed into
+.CW 9+8*(a+2*b) .
+Such expressions arise from address
+manipulation and array indexing.
+.NH 2
+Addressability
+.LP
+This is the first of the machine-dependent passes.
+The addressability of a processor is defined as the set of
+expressions that is legal in the address field
+of a machine language instruction.
+The addressability of different processors varies widely.
+At one end of the spectrum are the 68020 and VAX,
+which allow a complex mix of incrementing,
+decrementing,
+indexing, and relative addressing.
+At the other end is the MIPS,
+which allows only registers and constant offsets from the
+contents of a register.
+The addressability can be different for different instructions
+within the same processor.
+.LP
+It is important to the code generator to know when a
+subtree represents an address of a particular type.
+This is done with a bottom-up walk of the tree.
+In this pass, the leaves are labeled with small integers.
+When an internal node is encountered,
+it is labeled by consulting a table indexed by the
+labels on the left and right subtrees.
+For example,
+on the 68020 processor,
+it is possible to address an
+offset from a named location.
+In C, this is represented by the expression
+.CW *(&name+constant) .
+This is marked addressable by the following table.
+In the table,
+a node represented by the left column is marked
+with a small integer from the right column.
+Marks of the form
+.CW A\s-2\di\u\s0
+are addressable while
+marks of the form
+.CW N\s-2\di\u\s0
+are not addressable.
+.DS
+.B
+.ta .1i 1.1i
+ Node Marked
+.CW
+ name A\s-2\d1\u\s0
+ const A\s-2\d2\u\s0
+ &A\s-2\d1\u\s0 A\s-2\d3\u\s0
+ A\s-2\d3\u\s0+A\s-2\d1\u\s0 N\s-2\d1\u\s0 \fR(note that this is not addressable)\fP
+ *N\s-2\d1\u\s0 A\s-2\d4\u\s0
+.DE
+Here there is a distinction between
+a node marked
+.CW A\s-2\d1\u\s0
+and a node marked
+.CW A\s-2\d4\u\s0
+because the address operator of an
+.CW A\s-2\d4\u\s0
+node is not addressable.
+So to extend the table:
+.DS
+.B
+.ta .1i 1.1i
+ Node Marked
+.CW
+ &A\s-2\d4\u\s0 N\s-2\d2\u\s0
+ N\s-2\d2\u\s0+N\s-2\d1\u\s0 N\s-2\d1\u\s0
+.DE
+The full addressability of the 68020 is expressed
+in 18 rules like this,
+while the addressability of the MIPS is expressed
+in 11 rules.
+When one ports the compiler,
+this table is usually initialized
+so that leaves are labeled as addressable and nothing else.
+The code produced is poor,
+but porting is easy.
+The table can be extended later.
+.LP
+This pass also rewrites some complex operators
+into procedure calls.
+Examples include 64-bit multiply and divide.
+.LP
+In the same bottom-up pass of the tree,
+the nodes are labeled with a Sethi-Ullman complexity
+[Set70].
+This number is roughly the number of registers required
+to compile the tree on an ideal machine.
+An addressable node is marked 0.
+A function call is marked infinite.
+A unary operator is marked as the
+maximum of 1 and the mark of its subtree.
+A binary operator with equal marks on its subtrees is
+marked with a subtree mark plus 1.
+A binary operator with unequal marks on its subtrees is
+marked with the maximum mark of its subtrees.
+The actual values of the marks are not too important,
+but the relative values are.
+The goal is to compile the harder
+(larger mark)
+subtree first.
+.NH 2
+Code generation
+.LP
+Code is generated by recursive
+descent.
+The Sethi-Ullman complexity completely guides the
+order.
+The addressability defines the leaves.
+The only difficult part is compiling a tree
+that has two infinite (function call)
+subtrees.
+In this case,
+one subtree is compiled into the return register
+(usually the most convenient place for a function call)
+and then stored on the stack.
+The other subtree is compiled into the return register
+and then the operation is compiled with
+operands from the stack and the return register.
+.LP
+There is a separate boolean code generator that compiles
+conditional expressions.
+This is fundamentally different from compiling an arithmetic expression.
+The result of the boolean code generator is the
+position of the program counter and not an expression.
+The boolean code generator makes extensive use of De Morgan's rule.
+The boolean code generator is an expanded version of that described
+in chapter 8 of Aho, Sethi, and Ullman
+[Aho87].
+.LP
+There is a considerable amount of talk in the literature
+about automating this part of a compiler with a machine
+description.
+Since this code generator is so small
+(less than 500 lines of C)
+and easy,
+it hardly seems worth the effort.
+.NH 2
+Registerization
+.LP
+Up to now,
+the compiler has operated on syntax trees
+that are roughly equivalent to the original source language.
+The previous pass has produced machine language in an internal
+format.
+The next two passes operate on the internal machine language
+structures.
+The purpose of the next pass is to reintroduce
+registers for heavily used variables.
+.LP
+All of the variables that can be
+potentially registerized within a procedure are
+placed in a table.
+(Suitable variables are any automatic or external
+scalars that do not have their addresses extracted.
+Some constants that are hard to reference are also
+considered for registerization.)
+Four separate data flow equations are evaluated
+over the procedure on all of these variables.
+Two of the equations are the normal set-behind
+and used-ahead
+bits that define the life of a variable.
+The two new bits tell if a variable life
+crosses a function call ahead or behind.
+By examining a variable over its lifetime,
+it is possible to get a cost
+for registerizing.
+Loops are detected and the costs are multiplied
+by three for every level of loop nesting.
+Costs are sorted and the variables
+are replaced by available registers on a greedy basis.
+.LP
+The 68020 has two different
+types of registers.
+For the 68020,
+two different costs are calculated for
+each variable life and the register type that
+affords the better cost is used.
+Ties are broken by counting the number of available
+registers of each type.
+.LP
+Note that externals are registerized together with automatics.
+This is done by evaluating the semantics of a ``call'' instruction
+differently for externals and automatics.
+Since a call goes outside the local procedure,
+it is assumed that a call references all externals.
+Similarly,
+externals are assumed to be set before an ``entry'' instruction
+and assumed to be referenced after a ``return'' instruction.
+This makes sure that externals are in memory across calls.
+.LP
+The overall results are satisfactory.
+It would be nice to be able to do this processing in
+a machine-independent way,
+but it is impossible to get all of the costs and
+side effects of different choices by examining the parse tree.
+.LP
+Most of the code in the registerization pass is machine-independent.
+The major machine-dependency is in
+examining a machine instruction to ask if it sets or references
+a variable.
+.NH 2
+Machine code optimization
+.LP
+The next pass walks the machine code
+for opportunistic optimizations.
+For the most part,
+this is highly specific to a particular
+processor.
+One optimization that is performed
+on all of the processors is the
+removal of unnecessary ``move''
+instructions.
+Ironically,
+most of these instructions were inserted by
+the previous pass.
+There are two patterns that are repetitively
+matched and replaced until no more matches are
+found.
+The first tries to remove ``move'' instructions
+by relabeling variables.
+.LP
+When a ``move'' instruction is encountered,
+if the destination variable is set before the
+source variable is referenced,
+then all of the references to the destination
+variable can be renamed to the source and the ``move''
+can be deleted.
+This transformation uses the reverse data flow
+set up in the previous pass.
+.LP
+An example of this pattern is depicted in the following
+table.
+The pattern is in the left column and the
+replacement action is in the right column.
+.DS
+.CW
+.ta .1i .6i 1.6i 2.1i 2.6i
+ MOVE a->b \fR(remove)\fP
+.R
+ (sequence with no mention of \f(CWa\fP)
+.CW
+ USE b USE a
+.R
+ (sequence with no mention of \f(CWa\fP)
+.CW
+ SET b SET b
+.DE
+.LP
+Experiments have shown that it is marginally
+worthwhile to rename uses of the destination variable
+with uses of the source variable up to
+the first use of the source variable.
+.LP
+The second transform will do relabeling
+without deleting instructions.
+When a ``move'' instruction is encountered,
+if the source variable has been set prior
+to the use of the destination variable
+then all of the references to the source
+variable are replaced by the destination and
+the ``move'' is inverted.
+Typically,
+this transformation will alter two ``move''
+instructions and allow the first transformation
+another chance to remove code.
+This transformation uses the forward data flow
+set up in the previous pass.
+.LP
+Again,
+the following is a depiction of the transformation where
+the pattern is in the left column and the
+rewrite is in the right column.
+.DS
+.CW
+.ta .1i .6i 1.6i 2.1i 2.6i
+ SET a SET b
+.R
+ (sequence with no use of \f(CWb\fP)
+.CW
+ USE a USE b
+.R
+ (sequence with no use of \f(CWb\fP)
+.CW
+ MOVE a->b MOVE b->a
+.DE
+Iterating these transformations
+will usually get rid of all redundant ``move'' instructions.
+.LP
+A problem with this organization is that the costs
+of registerization calculated in the previous pass
+must depend on how well this pass can detect and remove
+redundant instructions.
+Often,
+a fine candidate for registerization is rejected
+because of the cost of instructions that are later
+removed.
+.NH 2
+Writing the object file
+.LP
+The last pass walks the internal assembly language
+and writes the object file.
+The object file is reduced in size by about a factor
+of three with simple compression
+techniques.
+The most important aspect of the object file
+format is that it is independent of the compiling machine.
+All integer and floating numbers in the object
+code are converted to known formats and byte
+orders.
+.NH
+The loader
+.LP
+The loader is a multiple pass program that
+reads object files and libraries and produces
+an executable binary.
+The loader also does some minimal
+optimizations and code rewriting.
+Many of the operations performed by the
+loader are machine-dependent.
+.LP
+The first pass of the loader reads the
+object modules into an internal data
+structure that looks like binary assembly language.
+As the instructions are read,
+code is reordered to remove
+unconditional branch instructions.
+Conditional branch instructions are inverted
+to prevent the insertion of unconditional branches.
+The loader will also make a copy of a few instructions
+to remove an unconditional branch.
+.LP
+The next pass allocates addresses for
+all external data.
+Typical of processors is the MIPS,
+which can reference ±32K bytes from a
+register.
+The loader allocates the register
+.CW R30
+as the static pointer.
+The value placed in
+.CW R30
+is the base of the data segment plus 32K.
+It is then cheap to reference all data in the
+first 64K of the data segment.
+External variables are allocated to
+the data segment
+with the smallest variables allocated first.
+If all of the data cannot fit into the first
+64K of the data segment,
+then usually only a few large arrays
+need more expensive addressing modes.
+.LP
+For the MIPS processor,
+the loader makes a pass over the internal
+structures,
+exchanging instructions to try
+to fill ``delay slots'' with useful work.
+If a useful instruction cannot be found
+to fill a delay slot,
+the loader will insert
+``noop''
+instructions.
+This pass is very expensive and does not
+do a good job.
+About 40% of all instructions are in
+delay slots.
+About 65% of these are useful instructions and
+35% are ``noops.''
+The vendor-supplied assembler does this job
+more effectively,
+filling about 80%
+of the delay slots with useful instructions.
+.LP
+On the 68020 processor,
+branch instructions come in a variety of
+sizes depending on the relative distance
+of the branch.
+Thus the size of branch instructions
+can be mutually dependent.
+The loader uses a multiple pass algorithm
+to resolve the branch lengths
+[Szy78].
+Initially, all branches are assumed minimal length.
+On each subsequent pass,
+the branches are reassessed
+and expanded if necessary.
+When no more expansions occur,
+the locations of the instructions in
+the text segment are known.
+.LP
+On the MIPS processor,
+all instructions are one size.
+A single pass over the instructions will
+determine the locations of all addresses
+in the text segment.
+.LP
+The last pass of the loader produces the
+executable binary.
+A symbol table and other tables are
+produced to help the debugger to
+interpret the binary symbolically.
+.LP
+The loader places absolute source line numbers in the symbol table.
+The name and absolute line number of all
+.CW #include
+files is also placed in the
+symbol table so that the debuggers can
+associate object code to source files.
+.NH
+Performance
+.LP
+The following is a table of the source size of the MIPS
+compiler.
+.DS
+.ta .1i .6i
+ lines module
+ \0509 machine-independent headers
+ 1070 machine-independent YACC source
+ 6090 machine-independent C source
+
+ \0545 machine-dependent headers
+ 6532 machine-dependent C source
+
+ \0298 loader headers
+ 5215 loader C source
+.DE
+.LP
+The following table shows timing
+of a test program
+that plays checkers, running on a MIPS R4000.
+The test program is 26 files totaling 12600 lines of C.
+The execution time does not significantly
+depend on library implementation.
+Since no other compiler runs on Plan 9,
+the Plan 9 tests were done with the Plan 9 operating system;
+the other tests were done on the vendor's operating system.
+The hardware was identical in both cases.
+The optimizer in the vendor's compiler
+is reputed to be extremely good.
+.DS
+.ta .1i .9i
+ \0\04.49s Plan 9 \f(CWvc\fP \f(CW-N\fP compile time (opposite of \f(CW-O\fP)
+ \0\01.72s Plan 9 \f(CWvc\fP \f(CW-N\fP load time
+ 148.69s Plan 9 \f(CWvc\fP \f(CW-N\fP run time
+
+ \015.07s Plan 9 \f(CWvc\fP compile time (\f(CW-O\fP implicit)
+ \0\01.66s Plan 9 \f(CWvc\fP load time
+ \089.96s Plan 9 \f(CWvc\fP run time
+
+ \014.83s vendor \f(CWcc\fP compile time
+ \0\00.38s vendor \f(CWcc\fP load time
+ 104.75s vendor \f(CWcc\fP run time
+
+ \043.59s vendor \f(CWcc\fP \f(CW-O\fP compile time
+ \0\00.38s vendor \f(CWcc\fP \f(CW-O\fP load time
+ \076.19s vendor \f(CWcc\fP \f(CW-O\fP run time
+
+ \0\08.19s vendor \f(CWcc\fP \f(CW-O3\fP compile time
+ \035.97s vendor \f(CWcc\fP \f(CW-O3\fP load time
+ \071.16s vendor \f(CWcc\fP \f(CW-O3\fP run time
+.DE
+.LP
+To compare the Intel compiler,
+a program that is about 40% bit manipulation and
+about 60% single precision floating point was
+run on the same 33 MHz 486, once under Windows
+compiled with the Watcom compiler, version 10.0,
+in 16-bit mode and once under
+Plan 9 in 32-bit mode.
+The Plan 9 execution time was 27 sec while the Windows
+execution time was 31 sec.
+.NH
+Conclusions
+.LP
+The new compilers compile
+quickly,
+load slowly,
+and produce
+medium quality
+object code.
+The compilers are relatively
+portable,
+requiring but a couple of weeks' work to
+produce a compiler for a different computer.
+For Plan 9,
+where we needed several compilers
+with specialized features and
+our own object formats,
+this project was indispensable.
+It is also necessary for us to
+be able to freely distribute our compilers
+with the Plan 9 distribution.
+.LP
+Two problems have come up in retrospect.
+The first has to do with the
+division of labor between compiler and loader.
+Plan 9 runs on multi-processors and as such
+compilations are often done in parallel.
+Unfortunately,
+all compilations must be complete before loading
+can begin.
+The load is then single-threaded.
+With this model,
+any shift of work from compile to load
+results in a significant increase in real time.
+The same is true of libraries that are compiled
+infrequently and loaded often.
+In the future,
+we may try to put some of the loader work
+back into the compiler.
+.LP
+The second problem comes from
+the various optimizations performed over several
+passes.
+Often optimizations in different passes depend
+on each other.
+Iterating the passes could compromise efficiency,
+or even loop.
+We see no real solution to this problem.
+.NH
+References
+.LP
+[Aho87] A. V. Aho, R. Sethi, and J. D. Ullman,
+.I
+Compilers \- Principles, Techniques, and Tools,
+.R
+Addison Wesley,
+Reading, MA,
+1987.
+.LP
+[ANSI90] \f2American National Standard for Information Systems \-
+Programming Language C\f1, American National Standards Institute, Inc.,
+New York, 1990.
+.LP
+[Dav91] J. W. Davidson and D. B. Whalley,
+``Methods for Saving and Restoring Register Values across Function Calls'',
+.I
+Software\-Practice and Experience,
+.R
+Vol 21(2), pp. 149-165, February 1991.
+.LP
+[Joh79] S. C. Johnson,
+``YACC \- Yet Another Compiler Compiler'',
+.I
+UNIX Programmer's Manual, Seventh Ed., Vol. 2A,
+.R
+AT&T Bell Laboratories,
+Murray Hill, NJ,
+1979.
+.LP
+[Set70] R. Sethi and J. D. Ullman,
+``The Generation of Optimal Code for Arithmetic Expressions'',
+.I
+Journal of the ACM,
+.R
+Vol 17(4), pp. 715-728, 1970.
+.LP
+[Szy78] T. G. Szymanski,
+``Assembling Code for Machines with Span-dependent Instructions'',
+.I
+Communications of the ACM,
+.R
+Vol 21(4), pp. 300-308, 1978.