summaryrefslogtreecommitdiff
path: root/sys/doc/compiler.ms
blob: 205de2d71d3470135d9adb0bceec6e7fc486324a (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
.HTML "Plan 9 C Compilers
.TL
Plan 9 C Compilers
.AU
Ken Thompson
ken@plan9.bell-labs.com
.AB
.FS
Originally appeared, in a different form, in
.I
Proceedings of the Summer 1990 UKUUG Conference,
.R
pp. 41-51,
London, 1990.
.FE
This paper describes the overall structure and function of the Plan 9 C compilers.
A more detailed implementation document
for any one of the compilers
is yet to be written.
.AE
.NH
Introduction
.LP
There are many compilers in the series.
Six of the compilers (MIPS 3000, SPARC, Intel 386, Power PC, DEC Alpha, and Motorola 68020)
are considered active and are used to compile
current versions of Plan 9.
Several others (Motorola 68000, Intel 960, ARM 7500, AMD 29000) have had only limited use, such as
to program peripherals or experimental devices.
.NH
Structure
.LP
The compiler is a single program that produces an
object file.
Combined in the compiler are the traditional
roles of preprocessor, lexical analyzer, parser, code generator,
local optimizer,
and first half of the assembler.
The object files are binary forms of assembly
language,
similar to what might be passed between
the first and second passes of an assembler.
.LP
Object files and libraries
are combined by a loader
program to produce the executable binary.
The loader combines the roles of second half
of the assembler, global optimizer, and loader.
The names of the compliers, loaders, and assemblers
are as follows:
.DS
.ta 1.5i
.de Ta
\\$1	\f(CW\\$2\fP  \f(CW\\$3\fP  \f(CW\\$4\fP
..
.Ta SPARC kc kl ka
.Ta Power\ PC qc ql qa
.Ta MIPS vc vl va
.Ta Motorola\ 68000 1c 1l 1a
.Ta Motorola\ 68020 2c 2l 2a
.Ta ARM\ 7500 5c 5l 5a
.Ta Intel\ 960 6c 6l 6a
.Ta DEC\ Alpha 7c 7l 7a
.Ta Intel\ 386 8c 8l 8a
.Ta AMD\ 29000 9c 9l 9a
.DE
There is a further breakdown
in the source of the compilers into
object-independent and
object-dependent
parts.
All of the object-independent parts
are combined into source files in the
directory
.CW /sys/src/cmd/cc .
The object-dependent parts are collected
in a separate directory for each compiler,
for example
.CW /sys/src/cmd/vc .
All of the code,
both object-independent and
object-dependent,
is machine-independent
and may be cross-compiled and executed on any
of the architectures.
.NH
The Language
.LP
The compiler implements ANSI C with some
restrictions and extensions
[ANSI90].
Most of the restrictions are due to
personal preference, while
most of the extensions were to help in
the implementation of Plan 9.
There are other departures from the standard,
particularly in the libraries,
that are beyond the scope of this
paper.
.NH 2
Register, volatile, const
.LP
The keyword
.CW register
is recognized syntactically
but is semantically ignored.
Thus taking the address of a
.CW register
variable is not diagnosed.
The keyword
.CW volatile
disables all optimizations, in particular registerization, of the corresponding variable.
The keyword
.CW const
generates warnings (if warnings are enabled by the compiler's
.CW -w
option) of non-constant use of the variable,
but does not affect the generated code.
.NH 2
The preprocessor
.LP
The C preprocessor is probably the
biggest departure from the ANSI standard.
.LP
The preprocessor built into the Plan 9 compilers does not support
.CW #if ,
although it does handle
.CW #ifdef
and
.CW #include .
If it is necessary to be more standard,
the source text can first be run through the separate ANSI C
preprocessor,
.CW cpp .
.NH 2
Unnamed substructures
.LP
The most important and most heavily used of the
extensions is the declaration of an
unnamed substructure or subunion.
For example:
.DS
.CW
.ta .1i .6i 1.1i 1.6i
	typedef
	struct	lock
	{
		int    locked;
	} Lock;

	typedef
	struct	node
	{
		int	type;
		union
		{
			double dval;
			float  fval;
			long   lval;
		};
		Lock;
	} Node;

	Lock*	lock;
	Node*	node;
.DE
The declaration of
.CW Node
has an unnamed substructure of type
.CW Lock
and an unnamed subunion.
One use of this feature allows references to elements of the
subunit to be accessed as if they were in
the outer structure.
Thus
.CW node->dval
and
.CW node->locked
are legitimate references.
.LP
When an outer structure is used
in a context that is only legal for
an unnamed substructure,
the compiler promotes the reference to the
unnamed substructure.
This is true for references to structures and
to references to pointers to structures.
This happens in assignment statements and
in argument passing where prototypes have been
declared.
Thus, continuing with the example,
.DS
.CW
.ta .1i .6i 1.1i 1.6i
	lock = node;
.DE
would assign a pointer to the unnamed
.CW Lock
in
the
.CW Node
to the variable
.CW lock .
Another example,
.DS
.CW
.ta .1i .6i 1.1i 1.6i
	extern void lock(Lock*);
	func(...)
	{
		...
		lock(node);
		...
	}
.DE
will pass a pointer to the
.CW Lock
substructure.
.LP
Finally, in places where context is insufficient to identify the unnamed structure,
the type name (it must be a
.CW typedef )
of the unnamed structure can be used as an identifier.
In our example,
.CW &node->Lock
gives the address of the anonymous
.CW Lock
structure.
.NH 2
Structure displays
.LP
A structure cast followed by a list of expressions in braces is
an expression with the type of the structure and elements assigned from
the corresponding list.
Structures are now almost first-class citizens of the language.
It is common to see code like this:
.DS
.CW
.ta .1i
	r = (Rectangle){point1, (Point){x,y+2}};
.DE
.NH 2
Initialization indexes
.LP
In initializers of arrays,
one may place a constant expression
in square brackets before an initializer.
This causes the next initializer to assign
the indicated element.
For example:
.DS
.CW
.ta .1i .6i 1.6i
	enum	errors
	{
		Etoobig,
		Ealarm,
		Egreg
	};
	char* errstrings[] =
	{
		[Ealarm]	"Alarm call",
		[Egreg]	"Panic: out of mbufs",
		[Etoobig]	"Arg list too long",
	};
.DE
In the same way,
individual structures members may
be initialized in any order by preceding the initialization with
.CW .tagname .
Both forms allow an optional
.CW = ,
to be compatible with a proposed
extension to ANSI C.
.NH 2
External register
.LP
The declaration
.CW extern
.CW register
will dedicate a register to
a variable on a global basis.
It can be used only under special circumstances.
External register variables must be identically
declared in all modules and
libraries.
The feature is not intended for efficiency,
although it can produce efficient code;
rather it represents a unique storage class that
would be hard to get any other way.
On a shared-memory multi-processor,
an external register is
one-per-processor and neither one-per-procedure (automatic)
or one-per-system (external).
It is used for two variables in the Plan 9 kernel,
.CW u
and
.CW m .
.CW U
is a pointer to the structure representing the currently running process
and
.CW m
is a pointer to the per-machine data structure.
.NH 2
Long long
.LP
The compilers accept
.CW long
.CW long
as a basic type meaning 64-bit integer.
On all of the machines
this type is synthesized from 32-bit instructions.
.NH 2
Pragma
.LP
The compilers accept
.CW #pragma
.CW lib
.I libname
and pass the
library name string uninterpreted
to the loader.
The loader uses the library name to
find libraries to load.
If the name contains
.CW %O ,
it is replaced with
the single character object type of the compiler
(e.g.,
.CW v
for the MIPS).
If the name contains
.CW %M ,
it is replaced with
the architecture type for the compiler
(e.g.,
.CW mips
for the MIPS).
If the name starts with
.CW /
it is an absolute pathname;
if it starts with
.CW .
then it is searched for in the loader's current directory.
Otherwise, the name is searched from
.CW /%M/lib .
Such
.CW #pragma
statements in header files guarantee that the correct
libraries are always linked with a program without the
need to specify them explicitly at link time.
.LP
They also accept
.CW #pragma
.CW hjdicks
.CW on
(or
.CW yes
or
.CW 1 )
to cause subsequently declared data, until
.CW #pragma
.CW hjdicks
.CW off
(or
.CW no
or
.CW 0 ),
to be laid out in memory tightly packed in successive bytes, disregarding
the usual alignment rules.
Accessing such data can cause faults.
.LP
Similarly, 
.CW #pragma
.CW profile
.CW off
(or
.CW no
or
.CW 0 )
causes subsequently declared functions, until
.CW #pragma
.CW profile
.CW on
(or
.CW yes
or
.CW 1 ),
to be marked as unprofiled.
Such functions will not be profiled when 
profiling is enabled for the rest of the program.
.LP
Two
.CW #pragma
statements allow type-checking of
.CW print -like
functions.
The first, of the form
.P1
#pragma varargck argpos error 2
.P2
tells the compiler that the second argument to
.CW error
is a
.CW print
format string (see the manual page
.I print (2))
that specifies how to format
.CW error 's
subsequent arguments.
The second, of the form
.P1
#pragma varargck type "s" char*
.P2
says that the
.CW print
format verb
.CW s
processes an argument of
type
.CW char* .
If the compiler's
.CW -F
option is enabled, the compiler will use this information
to report type violations in the arguments to
.CW print ,
.CW error ,
and similar routines.
.NH
Object module conventions
.LP
The overall conventions of the runtime environment
are important
to runtime efficiency.
In this section,
several of these conventions are discussed.
.NH 2
Register saving
.LP
In the Plan 9 compilers,
the caller of a procedure saves the registers.
With caller-saves,
the leaf procedures can use all the
registers and never save them.
If you spend a lot of time at the leaves,
this seems preferable.
With callee-saves,
the saving of the registers is done
in the single point of entry and return.
If you are interested in space,
this seems preferable.
In both,
there is a degree of uncertainty
about what registers need to be saved.
Callee-saved registers make it difficult to
find variables in registers in debuggers.
Callee-saved registers also complicate
the implementation of
.CW longjmp .
The convincing argument is
that with caller-saves,
the decision to registerize a variable
can include the cost of saving the register
across calls.
For a further discussion of caller- vs. callee-saves,
see the paper by Davidson and Whalley [Dav91].
.LP
In the Plan 9 operating system,
calls to the kernel look like normal procedure
calls, which means
the caller
has saved the registers and the system
entry does not have to.
This makes system calls considerably faster.
Since this is a potential security hole,
and can lead to non-determinism,
the system may eventually save the registers
on entry,
or more likely clear the registers on return.
.NH 2
Calling convention
.LP
Older C compilers maintain a frame pointer, which is at a known constant
offset from the stack pointer within each function.
For machines where the stack grows towards zero,
the argument pointer is at a known constant offset
from the frame pointer.
Since the stack grows down in Plan 9,
the Plan 9 compilers
keep neither an
explicit frame pointer nor
an explicit argument pointer;
instead they generate addresses relative to the stack pointer.
.LP
On some architectures, the first argument to a subroutine is passed in a register.
.NH 2
Functions returning structures
.LP
Structures longer than one word are awkward to implement
since they do not fit in registers and must
be passed around in memory.
Functions that return structures
are particularly clumsy.
The Plan 9 compilers pass the return address of
a structure as the first argument of a
function that has a structure return value.
Thus
.DS
.CW
.ta .1i .6i 1.1i 1.6i
	x = f(...)
.DE
is rewritten as
.DS
.CW
.ta .1i .6i 1.1i 1.6i
	f(&x, ...)\f1.
.DE
This saves a copy and makes the compilation
much less clumsy.
A disadvantage is that if you call this
function without an assignment,
a dummy location must be invented.
.LP
There is also a danger of calling a function
that returns a structure without declaring
it as such.
With ANSI C function prototypes,
this error need never occur.
.NH
Implementation
.LP
The compiler is divided internally into
four machine-independent passes,
four machine-dependent passes,
and an output pass.
The next nine sections describe each pass in order.
.NH 2
Parsing
.LP
The first pass is a YACC-based parser
[Joh79].
Declarations are interpreted immediately,
building a block structured symbol table.
Executable statements are put into a parse tree
and collected,
without interpretation.
At the end of each procedure,
the parse tree for the function is
examined by the other passes of the compiler.
.LP
The input stream of the parser is
a pushdown list of input activations.
The preprocessor
expansions of
macros
and
.CW #include
are implemented as pushdowns.
Thus there is no separate
pass for preprocessing.
.NH 2
Typing
.LP
The next pass distributes typing information
to every node of the tree.
Implicit operations on the tree are added,
such as type promotions and taking the
address of arrays and functions.
.NH 2
Machine-independent optimization
.LP
The next pass performs optimizations
and transformations of the tree, such as converting
.CW &*x
and
.CW *&x
into
.CW x .
Constant expressions are converted to constants in this pass.
.NH 2
Arithmetic rewrites
.LP
This is another machine-independent optimization.
Subtrees of add, subtract, and multiply of integers are
rewritten for easier compilation.
The major transformation is factoring:
.CW 4+8*a+16*b+5
is transformed into
.CW 9+8*(a+2*b) .
Such expressions arise from address
manipulation and array indexing.
.NH 2
Addressability
.LP
This is the first of the machine-dependent passes.
The addressability of a processor is defined as the set of
expressions that is legal in the address field
of a machine language instruction.
The addressability of different processors varies widely.
At one end of the spectrum are the 68020 and VAX,
which allow a complex mix of incrementing,
decrementing,
indexing, and relative addressing.
At the other end is the MIPS,
which allows only registers and constant offsets from the
contents of a register.
The addressability can be different for different instructions
within the same processor.
.LP
It is important to the code generator to know when a
subtree represents an address of a particular type.
This is done with a bottom-up walk of the tree.
In this pass, the leaves are labeled with small integers.
When an internal node is encountered,
it is labeled by consulting a table indexed by the
labels on the left and right subtrees.
For example,
on the 68020 processor,
it is possible to address an
offset from a named location.
In C, this is represented by the expression
.CW *(&name+constant) .
This is marked addressable by the following table.
In the table,
a node represented by the left column is marked
with a small integer from the right column.
Marks of the form
.CW A\s-2\di\u\s0
are addressable while
marks of the form
.CW N\s-2\di\u\s0
are not addressable.
.DS
.B
.ta .1i 1.1i
	Node	Marked
.CW
	name	A\s-2\d1\u\s0
	const	A\s-2\d2\u\s0
	&A\s-2\d1\u\s0	A\s-2\d3\u\s0
	A\s-2\d3\u\s0+A\s-2\d1\u\s0	N\s-2\d1\u\s0 \fR(note that this is not addressable)\fP
	*N\s-2\d1\u\s0	A\s-2\d4\u\s0
.DE
Here there is a distinction between
a node marked
.CW A\s-2\d1\u\s0
and a node marked
.CW A\s-2\d4\u\s0
because the address operator of an
.CW A\s-2\d4\u\s0
node is not addressable.
So to extend the table:
.DS
.B
.ta .1i 1.1i
	Node	Marked
.CW
	&A\s-2\d4\u\s0	N\s-2\d2\u\s0
	N\s-2\d2\u\s0+N\s-2\d1\u\s0	N\s-2\d1\u\s0
.DE
The full addressability of the 68020 is expressed
in 18 rules like this,
while the addressability of the MIPS is expressed
in 11 rules.
When one ports the compiler,
this table is usually initialized
so that leaves are labeled as addressable and nothing else.
The code produced is poor,
but porting is easy.
The table can be extended later.
.LP
This pass also rewrites some complex operators
into procedure calls.
Examples include 64-bit multiply and divide.
.LP
In the same bottom-up pass of the tree,
the nodes are labeled with a Sethi-Ullman complexity
[Set70].
This number is roughly the number of registers required
to compile the tree on an ideal machine.
An addressable node is marked 0.
A function call is marked infinite.
A unary operator is marked as the
maximum of 1 and the mark of its subtree.
A binary operator with equal marks on its subtrees is
marked with a subtree mark plus 1.
A binary operator with unequal marks on its subtrees is
marked with the maximum mark of its subtrees.
The actual values of the marks are not too important,
but the relative values are.
The goal is to compile the harder
(larger mark)
subtree first.
.NH 2
Code generation
.LP
Code is generated by recursive
descent.
The Sethi-Ullman complexity completely guides the
order.
The addressability defines the leaves.
The only difficult part is compiling a tree
that has two infinite (function call)
subtrees.
In this case,
one subtree is compiled into the return register
(usually the most convenient place for a function call)
and then stored on the stack.
The other subtree is compiled into the return register
and then the operation is compiled with
operands from the stack and the return register.
.LP
There is a separate boolean code generator that compiles
conditional expressions.
This is fundamentally different from compiling an arithmetic expression.
The result of the boolean code generator is the
position of the program counter and not an expression.
The boolean code generator makes extensive use of De Morgan's rule.
The boolean code generator is an expanded version of that described
in chapter 8 of Aho, Sethi, and Ullman
[Aho87].
.LP
There is a considerable amount of talk in the literature
about automating this part of a compiler with a machine
description.
Since this code generator is so small
(less than 500 lines of C)
and easy,
it hardly seems worth the effort.
.NH 2
Registerization
.LP
Up to now,
the compiler has operated on syntax trees
that are roughly equivalent to the original source language.
The previous pass has produced machine language in an internal
format.
The next two passes operate on the internal machine language
structures.
The purpose of the next pass is to reintroduce
registers for heavily used variables.
.LP
All of the variables that can be
potentially registerized within a procedure are
placed in a table.
(Suitable variables are any automatic or external
scalars that do not have their addresses extracted.
Some constants that are hard to reference are also
considered for registerization.)
Four separate data flow equations are evaluated
over the procedure on all of these variables.
Two of the equations are the normal set-behind
and used-ahead
bits that define the life of a variable.
The two new bits tell if a variable life
crosses a function call ahead or behind.
By examining a variable over its lifetime,
it is possible to get a cost
for registerizing.
Loops are detected and the costs are multiplied
by three for every level of loop nesting.
Costs are sorted and the variables
are replaced by available registers on a greedy basis.
.LP
The 68020 has two different
types of registers.
For the 68020,
two different costs are calculated for
each variable life and the register type that
affords the better cost is used.
Ties are broken by counting the number of available
registers of each type.
.LP
Note that externals are registerized together with automatics.
This is done by evaluating the semantics of a ``call'' instruction
differently for externals and automatics.
Since a call goes outside the local procedure,
it is assumed that a call references all externals.
Similarly,
externals are assumed to be set before an ``entry'' instruction
and assumed to be referenced after a ``return'' instruction.
This makes sure that externals are in memory across calls.
.LP
The overall results are satisfactory.
It would be nice to be able to do this processing in
a machine-independent way,
but it is impossible to get all of the costs and
side effects of different choices by examining the parse tree.
.LP
Most of the code in the registerization pass is machine-independent.
The major machine-dependency is in
examining a machine instruction to ask if it sets or references
a variable.
.NH 2
Machine code optimization
.LP
The next pass walks the machine code
for opportunistic optimizations.
For the most part,
this is highly specific to a particular
processor.
One optimization that is performed
on all of the processors is the
removal of unnecessary ``move''
instructions.
Ironically,
most of these instructions were inserted by
the previous pass.
There are two patterns that are repetitively
matched and replaced until no more matches are
found.
The first tries to remove ``move'' instructions
by relabeling variables.
.LP
When a ``move'' instruction is encountered,
if the destination variable is set before the
source variable is referenced,
then all of the references to the destination
variable can be renamed to the source and the ``move''
can be deleted.
This transformation uses the reverse data flow
set up in the previous pass.
.LP
An example of this pattern is depicted in the following
table.
The pattern is in the left column and the
replacement action is in the right column.
.DS
.CW
.ta .1i .6i 1.6i 2.1i 2.6i
	MOVE	a->b		\fR(remove)\fP
.R
	(sequence with no mention of \f(CWa\fP)
.CW
	USE	b		USE	a
.R
	(sequence with no mention of \f(CWa\fP)
.CW
	SET	b		SET	b
.DE
.LP
Experiments have shown that it is marginally
worthwhile to rename uses of the destination variable
with uses of the source variable up to
the first use of the source variable.
.LP
The second transform will do relabeling
without deleting instructions.
When a ``move'' instruction is encountered,
if the source variable has been set prior
to the use of the destination variable
then all of the references to the source
variable are replaced by the destination and
the ``move'' is inverted.
Typically,
this transformation will alter two ``move''
instructions and allow the first transformation
another chance to remove code.
This transformation uses the forward data flow
set up in the previous pass.
.LP
Again,
the following is a depiction of the transformation where
the pattern is in the left column and the
rewrite is in the right column.
.DS
.CW
.ta .1i .6i 1.6i 2.1i 2.6i
	SET	a		SET	b
.R
	(sequence with no use of \f(CWb\fP)
.CW
	USE	a		USE	b
.R
	(sequence with no use of \f(CWb\fP)
.CW
	MOVE	a->b		MOVE	b->a
.DE
Iterating these transformations
will usually get rid of all redundant ``move'' instructions.
.LP
A problem with this organization is that the costs
of registerization calculated in the previous pass
must depend on how well this pass can detect and remove
redundant instructions.
Often,
a fine candidate for registerization is rejected
because of the cost of instructions that are later
removed.
.NH 2
Writing the object file
.LP
The last pass walks the internal assembly language
and writes the object file.
The object file is reduced in size by about a factor
of three with simple compression
techniques.
The most important aspect of the object file
format is that it is independent of the compiling machine.
All integer and floating numbers in the object
code are converted to known formats and byte
orders.
.NH
The loader
.LP
The loader is a multiple pass program that
reads object files and libraries and produces
an executable binary.
The loader also does some minimal
optimizations and code rewriting.
Many of the operations performed by the
loader are machine-dependent.
.LP
The first pass of the loader reads the
object modules into an internal data
structure that looks like binary assembly language.
As the instructions are read,
code is reordered to remove
unconditional branch instructions.
Conditional branch instructions are inverted
to prevent the insertion of unconditional branches.
The loader will also make a copy of a few instructions
to remove an unconditional branch.
.LP
The next pass allocates addresses for
all external data.
Typical of processors is the MIPS,
which can reference ±32K bytes from a
register.
The loader allocates the register
.CW R30
as the static pointer.
The value placed in
.CW R30
is the base of the data segment plus 32K.
It is then cheap to reference all data in the
first 64K of the data segment.
External variables are allocated to
the data segment
with the smallest variables allocated first.
If all of the data cannot fit into the first
64K of the data segment,
then usually only a few large arrays
need more expensive addressing modes.
.LP
For the MIPS processor,
the loader makes a pass over the internal
structures,
exchanging instructions to try
to fill ``delay slots'' with useful work.
If a useful instruction cannot be found
to fill a delay slot,
the loader will insert
``noop''
instructions.
This pass is very expensive and does not
do a good job.
About 40% of all instructions are in
delay slots.
About 65% of these are useful instructions and
35% are ``noops.''
The vendor-supplied assembler does this job
more effectively,
filling about 80%
of the delay slots with useful instructions.
.LP
On the 68020 processor,
branch instructions come in a variety of
sizes depending on the relative distance
of the branch.
Thus the size of branch instructions
can be mutually dependent.
The loader uses a multiple pass algorithm
to resolve the branch lengths
[Szy78].
Initially, all branches are assumed minimal length.
On each subsequent pass,
the branches are reassessed
and expanded if necessary.
When no more expansions occur,
the locations of the instructions in
the text segment are known.
.LP
On the MIPS processor,
all instructions are one size.
A single pass over the instructions will
determine the locations of all addresses
in the text segment.
.LP
The last pass of the loader produces the
executable binary.
A symbol table and other tables are
produced to help the debugger to
interpret the binary symbolically.
.LP
The loader places absolute source line numbers in the symbol table.
The name and absolute line number of all
.CW #include
files is also placed in the
symbol table so that the debuggers can
associate object code to source files.
.NH
Performance
.LP
The following is a table of the source size of the MIPS
compiler.
.DS
.ta .1i .6i
	lines	module
	\0509	machine-independent headers
	1070	machine-independent YACC source
	6090	machine-independent C source

	\0545	machine-dependent headers
	6532	machine-dependent C source

	\0298	loader headers
	5215	loader C source
.DE
.LP
The following table shows timing
of a test program
that plays checkers, running on a MIPS R4000.
The test program is 26 files totaling 12600 lines of C.
The execution time does not significantly
depend on library implementation.
Since no other compiler runs on Plan 9,
the Plan 9 tests were done with the Plan 9 operating system;
the other tests were done on the vendor's operating system.
The hardware was identical in both cases.
The optimizer in the vendor's compiler
is reputed to be extremely good.
.DS
.ta .1i .9i
	\0\04.49s	Plan 9 \f(CWvc\fP \f(CW-N\fP compile time (opposite of \f(CW-O\fP)
	\0\01.72s	Plan 9 \f(CWvc\fP \f(CW-N\fP load time
	148.69s	Plan 9 \f(CWvc\fP \f(CW-N\fP run time

	\015.07s	Plan 9 \f(CWvc\fP compile time (\f(CW-O\fP implicit)
	\0\01.66s	Plan 9 \f(CWvc\fP load time
	\089.96s	Plan 9 \f(CWvc\fP run time

	\014.83s	vendor \f(CWcc\fP compile time
	\0\00.38s	vendor \f(CWcc\fP load time
	104.75s	vendor \f(CWcc\fP run time

	\043.59s	vendor \f(CWcc\fP \f(CW-O\fP compile time
	\0\00.38s	vendor \f(CWcc\fP \f(CW-O\fP load time
	\076.19s	vendor \f(CWcc\fP \f(CW-O\fP run time

	\0\08.19s	vendor \f(CWcc\fP \f(CW-O3\fP compile time
	\035.97s	vendor \f(CWcc\fP \f(CW-O3\fP load time
	\071.16s	vendor \f(CWcc\fP \f(CW-O3\fP run time
.DE
.LP
To compare the Intel compiler,
a program that is about 40% bit manipulation and
about 60% single precision floating point was
run on the same 33 MHz 486, once under Windows
compiled with the Watcom compiler, version 10.0,
in 16-bit mode and once under
Plan 9 in 32-bit mode.
The Plan 9 execution time was 27 sec while the Windows
execution time was 31 sec.
.NH
Conclusions
.LP
The new compilers compile
quickly,
load slowly,
and produce
medium quality
object code.
The compilers are relatively
portable,
requiring but a couple of weeks' work to
produce a compiler for a different computer.
For Plan 9,
where we needed several compilers
with specialized features and
our own object formats,
this project was indispensable.
It is also necessary for us to
be able to freely distribute our compilers
with the Plan 9 distribution.
.LP
Two problems have come up in retrospect.
The first has to do with the
division of labor between compiler and loader.
Plan 9 runs on multi-processors and as such
compilations are often done in parallel.
Unfortunately,
all compilations must be complete before loading
can begin.
The load is then single-threaded.
With this model,
any shift of work from compile to load
results in a significant increase in real time.
The same is true of libraries that are compiled
infrequently and loaded often.
In the future,
we may try to put some of the loader work
back into the compiler.
.LP
The second problem comes from
the various optimizations performed over several
passes.
Often optimizations in different passes depend
on each other.
Iterating the passes could compromise efficiency,
or even loop.
We see no real solution to this problem.
.NH
References
.LP
[Aho87] A. V. Aho, R. Sethi, and J. D. Ullman,
.I
Compilers \- Principles, Techniques, and Tools,
.R
Addison Wesley,
Reading, MA,
1987.
.LP
[ANSI90] \f2American National Standard for Information Systems \-
Programming Language C\f1, American National Standards Institute, Inc.,
New York, 1990.
.LP
[Dav91] J. W. Davidson and D. B. Whalley,
``Methods for Saving and Restoring Register Values across Function Calls'',
.I
Software\-Practice and Experience,
.R
Vol 21(2), pp. 149-165, February 1991.
.LP
[Joh79] S. C. Johnson,
``YACC \- Yet Another Compiler Compiler'',
.I
UNIX Programmer's Manual, Seventh Ed., Vol. 2A,
.R
AT&T Bell Laboratories,
Murray Hill, NJ,
1979.
.LP
[Set70] R. Sethi and J. D. Ullman,
``The Generation of Optimal Code for Arithmetic Expressions'',
.I
Journal of the ACM,
.R
Vol 17(4), pp. 715-728, 1970.
.LP
[Szy78] T. G. Szymanski,
``Assembling Code for Machines with Span-dependent Instructions'',
.I
Communications of the ACM,
.R
Vol 21(4), pp. 300-308, 1978.