System Programming

System Programming 

Chapter 5 

Compiler 

1

Compiler 

2

Basic Compiler Functions 

Grammars 

Lexical Analysis 

Syntactic Analysis 

Code Generation 

3

Terminology 

• Statement ( 敘述 ) 

–Declaration, assignment containing expression ( 運算式 ) 

• Grammar ( 文法 ) 

–A set of rules specify the form of legal statements 

• Syntax ( 語法 ) vs. Semantics ( 語意 ) 

–Example: assuming I, J, K: integer and X,Y: float 

–I:=J+K vs. X:= I+Y 

• Compilation ( 編譯 ) 

–Matching statements (written by programmers) to structures 

(defined by the grammar) and generating the appropriate 

object code 

4

Basic Compiler 

•Lexical analysis - scanner 

–Scanning the source statement, recognizing and 

classifying the various tokens 

•Syntactic analysis - parser 

–Recognizing the statement as some language 

construct. 

–Construct a parser tree (syntax tree) 

•Code generation –code generator 

–Generate assembly language codes 

–Generate machine codes (Object codes) 

5

High-Level Programming Language 

• A high-level programming language is described in terms of a 

grammar, which specifies the syntax of legal statements. 

– An assignment statement: 

• a variable name + an assignment operator + an expression 

6

Grammars 

•A grammar for a language is a formal 

description of the syntax. 

–The grammar does not describe the semantics 

(meaning) of the various statement. 

•Example: I, J, K: integer and X,Y: float 

–I:=J+K vs. I:= X+Y 

–Identical syntax 

–Different semantics 

• integer arithmetic operation 

• Floating-point addition 

–Very different sequences of machine instructions 

• Recognized during code generation 

8

BNF (Backus-Naur Form) 

• A simple and widely used notations for writing grammars 

introduced by John Backus and Peter Naur in about 1960. 

• A BNF grammar consists of a set of rules, each of which defines 

the syntax of some construct in the programming language. 

• Meta-symbols of BNF: 

– ::= "is defined as" 

– | "or" 

– < > angle brackets used to surround non-terminal symbols 

• Entries not enclosed in angle brackets are terminal symbols of the grammar 

(i.e., token). 

• A BNF rule defining a nonterminal has the form: 

– nonterminal ::= sequence_of_alternatives consisting of strings of terminals 

(tokens) or nonterminals separated by the meta-symbol 

9

Simplified Pascal Grammar 

Recursive rule 

10

Parse Tree 

(Syntax Tree) 

READ(VALUE) 

VARIANCE:=SUMSQ DIV 100 

–MEAN*MEAN 

The multiplication and division 

precede the addition and 

subtraction 

12

Parse Tree 

•If there is more than one possible parse tree for a 

given statement, the grammar is said to be 

ambiguous. 

13

Parse Tree 

14

Parse Tree 

15

Scanner 

•Recognize keywords, operators, integers, 

floating-point numbers, character strings and 

identifiers. 

•The exact set of tokens to be recognized depends 

on the programming language be compiled. 

16


•Function 

–Scanning the program to be compiled and 

recognizing the tokens that make up the source 

statements. 

•Tokens 

–Tokens can be keywords, operators, identifiers, 

integers, floating-point numbers, character strings, etc. 

–Each token is usually represented by some fixedlength 

code, such as an integer, rather than as a 

variable-length character string (see Figure 5.5) 

–Token type, Token specifier (value) (see Figure 5.6) 

17


• Tokens might be defined by grammar rules 

to be recognized by the parser: 

• For better efficiency, a scanner can be used 

instead to recognize and output the tokens in 

a sequence represented by fixed-length 

codes (such as integers) and the associated 

token specifiers. 

18

Token Specifier 

•The scanner is designed to enter identifier 

directly into a symbol table when they are first 

recognize. 

•A token specifier for a identifier is a pointer to 

the correspondings symbol-table entry (e.g., 

^SUM for identifier, #100 for integer). 

–Avoid much of the need for table searching during 

the rest of the complication process. 

19

Scanner Output 

•Token specifier 

–Identifier name, integer value, (type) 

•Token coding scheme 

–Figure 5.5 

20

Lexical 

Scan 

21

Parser vs. Scanner 

•The scanner operates as a procedure that is called 

by the parser when it needs another token. 

•Each call to the scanner would produce the 

coding for the next token in the source program. 

•The parser would responsible to saving any token 

that it might require for later analysis. 

22

Languages 

•In FORTRAN 

–DO 10 I = 1, 100 

• DO: keyword 

• 10: a statement number 

• I: identifier 

–DO 10 I = 1 

• DO10I: identifier 

23

Special Statement of FORTRAN 

IF (THEN .EQ. ELSE) THEN 

IF = THEN 

ELSE 

THEN = IF 

ENDIF 

24

Token Recognizer 

•By grammar 

::= || 

::= A | B | C | D | … | Z 

::= 0 | 1 | 2 | 3 | … | 9 

•By scanner - modeling as finite automata 

–Figure 5.8 (a) 

25

Modeling Scanners as Finite 

Automata 

• Tokens can often be recognized by a finite automaton, 

which consists of 

–A finite set of states (including a starting state and one or 

more final states) 

–A set of transitions from one state to another 

26

Finite Automata for Scanner 

•If the automata stops in 

a final state, we say 

that it recognizes (or 

accepts) the string 

being scanned. 

•If it stops in a nonfinal 

state, it fails to 

recognize (or reject) 

the string. 

27

Finite Automata for Typical Tokens 

The finite automata can recognize 

all of the tokens in Figure 5. 

Underscore character 

The notation A-Z specifies any character from A to Z 

28

Token 

Recognition 

Algorithm 

A typical algorithm to 

recognize identifiers 

may contain underscores. 

30


Operator-Precedence Parsing 

Recursive-Descent Parsing 

31


•Syntactic analysis: building the parse tree for the 

statements being translated 

•Parse tree 

–Root: goal grammar rule 

–Leaves: terminal symbols 

•Methods: 

–Bottom-up: operator-precedence parsing 

–Top-down: recursive-descent parsing 

32


• Recognize source statements as language constructs or 

build the parse tree for the statements. 

–Bottom-up 

• Operator-precedence parsing 

• Shift-reduce parsing 

• LR(0) parsing 

• LR(1) parsing 

• SLR(1) parsing 

• LALR(1) parsing 

–Top-down 

• Recursive-descent parsing 

• LL(1) parsing 

33

Precedence 

•A + B * C –D 

•Multination and division have higher precedence 

than addition and subtraction. 

–+ has lower precedence than * 

< 

• + * 

•In terms of the parse tree, this means that the * 

operation appears at a lower level than does 

either + or -. 

• > : the previous one has higher precedence than 

the later one 

• : the two tokens have equal precedence. 

= 

34


• The operator-precedence method uses the precedence 

relation between consecutive operators to guide the 

parsing processing. 

A + B * C - D 

 

• Subexpression B*C is to be computed first because * 

has higher precedence than the surrounding operators, 

this means that * appears at a lower level than does + 

or –in the parse tree. 

• Precedence: 

< = > 

35

Precedence Matrix 

later 

previous 

36

Precedence 

•; END END ; 

< > 

–When ; is followed by END, the ; has higher 

precedence. 

–When END is followed by ;, the END has higher 

precedence. 

•Empty means that these two tokens cannot 

appear together in any legal statement. 

•; BEGIN and ; BEGIN can not 

exist. 

< > 

37


•The parser has identified the portion of the 

statement delimited by the precedence relations 

and to be interpreted in terms of the grammar. 

> 

•An operator-precedence parser generally uses a 

stack to save tokens that have scanned but not yet 

parsed, so it can re-examine them. 

< 

38

Example: READ ( VALUE ) 

39

Example: VARIANCE:=SUMSQ DIV 100 –MEAN*MEAN 

40


41

Example: VARIANCE:=SUMSQ DIV 100 –MEAN* MEAN*MEAN 

42

Example: VARIANCE:=SUMSQ DIV 100 –MEAN* MEAN*MEAN 

43

Bottom-up Parsing 

•Each of the parse tree is constructed from the 

terminal nodes up toward the root. 

44

Operator Precedence vs. 

Shift-Reduce Parsing 

•The idea behind the operator precedence 

technique are developed into shift-reduce parsing. 

45

Shift-Reduce Parsing 

• Operator-precedence parsing can deal with the 

operator grammars having the property that no 

production right side has two adjacent nonterminals. 

• Shift-reduce parsing 

–It makes use of a stack to store tokens that have not yet been 

recognized in terms of grammar. 

–Actions: 

• Shift: push the current token onto the stack 

–Shift roughly corresponds to the action taken by an operatorprecedence 

parser when it encounters the relations < and . 

• Reduce: recognize symbols on top of the stack according to a 

grammar rule. 

–Reduce roughly corresponds to the action taken by an operatorprecedence 

parser when it encounters the relations . 

• The most powerful shift-reduce parsing technique is 

called LR(k). 

> 

= 

46


47

Recursive-Descent Parser 

•A recursive-descent parser is made up of a 

procedure for each nonternimal symbol in the 

grammar. 

•Each nonterminal symbol in the grammar is 

associated with a procedure. 

•When a procedure is called, it attempt to find 

substring of the input, beginning with the current 

token. 

48

Left Recursion 

• ::= | ; 

–If the procedure decides to try the second alternative 

(;), it would immediately call itself 

reclusively to find an (). 

–Results in an unending chain. 

•Modification 

– ::= {;} 

49

Recursive-Descent Parsing 

• A recursive-descent parser is made up of a procedure 

for each nonterminal symbol in the grammar. 

–The procedure attempts to find a substring of the input that 

can be interpreted as the nonterminal. 

–The procedure may call other procedures, or even itself 

recursively, to search for other nonterminals. 

–The procedure must decide which alternative in the 

grammar rule to use by examining the next input token. 

• Top-down parsers cannot be directly used with a 

grammar containing immediate left recursion. 

–An unending chain 

• Two grammar 

– ::= id | , id 

– ::= id { , id } 

50

Extension to BNF 

•id {, id } 

–The terms between { and } may be omitted, or 

repeated one or more times. 

–With the revised definition, the procedure simply 

looks first for an id, and then keeps scanning the 

input as long as the next two tokens are a comma (,) 

and id. 

51

Modified Grammar without Left Recursion 

still recursive, but a 

chain of calls always 

consume at least one 

token 

52

Recursive-Descent Parsing of 

READ 

53

Recursive-Descent Parsing of 

IDLIST 

54

check_read() 

{ 

if( get_token()==‘READ’&& 

get_token()==‘(’&& 

check_id-list()==true && 

get_token()==‘)’) 

return(true); 

else 

return(false); 

} 

55

check_prog() 

{ 

if( get_token()==‘PROGRAM’&& 

check_prog-name()==true && 

get_token()==‘VAR’&& 

check_dec-list()==true && 

get_token()==‘BEGIN’&& 

check_stmt-list()==true && 

get_token()==‘END.’) 

return(true); 

else 


} 

56

check_for() 

{ 

if( get_token()==‘FOR’&& 

check_index-exp()==true && 

get_token()==‘DO’&& 

check_body()==true) 

return(true); 

else 


} 

57

check_stmt() 

{ 

/* Resolve alternatives by look-ahead */ 

if( next_token()==id ) 

return check_assign(); 

if( next_token()==‘READ’) 

return check_read(); 

if( next_token()==‘WRITE’) 

return check_write(); 

if( next_token()==‘FOR’) 

return check_for(); 

} 

58

Left Recursive 

• 3 ::=|; 

• 3a ::={;} 

check_dec-list() 

{ 

flag=true; 

if(check_dec()==false) 

flag=false; 

while(next_token()==‘;’) 

{ 

get_token(); 

if(check_dec()==false) 

flag=false; 

} 

return flag; 

} 

59

• 10 ::=|+|- 

• 10a ::={+|-} 

check_exp() 

{ 

flag=true; 

if(check_term()==false) 

flag=false; 

while(next_token()==‘+’or next_token()==‘-’) 

{ 

get_token(); 

if(check_term()==false) 

flag=false; 

} 

return flag; 

} 

60


61

Recursive-Descent Procedure for Assign 

::= id := 

62

Recursive-Descent Procedure for EXP 

::= { + | - } 

63

Recursive-Descent Procedure for TERM 

::= { * | DIV } 

64

Recursive-Descent Procedure for FACTOR 

::= id | int | () 

65

Recursive-Descent Parsing (1/3) 

id1 := SUMSQ DIV 100 –MEAN * MEAN 

66


67


68

Code Generation 

• When the parser recognizes a portion of the source 

program according to some rule of the grammar, the 

corresponding semantic routine (code generation 

routine) is executed. 

• As an example, symbolic representation of the object 

code for a SIC/XE machine is generated. 

• Two data structures are used for working storage: 

–A list (associated with a variable LISTCOUNT) 

–A stack 

69

• SUM,SUMQ,I,VALUE,MEAN,VARIANCE:INTEGER; 

– SUM WORD 0 

– SUMQ WORD 0 

– I WORD 0 

– VALUE WORD 0 

– MEAN WORD 0 

– VARIANCE WORD 0 

• SUM:=0; 

– LDA #0 

– STA SUM 

• SUM:=SUM+VALUE; 

– LDA SUM 

– ADD VALUE 

– STA SUM 

70

• VARIANCE := SUMQ DIV 100 –MEAN * MEAN; 

– TEMP1 WORD 0 



– LDA SUMQ 

– TEMP WORD 0 

– DIV #100 

– LDA MEAN 

– STA TEMP1 

– MUL MEAN 

– LDA MEAN 

– STA TEMP 

– MUL MEAN 

– LDA SUMQ 

– STA TEMP2 

– DIV #100 

– LDA TEMP1 

– SUB TEMP 

– SUB TEMP2 

– STA VARIANCE 

– STA TEMP3 

– LDA TEMP3 

– STA VARIANCE 

71


Argument passing 

placed in register L 

72

Terminology 

• Token specifier S(id) is the name of the identifier, or 

pointer to the symbol-table entry. 

• S(int) is the value of the integer. 

• The node specifier S() is set to rA, indicating that 

the result of the computation is in register A. 

• The variable REGA is used to indicate the highest-level 

node of the parse tree whose value is left in register A 

by the code generated so far. 

• Procedure GETA generates a LDA instruction to load a 

value into register A. 

73


74


75

Other Code- 

Generation 

Routines 

76

Other Code- 

Generation 

Routines 

77

Compiler 

•Basic Compiler Functions 

•Machine-Dependent Compiler Features (5.2) 

•Machine-Independent Compiler Features (5.3) 

79

Intermediate Form 

• The syntax and semantics of the source statements have 

been completely analyzed, but the actual translation into 

machine code have not yet been performed. 

• It is much easier to analyze and manipulate the 

intermediate form of a program than the machine code. 

• Operation, op1, op2, result 

–Operation: some function to be performed by the object code 

–op1 and op2 are the operands for the operation 

–Result: where the resulting value is to be placed 

80

Example 

81

Code Optimization 

83

Potential Improvement 

84

Intermediate Form of the 

Program 

• Representation of the executable instructions with a 

sequence of quadruples: 

operation, op1, op2, result 

• For example: 

85

Intermediate 

Code 

86

Quadruple Analysis for Code Optimization 

•Intermediate results can be assigned to registers 

or to temporary variables to make their use as 

efficient as possible. 

•Quadruples can be rearranged to eliminate 

redundant load and store operations. 

87

Assignment and Use of Registers as 

Instruction Operands 

• We would prefer to keep in registers all variables and 

intermediate results that will be used later in the program. 

• Consider “VALUE”in quadruples 7 and 9, “MEAN”in 

quadruples 16 and 18. 

• Register selection for replacement: 

– Scan the program for the next point at which each register value would 

be used. 

– Select the one whose value will not be needed for the longest time. 

– Save the value of the selected register to a temporary variable if 

necessary. 

• Be careful about the control flow of the program when 

assigning and using registers: 

– Consider “SUM”in quadruples 1 and 7. 

88

Basic Blocks 

•One way to deal with the control flow is to divide 

the program into basic blocks. 

•A basic block is a sequence of quadruples with 

one entry point (beginning of the block), one exit 

point (end of the block), and no jumps within the 

block. 

•Assignment and use of registers within a basic 

block can follow the method previously 

described. 

89

Basic Blocks 

A 

B 

C 

D 

E 

90

Rearrangement of Quadruples 

DIV SUMSQ #100 i1 

* MEAN MEAN i2 

- i1 i2 i3 

:= i3 VARIANCE 

* MEAN MEAN i2 

DIV SUMSQ #100 i1 

- i1 i2 i3 

:= i3 VARIANCE 

LDA SUMSQ 

DIV #100 

STA T1 

LDA MEAN 

MUL MEAN 

STA T2 

LDA T1 

SUB T2 

STA VARIANCE 

LDA MEAN 

MUL MEAN 

STA T1 

LDA SUMSQ 

DIV #100 

SUB T1 

STA VARIANCE 

91

Compiler 




•Compiler Design Options (5.4) 

92

Structured Variables 

•Array 

•Record 

•String 

•Set 

93

Array 

•A: ARRAY[1..10] OF INTEGER 

•If each INTEGER variable occupies one word of 

memory, we must allocate ten words to store this 

array. 

•General case 

–ARRAY[l..u] of INTEGER 

–Allocate u-l+1 words of storage for this array 

94

Multi-dimensional Array 

•B: ARRAY[0..3, 1..6] 

–4*6=24 words 

•General case 

–ARRAY[l 1 ..u 1 , l 2 ..u 2 ] of INTEGER 

• The number of words to be allocated is (u 1 -l 1 +1)*(u 2 -l 2 +1) 

95

Row-Major vs. Column-Major 

•Row-major 

–All array elements that nave the same value of the 

first subscript are stored in contiguous locations 

0,1 0,2 0,3 0,4 1,1 1,2 1,3 1,4 2,1 2,2 2,3 2,4 3,1 3,2 3,3 3,4 4,1 4,2 4,3 4,4 

Row 0 Row 1 Row 2 Row 3 Row 4 

•Column-major 

–All array elements that nave the same value of the 

second subscript are stored in contiguous locations 

0,1 1,1 2,1 3,1 4,1 0,2 1,2 2,2 3,2 4,2 0,3 1,3 2,3 3,3 4,3 0,4 1,4 2,4 3,4 4,4 

Column 1 Column 2 Column 3 Column 4 

96

Array Reference 

•How to calculate the address of the referenced 

relative to the base address of the array 

•A: ARRAY[1..10] OF INTEGER 

–A[6]: the starting address relative to the starting 

address is 5*3= 15. 

•General case: 

–ARRAY[l..u] OF INTEGER and each array element 

occupies w bytes of storage 

–A[s]: the relative address of A[s] is w*(s-l) 

97

Two-Dimensional Array 

•B: ARRAY[0..3, 1..6] 

–B[2, 5] 

• 2 * 6 + 4 = 16 

Reference 

•B: ARRAY[l 1 ..u 1 , l 2 ..u 2 ] of INTEGER 

–The relative address of B[s 1 , s 2 ] is w *[(s 1 -l 1 )*(u 2 - l 2 

+1)+ (s 2 - l 2 )] 

98

Code Generation for Array 

References (1/2) 

A: ARRAY[1..10] of INTEGER 

… 

A[I] := 5 

(1) - I #1 i1 

(2) * i1 #3 i2 

(3) := #5 A[i2] 

99

Code Generation for Array 

References (2/2) 

B: ARRAY[0..3, 1..6] of INTEGER 

… 

B[I, J] := 5 

(1) * I #6 i1 

(2) - J #1 i2 

(3) + i1 i2 i3 

(4) * i3 #3 i4 

(5) := #5 B[i4] 

100

Machine-Independent 

• Common subexpression 


–These are subexpressions that appear at more than one point in 

the program and that compute the same value. 

–Common subexpressions are usually detected through the 

analysis of the intermediate form of the program. 

• Loop invariants 

–These are subexpressions within a loop whose values do not 

change from one iteration of the loop to the next. 

–Their values can be computed once before the loop is entered, 

rather than being recalculated for each iteration. 

• Reduction in strength of an operation 

101

102

Common Subexpression Elimination 

103

Loop Invariant Elimination 

104

Reducing in Strength of Operations 

105


• Some optimization can be obtained by rewriting the 

source program, e.g., 

T1 := 2 * J; 

T2 := T1 –1; 

FOR I := 1 TO 10 DO 

X[I, T2] := Y[I, T1] 

• However, this would achieve only a part of the 

benefits of code optimization. 

• An optimizing compiler should allow the programmer 

to write source code that is clear and easy to read, and 

it should compile such a program into machine code 

that is efficient to execute. 

106

Static Storage Allocation 

System 



Main 

Main 

Main 

Call SUB 

Call SUB 

RETARD 

RETARD 

RETARD 

SUB 

SUB 

RETARD 

Call SUB 

RETARD 

107

Dynamic Storage Allocation 



Main 

B 

Variables 

for Main 

RETARD 

NEXT 

0 

Stack 

Main 

Call SUB 

B 

Variables 

for SUB 

RETARD 

NEXT 

PREV 

SUB 

Variables 

for Main 

RETARD 

NEXT 

0 

Stack 

108

Variables 

for SUB 


Main 

Call SUB 

B 

RETARD 

NEXT 

PREV 

Variables 

for SUB 

RETARD 

NEXT 

PREV 


Main 

Call SUB 

B 

Variables 

for SUB 

RETARD 

NEXT 

PREV 

SUB 

Call SUB 

Variables 

for Main 

RETARD 

NEXT 

0 

Stack 

SUB 

Variables 

for Main 

RETARD 

NEXT 

0 

109 

Stack

Block-Structured Languages 

PROCEDURE A; 

VAR X, Y, Z: INTEGER; 

PROCEDURE B; 

VAR W, X, Y: REAL; 

Block 

Name 

Block 

Number 

Block 

Level 

Surrounding 

Block 

PROCEDURE C; 

VAR V, W: INTEGER; 

A 

1 

1 

- 

END {C}; 

B 

2 

2 

1 

END {B}; 

C 

3 

3 

2 

PROCEDURE D; 

VAR X, Z: CHAR; 

D 

4 

2 

1 

END {D}; 

END {A}; 

110

Compiler 




•Compiler Design Options (5.4) 

111

Compiler Design Options 

•Division into passes 

•Interpreter 

•P-Code compiler 

•Compiler-Compilers 

112

Division into Passes 

•In some languages, the declaration of an 

identifier may appear it has been used in the 

program. (forward reference) 

113

Interpreters 

•The interpreters execute a version of the source 

program directly, instead of translating it into 

machine code. 

•The advantage is in the debugging facilities. 

114

P-Code Compilers 

•The main advantage is portability. 

Source Program 

Compile 

P-code 

Compiler 

Object Program 

(P-code) 

Execute 

P-code 

Interpreter 115

Compiler-Compilers 

Compilers 

Lexical rules 

Scanner 

Grammar 

Semantic 

routines 

Compiler-compiler 

Parser 

Code 

generator 

116

Summary 


•Machine-Dependent Compiler Features 

•Machine-Independent Compiler Features 

•Compiler Design Options 

117

Another Example 

118

Three-Address Code 

119

Flow Graph 

120

Local Common Subexpression 

Elimination 

121

Global Common Subexpression 

Elimination 

122

Copy Propagation 

• Improve the code in B5 by eliminating x: 

x := t3 

a[t2] := t5 

a[t4] := t3 

goto B2 

• The idea is to use g for f, wherever possible after the 

copy statement 

f:=g 

• This may not appear to be an improvement, but it gives 

us the opportunity to eliminate the assignment to x. 

123

Dead-Code Elimination 

•A variable is live at a point in a program if its 

value can be used subsequently; otherwise, it is 

dead (or useless) at that point. 

•Copy propagation followed by dead-code 

elimination removes the assignment to x: 

a[t2] := t5 

a[t4] := t3 

goto B2 

124

Loop Optimizations 

• The running time of a program may be improved if we 

decrease the number of instructions in an inner loop. 

• Three techniques are import for loop optimization: 

–Code motion 

• Moves code outside a loop 

–Reduction in strength 

• Replaces an expensive operation by a cheaper one 

–Induction-variable elimination 

• Eliminates variable from the inner loop 

125

Strength Reduction 

126

Induction-Variable Elimination 

induction variables 

induction variables 

127

Code Optimization Result 

128

System Programming

Create successful ePaper yourself

Delete template?

Save as template?