Slides02 - Computer Science and Engineering

CS308 Compiler Principles 

Lexical Analyzer 

Fan Wu 

Department of Computer Science and Engineering 

Shanghai Jiao Tong University

Lexical Analyzer 

• Lexical Analyzer reads the source program 

character by character to produce tokens. 

– strips out comments and whitespaces 

– returns a token when the parser asks for 

– correlates error messages with the source 

program 

2 

Compiler Principles

Token 

• A token is a pair of a token name and an optional 

attribute value. 

– Token name specifies the pattern of the token 

– Attribute stores the lexeme of the token 

• Tokens 

– Keyword: “begin”, “if”, “else”, … 

– Identifier: string of letters or digits, starting with a letter 

– Integer: a non-empty string of digits 

– Punctuation symbol: “,”, “;”, “(”, “)”, … 

• Regular expressions are widely used to specify 

patterns of the tokens. 

3 

Compiler Principles

Token Example 

4 

Compiler Principles

Terminology of Languages 

• Alphabet: a finite set of symbols 

– ASCII 

– Unicode 

• String: a finite sequence of symbols on an alphabet 

– is the empty string 

– |s| is the length of string s 

– Concatenation: xy represents x followed by y 

– Exponentiation: s n = s s s .. s ( n times) s 0 = 

• Language: a set of strings over some fixed alphabet 

– the empty set is a language 

– The set of well-formed C programs is a language 

7 

Compiler Principles

Operations on Languages 

• Union: L 1 L 2 = { s | s L 1 or s L 2 } 

• Concatenation: L 1 L 2 = { s 1 s 2 | s 1 L 1 

L 2 } 

and s 2 

• (Kleene) Closure: 

• Positive Closure: 

L 

L 

* 

 

 

 

 

i 0 

 

 

i 1 

i 

L 

i 

L 

8 

Compiler Principles

Example 

• L 1 = {a,b,c,d} L 2 = {1,2} 

• L 1 L 2 = {a,b,c,d,1,2} 

• L 1 L 2 = {a1,a2,b1,b2,c1,c2,d1,d2} 

• L 1 

* 

= all strings using letters a,b,c,d including 

the empty string 

• L 1+ = all strings using letters a,b,c,d without 

the empty string 

9 

Compiler Principles

Regular Expressions 

• Regular expression is a representation of a 

language that can be built from the operators 

applied to the symbols of some alphabet. 

• A regular expression is built up of smaller 

regular expressions (using defining rules). 

• Each regular expression r denotes a 

language L(r). 

• A language denoted by a regular expression 

is called as a regular set. 

10 

Compiler Principles

Regular Expressions (Rules) 

Regular expressions over alphabet 

Reg. Expr 

 

a 

(r 1 ) | (r 2 ) L(r 1 ) L(r 2 ) 

(r 1 ) (r 2 ) L(r 1 ) L(r 2 ) 

(r) * (L(r)) * 

(r) 

L(r) 

Language it denotes 

L() = {} 

L(a) = {a} 

Extension 

(r) + = (r)(r) * (L(r)) + 

(r) = (r) | 

L(r) {} zero or one instance 

[a 1 -a n ] L(a 1 |a 2 |…|a n ) character class 

11 

Compiler Principles

Regular Expressions (cont.) 

• We may remove parentheses by using 

precedence rules: 

– * highest 

– concatenation second highest 

– | lowest 

• (a(b) * )|(c) ab * |c 

• Example: 

– = {0,1} 

– 0|1 => {0,1} 

– (0|1)(0|1) => {00,01,10,11} 

–0 * => { ,0,00,000,0000,....} 

– (0|1) * => all strings with 0 and 1, including the empty 

string 

12 

Compiler Principles

Regular Definitions 

• We can give names to regular expressions, and 

use these names as symbols to define other 

regular expressions. 

13 

• A regular definition is a sequence of the 

definitions of the form: 

d 1 r 1 where d i is a innovative symbol and 

d 2 r 2 r i is a regular expression over symbols 

… in {d 1 ,d 2 ,...,d i-1 } 

d n r n 

alphabet 

previously defined 

symbols 

Compiler Principles

Regular Definitions Example 

• Example: Identifiers in Pascal 

letter A | B | ... | Z | a | b | ... | z 

digit 0 | 1 | ... | 9 

id letter (letter | digit ) * 

– If we try to write the regular expression 

representing identifiers without using regular 

definitions, that regular expression will be 

complex. 

(A|...|Z|a|...|z) ( (A|...|Z|a|...|z) | (0|...|9) ) * 

14 

Compiler Principles

Grammar 

Regular Definitions 

15 

Compiler Principles

Transition Diagram 

• State: represents a condition that could 

occur during scanning 

– start/initial state: 

– accepting/final state: lexeme found 

– intermediate state: 

• Edge: directs from one state to another, 

labeled with one or a set of symbols 

16 

Compiler Principles

Transition Diagram for relop 

Transition Diagram for ``relop < | > |< = | >= | = | ’’ 

17 

Compiler Principles

Transition-Diagram-Based Lexical Analyzer 

18 

Implementation of relop transition diagram 

Compiler Principles

Transition Diagram for Others 

A transition diagram for id's 

19 

A transition diagram for unsigned numbers 

Compiler Principles

Practice 

• Draw the transition diagram for recognizing 

the following regular expression 

a(a|b)*a 

a|b 

b 

a 

a a 

1 2 3 

a a 

1 2 3 

Nondeterministic 

b 

Deterministic 

20 

Compiler Principles

Finite Automata 

• A finite automaton is a recognizer that takes a 

string, and answers “yes” if the string matches a 

pattern of a specified language, and “no” 

otherwise. 

• Two kinds: 

– Nondeterministic finite automaton (NFA) 

• no restriction on the labels of their edges 

– Deterministic finite automaton (DFA) 

• exactly one edge with a distinguished symbol goes out of 

each state 

• Both NFA and DFA have the same capability 

• We may use NFA or DFA as lexical analyzer 

21 

Compiler Principles

Nondeterministic Finite Automaton (NFA) 

• A NFA consists of: 

– S: a set of states 

– Σ: a set of input symbols (alphabet) 

– A transition function: maps state-symbol pairs to sets of 

states 

– s 0 : a start (initial) state 

– F: a set of accepting states (final states) 

• NFA can be represented by a transition graph 

• Accepts a string x, if and only if there is a path from 

the starting state to one of accepting states such that 

edge labels along this path spell out x. 

• Remarks 

– The same symbol can label edges from one state to 

several different states 

– An edge may be labeled by ε, the empty string 

22 

Compiler Principles

NFA Example (1) 

The language recognized by this NFA is (a|b) * a b 

23 

Compiler Principles

NFA Example (2) 

NFA accepting aa* |bb* 

24 

Compiler Principles

Implementing an NFA 

S -closure({s 0 }) 

c nextchar() 

while (c != eof) { 

begin 

S -closure(move(S,c)) 

{ set all of states can be accessible 

from s 0 by -transitions } 

{ set of all states can be 

accessible from a state in S by a 

transition on c} 

c nextchar 

end 

if (SF != ) then { if S contains an accepting state } 

return “yes” 

else 

return “no” 

Subset Construction 

25 

Compiler Principles

Deterministic Finite Automaton (DFA) 

• A Deterministic Finite Automaton (DFA) is 

a special form of a NFA. 

– No state has ε- transition 

– For each symbol a and state s, there is at 

most one a labeled edge leaving s. 

start 

The language recognized by this DFA is also (a|b) * a b 

26 

Compiler Principles

Implementing a DFA 

s s 0 { start from the initial state } 

c nextchar { get the next character from the 

input string } 

while (c != eof) do { do until the end of the string } 

begin 

s move(s,c) { transition function } 

c nextchar 

end 

if (s in F) then { if s is an accepting state } 

return “yes” 

else 

return “no” 

28 

Compiler Principles

NFA vs. DFA 

Compactibility Readability Speed 

NFA Good Good Slow 

DFA Bad Bad Fast 

• DFAs are widely used to build lexical analyzers. 

30 

NFA 

DFA 

The language recognized (a|b) * a b 

Compiler Principles

Pop Quiz 

1) What are the languages presented by the two FAs 

(a) 

0 1 1 0 

1 2 3 4 5 

1 0 0 1 

0 

6 

0 0 

7 8 9 

1 1 1 

Solution: 01 strings with length 4, except 0110 

a a a a 

(b) 1 2 3 4 5 

31 

Solution: a(aaaaa)* 

a 

Compiler Principles 

31

Pop Quiz 

2) For a language only accepting characters from {0,1}, 

please design a DFA which represents all strings containing 

three ‘0’s. 

Solution: 

1 

1 1 1 

1 

0 0 0 

2 3 4 

32 

Compiler Principles

Regular Expression NFA 

• McNaughton-Yamada-Thompson (MYT) 

construction 

– Simple and systematic 

– Construction starts from the simplest parts 

(alphabet symbols). 

– For a complex regular expression, subexpressions 

are combined to create its NFA. 

– Guarantees the resulting NFA will have 

exactly one final state, and one start state. 

33 

Compiler Principles

MYT Construction 

• Basic rules: for subexpressions with no 

operators 

– For expression 

start 

i 

 

f 

– For a symbol a in the alphabet 

start 

i 

a 

f 

34 

Compiler Principles

MYT Construction Cont’d 

• Inductive rules: for constructing larger 

NFAs from the NFAs of subexpressions 

(Let N(r 1 ) and N(r 2 ) denote NFAs for regular 

expressions r 1 and r 2 , respectively) 

– For regular expression r 1 | r 2 

start 

i 

 

N(r 1 ) 

 

f 

 

N(r 2 ) 

 

35 

Compiler Principles

MYT Construction Cont’d 

– For regular expression r 1 r 2 

start 

i N(r 1 ) N(r 2 ) f 

– For regular expression r * 

 

start 

i 

 

N(r) 

 

f 

 

36 

Compiler Principles

Example: (a|b) * a 

a: 

b: 

a 

b 

(a|b): 

 

 

a 

b 

 

 

 

(a|b) * : 

 

 

 

a 

b 

 

 

 

 

(a|b) * a: 

 

 

 

 

a 

b 

 

 

 

a 

 

37 


37

Properties of the Constructed NFA 

1. N(r) has at most twice as many states as there 

are operators and operands in r. 

– This bound follows from the fact that each step of 

the algorithm creates at most two new states. 

2. N(r) has one start state and one accepting 

state. The accepting state has no outgoing 

transitions, and the start state has no incoming 

transitions. 

3. Each state of N(r) other than the accepting 

state has either one outgoing transition on a 

symbol in {} or two outgoing transitions, 

both on . 

38 

Compiler Principles

Conversion of an NFA to a DFA 

• Approach: Subset Construction 

– each state of the constructed DFA corresponds to 

a set / combination of NFA states 

• Details 

1 Create transition table Dtran for the DFA 

2 Insert -closure(s 0 ) to Dstates as initial state 

3 Pick a not visited state T in Dstates 

4 For each symbol a, Create state 

-closure(move(T, a)), and add it to Dstates and 

Dtran 

5 Repeat step (3) and (4) until all states in 

Dstates are visited 

39 

Compiler Principles

The Subset Construction 

40 

Compiler Principles

NFA to DFA Example 

NFA for (a|b) * abb 

Transition table for DFA 

Equivalent DFA 

4 

41 

Compiler Principles

Regular Expression DFA 

• First, augment the given regular expression 

by concatenating a special symbol # 

r r# augmented regular expression 

• Second, create a syntax tree for the 

augmented regular expression. 

– All leaves are alphabet symbols (plus # and the 

empty string) 

– All inner nodes are operators 

• Third, number each alphabet symbol (plus #) 

(position numbers) 

44 

Compiler Principles

Regular Expression DFA Cont’d 

(a|b) * a (a|b) * a# 

augmented regular expression 

a 

1 

* 

| 

 

b 

2 

 

a 

3 

# 

4 

 

 

 

1 

2 

 

a 

b 

 

 

 

 

a 

3 4 # F 

Syntax tree of (a|b) * a# 

• each symbol is at a leaf 

• each symbol is numbered (positions) 

• inner nodes are operators 

45 

Compiler Principles

followpos 

Then we define the function followpos for the positions (positions 

assigned to leaves). 

followpos(i) -- the set of positions which can follow 

the position i in the strings generated by 

the augmented regular expression. 

Example: ( a | b) * a # 

1 2 3 4 

followpos(1) = {1,2,3} 

followpos(2) = {1,2,3} 

followpos(3) = {4} 

followpos(4) = {} 

followpos() is just defined for leaves, 

not defined for inner nodes. 

46 

Compiler Principles

firstpos, lastpos, nullable 

• To compute followpos, we need three more 

functions defined for the nodes (not just for 

leaves) of the syntax tree. 

– firstpos(n) -- the set of the positions of the first 

symbols of strings generated by the subexpression 

rooted by n. 

– lastpos(n) -- the set of the positions of the last 

symbols of strings generated by the subexpression 

rooted by n. 

– nullable(n) -- true if the empty string is a 

member of strings generated by the subexpression 

rooted by n; false otherwise 

47 

Compiler Principles

Usage of the Functions 

(a|b) * a (a|b) * a# 

augmented regular expression 

m 

* 

| 

n 

 

 

a 

3 

# 

4 

nullable(n) = false 

nullable(m) = true 

firstpos(n) = {1, 2, 3} 

a 

1 

b 

2 

lastpos(n) = {3} 

Syntax tree of (a|b) * a# 

48 

Compiler Principles

Computing nullable, firstpos, lastpos 

n nullable(n) firstpos(n) lastpos(n) 

leaf labeled true 

leaf labeled 

with position i 

false {i} {i} 

| 

c 1 c 2 

nullable(c 1 ) or 

nullable(c 2 ) 

firstpos(c 1 ) firstpos(c 2 ) 

lastpos(c 1 ) 

lastpos(c 2 ) 


c 1 c 2 

and 


if (nullable(c 1 )) 

firstpos(c 1 )firstpos(c 2 ) 

else firstpos(c 1 ) 

if (nullable(c 2 )) 

lastpos(c 1 )lastpos(c 2 ) 

else lastpos(c 2 ) 

* 

true firstpos(c 1 ) lastpos(c 1 ) 

c 1 

49 

Compiler Principles

How to evaluate followpos 

• Two-rules define the function followpos: 

1. If n is concatenation-node with left child c 1 and 

right child c 2 , and i is a position in lastpos(c 1 ), 

then all positions in firstpos(c 2 ) are in 

followpos(i). 

2. If n is a star-node, and i is a position in 

lastpos(n), then all positions in firstpos(n) are 

in followpos(i). 

• If firstpos and lastpos have been computed 

for each node, followpos of each position 

can be computed by making one depth-first 

traversal of the syntax tree. 

50 

Compiler Principles

Example -- ( a | b) * a # 

{1} 

{1,2,3} {3} {4}# 

4 

{1,2}* 

{1,2}{3} 

a{3} 

3 

{1,2} 

a 

1 

{1} 

| 

{1,2,3} 

{1,2} 

 

{2} b {2} 

2 

{4} 

{4} 

red – firstpos 

blue – lastpos 

Then we can calculate followpos 

followpos(1) = {1,2,3} 

followpos(2) = {1,2,3} 

followpos(3) = {4} 

followpos(4) = {} 

• After we calculate follow positions, we are ready to create 

DFA for the regular expression. 

51 

Compiler Principles

Algorithm (RE DFA) 

1. Create the syntax tree of (r) # 

2. Calculate nullable, firstpos, lastpos, followpos 

3. Put firstpos(root) into the states of DFA as an unmarked state. 

4. while (there is an unmarked state S in the states of DFA) do 

– mark S 

– for each input symbol a do 

• let s 1 ,...,s n are positions in S and symbols in those positions are a 

• S’ followpos(s 1 ) ... followpos(s n ) 

• Dtran[S,a] S’ 

• if (S’ is not in the states of DFA) 

– put S’ into the states of DFA as an unmarked state. 

• the start state of DFA is firstpos(root) 

• the accepting states of DFA are all states containing the position of # 

52 

Compiler Principles

Example -- ( a | b) * a # 

followpos(1)={1,2,3} followpos(2)={1,2,3} 

followpos(3)={4} followpos(4)={} 

1 2 3 4 

S 1 =firstpos(root)={1,2,3} 

mark S 1 

a: followpos(1) followpos(3)={1,2,3,4}=S 2 Dtran[S 1 ,a]=S 2 

b: followpos(2)={1,2,3}=S 1 Dtran[S 1 ,b]=S 1 

mark S 2 

a: followpos(1) followpos(3)={1,2,3,4}=S 2 Dtran[S 2 ,a]=S 2 

b: followpos(2)={1,2,3}=S 1 Dtran[S 2 ,b]=S 1 

start state: S 1 

accepting states: {S 2 } 

b 

S 1 

a 

S 2 

a 

53 

b 

Compiler Principles

Example -- ( a | ) b c * # 

1 2 3 4 

followpos(1)={2} followpos(2)={3,4} followpos(3)={3,4} 

followpos(4)={} 

S 1 =firstpos(root)={1,2} 

mark S 1 

a: followpos(1)={2}=S 2 Dtran[S 1 ,a]=S 2 

b: followpos(2)={3,4}=S 3 Dtran[S 1 ,b]=S 3 

mark S 2 

b: followpos(2)={3,4}=S 3 Dtran[S 2 ,b]=S 3 

mark S 3 

c: followpos(3)={3,4}=S 3 Dtran[S 3 ,c]=S 3 

start state: S 1 

accepting states: {S 3 } 

S 1 

a 

b 

S 2 

b 

S 3 

c 

54 

Compiler Principles

Minimizing Number of DFA States 

• For any regular language, there is always a unique 

minimum state DFA, which can be constructed from 

any DFA of the language. 

• Algorithm: 

– Partition the set of states into two groups: 

• G 1 : set of accepting states 

• G 2 : set of non-accepting states 

– For each new group G 

• partition G into subgroups such that states s 1 and s 2 are in the 

same group iff 

for all input symbols a, states s 1 and s 2 have transitions to states 

in the same group. 

– Start state of the minimized DFA is the group containing 

the start state of the original DFA. 

– Accepting states of the minimized DFA are the groups 

containing the accepting states of the original DFA. 

55 

Compiler Principles

Minimizing DFA – Example (1) 

1 

a 

b 

a 

2 

b 

3 

a 

G 1 = {2} 

G 2 = {1,3} 

G 2 cannot be partitioned because 

Dtran[1,a]=2 

Dtran[3,a]=2 

Dtran[1,b]=3 

Dtran[3,b]=3 

b 

So, the minimized DFA (with minimum states) is 

b 

a 

1 

a 

b 

2 

56 

Compiler Principles

Minimizing DFA – Example (2) 

a 

a 

2 

a 

1 b 4 

b a 

3 b 

b 

a 

Minimized DFA 

1 

b 

Groups: {1,2,3} {4} 

{1,2} {3} 

no more partitioning 

b 

2 

a 

b 

a b 

1->2 1->3 

2->2 2->3 

3->4 3->3 

57 

a 

3 


57

Architecture of A Lexical Analyzer 

58 


58

An NFA for Lex program 

• Create an NFA for each 

regular expression 

• Combine all the NFAs into 

one 

• Introduce a new start 

state 

• Connect it with ε- 

transitions to the start 

states of the NFAs 

59 

Compiler Principles

Pattern Matching with NFA 

1 The lexical analyzer reads 

in input and calculates the 

set of states it is in at each 

symbol. 

2 Eventually, it reach a point 

with no next state. 

3 It looks backwards in the 

sequence of sets of 

states, until it finds a set 

including one or more 

accepting states. 

4 It picks the one associated 

with the earliest pattern in 

the list from the Lex 

program. 

5 It performs the associated 

action of the pattern. 

60 

Compiler Principles

Pattern Matching with NFA -- Example 

Input: aaba 

61 


Report pattern: a*b +

Pattern Matching with DFA 

1 Convert the NFA for all the 

patterns into an equivalent 

DFA. For each DFA state 

with more than one 

accepting NFA states, 

choose the pattern, who is 

defined earliest, the output 

of the DFA state. 

2 Simulate the DFA until 

there is no next state. 

3 Trace back to the nearest 

accepting DFA state, and 

perform the associated 

action. 

Input: abba 

0137 247 58 68 

Report pattern abb 

62 

Compiler Principles

Summary 

• How lexical analyzers work 

– Convert REs to NFA 

– Convert NFA to DFA 

– Minimize DFA 

– Use the minimized DFA to recognize tokens 

in the input 

– Use priorities, longest matching rule 

63 

Compiler Principles

• Exercise 3.7.1 (c) 

• Exercise 3.7.3 (c) 

• Exercise 3.9.4 (a) 

Homework 

• Due date: Oct. 9, 2014 (Monday) 

64 

Compiler Principles

Slides02 - Computer Science and Engineering

Create successful ePaper yourself

Delete template?

Save as template?