Compiler
Compiler
Outline
Role of lexical analyzer
Specification of tokens
Recognition of tokens
Lexical analyzer generator
Finite automata
Design of lexical analyzer generator
pass token
and attribute value
read char
Source
program
Lexical
analyzer
put back
char
Read entire
program into
memory
id
Symbol Table
Compiler Construction
Parser
get next
Lexical
Analyzer
token
Parser
getNextToken
Symbol
table
To semantic
analysis
Other tasks
Stripping out from the source program
Lexical analyzer
operations)
scanning (simple
lexical analysis(complex)
Example
Token
Informal description
if
Characters i, f
Characters e, l, s, e
else
comparison < or > or <= or >= or == or !=
id
number
literal
Sample lexemes
if
else
<=, !=
pi, score, D2
3.14159, 0, 6.02e23
core dumped
E = C1 * 10
Token
Attribute
ID
=
ID
*
NUM
10
Compiler Construction
Compiler Construction
Input buffering
Sometimes lexical analyzer needs to look
Specification of Tokens
Regular expressions are an important notation for specifying
lexeme patterns. While they cannot express all possible
patterns, they are very effective in specifying those types
of patterns that we actually need for tokens.
Compiler Construction
Specification of tokens
In theory of compilation regular expressions
Letter(letter| digit)*
Regular expressions
is a regular expression, L() = {}
If a is a symbol in then a is a regular
expression, L(a) = {a}
(r) | (s) is a regular expression denoting the
language L(r) L(s)
(r)(s) is a regular expression denoting the
language L(r)L(s)
(r)* is a regular expression denoting (L(r))*
R* = R concatenated with itself 0 or more
times
= {} R RR RRR
Extensions
One or more instances: (r)+
Zero of one instances: r?
Character classes: [ abc ]
Example:
letter_ -> [A-Z , a-z_]
digit
id
-> [0-9]
-> letter(letter | digit)*
Regular definitions
d1 -> r1
d2 -> r2
d n -> r
Example:
letter -> A | B | | Z | a | b | | z | _
digit
-> 0 | 1 | | 9
id
-> letter(letter| digit)*
Operations on Languages
Example:
Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z ) and let D
be the set of digits {0,1,.. .9). L and D are, respectively, the
alphabets of uppercase and lowercase letters and of digits.
other languages can be constructed from L and D, using the
operators illustrated above
Operations on Languages
1. L U D is the set of letters and digits - strictly speaking the
language with 62 (52+10) strings of length one, each of which
strings is either one letter or one digit.
2. LD is the set of 520 strings of length two, each consisting of
one letter followed by one digit.(1052).
Ex: A1, a1,B0,etc
3. L4 is the set of all 4-letter strings. (ex: aaba, bcef)
4. L* is the set of all strings of letters, including e, the empty
string.
5. L(L U D)* is the set of all strings of letters and digits beginning
with a letter.
6. D+ is the set of all strings of one or more digits.
Compiler Construction
Compiler Construction
Specification of Tokens
Algebraic laws of regular expressions
1) |= |
2) |(|)=(|)| () =( )
3) (| )= |
(|)= |
4) = =
5)(*)*=*
6) *= |
* = *
7) (|)*= (* | *)*= (* *)*
Recognition of Tokens
Task of recognition of token in a lexical analyzer
Recognition of Tokens
Task of recognition of token in a lexical analyzer
Regular
expression
if
id
Token
<
relop
if
id
Attributevalue
Pointer to
table entry
LT
Recognition of Tokens
Methods to recognition of token
Recognition of tokens
Starting point is the language grammar to
Recognition of tokens
(cont.)
The next step is to formalize the patterns:
digit
-> [0-9]
digits -> digit+
number -> digits(.digits)? (E[+|-]? digits)?
letter -> [A-Z a-z_]
id
-> letter (letter | digit)*
If
-> if
Then
-> then
Else
-> else
Relop -> < | > | <= | >= | = | <>
start
0
>
other
return(relop,LE)
return(relop,NE)
return(relop,LT)
return(relop,EQ)
>
6
# indicates input retraction
Compiler Construction
=
other
return(relop,GE)
# return(relop,GT)
Ex2:
ID = letter(letter | digit) *
Transition Diagram:
letter or digit
letter
start
9
other
10
11
return(id)
Compiler Construction
Transition diagrams
(cont.)
Transition diagram for whitespace
letter or digit
letter
start
9
other
10
11
return(id)
switch (state) {
case 9:
if (isletter( c) ) state = 10; else state =
failure();
break;
case 10:
c = nextchar();
if (isletter( c) || isdigit( c) ) state = 10;
else state 11
case 11: retract(1); insert(id); return;
case 8: retract();
retToken.attribute = GT;
return(retToken);
}
Lexical Analyzer
Generator - Lex
Lex Source
program
lex.l
lex.yy.c
Input stream
Lexical
Compiler
lex.yy.c
C
compiler
a.out
a.out
Sequence
of tokens
Structure of Lex
programs
declarations
%%
translation rules
%%
auxiliary functions
Pattern
{Action}
Example
%{
/* definitions of manifest constants
LT, LE, EQ, NE, GT, GE,
IF, THEN, ELSE, ID, NUMBER, RELOP */
%}
/* regular
delim
ws
letter
digit
id
number
definitions
[ \t\n]
{delim}+
[A-Za-z]
[0-9]
{letter}({letter}|{digit})*
{digit}+(\.{digit}+)?(E[+-]?{digit}+)?
%%
{ws}
{/* no action and no return */}
if
{return(IF);}
then
{return(THEN);}
else
{return(ELSE);}
{id}
{yylval = (int) installID(); return(ID); }
{number}
{yylval = (int) installNum();
return(NUMBER);}
Finite Automata
Regular expressions = specification
Finite automata = implementation
A finite automaton consists of
An input alphabet
A set of states S
A start state n
A set of accepting states F S
A set of transitions state input state
35
Finite Automata
Transition
s1 a s2
Is read
37
A
Simple
Example
A finite automaton that accepts only 1
1
38
Another
Simple
Example
A finite automaton accepting any number of
1s followed by a single 0
Alphabet: {0,1}
1
0
not
39
And
Another
Example
Alphabet {0,1}
What language does this recognize?
0
1
0
1
40
state
41
Epsilon
Moves
Another kind of transition: -moves
A
42
Deterministic and
Nondeterministic
Deterministic Finite Automata (DFA)
One transition per input per state
Automata
No -moves
Nondeterministic Finite Automata (NFA)
Can have multiple transitions for one input in a
given state
Can have -moves
Finite automata have finite memory
Need only to encode the current state
43
Execution of Finite
Automata
A DFA can take only one path through the
state graph
Completely determined by input
44
Acceptance of NFAs
An NFA can get into multiple states
1
0
Input:
46
NFA
0
1
DFA
1
1
Regular Expressions to
High-level sketch
Finite Automata
NFA
Regular
expressions
DFA
Lexical
Specification
Table-driven
Implementation of DFA
48
Regular Expressions to
For each
NFA
(1)
kind of rexp, define an NFA
Notation: NFA for rexp A
For
For input a
49
Regular Expressions to
For AB(2)
NFA
For A | B
B
A
50
Regular Expressions to
NFA
For A*(3)
51
C 1
0 F
D
52
Next
NFA
Regular
expressions
DFA
Lexical
Specification
Table-driven
Implementation of DFA
53
C 1
0 F
D
0
ABCDHI
FGABCDHI
1
EJGABCDHI
0
1
55
Implementation
A DFA can be implemented by a 2D table T
One dimension is states
Other dimension is input symbols
For every transition Si a Sk define T[i,a] = k
DFA execution
If in state Si and input a, read T[i,a] = k and
skip to state Sk
Very efficient
57
Table Implementation of
0
a DFA
T
0
58
Implementation (Cont.)
NFA -> DFA conversion is at the heart of tools
59
Readings
Chapter 3 of the book