Compiler

Lexical Analysis
Outline
Role of lexical analyzer
Specification of tokens
Recognition of tokens
Lexical analyzer generator
Finite automata
Design of lexical analyzer generator
pass token
and attribute value
read char
Source
program
Lexical
analyzer
put back
char
Read entire
program into
memory
id
Symbol Table
Compiler Construction
Parser
get next
The role of lexical

analyzer
Source
program
Lexical
Analyzer
token
Parser
getNextToken
Symbol
table
To semantic
analysis
Other tasks
Stripping out from the source program
comment and white space in the form of

blank, tab and newline characters .
Correlate error messages with source program
(e.g., line number of error).
Lexical analyzer
operations)
scanning (simple
lexical analysis(complex)
Why to separate Lexical

analysis
and
parsing
1. Simplicity of design
Eliminate the white space and comment lines
before parsing
1. Improving compiler efficiency

A large amount of time is being spent on
reading the source program and generating
the tokens
1. Enhancing compiler portability

The representation of special or non
standard symbols can be isolated in the
lexical analyzer
Tokens, Patterns and

Lexemes
Pattern: A rule that describes a set of strings
Token: A set of strings in the same pattern
Lexeme: The sequence of characters of a token
Example
Token
Informal description
if
Characters i, f
Characters e, l, s, e
else
comparison < or > or <= or >= or == or !=
id
number
literal
Letter followed by letter and digits

Any numeric constant
Anything but sorrounded by
Sample lexemes
if
else
<=, !=
pi, score, D2
3.14159, 0, 6.02e23
core dumped
E = C1 * 10
Token
Attribute
ID
Index to symbol table entry E
=
ID
Index to symbol table entry C1
*
NUM
10
Lexical Error and

Error detection
Recovery
Error reporting
Error recovery
Delete the current character and restart scanning at

the next character
Delete the first character read by the scanner and
resume scanning at the character following it.
Input buffering
Sometimes lexical analyzer needs to look
ahead some symbols to decide about the

token to return
In C language: we need to look after -, = or < to
decide what token to return
We need to introduce a two buffer scheme to
handle large look-aheads safely

E = M * C * * 2 eof
Specification of Tokens
Regular expressions are an important notation for specifying
lexeme patterns. While they cannot express all possible
patterns, they are very effective in specifying those types
of patterns that we actually need for tokens.
Specification of tokens
In theory of compilation regular expressions
are used to formalize the specification of

tokens
Regular expressions are means for specifying
regular languages
Example:
Letter(letter| digit)*
Each regular expression is a pattern
specifying the form of strings
Regular expressions
is a regular expression, L() = {}
If a is a symbol in then a is a regular
expression, L(a) = {a}
(r) | (s) is a regular expression denoting the
language L(r) L(s)
(r)(s) is a regular expression denoting the
language L(r)L(s)
(r)* is a regular expression denoting (L(r))*
R* = R concatenated with itself 0 or more
times
= {} R RR RRR
Extensions
One or more instances: (r)+
Zero of one instances: r?
Character classes: [ abc ]
Example:
letter_ -> [A-Z , a-z_]
digit
id
-> [0-9]
-> letter(letter | digit)*
Regular definitions
d1 -> r1
d2 -> r2
d n -> r
Example:
letter -> A | B | | Z | a | b | | z | _
digit
-> 0 | 1 | | 9
id
-> letter(letter| digit)*
Operations on Languages
Example:
Let L be the set of letters {A, B, . . . , Z, a, b, . . . , z ) and let D
be the set of digits {0,1,.. .9). L and D are, respectively, the
alphabets of uppercase and lowercase letters and of digits.
other languages can be constructed from L and D, using the
operators illustrated above
Operations on Languages
1. L U D is the set of letters and digits - strictly speaking the
language with 62 (52+10) strings of length one, each of which
strings is either one letter or one digit.
2. LD is the set of 520 strings of length two, each consisting of
one letter followed by one digit.(1052).
Ex: A1, a1,B0,etc
3. L4 is the set of all 4-letter strings. (ex: aaba, bcef)
4. L* is the set of all strings of letters, including e, the empty
string.
5. L(L U D)* is the set of all strings of letters and digits beginning
with a letter.
6. D+ is the set of all strings of one or more digits.
Terms for Parts of Strings
Specification of Tokens
Algebraic laws of regular expressions
1) |= |
2) |(|)=(|)| () =( )
3) (| )= |
(|)= |
4) = =
5)(*)*=*
6) *= |
* = *
7) (|)*= (* | *)*= (* *)*
Recognition of Tokens
Task of recognition of token in a lexical analyzer
Isolate the lexeme for the next token in the

input buffer
Produce as output a pair consisting of the

appropriate token and attribute-value, such
as <id,pointer to table entry> , using the
translation table given in the Fig in next page
Task of recognition of token in a lexical analyzer
Regular
expression
if
id
Token
<
relop
if
id
Attributevalue
Pointer to
table entry
LT
Methods to recognition of token
Use Transition Diagram
Starting point is the language grammar to
understand the tokens:

Grammar for branching statement
stmt -> if expr then stmt
| if expr then stmt else stmt
|
expr -> term relop term
| term
term -> id
| number
(cont.)
The next step is to formalize the patterns:
digit
-> [0-9]
digits -> digit+
number -> digits(.digits)? (E[+|-]? digits)?
letter -> [A-Z a-z_]
id
-> letter (letter | digit)*
If
-> if
Then
-> then
Else
-> else
Relop -> < | > | <= | >= | = | <>
We also need to handle whitespaces:

ws -> (blank | tab | newline)+
Ex :RELOP = < | <= | = | <> | > | >=

=
1
<
=
start
0
>
other
return(relop,LE)
return(relop,NE)
return(relop,LT)
return(relop,EQ)
>
6
# indicates input retraction
=
other
return(relop,GE)
# return(relop,GT)
Ex2:
ID = letter(letter | digit) *
Transition Diagram:
letter or digit
letter
start
9
other
10
11
return(id)
# indicates input retraction
Transition diagrams
(cont.)
Transition diagram for whitespace
letter or digit
letter
start
9
other
10
11
return(id)
switch (state) {
case 9:
if (isletter( c) ) state = 10; else state =
failure();
break;
case 10:
c = nextchar();
if (isletter( c) || isdigit( c) ) state = 10;
else state 11
case 11: retract(1); insert(id); return;
Architecture of a transition-diagrambased lexical analyzer

TOKEN getRelop()
{
TOKEN retToken = new (RELOP)
while (1) {
/* repeat character processing until a
return or failure occurs
*/
switch(state) {
case 0: c= nextchar();
if (c == <) state = 1;
else if (c == =) state = 5;
else if (c == >) state = 6;
else fail();
/* lexeme is not a relop */
break;
case 1:
case 8: retract();
retToken.attribute = GT;
return(retToken);
}
Lexical Analyzer
Generator - Lex
Lex Source
program
lex.l
lex.yy.c
Input stream
Lexical
Compiler
lex.yy.c
C
compiler
a.out
a.out
Sequence
of tokens
Structure of Lex
programs
declarations
%%
translation rules
%%
auxiliary functions
Pattern
{Action}
Example
%{
/* definitions of manifest constants
LT, LE, EQ, NE, GT, GE,
IF, THEN, ELSE, ID, NUMBER, RELOP */
%}
/* regular
delim
ws
letter
digit
id
number
definitions
[ \t\n]
{delim}+
[A-Za-z]
[0-9]
{letter}({letter}|{digit})*
{digit}+(\.{digit}+)?(E[+-]?{digit}+)?
%%
{ws}
{/* no action and no return */}
if
{return(IF);}
then
{return(THEN);}
else
{return(ELSE);}
{id}
{yylval = (int) installID(); return(ID); }
{number}
{yylval = (int) installNum();
return(NUMBER);}
Int installID() {/* funtion to install

the lexeme, whose first
character is pointed to by
yytext, and whose length is
yyleng, into the symbol table
and return a pointer thereto */
}
Int installNum() { /* similar to
installID, but puts numerical
constants into a separate
table */
}
Finite Automata
Regular expressions = specification
Finite automata = implementation
A finite automaton consists of
An input alphabet
A set of states S
A start state n
A set of accepting states F S
A set of transitions state input state
35
Finite Automata
Transition
s1 a s2
Is read
In state s1 on input a go to state s2

If end of input
If in accepting state => accept, othewise =>
reject
If no transition possible => reject
36
Finite Automata State Graphs

A state
The start state

An accepting state
A transition
37
A
Simple
Example
A finite automaton that accepts only 1
1
A finite automaton accepts a string if we can
follow transitions labeled with the characters

in the string from the start to some accepting
state
38
Another
Simple
Example
A finite automaton accepting any number of
1s followed by a single 0
Alphabet: {0,1}
1
0
Check that 1110 is accepted but 110 is
not
39
And
Another
Example
Alphabet {0,1}
What language does this recognize?
0
1
0
1
40
And Another Example

Alphabet still { 0, 1 }
1
1
The operation of the automaton is not
completely defined by the input

On input 11 the automaton could be in either
state
41
Epsilon
Moves
Another kind of transition: -moves
A
Machine can move from state A to state B

without reading input
42
Deterministic and
Nondeterministic
Deterministic Finite Automata (DFA)
One transition per input per state
Automata
No -moves
Nondeterministic Finite Automata (NFA)
Can have multiple transitions for one input in a
given state
Can have -moves
Finite automata have finite memory
Need only to encode the current state
43
Execution of Finite
Automata
A DFA can take only one path through the
state graph
Completely determined by input
NFAs can choose

Whether to make -moves
Which of multiple transitions for a single input
to take
44
Acceptance of NFAs
An NFA can get into multiple states
1
0
Input:
Rule: NFA accepts if it can get in a final state

45
NFA vs. DFA (1)

NFAs and DFAs recognize the same set of
languages (regular languages)
DFAs are easier to implement

There are no choices to consider
46
NFA vs. DFA (2)
For a given language the NFA can be simpler
than the DFA

1
NFA
0
1
DFA
1
1
DFA can be exponentially larger than NFA

47
Regular Expressions to
High-level sketch
Finite Automata
NFA
Regular
expressions
DFA
Lexical
Specification
Table-driven
Implementation of DFA
48
For each
NFA
(1)
kind of rexp, define an NFA
Notation: NFA for rexp A
For
For input a
49
For AB(2)
NFA
For A | B
B
A
50
NFA
For A*(3)
51
Example of RegExp ->

NFA
conversion
Consider
the regular expression
(1 | 0)*1
The NFA is
C 1
0 F
D
52
Next
NFA
Regular
expressions
DFA
Lexical
Specification
Table-driven
Implementation of DFA
53
NFA to DFA. The Trick

Simulate the NFA
Each state of resulting DFA
= a non-empty subset of states of the NFA
Start state
= the set of NFA states reachable through moves from NFA start state
Add a transition S a S to DFA iff

S is the set of NFA states reachable from the
states in S after seeing the input a
considering -moves as well

54
NFA -> DFA

Example
C 1
0 F
D
0
ABCDHI
FGABCDHI
1
EJGABCDHI
0
1
55
NFA to DFA. Remark

An NFA may be in many states at any time
How many different states ?
If there are N states, the NFA must be in some
subset of those N states

How many non-empty subsets are there?
2N - 1 = finitely many, but exponentially many
56
Implementation
A DFA can be implemented by a 2D table T
One dimension is states
Other dimension is input symbols
For every transition Si a Sk define T[i,a] = k
DFA execution
If in state Si and input a, read T[i,a] = k and
skip to state Sk
Very efficient
57
Table Implementation of
0
a DFA
T
0
58
Implementation (Cont.)
NFA -> DFA conversion is at the heart of tools
such as flex or jflex

But, DFAs can be huge
In practice, flex-like tools trade off speed for
space in the choice of NFA and DFA

representations
59
Readings
Chapter 3 of the book

Compiler

Uploaded by

Compiler

Uploaded by

Lexical Analysis

The role of lexical

comment and white space in the form of

Why to separate Lexical

1. Improving compiler efficiency

1. Enhancing compiler portability

Tokens, Patterns and

Letter followed by letter and digits

Index to symbol table entry E

Index to symbol table entry C1

Lexical Error and

Delete the current character and restart scanning at

ahead some symbols to decide about the

decide what token to return

We need to introduce a two buffer scheme to

handle large look-aheads safely

are used to formalize the specification of

Each regular expression is a pattern

specifying the form of strings

Terms for Parts of Strings

Isolate the lexeme for the next token in the

Produce as output a pair consisting of the

Use Transition Diagram

understand the tokens:

We also need to handle whitespaces:

Ex :RELOP = < | <= | = | <> | > | >=

# indicates input retraction

Architecture of a transition-diagrambased lexical analyzer

Int installID() {/* funtion to install

In state s1 on input a go to state s2

Finite Automata State Graphs

The start state

A finite automaton accepts a string if we can

follow transitions labeled with the characters

Check that 1110 is accepted but 110 is

And Another Example

The operation of the automaton is not

completely defined by the input

Machine can move from state A to state B

NFAs can choose

Rule: NFA accepts if it can get in a final state

NFA vs. DFA (1)

languages (regular languages)

DFAs are easier to implement

NFA vs. DFA (2)

For a given language the NFA can be simpler

than the DFA

DFA can be exponentially larger than NFA

Example of RegExp ->

NFA to DFA. The Trick

Add a transition S a S to DFA iff

considering -moves as well

NFA -> DFA

NFA to DFA. Remark

subset of those N states

such as flex or jflex

space in the choice of NFA and DFA

You might also like