Chapter 3 - Syntax Analysis
Chapter 3 - Syntax Analysis
Syntax Analysis
Basic Topics of Chapter -Three
Syntax Analysis creates the syntactic structure of the given source program.
Parser: program that takes tokens and grammars (CFGs) as input and
The syntax analysis (parser) checks whether a given source program satisfies the
A context-free grammar
token
Source Lexical Parser Parse tree Rest of Front End Intermediate
program Analyzer representation
getNext
Token
Symbol
table
The main Responsibility of Syntax Analysis
the parser obtains a stream of tokens from the lexical analyzer and verifies that
the stream of token names can be generated by the grammar for the source
language.
We expect the parser to report any syntax error and to recover from commonly
The parser construct the parse tree and passes it to the rest of the program.
Parsing Techniques
Planning the error handling right from the start can both
o simplify the structure of a compiler and improve its handling of errors.
a. Panic-mode recovery
b. Phrase-level recovery
c. Error-productions, and
d. Global-correction.
i. Panic Mode Recovery
Once an error is found, the parser intends to find designated set of synchronizing tokens
by discarding input symbols one at a time.
Synchronizing tokens are delimiters, semicolon or } whose role in source program is clear.
When parser finds an error in the statement, it ignores the rest of the statement by not
processing the input.
string .
Basically, there are a number of types of grammar, but for compiler design or the
I am going.
In a programming language, suppose we write a sentence in this form.
int a, b,c;
define data type, variable name separated by comma, terminated by semicolon
Therefore, CFG used to check the syntax of a programming language
Formal Definition of a CFG
A grammar is a set of rules that validates the correctness of the sentences in a language. i.
e. grammar defines the rules
A context-free grammar (grammar for short) consists of four tuple:
G: {S, T, N, P} where
i. T is a finite set of terminals (in our case, this will be the set of tokens)
ii. N is a finite set of non-terminals (syntactic-variables)
iii. S is a start symbol (one of the non-terminal symbol)
iv. P is a finite set of productions rules in the following form:
A where A is a non-terminal and is a string of terminals and non-terminals
(including the empty string)
i. Terminals
Terminals are the basic symbols from which strings are formed.
The term "token name" is a synonym for "terminal" and frequently we will use the word
"token" for terminal when it is clear that we are talking about just the token name.
We assume that the terminals are the first components of the tokens output by the lexical
analyzer.
Notation Conventions
• Terminals:
– Lower case letters
– Operators
– Punctuation , like parenthesis and comma, etc
– The digits
– Bold face string such as, (id or if)
ii. Non-terminals
Nonterminals are syntactic variables that denote sets of strings.
The sets of strings denoted by non-terminals help define the language generated by the
grammar.
Non-terminals impose a hierarchical structure on the language that is key to syntax
analysis and translation.
Non-terminals:
– Uppercase letters in the alphabets
– The letter S, which when it appears is usually the start symbol.
– Lowercase, italic names such as expr or stmt
For example, non-terminals for expressions, terms, and factors are often represented
by E, T, and F, respectively.
iii. Start Symbol
In a grammar, one nonterminal is distinguished as the start symbol from where the production
beginning, and the set of strings it denotes is the language generated by the grammar.
Conventionally, the productions for the start symbol are listed first.
iv. Productions
The productions of a grammar specify the manner in which the terminals and non-
terminals can be combined to form strings.
Each Production Consists of:
a. A non-terminal called the head or left side of the production;
this production defines some of the strings denoted by the head.
b. A body or right side consisting of zero or more terminals and non-terminals.
Example #1: Simple Arithmetic Expressions
G: EE+E|E-E|E*E|E/E|(E)id
sentence: id+id*id
( this sentence is derived from the above grammar hence it is a valid sentence)
Invalid sentence: id++id*id
( this sentence can not be generated using grammar above hence it is invalid
sentence)
NB: so the job of a grammar is to validate correctness of the sentence or find out the
error of the sentence
Example #4: CFG (“If else” Grammar)
• Sif expression then Statement
G: E → E + T | T
T→T*F|F
F → (E) | id
Expression grammar above belongs to the class of LR grammars that are suitable for
bottom-up parsing.
This grammar can be adapted to handle additional operators and additional levels of
precedence.
However, it cannot be used for top-down parsing because it is left recursive
Cont’d
The following non-left-recursive variant of the expression grammar below will be
used for top-down parsing:
E → TE’
E’ →+TE’ | Ɛ
T →FT’
T’ → *FT’ | Ɛ
F →(E) | id
Cont’d
The following grammar treats + and * alike, so it is useful for illustrating techniques
for handling ambiguities during parsing:
Derivations
Derivation is a sequence of production rules.
It is used to get the input string through these productions rules.
We have to decide:
• which non-terminal to replace
• Production rule by which the Non-terminals will be replaced
If there is a production A α then we say that A derives α and is denoted by A α
α A β α γ β if A γ is a production
If α1 α2 … αn then α1 αn
Given a grammar G and a string w of terminals in L(G) we can write S w
If S α where α is a string of terminals and non terminals of G then we say that α is a
sentential form of G.
There are two options for Derivation
a. Left-Most Derivation (LMD) and b. Right-Most Derivation (RMD)
a. Left-Most Derivations (LMD)
• If we always choose the left-most non-terminal in each derivation step, this
derivation is called as left-most derivation.
• In LMD- input string scanned and replaced with the production rule from left to
right.
• In a sentential form only the leftmost non terminal is replaced then it becomes
leftmost derivation.
LMD: Example #1 LMD: Example #2
G: E E+E
G: E E+E Input String: -(id+id)
Input String: id+id
|(E)
E+E derives from E
o we can replace E by E+E |-E
Right-Most Derivation
E -E -(E) -(E+E) -(E+id) -(id+id)
Right-Most Derivation
It shows how the start symbol of a grammar derives a string in the language
Parse tree:
Parse Tree: Example #2
Construct parse tree for the given grammar:
G: listlist+list|list-digit|digit Input string: 9-5+2
Parse tree:
Exercise #2
1. G: TT+T
|T*T
|a|b|c input String: a*b+c
2. G: SXYZ input String: abd
Xa
Yb
Zc|d
3.7. Ambiguity
A grammar that produces more than one parse tree for some sentence is said to be
ambiguous. Or
An ambiguous grammar is one that produces :
o more than one leftmost derivation (lm)or
o more than one rightmost derivation (rm)for the same sentence.
Drawback of Ambiguity:
o Parsing complexity
o Affects other phases
Ambiguity : Example #1
Consider grammar:
G: E E+E|E*E|id Input string: id+id*id
E E+E id+E id+E*E E
E + E
id+id*E id+id*id id *
E E
id id
E * E
id+id*E id+id*id
E + E id
id id
Ambiguity : Example #2
string string + string
| string – string
|0|1|…|9
• String 9-5+2 has two parse trees
Ambiguity (cont.)
For the most parsers, the grammar must be unambiguous.
unambiguous grammar
it means that the first operator will get its operands before the operators with lower
precedence.
In all programming languages with if-then-else statements of this form, the first parse
tree is preferred.
Hence the general rule is: match each else with the previous closest then.
This disambiguating rule can be incorporated directly into a grammar by using the
following observations.
Eliminating Ambiguity(cont ’d)
A statement appearing between a then and a else must be matched. (Otherwise there
will be an ambiguity.)
Thus statements must split into kinds: matched and unmatched.
A matched statement is
either an if-then-else statement containing no unmatched statements
or any statement which is not an if-then-else statement and not an if-then
statement.
Eliminating Ambiguity(cont ’d)
A →A α|β
– We may replace it with
A → β A’
A’ → α A’ | ɛ
where
Eliminate left recursion: Example #1
The following grammar which generates arithmetic expressions.
E →E + T | T
T→T*F | F
F→(E) | id
has two left recursive productions. Applying the above trick leads to
E →TE’
E’ →+TE’ |∈
T →FT’
T’ →*FT’ |∈
F→(E) | id
Elimination of left recursion(cont ’d)
The Case of Several Left Recursive A-productions.
Assume that the set of all A-productions has the form
Elimination of left recursion: Example #2
Indirect left recursion
Let us consider the following grammar.
– S → Aa | b
– A → Ac | Sd | ɛ
the nonterminal S is left recursive since we have
Elimination of left recursion: Example #3
The recursion which is neither left recursion nor right recursion is called as general recursion.
Example: S → aSb / ∈
Left factoring
Left factoring is a process by which the grammar with common prefixes is
transformed to make it useful for Top down parsers.
If the RHS of more than one production starts with the same symbol, then such a
grammar is called as grammar with common prefixes.
• Ex: A → αβ1 / αβ2 / αβ3 (Grammar with common prefixes)
This kind of grammar creates a problematic situation for Top down parsers.
Top down parsers can not decide which production must be chosen to parse the string in
hand.
In left factoring, we make one production for each common prefixes and rest of the
derivation is added by new productions
The grammar obtained after the process of left factoring is called as left factored
grammar.
Example:
A Top-down parser tries to create a parse tree from the root towards the leafs
scanning input from left to right .
It can be also viewed as finding a leftmost derivation for an input string.
S→cAd
A→ab | a
Backtracking is needed.
It tries to find the left-most derivation
Recursive descent parsing: Example #2
It is a top-down parser.
First(A), is the set of terminals that begin the strings derived from A.
Follow (A), for a nonterminal A, to be the set of terminals a that can appear immediately to the right of
A in a sentential form.
Calculating First(Y2Y3)
Example: #2
Rules of Computing FOLLOW
Rules in computing FOLLOW ( X) where X is a nonterminal
for example:
A → Xa; then Follow(X) = { a }
2) If X is the start symbol for a grammar, for example:
X → AB
A→a
B → b;
then add $ to FOLLOW (X); FOLLOW(X)= { $ }
Rules of Computing FOLLOW(cont ’d)
3) If X is a part of a production and followed by another non terminal, get the FIRST of that succeeding
nonterminal.
Ex: A →XD D → aB ;
• Before calculating the first and follow functions, eliminate Left Recursion
from the grammar, if present.
• We calculate the follow function of a non-terminal by looking where it is
present on the RHS of a production rule.
Computing FOLLOW: Example 1&2
Computing FIRST and FOLLOW: Exercise #1
Consider the following grammar G:
S→ABCDE
A →a|∈
B →b|∈
C →c
D →d|∈
E →e|∈
A grammar G is LL(1) if and only if whenever A→α|β are two distinct productions of
G, the following conditions hold:
– For no terminal a do α and β both derive strings beginning with a
– At most one of α or β can derive empty string
– If α=> ɛ then β does not derive any string beginning with a terminal in Follow(A).
How LL(1) Parser works?
How LL(1) Parser works?…
input buffer
– our string to be parsed.
– We will assume that its end is marked with a special symbol $.
output
– a production rule representing a step of the derivation sequence (left-most derivation) of the
string in the input buffer.
stack
– contains the grammar symbols
– at the bottom of the stack, there is a special end marker symbol $.
– initially the stack contains only the symbol $ and the starting symbol S.
$S initial stack
– when the stack is emptied (ie. only $ left in the stack), the parsing is completed.
Predictive Parsing Tables Construction
The general idea is to use the FIRST AND FOLLOW to construct the parsing tables.
Each FIRST of every production is labeled in the table whenever the input matches with it.
When a FIRST of a production contains ε, then we get the Follow of the production.
• we start from a sentence and then apply production rules in reverse manner in order to
The bottom-up parsing as the process of “reducing” a token string to the start
At each reduction, the token string matching the RHS of a production is replaced
The key decisions during bottom-up parsing are about when to reduce and about
• After entering this state or configuration, the parser halts and announces the successful
completion of parsing.
• Error: This is the situation in which the parser can neither perform shift action nor reduce
action and not even accept action.
Solution 1
Shift reduce parsing:
Stack Input string action
Example #1
$ id*id+id$ Shift id
$E +id$ Shift +
$E $ Accept
Shift reduce parsing: Exercise #1
E E + E | E* E | id is an operator grammar
PRECEDENCE TABLE
Then the input string id + id*id with the precedence relations inserted will be: $ <. id .> + <. id .> * <. id .> $
Basic principle
Scan input string left to right, try to detect .> and put a pointer on its location.
$ ⋖ id+id*id$ Shift
$E ⋖ +id*id$ Shift
$E+E*id ⋗ $ Reduce by E → id
$E+E*E ⋗ $ Reduce by E → E * E
$E+E ⋗ $ Reduce by E → E + E
$E A $ Accept
Example #2
Q1. Construct operator precedence parsing table
Grammar:
E → E+E| E*E | id with input string: id+id+id $
Q2. Consider the following grammar and construct the operator precedence parser.
EEAE|id
A+|*
Then parse the following string: id+id*id
Advantages and Disadvantages of Operator Precedence
Parsing
Advantages:
simple
powerful enough for expressions in programming languages
Disadvantages:
It cannot handle the unary minus (the lexical analyzer should handle the
unary minus).
Operator precedence parsers use precedence functions that map terminal symbols to
integers.
1. Create functions fa for each grammar terminal a and for the end of string symbol.
2. Partition the symbols in groups so that fa and gb are in the same group
if a =· b (there can be symbols in the same group even if they are not connected by this relation).
3. Create a directed graph whose nodes are in the groups, next for each symbols a and b
do: place an edge from the group of gb to the group of fa if a <· b, otherwise if a ·> b
5. When there are no cycles collect the length of the longest paths from the groups of fa
and gb respectively.
Consider the following table:
actions.
We can make the look-ahead parameter explicit and discuss LR(k) parsers, where
– Table driven
– Can be constructed to recognize all programming language constructs
for which CFG can be written
– Most general non-backtracking shift-reduce parsing method
– Can detect a syntactic error as soon as it is possible to do so
– Class of grammars for which we can construct LR parsers are superset
of those which we can construct LL parsers.
LR(k) Parsers
LR(k), mostly interested on parsers with k<=1
LR(k) parsers are of interest in that they are the most powerful class of
deterministic bottom-up parsers using at most K look-ahead tokens.
Deterministic parsers must uniquely determine the correct parsing action at
each step.
they cannot back up or retry parsing actions.
LL vs. LR
LL LR
Does a leftmost derivation. Does a rightmost derivation in reverse.
Starts with the root nonterminal on the stack. Ends with the root nonterminal on the stack.
Ends when the stack is empty. Starts with an empty stack.
Builds the parse tree top-down. Builds the parse tree bottom-up.
Continuously pops a nonterminal off the stack, Tries to recognize a right hand side on the
and pushes the corresponding right hand side. stack, pops it, and pushes the corresponding
nonterminal.
Expands the non-terminals. Reduces the non-terminals.
Reads the terminals when it pops one off the Reads the terminals while it pushes them on
stack. the stack.
Pre-order traversal of the parse tree. Post-order traversal of the parse tree.
Model of an LR parser
How LR parser works?
Model of an LR parser…
1) shift (S),
2) reduce (R),
3) accept (A) the source code, or
4) signal a syntactic error (E).
LR Parsers (Cont.)
An LR parser makes shift-reduce decisions by maintaining states to keep track of
where we are in a parse.
States represent sets of items.
LR(k) Parsers:
4 types of LR(k) parsers:
i. LR(0)
ii. SLR(1) –Simple LR
iii. LALR(1) – Look Ahead LR and
iv. CLR(1) – Canonical LR
LR Parsers (Cont.)
In order to construct parsing table of LR(0) and SLR(1) we use canonical collection of
LR(0) items
In order to construct parsing table of LALR(1) and CLR(1) we use canonical
collection of LR(1) items.
i. LR(0) Item
LR(0) and all other LR-style parsing are based on the idea of: an item of the form:
A→X1…Xi.Xi+1…Xj
The dot symbol . in an item may appear anywhere in the right-hand side of a
production.
It marks how much of the production has already been matched.
An LR(0) item (item for short) of a grammar G is a production of G with a dot at
some position of the RHS.
The production A → XYZ yields the four items:
A → .XYZ this means at RHS we have not seen anything
A → X . YZ this means at RHS we have seen X
A → XY . Z this means at RHS we have seen X and Y
A → XYZ . this means at RHS we have seen everything
The production A → λ generates only one item, A → .
Constructing Canonical LR(0) item sets
• Augmented Grammar
• If G is a grammar with starting symbol S, then G’ (augmented grammar for G) is a grammar with a new
starting symbol S‘ and productions S’-> .S .
– The purpose of this new starting production is to indicate to the parser when it should stop
parsing and announce acceptance of input.
– Let a grammar be
E→BB
B→ cB | d
Closure of a state
Closure of a state adds items for all productions whose LHS occurs in an item
• Closure operation
– Let I be a set of items for a grammar G
• Example
• If I is , E’→E. , E → E. + T } then goto(I,+) is
E → E + .T
T → .T * F
T →.F
F →.(E)
F → .id
Steps to construct LR(0) items
Step 2. Draw canonical collection of LR(0) item (apply closure and go-to)
E→BB
B →cB|d Input string: ccdd$
Step 1. Augment the given grammar
E’→E
E→BB
B →cB|d
Step 2. Draw canonical collection of LR(0) item
Step 3. Number the production
Find CLOSURE(I)
Constructing canonical LR(0) item sets: Example#2
(Cont.)
First, E’ → .E is put in CLOSURE(I) by rule 1.
I0=closure({[E’->.E]}
Then, E-productions with dots at the left end: E’->.E
E → ‧E + T and E → ‧T E->.E+T
Now, there is a T immediately to the right of a dot in E->.T
E → .T, so we add T → .T * F and T → .F T->.T*F
Next, T → .F forces us to add: T->.F
F → ‧(E) and F → .id
F->.(E)
F->.id
Goto Next State
Given an item set (state) s,
we can compute its next state, s’, under a symbol X,
E’->E
E -> E + T | T
T -> T * F | F
F -> (E) | id
SLR(1) Parsing Table
ii. Simple LR(1), SLR(1), Parsing
Few number of states, hence very small table.
Simple and fast construction.
Works on smallest class of grammar.
SLR(1) parsers can parse a larger number of grammars than LR(0).
SLR(1) has the same Transition Diagram and Goto table as LR(0)
BUT with different Action table because it looks ahead 1 token.
SLR(1) Look-ahead
SLR(1) parsers are built first by constructing:
• Transition Diagram,
• then by computing Follow set as SLR(1) look aheads.
The ideas is:
S→(L)|a
L → L,S|S Parse the input string (a,(a,a)) using shift reduce parsing.