Chapter 3 - Syntax Analysis Part One
Chapter 3 - Syntax Analysis Part One
Chapter 3 - Syntax Analysis Part One
❖ Discuss the different derivation formats: Leftmost derivation, Rightmost derivation and
Non-Leftmost, Non-Rightmost derivations
❖ Discuss ambiguous grammars and how to deal with ambiguity from CFGs.
❖ Conceptually, for well-formed programs, the parser constructs a parse tree and passes it to the
rest of the compiler for further processing. In fact, the parse tree need not be constructed
explicitly, since checking and translation actions can be interspersed with parsing. Thus, the
parser and the rest of the front end could well be implemented by a single module.
❖ Therefore, Parser performs context-free syntax analysis, guides context-sensitive analysis, constructs
an intermediate representation, produces meaningful error messages and attempts error correction
Compiled by: Dawit K. 1
Compiler Design
❖ The parser obtains a string of tokens from the lexical analyzer, as shown in the above Figure and verifies
that the string of token names can be generated by the grammar for the source language.
❖ A grammar gives a precise, yet easy-to-understand, syntactic specification of a programming language.
• From certain classes of grammars, we can construct automatically an efficient parser that
determines the syntactic structure of a source program.
• As a side benefit, the parser-construction process can reveal syntactic ambiguities and trouble spots
that might have slipped through the initial design phase of a language.
• The structure imparted to a language by a properly designed grammar is useful for translating
source programs into correct object code and for detecting errors.
❖ A grammar allows a language to be evolved or developed iteratively, by adding new constructs to perform
new tasks.
❖ These new constructs can be integrated more easily into an implementation that follows the grammatical
structure of the language.
❖ There are three general types of parsers for grammars: universal, top-down, and bottom-up.
• Universal parsing methods such as the Cocke-Younger-Kasami algorithm and Earley's algorithm
can parse any grammar (Read more on these).
❖ These general methods are, however, too inefficient to use in compilers production.
❖ The methods commonly used in compilers can be classified as being either top-down or bottom-up.
• Top-Down Methods: - As implied by their names, top-down methods build parse trees from the top
(root) to the bottom (leaves).
• Bottom-up methods: - start from the leaves and work their way up to the root to build the parse
tree.
• In either case, the input to the parser is scanned from left to right, one symbol at a time.
❖ The most efficient top-down and bottom-up methods work only for sub-classes of grammars, but several
of these classes, particularly, LL and LR grammars, are expressive enough to describe most of the
syntactic constructs in modern programming languages.
Error Handling
Common Programming Errors include:
• Lexical errors, Syntactic errors, Semantic errors and logical Errors. The type of error
handled in this phase of compilation is syntactical error.
1. Panic mode recovery: - Discard input symbol one at a time until one of designated set of synchronization
tokens is found.
2. Phrase level recovery: - Replacing a prefix of remaining input by some string that allows the parser to
continue.
3. Error productions: - Augment the grammar with productions that generate the erroneous constructs
4. Global correction: - Choosing minimal sequence of changes to obtain a globally least-cost correction
2. A set of non-terminals N
o Non-terminals are syntactic variables that denote sets of strings.
o The sets of strings denoted by non-terminals help define the language generated by the grammar.
o Non-terminals impose a hierarchical structure on the language that is key to syntax analysis and
translation.
4. A special non-terminal S Є N, which is the start symbol. The production for the start symbol are listed
first.
❖ Just as regular expression generates strings of characters, CFG generate strings of tokens.
3. Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar symbols; that is, either non-
terminals or terminals.
4. Lowercase letters late in the alphabet, chiefly u, v, ..., z, represent (possibly empty) strings of terminals.
5. Lowercase Greek letters ,,, for example, represent (possibly empty) strings of grammar symbols.
❖ Thus, a generic production can be written as A →, where A is the head and the body.
6. A set of productions A→ 1, A→ 2, A→ 3,…, A→ k with a common head A (call them
A-productions), may be written A→ 1|2|3|...|k. call 1, 2, 3,...,k the alternatives for A.
7. Unless stated otherwise, the head of the first production is the start symbol.
Example 2: - Using these conventions, the grammar of Example 1 can be rewritten concisely as:
E→ E+T|E-T|T
T→ T*F|T/F|F
F → ( E ) | id
Derivations
❖ A derivation is a description of how a string is generated from the start symbol of a grammar.
❖ The construction of a parse tree can be made precise by taking a derivational view, in which
productions are treated as rewriting rules. Beginning with the start symbol, each rewriting step
replaces a nonterminal by the body of one of its productions.
❖ For a general definition of derivation, consider a nonterminal A in the middle of a sequence f
grammar symbols, as in 𝛼𝐴𝛽, where 𝛼 𝑎𝑛𝑑 𝛽 are arbitrary strings of grammar symbols.
o Suppose 𝐴 → 𝛾 is a production. Then, we write 𝛼𝐴𝛽 ⇒ 𝛼𝛾𝛽.
o The symbol ⇒ means, "derives in one step."
❖ Example 3: Use the CFG below to perform the derivations in example 4 & 5.
→ while(id>num) do print(E);
→ while(id>num) do print(id);
Rightmost Derivations
Is a derivation technique that chooses the rightmost non-terminal to replace
Example 5: Generate while (num > num) do print(id); from CFG in example 3
S → while(B) do S
→ while(B) do print(E);
→ while(B) do print(id);
→ while(E>E) do print(id);
→ while(E>num) do print(id);
→ while(num>num) do print(id);
Non-Leftmost, Non-Rightmost Derivations
Some derivations are neither leftmost or rightmost, such as:
S → while(B) do S
→ while(E>E) do S
→ while(E>E) do print(E);
→ while(E>id) do print(E);
→ while(num>id) do print(E);
→ while(num>id) do print(num);
CFG Shorthand
We can combine two rules of the form S → α and S → β to get the single rule S → α│β
Example 6: CFG in example 3 can be shortened as follows
Terminals = {id, num, if, then, else, print, =, {,}, ;, (, ) }
Non-Terminals = { S, E, B, L }
Rules = S → print(E); | while (B) do S | { L }
E → id | num
B→E>E
L → S | SL
Start Symbol = S
Parse Trees
➢ A parse tree is a graphical representation of a derivation that filters out the order in which
productions are applied to replace non-terminals.
Each interior node of a parse tree represents the application of a production.
The interior node is labeled with the nonterminal A in the head of the production; the
children of the node are labeled, from left to right, by the symbols in the body of the
production by which this A was replaced during the derivation.
➢ We start with the initial symbol S of the grammar as the root of the tree
The children of the root are the symbols that were used to rewrite the initial symbol in
Compiled by: Dawit K 6
Compiler Design
the derivation.
The internal nodes of the parse tree are non-terminals
The children of each internal node N are the symbols on the right-hand side of a rule
that has N as the left-hand side (e.g. B → E > E where E > E is the right-hand side and
B is the left-hand side of the rule)
➢ Terminals are leaves of the tree.
Examples 7: - ( id + id )
E ⇒ -E ⇒ - ( E ) ⇒ -( E + E ) ⇒ -( id + E) ⇒-( id + id)
Example 8: - ( id + id * id)
E ⇒ E + E ⇒ E + E * E ⇒ ( E + id * E) ⇒ (E + id * id) ⇒ ( id + id * id)
a) b)
Ambiguous Grammars
A grammar is ambiguous if there is at least one string derivable from the grammar that has
more than one different parse tree, or more than one leftmost derivation, or more than one
rightmost derivation
The example 8 above has two parse trees (parse tree a and b) that are ambiguous
grammars.
Ambiguous grammars are bad, because the parse trees don’t tell us the exact meaning of the
string.
For example, if we see the example 8 again, in figure a, the string means id + (id * id),
but in fig. b, the string means (id + id) * id. This is why we call it “ambiguous”.
We need to change the grammar to fix this problem. How? We may rewrite the grammar as follows:
Terminals = {id, +, -, *, /, (, )}
Non-Terminals = {E, T, F }
Start Symbol = E
Rules = E → E + T
E→ E-T
E→ T
T→ T*F
T→ T/F
F → id
F → (E)
A parse tree for id * (id + id)
We need to make sure that all additions appear higher in the tree than multiplications (Why?)
How can we do this?
➢ Once we replace an E with E*E using single rule 4, we don’t want to rewrite any of the Es
we’ve just created using rule 2, since that would place an addition (+) lower in the tree than a
multiplication (*)
➢ Let’s create a new non-terminal T for multiplication and division.
➢ T will generate strings of id’s multiplied or divided together, with no additions or subtractions.
➢ Then we can modify E to generate strings of T’s added together
➢ This modified grammar is shown above.
➢ However, this grammar is still ambiguous. It is impossible to generate a parse tree from this
CFG that has * higher than + in the tree
➢ Consider the string id+id+id, which has two parse trees, as shown at example 2 of slide 5
id+id+id = (id+id)+id or
= id+(id+id) are all ok
id-id-id = (id-id)-id
!= id-(id-id) but this is wrong
➢ We would like addition and subtraction to have leftmost association as above.
➢ In other words, we need to make sure that the right sub-tree of an addition or subtraction is
not another addition or subtraction
➢ We modified the parse tree of example 8 by the CFG and parse tree shown above in this page
to generate an unambiguous CFG and parse tree.
Review Exercises
Note: attempt all questions individually.
Submit your answer on [email protected] Due date: April 5, 2024 G.C.
Which of the following strings are derivable from the grammar? Give the parse tree for
derivable strings?
i. ab iv. aaabb
ii. aabbb v. aaaabb
iii. aba vi. Aabb
3. Show that the following CFGs are ambiguous by giving two parse trees for the same string?
3.1) Terminals = { a, b } 3.2) Terminals = { if, then, else, print, id }
Non-Terminals = {S, T} Non-Terminals = {S, T}
Start Symbol = S Start Symbol = S
Rules = S→ STS Rules = S→ if id then S T
S→ b S→ print id
T→ aT T→ else S
T→ ε
T→ ε