Chapter 3 - Syntax Analysis Part One

Compiler Design
Chapter Three: syntactic Analyzer part one

The Objectives of this chapter are listed as follows,
❖ Explain the basic roles of Parser (syntactic Analyzer).
❖ Describe context-Free Grammars (CFGs) and their representation format.
❖ Discuss the different derivation formats: Leftmost derivation, Rightmost derivation and
Non-Leftmost, Non-Rightmost derivations
❖ Be familiar with CFG shorthand techniques.
❖ Describe Parse Tree and its structure.
❖ Discuss ambiguous grammars and how to deal with ambiguity from CFGs.
The Role of the Parser

❖ The parser obtains a string of tokens from the lexical analyzer, as shown in the below
figure, and verifies that the string of token names can be generated by the grammar for
the source language.
❖ The parser is expected to report any syntax errors and to recover from commonly occurring
errors to continue processing the remainder of the program.
❖ Conceptually, for well-formed programs, the parser constructs a parse tree and passes it to the
rest of the compiler for further processing. In fact, the parse tree need not be constructed
explicitly, since checking and translation actions can be interspersed with parsing. Thus, the
parser and the rest of the front end could well be implemented by a single module.
Fig. Position of parser in compiler model
❖ Therefore, Parser performs context-free syntax analysis, guides context-sensitive analysis, constructs
an intermediate representation, produces meaningful error messages and attempts error correction
Compiled by: Dawit K. 1
Compiler Design
❖ The parser obtains a string of tokens from the lexical analyzer, as shown in the above Figure and verifies
that the string of token names can be generated by the grammar for the source language.
❖ A grammar gives a precise, yet easy-to-understand, syntactic specification of a programming language.
• From certain classes of grammars, we can construct automatically an efficient parser that
determines the syntactic structure of a source program.
• As a side benefit, the parser-construction process can reveal syntactic ambiguities and trouble spots
that might have slipped through the initial design phase of a language.
• The structure imparted to a language by a properly designed grammar is useful for translating
source programs into correct object code and for detecting errors.
❖ A grammar allows a language to be evolved or developed iteratively, by adding new constructs to perform
new tasks.
❖ These new constructs can be integrated more easily into an implementation that follows the grammatical
structure of the language.
❖ There are three general types of parsers for grammars: universal, top-down, and bottom-up.
• Universal parsing methods such as the Cocke-Younger-Kasami algorithm and Earley's algorithm
can parse any grammar (Read more on these).
❖ These general methods are, however, too inefficient to use in compilers production.
❖ The methods commonly used in compilers can be classified as being either top-down or bottom-up.
• Top-Down Methods: - As implied by their names, top-down methods build parse trees from the top
(root) to the bottom (leaves).
• Bottom-up methods: - start from the leaves and work their way up to the root to build the parse
tree.
• In either case, the input to the parser is scanned from left to right, one symbol at a time.
❖ The most efficient top-down and bottom-up methods work only for sub-classes of grammars, but several
of these classes, particularly, LL and LR grammars, are expressive enough to describe most of the
syntactic constructs in modern programming languages.
Error Handling
Common Programming Errors include:
• Lexical errors, Syntactic errors, Semantic errors and logical Errors. The type of error
handled in this phase of compilation is syntactical error.
Error handler goals

• Report the presence of errors clearly and accurately
• Recover from each error quickly enough to detect subsequent errors
• Add minimal overhead to the processing of correct programs

Compiler Design
Common Error-Recovery Strategies includes:
1. Panic mode recovery: - Discard input symbol one at a time until one of designated set of synchronization
tokens is found.
2. Phrase level recovery: - Replacing a prefix of remaining input by some string that allows the parser to
continue.
3. Error productions: - Augment the grammar with productions that generate the erroneous constructs
4. Global correction: - Choosing minimal sequence of changes to obtain a globally least-cost correction
Context-Free Grammars (CFGs)

❖ CFG is used as a tool to describe the syntax of a programming language.
❖ A CFG includes 4 components:
1. A set of terminals T, which are the tokens of the language

o Terminals are the basic symbols from which strings are formed.
o The term "token name" is a synonym for “terminal"
2. A set of non-terminals N
o Non-terminals are syntactic variables that denote sets of strings.
o The sets of strings denoted by non-terminals help define the language generated by the grammar.
o Non-terminals impose a hierarchical structure on the language that is key to syntax analysis and
translation.
3. A set of rewriting rules R.

o The left-hand side (head) of each rewriting rule is a single non-terminal.
o The right-hand side (body) of each rewriting rule is a string of terminals and/or non-terminals
4. A special non-terminal S Є N, which is the start symbol. The production for the start symbol are listed
first.
❖ Just as regular expression generates strings of characters, CFG generate strings of tokens.
❖ A string of tokens is generated by a CFG in the following way:

1. The initial input string is the start symbol S
2. While there are non-terminals left in the string:
a. Pick any non-terminal in the input string A
b. Replace a single occurrence of A in the string with the right-hand side of any rule that has A as
the left-hand side
Ccmpiled by: Dawit K. 3

Compiler Design
c. Repeat 1 and 2 until all elements in the string are terminals
Example 1: A grammar that defines simple arithmetic expressions:

Terminals = {id, +, -, *, /, (,)}
Non-Terminals = {expression, term, factor}
Start Symbol = expression
Rules = expression →expression + term
→ expression – term
→ term
term → term* factor
→ term/factor
→ factor
factor → (expression)
→ id
Notational Conventions
1. These symbols are terminals:
A. Lowercase letters early in the alphabet, such as a, b, c.
B. Operator symbols such as +, *, and so on.
C. Punctuation symbols such as parentheses, comma, and so on.
D. The digits 0, 1, ... ,9.
E. Boldface strings such as id or if, each of which represents a single terminal symbol.
2. These symbols are non-terminals:

A. Uppercase letters early in the alphabet, such as A, B, C.
B. The letter S, which, when it appears, is usually the start symbol.
C. Lowercase, italic names such as expr or stmt.
D. Uppercase letters may be used to represent non-terminals for the constructs.
For example: - non terminals for expressions, terms, and factors are often represented by E, T, and F, respectively.
3. Uppercase letters late in the alphabet, such as X, Y, Z, represent grammar symbols; that is, either non-
terminals or terminals.
4. Lowercase letters late in the alphabet, chiefly u, v, ..., z, represent (possibly empty) strings of terminals.
5. Lowercase Greek letters ,,, for example, represent (possibly empty) strings of grammar symbols.
❖ Thus, a generic production can be written as A →, where A is the head and  the body.
6. A set of productions A→ 1, A→ 2, A→ 3,…, A→ k with a common head A (call them
A-productions), may be written A→ 1|2|3|...|k. call 1, 2, 3,...,k the alternatives for A.
7. Unless stated otherwise, the head of the first production is the start symbol.

Compiler Design
Example 2: - Using these conventions, the grammar of Example 1 can be rewritten concisely as:
E→ E+T|E-T|T
T→ T*F|T/F|F
F → ( E ) | id
Derivations
❖ A derivation is a description of how a string is generated from the start symbol of a grammar.
❖ The construction of a parse tree can be made precise by taking a derivational view, in which
productions are treated as rewriting rules. Beginning with the start symbol, each rewriting step
replaces a nonterminal by the body of one of its productions.
❖ For a general definition of derivation, consider a nonterminal A in the middle of a sequence f
grammar symbols, as in 𝛼𝐴𝛽, where 𝛼 𝑎𝑛𝑑 𝛽 are arbitrary strings of grammar symbols.
o Suppose 𝐴 → 𝛾 is a production. Then, we write 𝛼𝐴𝛽 ⇒ 𝛼𝛾𝛽.
o The symbol ⇒ means, "derives in one step."
❖ Example 3: Use the CFG below to perform the derivations in example 4 & 5.
Terminals = {id, num, if, then, else, print, =, {, }, ;, (, ) }

Non-Terminals = { S, E, B, L }
Rules = (1) S → print(E);
(2) S → while (B) do S
(3) S → { L }
(4) E → id
(5) E → num
(6) B → E > E
(7) L → S
(8) L → SL
Start Symbol = S
Leftmost Derivations
 A string of terminals and non-terminals α that can be derived from the initial symbol of the
grammar is called a sentential form
 Thus the strings “{ SL }”, “while(id>E) do S”, and print(E>id)” of the above example
are all sentential forms.
 A derivation is “leftmost” if, at each step in the derivation, the leftmost non-terminal is
selected to replace (always picks the leftmost non-terminal to replace).
 A sentential form that occurs in a leftmost derivation is called a left-sentential form.
Example 4: We can use leftmost derivations to generate while (id > num) do print(id); from
the above CFG (example 3) as follows:
S → while(B) do S
→ while(E>E) do S
→ while(id>E) do S
→ while(id>num) do S
Compiler Design
→ while(id>num) do print(E);
→ while(id>num) do print(id);
Rightmost Derivations
 Is a derivation technique that chooses the rightmost non-terminal to replace
Example 5: Generate while (num > num) do print(id); from CFG in example 3
S → while(B) do S
→ while(B) do print(E);
→ while(B) do print(id);
→ while(E>E) do print(id);
→ while(E>num) do print(id);
→ while(num>num) do print(id);
Non-Leftmost, Non-Rightmost Derivations
 Some derivations are neither leftmost or rightmost, such as:
S → while(B) do S
→ while(E>E) do S
→ while(E>E) do print(E);
→ while(E>id) do print(E);
→ while(num>id) do print(E);
→ while(num>id) do print(num);
CFG Shorthand
 We can combine two rules of the form S → α and S → β to get the single rule S → α│β
Example 6: CFG in example 3 can be shortened as follows
Terminals = {id, num, if, then, else, print, =, {,}, ;, (, ) }
Non-Terminals = { S, E, B, L }
Rules = S → print(E); | while (B) do S | { L }
E → id | num
B→E>E
L → S | SL
Start Symbol = S
Parse Trees
➢ A parse tree is a graphical representation of a derivation that filters out the order in which
productions are applied to replace non-terminals.
 Each interior node of a parse tree represents the application of a production.
 The interior node is labeled with the nonterminal A in the head of the production; the
children of the node are labeled, from left to right, by the symbols in the body of the
production by which this A was replaced during the derivation.
➢ We start with the initial symbol S of the grammar as the root of the tree
 The children of the root are the symbols that were used to rewrite the initial symbol in
Compiled by: Dawit K 6
Compiler Design
the derivation.
 The internal nodes of the parse tree are non-terminals
 The children of each internal node N are the symbols on the right-hand side of a rule
that has N as the left-hand side (e.g. B → E > E where E > E is the right-hand side and
B is the left-hand side of the rule)
➢ Terminals are leaves of the tree.
Examples 7: - ( id + id )
E ⇒ -E ⇒ - ( E ) ⇒ -( E + E ) ⇒ -( id + E) ⇒-( id + id)
Example 8: - ( id + id * id)
E ⇒ E + E ⇒ E + E * E ⇒ ( E + id * E) ⇒ (E + id * id) ⇒ ( id + id * id)
a) b)
Ambiguous Grammars
 A grammar is ambiguous if there is at least one string derivable from the grammar that has
more than one different parse tree, or more than one leftmost derivation, or more than one
rightmost derivation
 The example 8 above has two parse trees (parse tree a and b) that are ambiguous
grammars.
 Ambiguous grammars are bad, because the parse trees don’t tell us the exact meaning of the
string.
 For example, if we see the example 8 again, in figure a, the string means id + (id * id),
but in fig. b, the string means (id + id) * id. This is why we call it “ambiguous”.

Compiler Design
We need to change the grammar to fix this problem. How? We may rewrite the grammar as follows:
Terminals = {id, +, -, *, /, (, )}
Non-Terminals = {E, T, F }
Start Symbol = E
Rules = E → E + T
E→ E-T
E→ T
T→ T*F
T→ T/F
F → id
F → (E)
A parse tree for id * (id + id)
We need to make sure that all additions appear higher in the tree than multiplications (Why?)
How can we do this?
➢ Once we replace an E with E*E using single rule 4, we don’t want to rewrite any of the Es
we’ve just created using rule 2, since that would place an addition (+) lower in the tree than a
multiplication (*)
➢ Let’s create a new non-terminal T for multiplication and division.
➢ T will generate strings of id’s multiplied or divided together, with no additions or subtractions.
➢ Then we can modify E to generate strings of T’s added together
➢ This modified grammar is shown above.
➢ However, this grammar is still ambiguous. It is impossible to generate a parse tree from this
CFG that has * higher than + in the tree
➢ Consider the string id+id+id, which has two parse trees, as shown at example 2 of slide 5
id+id+id = (id+id)+id or
= id+(id+id) are all ok
id-id-id = (id-id)-id
!= id-(id-id) but this is wrong
➢ We would like addition and subtraction to have leftmost association as above.
➢ In other words, we need to make sure that the right sub-tree of an addition or subtraction is
not another addition or subtraction
➢ We modified the parse tree of example 8 by the CFG and parse tree shown above in this page
to generate an unambiguous CFG and parse tree.

Compiler Design
right-hand side of CFG rules

 For example, consider the following CFG, which describes simple Java statement blocks and
stylized simple Java print statements:
1. S → { B }
2. S → print(id)
3. B → S ; C
4. C → S ; C
5. C → ε
Rules 3, 4, and 5 in the above grammar describe a series of one or more statements S, terminated
by semicolons
We could express the same language using an EBNF as follows:
1. S → { B }
2. S → print”(“id”)”
3. B → (S;)+
Note
▪ In Rule 2, when we want a parenthesis to appear in EBNF, we need to surround it with
quotation marks.
▪ But in Rule 3, the pair of parentheses is for the + symbol, not belongs to the language.
Review Exercises
Note: attempt all questions individually.
Submit your answer on [email protected] Due date: April 5, 2024 G.C.
1. Consider the context-free grammar: S → S S + | S S * | a and the string aa + a*.

a) Give a leftmost derivation for the string.
b) Give a rightmost derivation for the string.
c) Give a parse tree for the string.
d) Is the grammar ambiguous or unambiguous? Justify your answer.
e) Describe the language generated by this grammar.
2. Consider the following grammar

Terminals = { a, b }
Non-Terminals = {S, T, F }
Start Symbol = S
Rules = S→ TF
T→ T T T
T→ a
F→ aFb
F→ b
Compiler Design
Which of the following strings are derivable from the grammar? Give the parse tree for
derivable strings?
i. ab iv. aaabb
ii. aabbb v. aaaabb
iii. aba vi. Aabb
3. Show that the following CFGs are ambiguous by giving two parse trees for the same string?
3.1) Terminals = { a, b } 3.2) Terminals = { if, then, else, print, id }
Non-Terminals = {S, T} Non-Terminals = {S, T}
Start Symbol = S Start Symbol = S
Rules = S→ STS Rules = S→ if id then S T
S→ b S→ print id
T→ aT T→ else S
T→ ε
T→ ε
4. Construct a CFG for each of the following:

a. All integers with sign (Example: +3, -3)
b. The set of all strings over { (, ), [, ]} which form balanced parenthesis. That is, (), ()(),
((()())()), [()()] and ([()[]()]) are in the language but )( , ][ , (() and ([ are not.
c. The set of all string over {num, +, -, *, /} which are legal binary post-fix expressions.
Thus numnum+, num num num + *, num num – num * are all in the language, while
num*, num*num and num num num – are not in the language.
d. Are your CFGs in a, b and c ambiguous?

Chapter 3 - Syntax Analysis Part One

Uploaded by

Copyright:

Available Formats

Chapter 3 - Syntax Analysis Part One

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 3 - Syntax Analysis Part One

Uploaded by

Copyright:

Available Formats

Compiler Design

Chapter Three: syntactic Analyzer part one

❖ Describe context-Free Grammars (CFGs) and their representation format.

❖ Be familiar with CFG shorthand techniques.

❖ Describe Parse Tree and its structure.

The Role of the Parser

Fig. Position of parser in compiler model

Error handler goals

Compiled by: Dawit K. 2

Context-Free Grammars (CFGs)

❖ A CFG includes 4 components:

1. A set of terminals T, which are the tokens of the language

3. A set of rewriting rules R.

❖ A string of tokens is generated by a CFG in the following way:

Ccmpiled by: Dawit K. 3

Example 1: A grammar that defines simple arithmetic expressions:

2. These symbols are non-terminals:

Compiled by: Dawit K. 4

Terminals = {id, num, if, then, else, print, =, {, }, ;, (, ) }

Compiled by: Dawit K 7

Compiled by: Dawit K 8

right-hand side of CFG rules

1. Consider the context-free grammar: S → S S + | S S * | a and the string aa + a*.

2. Consider the following grammar

4. Construct a CFG for each of the following:

Compiled by: Dawit K. 10

You might also like