Unit 2 - Sessions 1 - 2

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 36

18CSC304J

COMPILER DESIGN

UNIT 2
SESSIONS 1 & 2
Topics that will be covered in this Session

• Syntax Analysis Definition


• Role of Parser
• Context Free Grammar
• Lexical versus Syntactic Analysis
• Syntax Error Handling
SYNTAX ANALYSIS
DEFINITION
Syntax Analysis

• Syntax Analysis is the second phase of the compiler design process

• It analyzes the syntactical structure

• It checks if the given input is in the correct syntax of the programming language or not

• Every programming language has rules that prescribe the syntactic structure of well-formed
programs

• In C, for example, a program is made up of functions, a function out of declarations and


statements, a statement our of expressions, and so on.

• The syntax of programming language constructs can be specified by context-free grammars or


BNF (Backus-Naur Form) notation

• Grammars offer significant benefits for both language designers and compiler writers
Benefits of Grammar
• A grammar gives a precise, yet easy-to-understand, syntactic specification of a programming
language

• From certain classes of grammars, we can construct automatically an efficient parser that
determines the syntactic structure of a source program

• The parser construction process can reveal syntactic ambiguities and trouble spots that might
have slipped through the initial design phase of a language

• The structure imparted to a language by a properly designed grammar is useful for translating
source programs into correct object code and for detecting errors

• A grammar allows a language to be evolved or developed iteratively, by adding new constructs to


perform new tasks

• These new constructs can be integrated more easily into an implementation that follows the
grammatical structure of the language
ROLE OF THE PARSER
The Role of Parser
• Input : Stream of tokens
• Output : Some representation of
the parse tree

Parsing Methods:
• Universal Parsing
• It obtains a string of tokens from the lexical • Cocke-Younger-Kasami algorithm
analyzer and verifies that the string can be and Earley’s algorithm
generated by the grammar for the source language
• Can parse any grammar
• The parser should also report syntax errors in an • Too inefficient to use in compilers
intelligible fashion
• Top-down Parsing commonly
• It should also recover from commonly occurring
errors • Bottom-up Parsing used in compilers
The Role of the Parser – cont..
Top-down Parsing

• Top-down parsers build parse trees from the top (root) to the bottom (leaves)

Bottom-up Parsing

• Bottom-up parsers start from the leaves and work up to the root

Note

• The input to the parser is always scanned from left to right, one symbol at a time

• The most efficient top-down and bottom-up methods work only for sub-classes of grammars

• LL and LR grammars are expensive enough to describe most of the syntactic constructs in
modern programming languages

• Parsers implemented by hand often use LL grammars (eg. Predictive Parsing approach)

• Parsers for the larger class of LR grammars are usually constructed using automated tools
The Role of the Parser – cont..
Tasks that may be conducted during parsing

• Collecting information about various tokens into the symbol table

• Performing type checking and other kinds of semantic analysis

• Generating intermediate code

These activities are lumped into the “Rest of the front end” box in the picture
CONTEXT FREE
GRAMMAR
Grammar & its Types
• Grammar denotes syntactical rules in languages

• Noam Chomsky gave a mathematical model for grammar. According to him there are 4 types of
grammars

Grammar Grammar Language Accepted Automation


Type
Type 0 Unrestricted Grammar Recursively Enumerable Turing Machine
Language

Type 1 Context Sensitive Context Sensitive Linear Bounded


Grammar Language Automata

Type 2 Context Free Grammar Context Free Language Pushdown


Automata
Type 3 Regular Grammar (or) Regular Language Finite Automata
Regular Expression
Context Free Grammar
• A Context-Free Grammar is used to systematically describe the syntax of programming
language constructs like expressions and statements

• A context-free grammar (grammar) consists of terminals, non-terminals, a start symbol and


productions

1. Terminals are the basic symbols from which strings are formed

2. Non-terminals are syntactic variables that denote sets of strings

3. In a grammar, one non-terminal is distinguished as the start symbol, and the set of strings
it denotes is the language generated by the grammar. Conventionally, the productions for
the start symbol are listed first

4. The productions of a grammar specify the manner in which the terminals and non-
terminals can be combined to form strings.
Context Free Grammar : Example
expression → expression + term E→E+T E→E+T|E–T|T

expression → expression – term E→E–T T→T*F|T/F|F

expression → term E→T F → ( E ) | id

term → term * factor T→T*F

term → term / factor T→T/F

term → factor OR T→F OR


factor → ( expression ) F→(E)

factor → id F → id

Non-terminals: Non-terminals: Non-terminals:

expression, term, factor E, T, F E, T, F

Start Symbol : expression Start Symbol : E Start Symbol : E


Notational Conventions

1 2
These symbols are terminals: These symbols are non-terminals:

• Lowercase letters early in the alphabet, such • Uppercase letters early in the alphabet,
as a, b, c such as A, B, C

• Operator symbols such as +, *, and so on • The letter S, which when it appears, is


usually the start symbol
• Punctuation symbols such as parentheses,
comma, and so on • Lowercase, italic names such as expr or
stmt
• The digits 0, 1, 2, … , 9

• Boldface strings such as id or if, each of


which represents a single terminal symbol
Notational Conventions – cont.. 5

3 Lowercase Greek letters, , β, γ represent


strings of grammar symbols. A generic
Uppercase letters late in the alphabet, such as production can be written as A → 
X, Y, Z, represent grammar symbols, that is,
either non-terminals or terminals
6
4 A set of productions A → 1, A → 2, A → 3
with a common head A (call them A-
Lowercase letters late in the alphabet, such as
productions) may be written as
u, v, … z represent strings of terminals
A → 1 | 2 | 3

1 2 3 are the alternatives of A


7
Unless stated otherwise, the head of the first
production is the start symbol
Derivations from a Grammar

• A derivation of a string from a grammar is applying a sequence of productions that transform


the start symbol into the string

• A derivation proves that a string belongs to the language defined by a grammar

• A parse tree can be constructed with the help of a derivation

• A parse tree is a graphical representation of a derivation

• If at each step in a derivation, a production is applied to the left most non-terminal, then the
derivation is called the leftmost derivation

• A derivation in which the rightmost non-terminal is replaced at each step is called the
rightmost derivation
Derivations – Example 1
Consider the grammar
S → ABC
A → aA | a
B → bB | b
C → cC | c

Derive the string aabbbcccc


Derivations – Example 2
Consider the grammar

S → aB | bA

A → a | aS | bAA

B → b | bS | aBB

Write leftmost & rightmost derivations and draw the parse tree for the string aababb

Leftmost Derivation Parse tree Rightmost Derivation

S aB S aB

aaBB aaBB

aabB aaBaBB

aabaBB aaBaBb

aababB aaBabb

aababb aababb
Derivations – cont..
We can derive the same string from the given grammar in another way

S → aB | bA

A → a | aS | bAA

B → b | bS | aBB

String : aababb

Leftmost Derivation Parse tree Rightmost Derivation

S aB S aB

aaBB aaBB

aabSB aaBb

aabaBB aabSb

aababB aabaBb

aababb aababb
Ambiguous Grammar

• Every parse tree has associated with it a unique leftmost and rightmost derivation

• A grammar that produces more than one parse tree for some sentence is said to be ambiguous

• Put another way, an ambiguous grammar is one that produces more than one leftmost
derivation or more than one rightmost derivation for the same sentence
Consider the grammar E → E + E | E * E | ( E ) | id
Check whether the grammar is ambiguous or not
Let us consider the string id + id * id

As we are able to draw two parse trees for the given string, the grammar is ambiguous
LEXICAL vs SYNTAX
ANALYSIS
Context Free Grammars Vs Regular Expressions

• Grammars are more powerful notation than regular expressions

• Every construct that can be described by a regular expression can be described by a grammar,
but not vice-versa

• Every regular language is a context-free language, but not vice versa


Lexical Vs Syntax Analysis
Everything that can be described by a regular expression can also be described by a grammar. We
may therefore ask “Why use regular expressions to define the lexical syntax of a language?”
There are several reasons
• Separating the syntactic structure of a language into lexical and non-lexical parts provides a
convenient way of modularizing the front end of a compiler into two manageable-sized
components
• The lexical rules of a language are frequently quite simple, and to describe them we do not
need a notation as powerful as grammars
• Regular expressions generally provide a more concise and easier-to-understand notation for
tokens than grammars
• More efficient lexical analyzers can be constructed automatically from regular expressions than
from arbitrary grammars
• Regular expressions are most useful for describing the structure of constructs such as identifiers,
constants, keywords, and white space. Grammars, on the other hand, are most useful for
describing nested structures such as balanced parentheses, matching begin-end’s, corresponding
if-then-else’s, and so on. These nested structures cannot be described by regular expressions
Lexical Analysis
Vs
Syntax Analysis
SYNTAX ERROR
HANDLING
Syntax Error Handling

• If a compiler had to process only correct programs, its design and implementation would be
simplified greatly

• However, a compiler is expected to assist the programmer in locating and tracking down errors
that inevitably creep into programs, despite the programmer’s best efforts

• Most programming language specifications do not describe how a compiler should respond to
errors; error handling is left to the compiler designer

• Planning the error handling right from the start can both simplify the structure of a compiler
and improve its handling of errors
Common Programming Errors

Common programming errors can occur at many different levels

• Lexical errors include misspellings of identifiers, keywords, or operators, and missing quotes
around text intended as a string

• Syntactic errors include misplaced semicolons or extra or missing braces, that is, { or }

Another example in C is the appearance of a case statement without an enclosing switch

• Semantic errors include type mismatches between operators and operands, e.g., the return of
a value from a function in C with return type void

• Logical errors can be anything from incorrect reasoning on the part of the programmer to the
use in a C program of the assignment operator = instead of the comparison operator ==. The
program containing = may be well formed; however, it may not reflect the programmer’s intent
Error Recovery during Parsing / Syntax Analysis

• The precision of parsing methods allows syntactic errors to be detected very efficiently

• Several parsing methods, such as the LL and LR methods, detect an error as soon as possible;
that is, when the stream of tokens from the lexical analyzer cannot be parsed further according
to the grammar for the language.

• They have the viable-prefix property, meaning that they detect that an error has occurred as
soon as they see a prefix of the input that cannot be completed to form a string in the language

• Error recovery is emphasized during parsing because many errors appear syntactic and are
exposed when parsing cannot continue. A few semantic errors such as type mismatches, can
also be detected efficiently; however, accurate detection of semantic and logical errors at
compile time is in general a difficult task
Goals of an Error Handler

The goals of an error handler in a parser is simple to state, but challenging to realize:

• Report the presence of errors clearly and accurately

• Recover from each error quickly enough to detect subsequent errors

• Add minimal overhead to the processing of correct programs

How should an error handler report the presence of an error?

• It must report the place in the source program where an error is detected, because there is a
good chance that the actual error occurred within the previous few tokens

• A common strategy is to print the offending line with a pointer to the position at which an error
is detected
Error Recovery Strategies

Once an error is detected, how should the parser recover?

• There is no universally acceptable strategy, but a few methods have broad applicability

• The simplest approach is for the parser to quit with an informative error message when it
detects the first error

• Additional errors are often uncovered if the parser can restore itself to a state where processing
of the input can continue with reasonable hopes that further processing will provide meaningful
diagnostic information

• If errors pile up, it is better for the compiler to give up after exceeding some error limit than to
produce an annoying avalanche of “spurious” errors
Error Recovery Strategies – cont..

Recovery strategies

• Panic-Mode Recovery

• Phrase-Level Recovery

• Error Productions

• Global Correction
Error Recovery Strategies – cont..
Panic-Mode Recovery

• On discovering an error, the parser discards input symbols one at a time until one of a
designated set of synchronizing tokens is found

• The synchronizing tokens are usually delimiters, such as semicolon or }, whose role in the
source program is clear and unambiguous

• The compiler designer must select the synchronizing tokens appropriate for the source language

• Panic-mode recovery often skips a considerable amount of input without checking it for
additional errors.

Advantages:

• Simplicity

• It is guaranteed not to go into an infinite loop


Error Recovery Strategies – cont..
Phrase-Level Recovery
• On discovering an error, the parser may perform local correction on the remaining input
• It may replace a prefix of the remaining input by some string that allows the parser to continue
• Examples for local correction
• Replace a comma by a semicolon
• Delete an extraneous semicolon
• Insert a missing semicolon
• The choice of the local correction is left to the compiler designer
• We must be careful to choose replacements that do not lead to infinite loops
• It is used in several error-repairing compilers, as it can correct any input string

Drawback
• Difficulty in coping with situations in which the actual error has occurred before the point of
detection
Error Recovery Strategies – cont..

Error Productions

• By anticipating common errors that might be encountered, we can augment the grammar for
the language at hand with productions that generate the erroneous constructs

• A parser constructed from a grammar augmented by these error productions detects the
anticipated errors when an error production is used during parsing

• The parser can generate appropriate error diagnostics about the erroneous construct that has
been recognized in the input
Error Recovery Strategies – cont..
Global Correction

• Ideally, we would like a compiler to make as few changes as possible in processing an incorrect
input string

• There are algorithms for choosing a minimal sequence of changes to obtain a globally least-cost
correction

• Given an incorrect input string x and grammar G, these algorithms will find a parse tree for a
related string y, such that the number of insertions, deletions, and changes of tokens required
to transform x into y is as small as possible

• Unfortunately, these methods are in general too costly to implement in terms of time and space

• So these techniques are currently only of theoretical interest

• Note: A closest correct program may not be what the programmer had in mind

You might also like