Unit - Iii

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

UNIT - III

GRAMMARS FORMALISM

 Last chapters we have discussed finite automata (DFA’s, NFA’s, NFA-Λ’s).


We will take a look at other “machines” that model computation. Before we
do that, we will examine methods of generating languages: Regular
expressions, Grammars. We have already discussed regular expressions.
Now we will examine several kinds of grammars.
 Generally, we are familiar with natural language grammars. For example, a
sentence with a transitive verb has the “creation” rule:
<sentence><subject> <predicate>
Then there are other rules that we apply before getting to the actual
words:
<subject><article> <adjective> <noun>
<predicate><verb> <object>
<object><article> <noun>
 In grammar theory, we call <sentence>, <subject>, <predicate>, <verb>,
<object>, <article>, <adjective>, <noun>, etc., variables or non-terminals.
The variable <sentence> is special and it is the “start variable”, i.e. where
we start constructing a sentence.
 Then there are other rules that allow us to replaces variables by actual
dictionary words, which we call terminals:
<noun>dog,<noun>cat,<verb>chased, etc.
We sometimes diagram sentences using these substitution rules, and
thus build a parse tree, with the start variable at the root.

Definition of Grammar:
A phrase-Structure grammar (or simply a grammar) is (V, T, P, S), Where
(i) V is a finite nonempty set whose elements are called variables.
(ii) T is finite nonempty set, whose elements are called terminals.
Computer Science & Engineering Formal Languages and Automata Theory

(iii) S is a special variable (i.e an element of V ) called the start symbol,


and

(iv) P is a finite set whose elements are α  β ,where α and β are strings

on V ¿ T, α has at least one symbol from V. Elements of P are called


Productions or production rules or rewriting rules.
1. Regular Grammar:
 A grammar G=(V,T,P,S) is said to be Regular grammar if the productions
are in either right-linear grammar or left-linear grammar.
ABx|xB|x
 There are two types of Regular grammars:
i. Right Linear Grammar (RLG)
ii. Left Linear Grammar (LLG)
1.1 Right –Linear Grammar:
 A grammar G=(V,T,P,S) is said to be right-linear if all productions are of the
form
AxB or Ax.
Where A,B ∈ V and x ∈ T*.

1.2 Left –Linear Grammar:


 A grammar G=(V,T,P,S) is said to be Left-linear grammar if all productions
are of the form
ABx or Ax
Where A,B ∈ V and x ∈ T*.

Example:
1. Construct the regular grammar for regular expression r=0(10) *.
Sol: Right-Linear Grammar:
S0A
A10A| ∈

Left-Linear Grammar:

2
Computer Science & Engineering Formal Languages and Automata Theory

SS10|0
2. Context – Free Grammar:
A context-free grammar (CFG or just grammar) is defined formally as
G=(V,T,P,S)
Where
V: a finite set of variables (“non-terminals”); e.g., A, B, C, …
T: a finite set of symbols (“terminals”), e.g., a, b, c, …
P: a set of production rules of the form A  , where A  V and   (V U T)*
S: a start non-terminal; S  V

A context-free grammar consists of a set of productions of the form A  ,


where ‘A’ is a single non-terminal symbol and ‘’ is a potentially mixed
sequence of terminal and non-terminal symbols.

Eg: E  E+E
E  E*E
E  (E)
E  id
In the above example, grammar tuples are defined as follows:
G=({E},{+,*,(,),id},{ E  E+E, E  E*E, E  (E), E  id},E).
In this chapter we use the following conventions regarding grammars.
1) The capital letters A,B,C,D,E and S denote variables; S is the start
symbol unless otherwise stated.
2) The lowe-case letters a,b,c,d,e,digits,special symbols and boldface strings
are terminals.
3) The capital letters X,Y and Z denote symbols that may be either
terminals or varibales.
4) The lower-case letters u,v,w,x,y and z denote strings of terminals.
5) The lower-case Greek letters α,β and γ denote strings of varibles and
terminals.

3
Computer Science & Engineering Formal Languages and Automata Theory

Generally we specify the grammar by listing the productions.


If A  α1, A  α2, A  α3, … A  αk are the prodcution then we may express
then by
A  α1 | α2 | α3 | … | αk

Context Free Language:


 If G is a CFG, then L(G), the language of G, is {w | S w }.

Note: ‘w’ must be a terminal string, S is the start symbol.

Examples:
1. Construct a CFG to generate set of palindromes over alphabet {a,b}.

Solution:The productions of a grammar to generate palindromes over {a,b} are

S  aSb | bSb | 

Hence S ⇒ aSa ⇒ abSba ⇒ abba ⇒ abba

This is the even palindrome.

Productions to generate odd palindrome are

S  aSb | bSb | a | b

Hence S ⇒ aSa ⇒ abSba ⇒ ab a ba ⇒ ababa

This is the odd palindrome.

2. Design CFG for a given language L(G)={ aibi | i ≥ 0}

Solution: L={ ,ab,aabb,aaabbb,…}


S  aSb | 
3. Design CFG for a given language L(G)={ aibi | i > 0}

Solution: L={ab,aabb,aaabbb,…}

4
Computer Science & Engineering Formal Languages and Automata Theory

S  aSb | ab

4. Design CFG for a given language L(G)={ wwR| w is binary}

Solution: L={,00,11,0110,1001,010010,…}

S  0S0 | 1S1 |

5. Design CFG for a given language L(G)={ w#wR| w is binary}

Solution: L={#,0#0,1#1,01#10,10#01,010#010,…}

S  0S0 | 1S1 |#

6. Design CFG for a regular expression r=(a+b)*

Solution: L={,a,b,aa,ab,ba,bb,aaa,aab,bbb,bba,…}

S  aS | bS |

7. Give a CFG for the set of all well formed paranthesis.

Solution: SSS | (S) | ( )

2.1 DERIVATION

 We derive strings in the language of a CFG by starting with the start


symbol, and repeatedly replacing some variable A by the right side of one
of its productions.
 That is, the “productions for A” are those that have A on the left side of
the .
 Aβ whenever there is a production A  γ
 Subscript with name of grammar, e.g.,

if necessary.
Example: 011AS 0110A1S

  β means string  can become β in zero or more derivation steps.


Example: 011AS 011AS (zero steps);

5
Computer Science & Engineering Formal Languages and Automata Theory

011AS 0110A1S (one step);


011AS 0110011 (three steps);
Sentential Forms:
 Any string of variables and/or terminals derived from the start symbol is
called a sentential form.
 Formally,  is a sentential form iff S .

Types of Derivations:

We have a choice of variable to replace at each step.


o Derivations may appear different only because we make the
same replacements in a different order.
o To avoid such differences, we may restrict the choice.
1. Left Most Derivation (LMD): If at each step in a derivation a production
is applied to the leftmost variable, then the derivation is called left most
derivation.
2. Right Most Derivation (RMD): If at each step in a derivation a
production is applied to the rightmost variable, then the derivation is
called right most derivation.

 used to indicate derivations are leftmost and rightmost.

Derivation/Parse Trees:

Given a grammar with the usual representation G = (V, T, P, S) with variables


V, terminal symbols T, set of productions P and the start symbol from V called
S.
A derivation tree is constructed with
1) Each tree vertex is a variable or terminal or epsilon
2) The root vertex is S
3) Interior vertices are from V, leaf vertices are from T or epsilon
4) An interior vertex A has children, in order, left to right, X1, X2, ..,Xk
when there is a production in P of the form A -> X1 X2 ... Xk
5) A leaf can be epsilon only when there is a production A -> epsilon

6
Computer Science & Engineering Formal Languages and Automata Theory

and the leafs parent can have only this child.


Example 1: Construct parse tree for the following CFG and take input string is
0110011.
S  AS | 
A  0A1 | A1 | 01

Sol: Before constructing parse tree, first derive the given input string from the
CFG.

LMD: S AS A1S 011S 011AS 0110A1S 0110011S


0110011.

RMD: S AS AAS AA A0A1 A0011 A10011 0110011.

Parse tree:

2.2 Ambiguous Grammars:

A CFG is ambiguous if one or more terminal strings have multiple leftmost


derivations or multiple rightmost derivations or multiple parse trees from the
start symbol.

Example 1: Consider the following grammar:

S  AS | 
A  0A1 | A1 | 01

The above CFG, the string 00111 has the following two leftmost derivations
from S.

Sol:

7
Computer Science & Engineering Formal Languages and Automata Theory

LMD 1: S AS 0A1S 0A11S 00111S 00111

LMD 2: S AS A1S 0A11S 00111S 00111

Intuitively, we can use A  A1 first or second to generate the extra 1.


Example 2:

Consider the following grammar:


S  SS
S  aSb
S  bSa
S
and the string w = aabb. We can draw the following 2 trees with the same
string w = aabb, so we say the grammar is “ambiguous” in this case.
If we can find either 2 leftmost / rightmost derivations or 2 different derivation
trees, then we can say the grammar is ambiguous.
S S

S S a S b

a S b a S b

a S b

2.3 Inherently Ambiguous context free language:

 A context free language for which we cannot construct an unambiguous


grammar is inherently ambiguous CFL.

8
Computer Science & Engineering Formal Languages and Automata Theory

Example:

L={anbncmdm | n≥1,m≥1} U {anbmcmdn | n≥1,m≥1}

 An operator grammar is a CFG with no -productions such that no


consecutive symbols on the right sides of productions are variables.

 Every CFL without  has an operator grammar.

 If all productions of a CFG are of the form AxB or Ax, then L(G) is a
regular set where x is a terminal string.

3. SIMPLIFICATION OF CFG

 In a CFG we may not use all the symbols for deriving a sentence. So, we
eliminate symbols are productions in G, which are not useful.

 We can “simplify" grammars to a great extent. Some of the things we can


do are: (Simplification Order is)

1. Elimination of  - productions: those of the form variable  .


2. Elimination of Unit productions: those of the form variable  variable.
3. Elimination of useless symbols: those that do not participate in any
derivation of a terminal string.

3.1 Eliminating  - productions:


 A variable A is nullable if A . Find them by a recursive algorithm.
Basis: If A   is a production, then A is nullable.
Induction: If A is the head of a production whose body consists of only
nullable symbols, then A is nullable.
 Once we have the nullable symbols, we can add additional productions
and then throw away the productions of the form A   for any A.
 If A  X1 X2 … Xk is a production, add all productions that can be
formed by eliminating some or all of those Xi's that are nullable.
o But, don't eliminate all k if they are all nullable.

9
Computer Science & Engineering Formal Languages and Automata Theory

Examples:

1. Grammar G:
S  aS | bA
A  aA |  , from this grammar eliminate -productions.
Solution:
S  aS, S  bA gives S  bA and S  b
A  aA gives A  aA and A  a
After elimination of -productions, the final grammar is
S  aS | bA | b
A  aA | a
2. Grammar G:
S  AaB | aaB
A
B  bbA | , from this grammar eliminate  - productions and
then eliminate useless symbols.
Solution:
The given grammar is

S  AaB | aaB ……….…………………..(1)


A  ………………………… (2)
B  bbA |  ……………….…(3)
Step 1: Elimination of  - productions
The given grammar contains two -proctions. i.e A   and B  

(1)  S  AaB | aaB | aB | Aa | a | aa {since A   and


B}
(2)  B  bbA | bb
The grammar is
S  AaB | aaB | aB | Aa | a | aa
B  bbA | bb
Step 2: Elimination of useless symbols from the above grmmar
10
Computer Science & Engineering Formal Languages and Automata Theory

S  AaB | aaB | aB | Aa | a | aa
B  bbA | bb
In this grammar Variable A is there, but it is not producing anything. So
that it can eliminated.
The remaking productions are

S  aaB | aB | a | aa
B  bb
In this grammar no symbol is useless, then the final productions are,
S  aaB | aB | a | aa
B  bb

3.2 Eliminating Unit productions:

 The productions of the form A  B, where A, B  V called unit


production.
 Eliminate useless symbols and  - productions.
 Discover those pairs of variables (A,B) such that A B.

o Because there are no  - productions, this derivation can only use


unit productions.
o Thus, we can find the pairs by computing reachability in a graph
where nodes = variables, and arcs = unit productions.
 Replace each combination where A B α and α is other than a single
variable by A  α.
o i.e., “short circuit" sequences of unit productions, which must
eventually be followed by some other kind of production. Remove
all unit productions.
Note: Consider the grammar G is S  A, A  B, B C, C  d.

Here A, B, C are the unit variables of length one. Then the resultant grammar
is S  d. This is called the chain rule.

11
Computer Science & Engineering Formal Languages and Automata Theory

Example:
1. Eliminate unit productions from the following grammar.
S  A | bb
A  B |b
BS|a
Solution:
In the given grammar, the unit productions are S  A, A  B and B  S.
S  A gives S  b.
S  A  B gives S  B gives S  a.
A  B gives A  a
A  B S gives A  S gives A  bb.
B  S gives B  bb.
B  S  A gives B  A gives B  b.
The new productions are
S  bb | b | a
A  b | a | bb
B  a | bb | b
It has no unit productions. In order to get the reduced CFG, we have to
eliminate the useless symbols. From the above grammar we can eliminate the
A and B productions.
Then the resultant grammar is S  bb | b | a.

3.3 Eliminating Useless symbols:

 In order for a symbol X to be useful, it must:


1. Derive some terminal string (possibly X is a terminal).
2. Be reachable from the start symbol; i.e., S αXβ

 Note that X wouldn't really be useful if α or β included a symbol that


didn't satisfy (1), so it is important that (1) be tested first, and symbols
that don't derive terminal strings be eliminated before testing (2).

12
Computer Science & Engineering Formal Languages and Automata Theory

Examples:
1. Eliminate useless symbols from the grammar
S  AB | a
Aa
Solution:
Here we find no terminal string is derivable from B. So that B is to be
eliminated from productions S  AB.
Remaking productions are
Sa
Aa
By rule 2, Here A is not useful to derive a string from starting symbol S.
So we can eliminate A  a.
The final production is
Sa
2. Eliminate useless symbols from the grammar
S  aS | A | C
Aa
B  aa
C  aCb
Solution:
By rule 2, B is not useful to derive a string from starting symbol S. So we can
eliminate
B  aa.
The Remaking productions are,
S  aS | A | C
Aa
C  aCb
By rule 1, C is not useful to derive some terminal string. So we can
eliminate
S  C and CaCb productions.

13
Computer Science & Engineering Formal Languages and Automata Theory

The final productions are

S  aS | A
Aa

6. NORMAL FORMS

 In a Context Free Grammar, the right hand side of the production can be
any string of variables and terminals. When productions in G satisfy certain
restrictions, then G is said to be in a Normal Form.
 There are two widely useful Normal forms of CFG. They are
i. Chomsky Normal Form (CNF)
ii. Greibach Normal Form ( GNF )

6.1 Chomsky Normal Form (CNF):

Definition: A context-free grammar G is in Chomsky normal form if any


production is of the form:

A  BC or
Aa
where ‘a’ is a terminal, A,B,C are non-terminals, and B,C may not be the start
variable (the axiom)

Note:
1. In CNF number of symbols on right side of production strictly limited.
2. The rule S, where S is the start variable, is not excluded from a CFG
in Chomsky normal form.
Conversion to Chomsky normal form:
Theorem: For every CFG, there is an equivalent grammar G in Chomsky
Normal Form.
Proof:
Construction of grammar in CNF.

14
Computer Science & Engineering Formal Languages and Automata Theory

Step 1:
Eliminate null productions and unit productions.
Step 2:
Eliminate terminals on right hand side of productions as follows.
i. All the productions in P of the form A  a and A  BC are
included.
ii. Consider A  w1w2….wn will some terminal on right hand side. If wi
is a terminal say ai, add a new variable cai and cai  P. Repeat same
for all terminals.
Step 3:
Restricting the number of variables on RHS as follows:
i. All the productions in P are added to P, if they are in the required
form.
ii. Consider A  A1A2A3 … Am, then we introduce new productions
are,
A  A1C1
C1  A2C2
C2  A3C3
Cm-2  Am-1Cm-1

Example:
Convert the following CFG to Chomsky Normal Form (CNF):
S  aX | Yb
XS|
Y  bY | b
Solution:
Step 1 - Kill all  productions:
By inspection, the only nullable non terminal is X.
Delete all  productions and add new productions, with all possible
combinations of the nullable X removed.

15
Computer Science & Engineering Formal Languages and Automata Theory

The new CFG, without  productions, is:


S  aX | a | Yb
XS
Y  bY | b
Step 2 - Kill all unit productions:
The only unit production is X  S, where the S can be replaced with all S’s
non-unit productions (i.e. aX, a, and Yb).
The new CFG, without unit productions, is:
S  aX | a | Yb
X  aX | a | Yb
Y  bY | b
Step 3 - Replace all mixed strings with solid non terminals.
Create extra productions that produce one terminal, when doing the
replacement.
The new CFG, with a RHS consisting of only solid non terminals or one
terminal is:
S  AX | YB | a
X  AX | YB | a
Y  BY | b
Aa
Bb
Step 4 - Shorten the strings of non terminals to length 2.
All non terminal strings on the RHS in the above CFG are already the required
length, so the CFG is in CNF.

6.2 Greibach Normal Form (GNF):

A CFG G = (V, T, P, S) is said to be in GNF if every production is of the form A


 aα, where a € T and α € V*, i.e., α is a string of zero or more variables.
Definition: A production U € R is said to be in the form left recursion, if U : A 
Aα for some A € V .

16
Computer Science & Engineering Formal Languages and Automata Theory

Left recursion in R can be eliminated by the following scheme:

• If A  Aα1|Aα2| . . . |Aαr|β1| β2| . . . | βs, then replace the above rules by


(i) Z αi|αiZ, 1≤ i≤ r and
(ii) A  βi| βiZ, 1≤ i ≤ s

• If G = (V, T, P, S) is a CFG, then we can construct another CFG G1 = (V1, T,


P1, S)
in Greibach Normal Form (GNF) such that L(G1) = L(G) − {}.
The stepwise algorithm is as follows:
1. Eliminate null productions, unit productions and useless symbols from
the grammar G and then construct a G’ = (V’, T, P’, S) in Chomsky
Normal Form (CNF) generating the language L(G’) = L(G) − {}.
2. Rename the variables like A1,A2, . . .An starting with S = A1.
3. Modify the rules in R’ so that if Ai  Aj γ € R’ then j > i.
4. Starting with A1 and proceeding to An this is done as follows:
(a) Assume that productions have been modified so that for 1≤ i≤ k,
Ai  Aj γ € R’ then j > i.
(b) If Ak  Ajγ is a production with j < k, generate a new set of
productions substituting for the Aj the body of each Aj
production.
(c) Repeating (b) at most k − 1 times we obtain rules of the form
Ak  Apγ, p≥ k
(d) Replace rules Ak  Akγ by removing left-recursion as stated
above.
5. Modify the Ai  Ajγ for i = n−1, n−2, ., 1 in desired form at the same time
change the Z production rules.

Example: Convert the following grammar G into Greibach Normal Form (GNF).
S  XA|BB
B  b|SB

17
Computer Science & Engineering Formal Languages and Automata Theory

Xb
Aa
Solution:
To write the above grammar G into GNF, we shall follow the following steps:
1. Rewrite G in Chomsky Normal Form (CNF)
It is already in CNF.
2. Re-label the variables
S with A1
X with A2
A with A3
B with A4
After re-labeling the grammar looks like:

A1  A2A3|A4A4
A4  b|A1A4
A2  b
A3  a
3. Identify all productions which do not conform to any of the types listed
below:
Ai  Ajxk such that j > i
Zi  Ajxk such that j ≤ n
Ai  axk such that xk € V* and a € T
4. A4  A1A4 ................ identified
5. A4  A1A4|b.
To eliminate A1 we will use the substitution rule A1  A2A3|A4A4.
Therefore, we have A4  A2A3A4|A4A4A4|b
The above two productions still do not conform to any of the types in step
3.
Substituting for A2  b
A4  bA3A4|A4A4A4|b

18
Computer Science & Engineering Formal Languages and Automata Theory

Now we have to remove left recursive production A4  A4A4A4


A4  bA3A4|b|bA3A4Z|bZ
Z  A4A4|A4A4Z
6. At this stage our grammar now looks like
A1  A2A3|A4A4
A4  bA3A4|b|bA3A4Z|bZ
Z  A4A4|A4A4Z
A2  b
A3  a
All rules now conform to one of the types in step 3.But the grammar is
still not in Greibach Normal Form.
7. All productions for A2,A3 and A4 are in GNF
for A1  A2A3|A4A4
Substitute for A2 and A4 to convert it to GNF
A1  bA3|bA3A4A4|bA4|bA3A4ZA4|bZA4
for Z  A4A4|A4A4Z
Substitute for A4 to convert it to GNF
Z  bA3A4A4|bA4|bA3A4ZA4|bZA4|bA3A4A4Z|bA4Z|bA3A4ZA4Z|bZA4Z
8. Finally the grammar in GNF is
A1  bA3|bA3A4A4|bA4|bA3A4ZA4|bZA4
A4  bA3A4|b|bA3A4Z|bZ
Z  bA3A4A4|bA4|bA3A4ZA4|bZA4|bA3A4A4Z|bA4Z|bA3A4ZA4Z|bZA4Z
A2  b
A3  a

7. Closure Properties of CFL's:

 The context-free languages are closed under

o substitution
o union
o concatenation
o Kleene star
19
Computer Science & Engineering Formal Languages and Automata Theory

o homomorphism
o reversal
o intersection with a regular set
o inverse homomorphism

8. Non-closure Properties of CFL's:

 The context-free languages are not closed under

o intersection
 L1 = {anbnci | n, i ≥ 0} and L2 = {aibncn | n, i ≥ 0 } are CFL's. But
L = L1 ∩ L2 = {anbncn | n ≥ 0 } is not a CFL.
o complement
 Suppose comp (L) is context free if L is context free. Since L1
∩ L2 = comp (comp (L1) ∪ comp (L2)), this would imply the
CFL's are closed under intersection.
o difference
 Suppose L1 – L2 is a context free if L1 and L2 are context free.
If L is a CFL over Σ, then comp (L) = Σ* - L would be context
free.

9. Pumping Lemma for CFL's:

Pumping Lemma for CFL’s is used to show that certain languages are non
context free. There are three forms of pumping lemma.

1. Standard form of pumping lemma: For every non finite context-free


language L, there exists a constant n that depends on L such that for all z in L
with |z| ≥ n, we can write z as uvwxy where
1. vx ≠ ε,
2. |vwx| ≤ n, and
3. for all i ≥ 0, the string uviwxiy is in L.
One important use of the pumping lemma is to prove certain languages are not
context free.
2. Strong form of pumping lemma (Ogden’s Lemma): Let L is an infinite
CFL. Then there is a constant n such that if z is any word in L, and we mark
any n or more positions of z “distinguished”, then we can write z=uvwxy such
that

20
Computer Science & Engineering Formal Languages and Automata Theory

i. v and x together have at least one distinguished positions,


ii. vwx has atmost n distinguished positions, and
iii. for all i≥0, uviwxiy is in L.
3. Weak form of pumping lemma: Let L is an infinite CFL. When we pump the
length of strings are
|uvwxy|=|uwy|+|vx|
|uv2wx2y|=|uwy|+2|vx|
...................................
|uviwxiy|=|uwy|+i|vx|.
When we pump the lengths are in arithmetic progression.

Example:
1. The language L = { anbncn | n ≥ 0 } is not context free.
Solution: Refer Class Notes
The proof will be by contradiction. Assume L is context free. Then by the
pumping lemma there is a constant n associated with L such that for all z in L
with |z| ≥ n, z can be written as uvwxy such that
1. vx ≠ ε,
2. |vwx| ≤ n, and
3. for all i ≥ 0, the string uviwxiy is in L.
Consider the string z = anbncn.
From condition (2), vwx cannot contain both a's and c's.
o Two cases arise:
1. vwx has no c's. But then uwy cannot be in L since at least
one of v or x is nonempty.
2. vwx has no a's. Again, uwy cannot be in L.
o In both cases we have a contradiction, so we must conclude L
cannot be context free.

21

You might also like