Session 3
Session 3
Session 3
Analysis ( Carried out by front end ) Lexical analysis A generic lexical analyzer - lex Syntax analysis Context Free Grammar How to build parse trees A generic parser - yacc Semantic analysis Synthesis ( Carried out by back end ) Intermediate code generation Code optimization Object code generation
Continuing syntax analysis : We already talked about the derivations and its graphical representation i.e. parse tree. Two type of derivations are there
o
Leftmost derivation : at each step, leftmost non-terminal is replaced; e.g. E => E * E => id * E => id * id
Rightmost derivation : at each step, rightmost non-terminal is replaced; e.g. E => E * E => E * id => id * id
Every parse tree has unique leftmost (or rightmost) derivation. Note that a sentence can have many parse trees but a parse tree will have unique derivation. Evaluation of parse tree will always happen from bottom to up and left to right. Ambiguity : A grammar is ambiguous if a sentence has more than one parse tree, i.e., more than one leftmost (or rightmost) derivation of a sentence is possible. Example : Given the grammar ( set of productions)
Page 1 of 18
How to resolve ambiguity : Write the unambiguous grammar. This can be achieved by defining precedence rules with extra non-terminals. Example : Another ambiguous grammar E -> E + E | E-E | E*E | E / E| -E | id
Problem : How to convert the above ambiguous grammar into non-ambiguous. Solution : Apply precedence rules with extra non-terminals. Usual precedence order from highest to lowest is : - (unary minus), *|/, +|Golden rule : Build grammar from lowest to highest precedence Goal -> Expr Expr -> Expr + Term | Expr - Term | Term Term -> Term * Factor | Term / Factor | Factor Factor -> -Primary | Primary Primary -> id Now the leftmost derivation for - id + id * id are Goal => Expr Expr => Expr + Term => Term + Term
Page 2 of 18
=> Factor + Term => - Primary + Term => - id + Term => - id + Term*Factor => -id + Factor*Factor => -id + Primary*Factor => -id + id * Factor => -id + id * Primary => - id + id * id There are three new non-terminals ( Term, Factor, Primary ). You can not have 2 parse tree for the above sentence using above grammar. Parser : A program that, given a sentence, reconstructs a derivation for that sentence ---- if done successfully, it recognizes the sentence. All parsers read their input left -to-right, but construct parse tree differently.
There are two type of parsers. a. Top-down parsers --- construct the tree from root to leaves b. Bottom-up parsers --- construct the tree from leaves to root
Top-down parser : ( LL parser - left-to-left parser ) It attempts to derive a string matching a source string through a sequence of derivations starting with the start symbol of grammar. In other terms it constructs parse tree by starting at the start symbol and guessing at each derivation step. It uses next input symbol from the sentence to guide guessing. For a valid input string 'a', a top down parse thus determines a derivation sequence S => => => a In top down parsing all the derivation has to be leftmost at each stage while matching the input string and that's why top down parsing is also termed as left-to-left (LL parsing). The first left is because all the parser reads the input sequence from left to right and second left is for leftmost derivation.
Page 3 of 18
There are three main concept in top down parsing 1. Start symbol - Selection of start symbol ( root of the parse tree ) is very important. 2. Guessing of right derivation which can lead to match the input sentence. This is called 'prediction'. 3. If the guess is wrong then one need to revert the guess and try it again. This is called 'backtracking'.
High level flow of top-down parsing : Step 1 : Identify start symbol and start with this Step 2 : Guess a production(which can lead to match the input) and apply it -- Prediction Step 3 : Match the input string Step 4 : If match then go to step 2 till the complete sentence is matched Else it is a wrong guess and revert back the derivation and go to step 2 -- Backtrack If the prediction matches the input string then no backtracking else backtracking. Some disadvantages of top-down parsing. Two problems arise due to possibility of backtracking a. Semantic analysis can not be performed while making a prediction. The action must be delayed until the prediction is known to be part of successful part. i.e. you dont know whether this prediction is correct or not. b. A source string is known to be erroneous only after all predictions have failed. This makes it very inefficient. Based on prediction and backtracking top-down parsers can be categorized into two categories 1. Recursive-Descent Parsing ( RD) - A top-down parser with backtrack
Page 4 of 18
Backtracking is needed (If a choice of a production rule does not work, we backtrack to try other alternatives.)
It is a general parsing technique, but not widely used. Not efficient. Can be used for quick and dirty parsing.
At each derivation it uses RHS of a derivation from left to right. Grammar with right recursion is suitable for this and do not enter into infinite loop while making predictions.
Parser is recursive in nature ( recursive derivations ) Descent because it goes from top->down Example : S B aBc bc | b ( Here it uses bc for B first and then b )
Grammar suitable for recursive-descent top-down parsing Grammars containing left recursion( NT appears at left side of RHS of a production) are not suitable for top-down parsing. Example : for the string == id + id*id E => E + T | T T => T * V | V V => <id> The first production would be E => E + T Now E has to be replaced as in top-down parsing leftmost derivtion takes place. If we consider the recursive-descent parsing then E will be again replaced by E + T which will create an infinite loop for prediction making.
Page 5 of 18
Grammars containing right recursion are suitable for top-down parsing and they never enter into infinite loop. However this method is time consuming and error-prone for large grammars. Example : The above grammar can be written as right recursion as follows. E => T + E | T T => V * T | V V => <id> The first production would be E => T + E T has to be replaced ( as top-down parsing has leftmost derivation ). Here is the complete sequence E => T + E => V + E => <id> + E => <id> + T => <id> + V * T => <id> + <id> * T => <id> + <id> * V => <id> + <id> * <id>
Efficient Needs a special form of CFG known as LL(k) grammar. Possible for only LL(k) grammar.
LL(k) grammar - are the context-free grammars for which there exists some positive integer k that allows a recursive descent parser to decide which production to use by examining only the next k tokens of input. Here are some of the properties.
Subset of CFGs Permits deterministic left-to-right recognition with a look ahead of k symbols
Page 6 of 18
Builds the parse tree top-down If a parse table can be constructed for the grammar, then it is LL(k), if it cant, it is not LL(k)
Each LL(k) grammar is unambiguous An LL(k) has no left-recursion. It might have right recursion but in case of right recursive production rule the same non terminal must have a production rule for epsilon also. With left recursion there might be chances of infinity loop which will never make this possible for a right prediction of K symbols.
Given a left recursive grammar ( or right recursive grammar)this can be converted to LL(k) grammar using concept of left factoring.
Left factoring : Take common parts of productions and form a new non terminal. With left factoring each production (i.e. each non terminal)become non-recursive or right recursive. If the production is right recursive then there is production for e (epsilon) Examples : How to convert a left recursive grammar into LL(k) grammar E => E + T | T T => T * V | V V => <id> E => T + E | T T => V * T | V ---> right recursive ( suitable for top-down recursive descent parsing ) V => <id> | E => TE' E' => +T E' | e (epsilon ) ) T => V T ' parsing ) T' => *V T' | e V = <id> Other examples on how to reduce grammar ---> Left factored LL(k) grammar ( suitable for top-down predictive ( Note that all the recursive production will have an derivation to e | v | ----> Left recursive ( Not suitable for any top-down parsing ) | v
Page 7 of 18
The LL(k) grammars therefore exclude all ambiguous grammars, as well as all grammars that contain left recursion. LL(1) --> recursive descent parser can decide which production to apply by examining only the next '1' token of input.
The predictive parser which uses the LL(1) grammar is known as LL(1) parser. o LL(1) means that the input is processed left-to-right a leftmost derivation is constructed the method uses at most one lookahead token o An LL(1) parser is a table driven parser for left-to-left parsing ( LL parsing ). o The '1' in LL(1) indicates that the grammar uses a look-ahead of one source symbol. i.e. the prediction to be made is
determined by the next source symbol. o It expects an LL(1) grammar. There are two important concepts in LL(1) parsing o Parsing table and algorithm to create parsing table
Page 8 of 18
o Algorithm for derivations About parsing table : o The parsing table has a row for each Non terminal(NT) in production rules o The parsing table has a column for each Terminal(T) in production rules o A parsing table entry PT(NT, T) indicates what prediction should be made if NT is the leftmost non-terminal in a sentential form And T is the next source symbol. o A blank entry in parsing table indicates an error. Multiple entry in table indicates conflict and this tells that the grammar is not LL(1) There must be exactly one entry in a cell. o There is a special column which depicts the end of symbols and it is marked as $ or |Algorithm to create the parsing table :
a E First(alpha) = => If alpha can derive a string starting from a B E Follow(A) ==> b that can follow a string derived from A Example : Here is the example of LL(1) grammar for arithmetic operation and the
Page 9 of 18
Input string :
Parsing Table :
LL(1) parsing algorithm {Input : A string and a and parsing table M for grammar G.} {Output : If is in L(G), a leftmost derivation of , otherwise, an error indication.} Initially, the parser is in a configuration in which it has $S on the stack with S, the start symbol of G on top, and $ in the imput buffer. Set ip to point ot the first symbol of $. Repeat Let X be the top stack symbol and a the symbol pointed to by ip. if X = a Pop of X fromt he stack and advance ip. else error() end if else if M X , a X YY 1 2 Yk Pop X fromt he stack Push Yk , Yk 1 , , Y1 onto the stack, with Y1 on top Output the production X Y1Y2 Yk else error() end if
Page 10 of 18
{X is a nonterminal}
Parsing steps :
Bottom-up parser : (Construct parse tree bottom-up --- from leaves to the root ) As the name suggests, bottom-up parsing works in the opposite direction from topdown. A topdown parser begins with the start symbol at the top of the parse tree and works downward, driving productions in forward order until it gets to the terminal leaves. A bottom-up parse starts with the string of terminals itself and builds from the leaves upward, working backwards to the start symbol by applying the productions in reverse. Along the way, a bottom-up parser searches for substrings of the working string that match the right side of some production. When it finds such a substring, it reduces it, i.e., substitutes the left side nonterminal for the matching right side. The goal is to reduce all the way up to the start symbol and report a successful parse. In general, bottom-up parsing algorithms are more powerful than top-down methods, but not surprisingly, the constructions required are also more complex. It is difficult to write a bottom-up parser by hand for anything but trivial grammars, but fortunately, there are excellent parser generator tools like yacc that build a parser from an input specification Some features of bottom up parsing
o o o
Bottom-up parsing always constructs right-most derivation It attempts to build trees upward toward the start symbol. More complex than top-down but efficient
Page 11 of 18
Shift reduce parser Shift-reduce parsing is the most commonly used and the most powerful of the bottom-up
techniques. It takes as input a stream of tokens and develops the list of productions used to build the parse tree, but the productions are discovered in reverse order of a topdown parser. Like a table-driven predictive parser, a bottom-up parser makes use of a stack to keep track of the position in the parse and a parsing table to determine what to do next. To illustrate stack-based shift-reduce parsing, consider this simplified expression grammar: S > E E > T | E + T T > id | (E) The shift-reduce strategy divides the string that we are trying parse into two parts: an undigested part and a semi-digested part. The undigested part contains the tokens that are still to come in the input, and the semidigested part is put on a stack. If parsing the string v, it starts out completely undigested, so the input is initialized to v, and the stack is initialized to empty. A shift-reduce parser proceeds by taking one of three actions at each step:
o
Reduce:
If we can find a rule A > w, and if the contents of the stack are qw for
some q (q may be empty), then we can reduce the stack to qA. We are applying the production for the nonterminal A backwards. There is also one special case: reducing the entire contents of the stack to the start symbol with no remaining input means we have recognized the input as a valid sentence (e.g., the stack contains just w, the input is empty, and we apply S > w). This is the last step in a successful parse. The w being reduced is referred to as a handle.
o
Shift: If it is impossible to perform a reduction and there are tokens remaining in the undigested input, then we transfer a token from the input onto the stack. This is called a shift. For example, using the grammar above, suppose the stack contained ( and the input contained id+id). It is impossible to perform a reduction on ( since it does not match the entire right side of any of our productions. So, we shift the first character of the input onto the stack, giving us (id on the stack and +id) remaining in the input.
Page 12 of 18
Error: If neither of the two above cases apply, we have an error. If the sequence on the stack does not match the right-hand side of any production, we cannot reduce. And if shifting the next input token would create a sequence on the stack that cannot eventually be reduced to the start symbol, a shift action would be futile. Thus, we have hit a dead end where the next token conclusively determines the input cannot form a valid sentence. This would happen in the above grammar on the input id+). The first id would be shifted, then reduced to T and again to E, next + is shifted. At this point, the stack contains E+ and the next input token is ). The sequence on the stack cannot be reduced, and shifting the ) would create a sequence that is not viable, so we have an error. The general idea is to read tokens from the input and push them onto the stack attempting to build sequences that we recognize as the right side of a production. When we find a match, we replace that sequence with the nonterminal from the left side and continue working our way up the parse tree. This process builds the parse tree from the leaves upward, the inverse of the top-down parser. If all goes well, we will end up moving everything from the input to the stack and eventually construct a sequence on the stack that we recognize as a right-hand side for the start symbol. Example : Grammar :
Input :
Page 13 of 18
Another example :
Page 14 of 18
Conflicts in the shift-reduce parsing : ambiguous grammars lead to parsing conflicts; conflicts can be fixed by rewriting the grammar, or making a decision during parsing shift / reduce (SR) conflicts : choose between reduce and shift actions S -> if E then S | if E then S else S| ......
LR Parsing : table driven shift reduce parser LR parsers ("L" for left to right scan of input, "R" for rightmost derivation) are efficient, table-driven shift-reduce parsers. The class of grammars that can be parsed using LR methods is a proper superset of the class of grammars that can be parsed with predictive LL parsers. In fact, virtually all programming language constructs for which CFGs can be written can be parsed with LR techniques. As an added advantage, there is no need for lots of grammar rearrangement to make it acceptable for LR parsing the way that LL parsing requires. The primary disadvantage is the amount of work it takes to build the tables by hand, which makes it infeasible to hand-code an LR parser for most grammars. Fortunately, there are LR parser generators that create the parser
Page 15 of 18
from an unambiguous CFG specification. The parser tool does all the tedious and complex work to build the necessary tables and can report any ambiguities or language constructs that interfere with the ability to parse it using LR techniques. Rather than reading and shifting tokens onto a stack, an LR parser pushes "states" onto the stack; these states describe what is on the stack so far. An LR parser uses two tables: 1. The action table : Action[s,a] tells the parser what to do when the state on top of the stack is s and terminal a is the next input token. The possible actions are to shift a state onto the stack, to reduce the handle on top of the stack, to accept the input, or to report an error. 2. The goto table : Goto[s,X] indicates the new state to place on top of the stack after a reduction of the nonterminal X while state s is on top of the stack. LR Parser Types There are three types of LR parsers: LR(k), simple LR(k), and lookahead LR(k) (abbreviated to LR(k), SLR(k), LALR(k))). The k identifies the number of tokens of lookahead. We will usually only concern ourselves with 0 or 1 tokens of lookahead, but the techniques do generalize to k > 1. Here are some widely used LR parsers based on value of k. o LR(0) - No lookahead symbol o SLR(1) - Simple with one lookahead symbol o LALR(1) - Lookahead bottom up, not as powerful as full LR(1) but simpler to implement. YACC deals with this kind of grammar. o LR(1) - Most general grammar, but most complex to implement. LR(0) is the simplest of all the LR parsing methods. It is also the weakest and although of theoretical importance, it is not used much in practice because of its limitations. LR(0) parses without using any lookahead at all. Adding just one token of lookahead to get LR(1) vastly increases the parsing power. Very few grammars can be parsed with LR(0), but most unambiguous CFGs can be parsed with LR(1). The drawback of adding the lookahead is that the algorithm becomes somewhat more complex and the parsing table gets much, much bigger. The full LR(1) parsing table for a typical programming language has many thousands of states compared to the few hundred needed for LR(0). A compromise in the middle is
Page 16 of 18
found in the two variants SLR(1) and LALR(1) which also use one token of lookahead but employ techniques to keep the table as small as LR(0). SLR(k) is an improvement over LR(0) but much weaker than full LR(k) in terms of the number of grammars for which it is applicable. LALR(k) parses a larger set of languages than SLR(k) but not quite as many as LR(k). LALR(1) is the method used by the yacc parser generator. Precedence parser Simple precedence parser Operator-precedence parser Extended precedence parser Bootstrapping is a term used in computer science to describe the techniques involved in writing a compiler (or assembler) in the target programming language which it is intended to compile itself improvements to the compiler's back-end improve not only general purpose programs but also the compiler itself it is a comprehensive consistency check as it should be able to reproduce its own object code. symbol table:An essential function of compiler is to record the identifiers and the relevant information about its attribute type, its scope and in case of procedure or the function, names, arguments, return types. A symbol table is a table containing a record for each identifier with fields for the attribute of the identifier . This table is used by all the steps of compiler to access the data as well as report errors. Generate parse tree for following sentences based on standard arithmetic CFG a -b *c a + b * c -d / ( e * f) a + b *c -d + e -f /(g + h ) a + b * c / d + e -f A /b + c * d + e -f 9*7+5-2 Use following grammar if not given : E => E + T | E-T|T T => T * V | T/V| V V => <id> | (E) a-b*c
Page 17 of 18
First of all find a start prediction. ( Always start from lower to higher precedence in the input string). E => E - T => T-T => V-T => <id> - T => <id> - T*V => <id> - V*V => <id> - <id>*V => <id> - <id> * <id> a+b *c -d / (e*f) ( two at the lowest precedence + and -, choose the one which is at rightmost side i.e. -) E => E - T E => E + T - T => T + T - T => V + T - T => <id> + T - T => <id> + T * V - T => <id> + V * V - T => <id> + <id> * V - T => <id> + <id> * <id> - T => <id> + <id> * <id> - T/V => <id> + <id> * <id> - V/V => <id> + <id> * <id> - <id>/V => <id> + <id> * <id> - <id>/(E) => <id> + <id> * <id> - <id>/(T) => <id> + <id> * <id> - <id>/(T*V) => <id> + <id> * <id> - <id>/(V*V) => <id> + <id> * <id> - <id>/(<id>*V) => <id> + <id> * <id> - <id>/(<id>*<id>)
Page 18 of 18