The Different Phases of A Compiler

The different phases of a compiler :
Conceptually, a compiler operates in phases, each of which transforms the source program from one representation to another.
Fig.2. Phases of a compiler The process of lexical analysis is called lexical analyzer, scanner, or tokenizer. Its purpose is to break a sequence of characters into a subsequences called tokens. The syntax analysis phase, called parser, reads tokens and validates them in accordance with a grammar. Vocabulary, i.e., a set of predefined tokens, is composed of word symbols (reserved words), names (identifiers), numerals (constants), and special symbols (operators). During compilation, a compiler will find errors such as lexical, syntax, semantic, and logical errors. If a token is found not belonging to the vocabulary, it is an lexical error. A grammar dictates the syntax of a language.
The first three phases, forms the bulk of the analysis portion of a compiler. Symbol table management and error handling, are shown interacting with the six phases.
The Analysis phases As translation progresses, the compilers internal representation of the source program changes. Consider the statement, position := initial + rate * 10 The lexical analysis phase reads the characters in the source pgm and groups them into a stream of tokens in which each token represents a logically cohesive sequence of characters, such as an identifier, a keyword etc. The character sequence forming a token is called the lexeme for the token. Certain tokens will be augmented by a lexical value. For example, for any identifier the lex analyzer generates not only the token id but also enter s the lexeme into the symbol table, if it is not already present there. The lexical value associated this occurrence of id points to the symbol table entry for this lexeme. The representation of the statement given above after the lexical analysis would be: id1: = id2 + id3 * 10
Syntax analysis imposes a hierarchical structure on the token stream, which is shown by syntax trees (fig Intermediate Code Generation After syntax and semantic analysis, some compilers generate an explicit intermediate representation of the source program. This intermediate representation can have a variety of forms. In three-address code, the source pgm might look like this, temp1: = inttoreal (10) temp2: = id3 * temp1 temp3: = id2 + temp2 id1: = temp3
Code Optimisation The code optimization phase attempts to improve the intermediate code, so that faster running machine codes will result. Some optimizations are trivial. There is a great variation in the amount of code optimization different compilers perform. In those that do the most, called optimising compilers, a significant fraction of the time of the compiler is spent on this phase.
Code Generation The final phase of the compiler is the generation of target code, consisting normally of relocatable machine code or assembly code. Memory locations are selected for each of the variables used by the program. Then, intermediate instructions are each translated into a sequence of machine instructions that perform the same task. A crucial aspect is the assignment of variables to registers.
Context-free grammar
In formal language theory, a context-free grammar (CFG) is a formal grammar in which every production rule is of the form Vw
where V is a single nonterminal symbol, and w is a string of terminals and/or nonterminals (w can be empty). The languages generated by context-free grammars are known as the context-free languages. Context-free grammars are important in linguistics for describing the structure of sentences and words in natural language, and in computer science for describing the structure of programming languages and other artificial languages. Notations for context-free grammars

BNF Syntax Diagrams (or Charts or Graphs) EBNF
BNF BNF = Backus Normal Form or Backus Naur Form. This first published version looked like: <number> ::= <digit> | <number> <digit> <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 which can be read as something like: "a number is a digit, or any number followed by an extra digit" (which is a contorted but precise way of saying that a number consists of one or more digits) "a digit is any one of the characters 0, 1, 2, ... 9" There are many variants, all equivalent to the original definition, such as:

<number> = <digit> | <number> <digit> . <digit> = 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 . number = digit . number = number digit . digit = '0' . digit = '1' . (etc.) digit = '9' . number = digit | number digit . digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' . number : digit | number digit ; digit : '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' ; number = digit | number digit digit = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
<number> = <digit> | <number> <digit> . <digit> = '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
Every variant has to distinguish between the names of grammar rules, such as number, and actual characters that can appear in the text of a program, such as 0123456789. Usually, one or the other is quoted, e.g. <number> or '0123456789' (or "0123456789") so the other can be left unquoted. Sometimes both are quoted, as in the last example above. However, if we decide to quote the names of rules and leave the actual characters unquoted (as in the first example above) a problem can arise if the metacharacters (i.e. the characters such as ' < > . | used to punctuate the rules) can appear as actual characters in the programming language. Therefore, the most widely used variants of BNF usually quote the actual characters e.g.: logical_expression = expression '|' expression | expression '&' expression . real_number = number '.' number . e.g. ANSI C syntax from K&R using a BNF (the "%token" line lists things that are assumed to be simple enough to leave out, such as names and numbers. The form of BNF used is essentially that accepted by yacc/bison.)
Syntax Diagrams (or Charts or Graphs) Unlike BNF, this kind of notation does not seem to have a commonly agreed-on name. "Syntax diagrams" are also known as "Railway Tracks" or "Railroad Diagrams". Whatever they are called, they do not allow us to write anything that can't be written in BNF, they just make the grammar easier to understand. e.g. (this is a crude attempt to give you an idea of what the Pascal version looks like - the real thing looks a lot better!) IDENTIFIER: +------+ ---|LETTER|--/ +------+ | | +------+ v / ----->|LETTER|----------------------> +------+ ^ \ | | \ +-----+ / ---|DIGIT|---+-----+ which can be read as something like: "an identifier consists of a letter, possibly followed by any number of letters and/or digits"
The names of diagrams/rules, such as letter and digit, appear in rectangles, and actual characters appear in circles or in boxes with rounded ends. You can see many (much better-drawn) examples of this particular style at The BNF Web Club e.g. (gif) and in the documentation for Ebnf2ps: Peter's Syntax Diagram Drawing Tool. EBNF EBNF = Extended BNF Like syntax diagrams, EBNF does not allow us to write anything that can't be written in BNF, it just makes the grammar easier to understand. Almost all variants of EBNF use brackets ( ) for the usual mathematical meaning of grouping items together. There are (at least) three main styles of EBNF:
Derived from regular expressions ( (e.g. * is the Kleene star): trailing ? means an optional item, trailing + and * means repeat 1 or more times, or 0 or more times, respectively. e.g. number = digit+ . identifier = letter (letter | digit)* . functioncall = functionname "(" parameterlist? ")" . Based on Wirth's definition: [ ] means an optional item, { } means repeat 0 or more times. e.g. number = digit {digit} . identifier = letter {letter | digit} . functioncall = functionname "(" [parameterlist] ")" .
Bootstrapping Bootstrapping or booting refers to a group of metaphors that share a common meaning: a selfsustaining process that proceeds without external help. The term is often attributed to Rudolf Erich Raspe's story The Surprising Adventures of Baron Munchausen, where the main character pulls himself out of a swamp by his hair (specifically, his pigtail), but the Baron does not, in fact, pull himself out by his bootstraps. Instead, the phrase appears to have originated in the early 19th century United States (particularly in the sense "pull oneself over a fence by one's bootstraps"), in the sense of being an absurdly impossible feat. Applications Computing The computer term bootstrap began as a metaphor in the 1950s. In computers, pressing a bootstrap button caused a hardwired program to read a bootstrap program from an input unit. The computer would then execute the bootstrap program, which caused it to read more program instructions. It became a self-sustaining process that proceeded without external help from manually entered instructions. As a computing term, bootstrap has been used since at least 1953. Business Bootstrapping in business means starting a business without external help or capital. Such startups fund the development of their company through internal cash flow and are cautious with their expenses. Generally at the start of a venture, a small amount of money will be set aside for the bootstrap process. Biology Richard Dawkins in his book River Out of Eden used the computer bootstrapping concept to explain how biological cells differentiate: "Different cells receive different combinations of chemicals, which switch on different combinations of genes, and some genes work to switch other genes on or off. And so the bootstrapping continues, until we have the full repertoire of different kinds of cells." Phylogenetics Bootstrapping analysis gives a way to judge the strength of support for nodes on phylogenetic trees. A number is presented by each node, which reflects the percentage of bootstrap trees which also resolve that clade. Cross compiler A cross compiler is a compiler capable of creating executable code for a platform other than the one on which the compiler is run. Cross compiler tools are used to generate executables for embedded system or multiple platforms. It is used to compile for a platform upon which it is not feasible to do the compiling, like microcontrollers that don't support an operating system. It has become more common to use this tool for paravirtualization where a system may have one or more platforms in use.
Symbol table management and Error Detection
Symbol table management An essential function of a compiler is to record the identifiers used in the source program and collect information about various attributes of each identifier. A symbol table is a data structure containing a record for each identifier, with fields for the attributes of the identifier. The data structure allows us to find the record for each identifier quickly and to store or retrieve data from that record quickly. When an identifier in the source program is detected by the lex analyzer, the identifier is entered into the symbol table.
Error Detection and Reporting
Each phase can encounter errors. A compiler that stops when it finds the first error is not as helpful as it could be. The syntax and semantic analysis phases usually handle a large fraction of the errors detectable by the compiler. The lexical phase can detect errors where the characters remaining in the input do not form any token of the language. Errors when the token stream violates the syntax of the language are determined by the syntax analysis phase. During semantic analysis the compiler tries to detect constructs that have the right syntactic structure but no meaning to the operation involved.
FIRST AND FOLLOW : FIRST(A) is (defined to be) the set of terminals that can appear in the first position of any string derived from A. FOLLOW(A) is the union over FIRST(B) where B is any non-terminal that immediately follows A in the right hand side of a production rule. Rules for First Sets 1. 2. 3. 4. If X is a terminal then First(X) is just X! If there is a Production X then add to first(X) If there is a Production X Y1Y2..Yk then add first(Y1Y2..Yk) to first(X) First(Y1Y2..Yk) is either 1. First(Y1) (if First(Y1) doesn't contain ) 2. OR (if First(Y1) does contain ) then First (Y1Y2..Yk) is everything in First(Y1) <except for > as well as everything in First(Y2..Yk) 3. If First(Y1) First(Y2)..First(Yk) all contain then add to First(Y1Y2..Yk) as well.
Rules for Follow Sets 1. First put $ (the end of input marker) in Follow(S) (S is the start symbol) 2. If there is a production A aBb, (where a can be a whole string) then everything in FIRST(b) except for is placed in FOLLOW(B). 3. If there is a production A aB, then everything in FOLLOW(A) is in FOLLOW(B) 4. If there is a production A aBb, where FIRST(b) contains , then everything in FOLLOW(A) is in FOLLOW(B)

The Different Phases of A Compiler

Uploaded by

The Different Phases of A Compiler

Uploaded by

The different phases of a compiler :

BNF Syntax Diagrams (or Charts or Graphs) EBNF

Symbol table management and Error Detection

Error Detection and Reporting

You might also like