Program Compilation Lec 7

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Program compilation

A translator or language processor is a program that translates an input


program written in a programming language into an equivalent program
in another language. The compiler is a type of translator, which takes a
program written in a high-level programming language as input and
translates it into an equivalent program in low-level languages such as
machine language or assembly language.
The program written in a high-level language is known as a source
program, and the program converted into a low-level language is known
as an object (or target) program. Without compilation, no program
written in a high-level language can be executed. For every programming
language, we have a different compiler; however, the basic tasks
performed by every compiler are the same. The process of translating the
source code into machine code involves several stages, including lexical
analysis, syntax analysis, semantic analysis, code generation, and
optimization.

Stages of Compiler Design


1. Lexical Analysis: The first stage of compiler design is lexical
analysis, also known as scanning. In this stage, the compiler
reads the source code character by character and breaks it down
into a series of tokens, such as keywords, identifiers, and
operators. These tokens are then passed on to the next stage of
the compilation process.
2. Syntax Analysis: The second stage of compiler design is syntax
analysis, also known as parsing. In this stage, the compiler
checks the syntax of the source code to ensure that it conforms
to the rules of the programming language. The compiler builds a
parse tree, which is a hierarchical representation of the
program’s structure, and uses it to check for syntax errors.
3. Semantic Analysis: The third stage of compiler design
is semantic analysis. In this stage, the compiler checks the
meaning of the source code to ensure that it makes sense. The
compiler performs type checking, which ensures that variables
are used correctly and that operations are performed on
compatible data types. The compiler also checks for other
semantic errors, such as undeclared variables and incorrect
function calls.
4. Code Generation: The fourth stage of compiler design is code
generation. In this stage, the compiler translates the parse tree
into machine code that can be executed by the computer. The
code generated by the compiler must be efficient and optimized
for the target platform.
5. Optimization: The final stage of compiler design is
optimization. In this stage, the compiler analyzes the generated
code and makes optimizations to improve its performance.

Lexical Analysis
Lexical Analysis is the first phase of the compiler also known as a
scanner. It converts the High level input program into a sequence
of Tokens.
 Lexical Analysis can be implemented with the Deterministic
finite Automata.
 The output is a sequence of tokens that is sent to the parser for
syntax analysis

What is a token? A lexical token is a sequence of characters that can be


treated as a unit in the grammar of the programming languages.

Example of tokens:
Keywords; Examples-for, while, if etc.
Identifier; Examples-Variable name, function name, etc.
Operators; Examples '+', '++', '-' etc.
Separators; Examples ',' ';' etc
Example of Non-Tokens:
 Comments, preprocessor directive, blanks, tabs, newline, etc.
Lexeme: The sequence of characters matched by a pattern to form the
corresponding token or a sequence of input characters that comprises a
single token is called a lexeme. eg- “float”, “abs_zero_Kelvin”, “=”, “-”,
“273”, “;”.
How Lexical Analyzer works-
1. Input preprocessing: This stage involves cleaning up the input
text and preparing it for lexical analysis. This may include
removing comments, whitespace, and other non-essential
characters from the input text.
2. Tokenization: This is the process of breaking the input text into
a sequence of tokens. This is usually done by matching the
characters in the input text against a set of patterns or regular
expressions that define the different types of tokens.
3. Token classification: In this stage, the lexer determines the type
of each token. For example, in a programming language, the
lexer might classify keywords, identifiers, operators, and
punctuation symbols as separate token types.
4. Token validation: In this stage, the lexer checks that each token
is valid according to the rules of the programming language. For
example, it might check that a variable name is a valid identifier,
or that an operator has the correct syntax.
5. Output generation: In this final stage, the lexer generates the
output of the lexical analysis process, which is typically a list of
tokens. This list of tokens can then be passed to the next stage of
compilation or interpretation.
 The lexical analyzer identifies the error with the help of the
automation machine and the grammar of the given language on
which it is based like C, C++, and gives row number and column
number of the error.
Suppose we pass a statement through lexical analyzer – a = b +c;
It will generate token sequence like this: id=id+id;
Where each id refers to it’s variable in the symbol table referencing all
details For example, consider the program
int main()
{
// 2 variables
int a, b;
a = 10;
return 0;
}
All the valid tokens are:
'int' 'main' '(' ')' '{' 'int' 'a' ',' 'b' ';'
'a' '=' '10' ';' 'return' '0' ';' '}'
Above are the valid tokens. You can observe that we have omitted
comments. As another example, consider below printf statement.

There are 5 valid token in this printf statement.


Exercise 1: Count number of tokens:
int main()
{
int a = 10, b = 20;
printf("sum is:%d",a+b);
return 0;
}
Answer: Total number of token: 27.
Exercise 2: Count number of tokens: int max(int i);
 Lexical analyzer first read int and finds it to be valid and accepts
as token.
 max is read by it and found to be a valid function name after
reading (
 int is also a token , then again I as another token and finally ;
Answer: Total number of tokens 7:
int, max, ( ,int, i, ), ;
 We can represent in the form of lexemes and tokens as under

Lexemes Tokens Lexemes Tokens


Lexemes Tokens Lexemes Tokens

While WHILE A IDENTIEFIER

( LAPREN = ASSIGNMENT

A IDENTIFIER a IDENTIFIER

>= COMPARISON – ARITHMETIC

B IDENTIFIER 2 INTEGER

) RPAREN ; SEMICOLON

Syntax Analysis

Syntax Analysis or Parsing is the second phase, i.e. after lexical analysis.
It checks the syntactical structure of the given input, i.e. whether the
given input is in the correct syntax (of the language in which the input
has been written) or not. It does so by building a data structure, called a
Parse tree or Syntax tree. The parse tree is constructed by using the pre-
defined Grammar of the language and the input string. If the given input
string can be produced with the help of the syntax tree (in the derivation
process), the input string is found to be in the correct syntax. if not, the
error is reported by the syntax analyzer.The main goal of syntax analysis
is to create a parse tree or abstract syntax tree (AST) of the source code,
which is a hierarchical representation of the source code that reflects the
grammatical structure of the program.

Features of syntax analysis:

Syntax Trees: Syntax analysis creates a syntax tree, which is a


hierarchical representation of the code’s structure. The tree shows the
relationship between the various parts of the code, including statements,
expressions, and operators.
Context-Free Grammar: Syntax analysis uses context-free grammar to
define the syntax of the programming language. Context-free grammar is
a formal language used to describe the structure of programming
languages.
Error Detection: Syntax analysis is responsible for detecting syntax
errors in the code. If the code does not conform to the rules of the
programming language, the parser will report an error and halt the
compilation process.
Intermediate Code Generation: Syntax analysis generates an
intermediate representation of the code, which is used by the subsequent
phases of the compiler. The intermediate representation is usually a more
abstract form of the code, which is easier to work with than the original
source code.
Optimization: Syntax analysis can perform basic optimizations on the
code, such as removing redundant code and simplifying expressions.
The pushdown automata (PDA) is used to design the syntax analysis
phase.

The Grammar for a Language consists of Production rules.


Example: Suppose Production rules for the Grammar of a language are:
S -> cAd

A -> bc|a

And the input string is “cad”.

Now the parser attempts to construct a syntax tree from this grammar for
the given input string. It uses the given production rules and applies those
as needed to generate the string. To generate string “cad” it uses the rules
as shown in the given diagram:

In step (iii) above, the production rule A->bc was not a suitable one to
apply (because the string produced is “cbcd” not “cad”), here the parser
needs to backtrack, and apply the next production rule available with A
which is shown in step (iv), and the string “cad” is produced.

Semantic Analysis
is the third phase of Compiler. Semantic Analysis makes sure that
declarations and statements of program are semantically correct. It is a
\collection of procedures which is called by parser as and when required
by grammar. Both syntax tree of previous phase and symbol table are
used to check the consistency of the given code.
Type checking is an important part of semantic analysis where compiler
makes sure that each operator has matching operands.
Semantic Analyzer:
It uses syntax tree and symbol table to check whether the given program
is semantically consistent with language definition. It gathers type
information and stores it in either syntax tree or symbol table. This type
information is subsequently used by compiler during intermediate-code
generation.
Semantic Errors:
Errors recognized by semantic analyzer are as follows:
 Type mismatch
 Undeclared variables
 Reserved identifier misuse
Functions of Semantic Analysis:
1. Type Checking –
Ensures that data types are used in a way consistent with their
definition.
2. Label Checking –
A program should contain labels references.
3. Flow Control Check –
Keeps a check that control structures are used in a proper
manner.(example: no break statement outside a loop)
Example:
float x = 10.1;

float y = x*30;
In the above example integer 30 will be typecasted to float 30.0 before
multiplication, by semantic analyzer.

Intermediate code generator

In the analysis-synthesis model of a compiler, the front end of a compiler


translates a source program into an independent intermediate code, then
the back end of the compiler uses this intermediate code to generate the
target code (which can be understood by the machine). The benefits of
using machine-independent intermediate code are:

 Because of the machine-independent intermediate code,


portability will be enhanced. For ex, suppose, if a compiler
translates the source language to its target machine language
without having the option for generating intermediate code, then
for each new machine, a full native compiler is required.
But if we will have a machine-independent intermediate code,
we will have only one optimizer. Intermediate code can be either
language-specific (e.g., Bytecode for Java) or language.
independent (three-address code). The following are commonly
used intermediate code representations:
1. Postfix Notation: Also known as reverse Polish notation or
suffix notation. The ordinary (infix) way of writing the sum of a
and b is with an operator in the middle: a + b The postfix
notation for the same expression places the operator at the right
end as ab +. In general, if e1 and e2 are any postfix expressions,
and + is any binary operator, the result of applying + to the
values denoted by e1 and e2 is postfix notation by e1e2 +. No
parentheses are needed in postfix notation because the position
and arity (number of arguments) of the operators permit only
one way to decode a postfix expression. In postfix notation, the
operator follows the operand.
Example 1: The postfix representation of the expression (a + b)
* c is : ab + c *
Example 2: The postfix representation of the expression (a – b)
* (c + d) + (a – b) is : ab – cd + *ab -+

2. Syntax Tree: A syntax tree is nothing more than a condensed


form of a parse tree. The operator and keyword nodes of the
parse tree are moved to their parents and a chain of single
productions is replaced by the single link in the syntax tree the
internal nodes are operators and child nodes are operands. To
form a syntax tree put parentheses in the expression, this way
it’s easy to recognize which operand should come first.
Example: x = (a + b * c) / (a – b * c)
Code Optimization

The code optimization in the synthesis phase is a program transformation


technique, which tries to improve the intermediate code by making it
consume fewer resources (i.e. CPU, Memory) so that faster-running
machine code will result. Compiler optimizing process should meet the
following objectives :
 The optimization must be correct, it must not, in any way,
change the meaning of the program.
 Optimization should increase the speed and performance of the
program.
 The compilation time must be kept reasonable.
 The optimization process should not delay the overall compiling
process.

When to Optimize?

Optimization of the code is often performed at the end of the


development stage since it reduces readability and adds code that is used
to increase the performance.

Why Optimize?

Optimizing an algorithm is beyond the scope of the code optimization


phase. So the program is optimized. And it may involve reducing the size
of the code. So optimization helps to:

 Reduce the space consumed and increases the speed of


compilation.
 Manually analyzing datasets involves a lot of time. Hence we
make use of software like Tableau for data analysis. Similarly
manually performing the optimization is also tedious and is
better done using a code optimizer.
 An optimized code often promotes re-usability.
Types of Code Optimization: The optimization process can be broadly
classified into two types :
1. Machine Independent Optimization: This code optimization
phase attempts to improve the intermediate code to get a better
target code as the output. The part of the intermediate code
which is transformed here does not involve any CPU registers or
absolute memory locations.
2. Machine Dependent Optimization: Machine-dependent
optimization is done after the target code has been generated
and when the code is transformed according to the target
machine architecture. It involves CPU registers and may have
absolute memory references rather than relative references.

Target code generation


Target code generation in a compiler is the phase where the compiler takes
the high-level source code, performs the necessary analysis, and produces
code that can be executed on a specific target platform or conforming to
the target programming language. The generated target code is emitted as
machine code or written to a file in the target programming language. This
output can be in various forms, such as executable binaries or object files.
1. Input : Optimized Intermediate Representation.
2. Output : Target Code.
3. Task Performed : Register allocation methods and
optimization, assembly level code.
4. Method : Three popular strategies for register allocation and
optimization.
5. Implementation : Algorithms.
Target code generation deals with assembly language to convert
optimized code into machine understandable format. Target code
can be machine readable code or assembly code. Each line in
optimized code may map to one or more lines in machine (or)
assembly code, hence there is a 1:N mapping associated with them .

1 : N Mapping

Computations are generally assumed to be performed on high speed


memory locations, known as registers. Performing various
operations on registers is efficient as registers are faster than cache
memory. This feature is effectively used by compilers, However
registers are not available in large amount and they are costly.
Therefore we should try to use minimum number of registers to
incur overall low cost.

Register Allocation :
Register allocation is the process of assigning program variables to
registers and reducing the number of swaps in and out of the
registers. Movement of variables across memory is time consuming
and this is the main reason why registers are used as they available
within the memory and they are the fastest accessible storage
location.
Example 1:

R1<--- a

R2<--- b

R3<--- c

R4<--- d

MOV R3, c

MOV R4, d

MUL R3, R4

MOV R2, b

ADD R2, R3

MOV R1, R2

MOV a, R1

Example 2:

R1<--- e

R2<--- f

R3<--- g

R4<--- h

MOV R3, g

MOV R4, h
DIV R3, R4

MOV R2, f

SUB R2, R3

MOV R1, R2

MOV e, R1

Advantages :
 Fast accessible storage
 Allows computations to be performed on them
 Reduce memory traffic
 Reduces overall computation time
Disadvantages :
 Registers are generally available in small amount ( up to
few hundred Kb )
 Register sizes are fixed and it varies from one processor to
another
 Registers are complicated
 Need to save and restore changes during context switch
and procedure calls

You might also like