Language Translation: Programming Tools

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

LANGUAGE TRANSLATION

When working with high-level languages some method of translation is needed that will
act upon the source code such as shown above, to produce the desired machine code that
will perform the required operation and therefore get the job done. The general name for
software that translates ‘source code’ into the code that can be run directly on a machine
is a translator. The situation can be summarized as follows:

Source Translation Object


code code

Programming Tools
A compiler translates a program written in a high-level language into low-level
instructions before the program is executed. The commands that you write in a high-level
language are referred to as source code. The low-level instructions that result from
compiling the source code are referred to as object code. Some compilers produce
executable files that contain machine language instructions. Other compilers produce
files that contain intermediate language instructions. This process comprises three main
steps

Source program

Lexical analysis

Syntax analysis

Code generation

Target program /
object code
An intermediate language is a set of low-level instructions that can be converted easily
and quickly into machine language. For example, when the source code for a Java
program is compiled, it produces a file containing intermediate language instructions
called bytecode. This bytecode can be distributed to PC or Mac owners. The bytecode is
converted into machine language by software called a Java Virtual Machine (JVM)
when the program is run. The JVM for a PC converts the bytecode into machine language
instructions that work on a Pentium processor. The JVM for a Mac converts the bytecode
into machine language for the PowerPC processor.

Microsoft’s .NET Framework also features an intermediate language. The .NET


Framework is a set of tools for building, deploying, and running software applications.
Programs constructed within the .NET Framework are compiled into an intermediate
language called MSIL (Microsoft Intermediate Language). MSIL is converted into
executable code by using a platform-specific CLR (common language runtime) module.
For example when you run the program on a PC, the PC version of the CLR compiles an
executable version of the program.

Interpreter vs. Compiler


An interpreter reads one high-level language or intermediate language instruction at a
time and converts it into machine language for intermediate processing. After an
instruction is executed, the interpreter reads the next instruction, converts it into machine
language, and so forth. In other words an interpreter executes each line of code as it
comes to it in the source program. In order to do this it has the following parts:
• Lexical analysis
• Syntax analysis
• Execution

The lexical analysis and the syntax analysis phases are similar to the compiler but the
interpreter does not generate any code.

Compiled programs typically run faster than interpreted programs because the computer
does not spend time translating instructions before executing them.

In contrast to compiled programs, interpreted programs run more slowly and require
interpreter software on end-user computers. For example, any users who want to run a
Java program supplied as bytecode must have the Java Virtual Machine installed on their
computers so that the bytecode can be translated into machine code as the program runs.

The major advantage of an interpreter is its ability to simplify the process of testing and
debugging a program. As each line contains an error, program execution stops, allowing
the programmer to view the error and fix it before continuing.

Debugging a compiled program is a bit more complex. The compiler attempts to compile
the entire program and accumulates a list of errors. This list is displayed to the
programmer, who must then locate the errors in the program and correct them.
Advantages of a Compiler
• A compiled program will always execute more quickly than one that is interpreted, as
the interpreter has to understand every statement as it comes to it. This is most
noticeable when executing a loop in a program. The interpreter will have to
reinterpret each statement every time it goes through the loop.
• The target program (called the object program) can be stored on a disk and re-
executed without being recompiled.
• Programs can be distributed in machine code form. This stops the user from
modifying the program as they do not have access to the source code.

Advantages of an Interpreter
• Interpreters are useful for program development when executing speed is not
important. As the interpreter is in command of the execution process debugging
features can be built in.
• It uses less memory than a compiler.

Compilation
It should be obvious from the above that the compiler itself must be a very complex piece
of software. Indeed, most compilers on large computer systems take many years to write.
The compilation process can be split up into several stages. These are as follows:

1. Lexical analysis
2. Syntax analysis
3. Code generation
4. Code optimisation

Lexical Analysis
This often the first stage of compilation. A lexical analyzer is a part of the compiler
program that breaks up the input presented to the compiler into chunks that are in a form
suitable to be analysed by the next stage of the compilation process.

For example, if different peripheral devices have been used to input a program into the
computer system, it is the job of the lexical analyzer to standardize these different source
codes into a form that would be identical for the next stage of the compilation process.
For example, if two identical programs were to be run on a computer, but one program
was stored on disk and the other on magnetic tape, then the codes that represent these
programs may be slightly different. The lexical analyzer ensures that these codes are
changed into a form that is the same for each peripheral device that may be used to input
source programs into the system.

When the strings of characters representing the source program are broken up into small
chunks, these chunks are often known as tokens. The source code has therefore been
tokenized. It is usual to remove all redundant parts of the source code (such as spaces
and comments) during this tokenization phase. It is also likely in many systems that key
words such as ‘ENDWHILE’ or ‘PROCEDURE’ etc. will be replaced by more efficient,
shorter tokens. It is the job of the lexical analyzer to check all the key words used are
valid, and to group certain symbols with their neighbours, so that they can form larger
units to be presented to the next stage of the compilation process. This is because the
characters of the source program are usually analysed just one at a time. Characters such
as ‘+’ or ‘x’ are known as terminal symbols in their own right but other characters such
as ‘P’, ‘R’, ‘T’, ‘N’ and ‘T’ may have to be grouped together to form the token for the
reserved word ‘PRINT’.

As errors can be detected during this first phase of compilation, facilities to generate an
error report must also be provided.

An overview of the functions performed by the lexical analyzer is as follows:


• Removing white space. White space is regarded as all the program code that is
superfluous (unnecessary) to the meaning of the programs and includes comments,
spaces, tabs and new-line characters.
• Identify the individual words, operators, etc. (known as syntactic units) in the
program.
• Create a symbol table. The symbol table will contain details of each symbol used in
the program. A symbol table will be used by the later stages of the compiler. It will
contain the symbol name plus information about the item, e.g. an integer, a procedure,
etc.
• Create a symbol table. The symbol table will contain details of each symbol used in
the program. A symbol table will be used by the later stages of the compiler. It will
contain the symbol name plus information about the item, e.g. an integer, a procedure,
etc.
• Load import or include files
• It keeps track of line numbers.
• It produces an output listing.
• Each reserved word, operator, etc. will be converted into a token to be passed to the
syntax analyzer.

Syntax Analysis
Syntax analysis is the second stage of compilation and determines whether the ‘string of
input tokens’ form valid sentences etc. (i.e. it checks the grammar to see if all the rules of
syntax are being obeyed). At this stage, the structure of the source program is analysed to
see if it conforms to the context-free grammar for the particular language that is being
compiled. It basically breaks up the statements into smaller identifiable chunks that can
be acted upon by the computer when it executes the final program. It would also include
finding out things like ‘whether the correct number of brackets has been used in
expressions’, and determining the priorities of the arithmetical operators used within an
arithmetic expression, for example. This process is also called parsing, and is carried out
by that part of the compiler called the parser. It is also at this stage that a data dictionary
is generated. This is simply a list kept by the compiler of the variables used by the
program, the variable types such as ‘numerical’, ‘integer’, ‘real’, ‘complex’ or ‘logical’
for example, and the place in memory at which these variables can be found. All the
information stored in the dictionary will be needed later when the object program is run.
It is possible to have a statement that is syntactically correct but has no meaning. For
example A:= B; may be a correct PASCAL statement, but it is not possible to assign B to
A if A is an integer variable and B is a character variable. Semantic analysis checks that
the statements have some correct meaning.

Code Generation
Code generation is the phase of the compilation process where the code specific to the
target machine is generated. As the code is machine code, then it is usual for several
machine-code instructions to be generated for each high-level language instruction.

As an example, consider the following simple line in a BASIC program:


LET length = 2*(side1 – side2) + 4*(side3 – side4)
Now the keyword LET is optional, and would, therefore, have been removed during
lexical analysis stage. The resulting statement might now be as follows:
length = 2 *(side1 – side2) + 4*(side3 – side4)
The ‘side1’, ‘side2’, ‘side3’, ‘side4’ and ‘length’ variables will have been created in the
dictionary mentioned earlier during the syntax analysis stage. The dictionary will contain
the variable name, the variable type where in memory this variable can be found so that
the machine code program can load it. Therefore, for the above line, the dictionary used
by the compiler may have entries similar to those shown in the following table.

Variable name Variable type Memory location


length Numeric FBFF
side1 Numeric FC06
side2 Numeric FC0D
side3 Numeric FC14
side4 Numeric FC16

As well as building up dictionary tables like the above, routines from the system library
may often have to be called up. Functions such as ‘square roots’ or ‘sine’ might be
needed so often that the machine code to deal with them is stored in the system library.

If we now imagine some fictitious assembly language with mnemonics ADD,


SUBTRACT and MULTIPLY etc., then the following assembly language code might be
generated for the single line of BASIC code

LOAD side1
SUBTRACT side2
MULTIPLY 2
STORE temp
LOAD side3
SUBTRACT side4
MULTIPLY 4
ADD temp
STORE length
In the above assembly language program, ‘temp’ is simply a location used by the
computer to store a ‘temporary answer’ during a calculation. In fact, actual machine
code would be produced by the compiler, but a list of hex or binary numbers is not as
enlightening as the above for showing the principles of how the object language might be
generated.

It is possible to generate machine code that will execute on different type of computer.
This is known as cross-compilation.

Code Optimisation
Often the code produced by such methods is not the best that can be obtained. This is a
consequence of trying to construct machine code from a high-level language. It’s often
possible to generate more efficient machine code by carrying out a process called
optimisation. However, it is still very unlikely that even the best optimisers can produce
code that would be as good as hand-optimised code. Usually, it is not worth going to
these extraordinary lengths to improve the final product, but if speed or efficiency is of
paramount importance as might be the case with a real-time system then there is no
choice.

Debugger
A debugger is often supplied with a compiler or interpreter. A debugger helps the
programmer to find logical errors in the program. The compiler or interpreter will
establish whether the program has broken any of the rules of the language but it cannot
check whether the program is performing the correct task. A debugger can offer the
following features:
• Breakpoints – this stops the execution of the program at a predetermined point
• Single step – to run the program one line at a time
• Watches – to allow the programmer to inspect the contents of variables
• Trace – provides a history of the statements executed immediately before the program
failed
• Store dump – provides details of the contents of the computer’s memory.

Linking
Programs that make up a system are normally compiled separately and each compilation
generates an object file. In order to build a system it is necessary to combine several
object files and a linkage editor program performs this task. The result is a machine code
file that I soften known as an executable file and contains all the required object file
linked together.

It is also common practice to place regularly used programs in library files. A library is a
file that contains a collection of object files. The linkage editor will manage these files
and link them to other programs as necessary. Libraries of prewritten routines are widely
available for most programming languages.

In order to link object files, the files have to be copied into memory. It is also necessary
to copy an executable file into memory we say that the code is loaded into memory. The
program that performs this task is called a loader. The loader is usually an integral part of
the linkage editor.

Backus-Naur Form
One method of specifying the grammatical rules (or syntax) of a language is Backus-
Naur Form (BNF). BNF is a special type of language with a simple set of rules. BNF is
often known as meta-language. The common symbols used are:
<> used to enclose a syntactic category
::= an operator meaning ‘is defined by’ or ‘consisting of’
| is used to mean ‘or’
{} zero or more repetition of the contents

BNF definitions are often recursive, for example a definition of an integer that contains
one or more digits might be written:
<integer> ::= <digit> | <digit> <integer>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

An integer constant is defined as having an optional + or – sign followed by one or more


digits.

1. Write a BNF specification of an integer constant


<integerconstant> ::= <sign> <integer> | <integer>
<integer> ::= <digit> | <digit><integer>
<digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
<sign> ::= + | -

You might also like