Chapter 2-Lexical Analysis

Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

Chapter two

Lexical Analysis/Scanning
• Lexical analysis: The process of converting a
sequence of characters (such as in a computer
program) into a sequence of tokens (strings with an
identified "meaning").OR
• The word “lexical” in the traditional sense means “pertaining to
words”. In terms of programming languages, words are objects like
variable names, numbers, keywords etc. Such words are traditionally
called tokens.
 Lexical analysis takes the modified source code from
language preprocessors that are written in the form of
sentences.
 The lexical analyzer breaks these syntaxes into a
series of tokens, by removing any whitespace or
comments in the source code.
Figure: Scanning
As the first phase of a compiler, the main task of the lexical analyzer is
to read the input characters of the source program, group them into
lexemes, and produce as output tokens for each lexeme in the source
program. This stream of tokens is sent to the parser for syntax
analysis.
Cont…..
Languages
Cont…..
Languages
Cont…language
Based on finite automata:
A set of states S.
A set of input symbols ∑.
the start (or initial) state
A set of states F distinguished as accepting (or final) state.
A number of transition to a single symbol.
Cont…..
Note:The output of lexical analysis is a sequence of tokens that is sent to
the parser for syntax analysis
Lexical Analysis can be implemented with the (Deterministic) finite
Automata.
OR
The lexical analysis process starts with a definition of what it means to be
a token in the language with regular expressions or grammars, then this
is translated to an abstract computational model for recognising tokens (a
non-deterministic finite state automaton), which is then translated to an
implementable model for recognising the defined tokens (a deterministic
finite state automaton) to which optimisations can be made (a minimised
DFA).
Regular expressions are used to describe tokens (lexical constructs).
 The formal tools of regular expressions and finite automata allow us to
state very precisely what may appear in a given token type. Then,
automated tools can process these definitions, find errors or
ambiguities, and produce compact, high performance code.
Cont…..
We use regular expressions to describe tokens of a programming language. A
regular expression is built up of simpler regular expressions (using defining
rules).Each regular expression denotes a language. A language denoted by a
regular expression is called as a regular set.
For Regular Expressions over alphabet Σ
Cont…

Regular expression → NFA → DFA → Minimized DFA


COMPILER DESIGN – FINITE AUTOMATA
Finite automata is a state machine that takes a string of symbols as input and changes its
state accordingly. Finite automata is a recognizer for regular expressions. When a regular
expression string is fed into finite automata, it changes its state for each literal. If the input
string is successfully processed and the automata reaches its final state, it is accepted,
i.e., the string just fed was said to be a valid token of the language in hand.

The mathematical model of finite automata consists of:


Finite set of states (Q)
Finite set of input symbols (Σ)
One Start state (q0)
Set of final states (qf)
Transition function (δ)
The transition function (δ) maps the finite set of state
(Q) to a finite set of input symbols (Σ), Q × Σ ➔ Q
Cont…
Finite Automata Construction
Let L(r) be a regular language recognized by some finite automata (FA).
States : States of FA are represented by circles. State names are written
inside circles.
Start state : The state from where the automata starts is known as the start
state. Start state has an arrow pointed towards it.
Intermediate states : All intermediate states have at least two arrows; one
pointing to and another pointing out from them.
Final state : If the input string is successfully parsed, the automata is
expected to be in this state. Final state is represented by double circles.
Transition : The transition from one state to another state happens when a
desired symbol in the input is found. Upon transition, automata can either move
to the next state or stay in the same state. Movement from one state to another
is shown as a directed arrow, where the arrows point to the destination state. If
automata stays on the same state, an arrow pointing from a state to itself is
drawn.
Example : We assume FA accepts any three digit binary value ending in digit 1.
FA = {Q(q0, qf), Σ(0,1), q0, qf, δ}
Cont…

a program that takes a string x, and answers “yes” if x is a sentence of that


language, and “no” otherwise.
We call the recognizer of the tokens as a finite automaton.
A finite automaton can be: deterministic (DFA) or non-deterministic (NFA)
This means that we may use a deterministic or non-deterministic automaton
as a lexical analyzer
Both deterministic and non-deterministic finite automaton recognize regular
sets.
Which one? – deterministic – faster recognizer, but it may take more space
– non-deterministic – slower, but it may take less space
– Deterministic automatons are widely used lexical analyzers.
First, we define regular expressions for tokens; Then we convert them into a
DFA to get a lexical analyzer for our tokens.
Cont…
Transition Diagram:
Transition Diagram has a collection of nodes or circles, called states.
Each state represents a condition that could occur during the process of
scanning the input looking for a lexeme that matches one of several
patterns .Edges are directed from one state of the transition diagram to
another. each edge is labeled by a symbol or set of symbols.
If we are in one state s, and the next input symbol is a, we look for an
edge out of state s labeled by a.
if we find such an edge ,we advance the forward pointer and enter the
state of the transition diagram to which that edge leads.
Some important conventions about transition diagrams are
1. Certain states are said to be accepting or final .These states
indicates that a lexeme has been found, although the actual lexeme
may not consist of all positions b/w the lexeme Begin and forward
pointers we always indicate an accepting state by a double circle.
2. In addition, if it is necessary to return the forward pointer one
position, then we shall additionally place a * near that accepting state.
3. One state is designed the state ,or initial state ., it is indicated by an
edge labeled “start” entering from nowhere .the transition diagram
always begins in the state before any input symbols have been used.
Cont…
As an intermediate step in the construction of a LA, we first produce a stylized
flowchart, called a transition diagram. Position in a transition diagram, are
drawn as circles and are called as states.

The above TD for an identifier, defined to be a


letter followed by any no of letters or digits.A
sequence of transition diagram can be
converted into program to look for the tokens
specified by the diagrams. Each state gets a
segment of code
Cont…

Non-Deterministic Finite Automaton (NFA)


A NFA accepts a string x, if and only if there is a path from the starting state
to one of accepting states such that edge labels along this path spell out x.
A NFA is A mathematical model consists of
A set of states S or Q.
A set of input symbols ∑.
A transition is a move from one state to another.
A state so that is distinguished as the start (or initial) state
A set of states F distinguished as accepting (or final) state.
A number of transition to a single symbol.
ε- transitions are allowed in NFAs. In other words, we can move from one
state to another one without consuming any symbol.
Cont…

The transition graph for an NFA that recognizes the language (a|b)*abb is
shown
Cont…

Example:
Cont…

Deterministic Finite Automaton (DFA) : A Deterministic Finite Automaton


(DFA) is a special form of a NFA.
No state has ε- transition
For each symbol a and state s, there is at most one labeled edge a leaving s.
i.e. transition
function is from pair of state-symbol to state (not set of states)
Example:
The DFA to recognize the language (a|b)* ab is as follows
Cont…

The Finite Automata is called DFA if there is only one path for a specific input
from current state to next state.

From state S0 for input „a‟


there is only one path
going to S2. similarly from
so there is only one path
for input going to S1.
Cont…
Application of Finite state machine and regular expression in Lexical analysis:
Lexical analysis is the process of reading the source text of a program and
converting that source code into a sequence of tokens. The approach of design
a finite state machine by using regular expression is so useful to generates
token form a given source text program. Since the lexical structure of more or
less every programming language can be specified by a regular language, a
common way to implement a lexical analysis is to;
1. Specify regular expressions for all of the kinds of tokens in the language.
The disjunction of all of the regular expressions thus describes any possible
token in the language.
2. Convert the overall regular expression specifying all possible tokens into a
deterministic finite automaton (DFA).
3. Translate the DFA into a program that simulates the DFA. This program is the
lexical analyzer. To recognize identifiers, numerals, operators, etc., implement a
DFA in code. State is an integer variable, δ is a switch statement Upon
recognizing a lexeme returns its lexeme, lexical class and restart DFA with next
character in source code.
Most languages will have tokens in these categories:
• Keywords are words in the language structure itself, like while or class or
true. Keywords must be chosen carefully to reflect the natural structure of the
language, without interfering with the likely names of variables and other
identifiers.
• Identifiers are the names of variables, functions, classes, and other code
elements chosen by the programmer. Typically, identifiers are arbitrary
sequences of letters and possibly numbers. Some languages require identifiers
to be marked with a sentinel (like the dollar sign in Perl) to clearly distinguish
identifiers from keywords.
• Numbers could be formatted as integers, or floating point values, or
fractions, or in alternate bases such as binary, octal or hexadecimal. Each
format should be clearly distinguished, so that the programmer does not
confuse one with the other.
• Strings are literal character sequences that must be clearly distinguished
from keywords or identifiers. Strings are typically quoted with single or double
quotes, but also must have some facility for containing quotations, newlines,
and unprintable characters.
• Comments and whitespace are used to format a program to make it visually
clear, and in some cases (like Python) are significant to the structure of a
program.
cont.….

 The lexical analyzer (either generated automatically


by a tool like lex, or hand-crafted) reads in a stream
of characters, identifies the lexemes in the stream, and
categorizes them into tokens.
 This is called "tokenizing". If the lexer finds an
invalid token, it will report an error.
Front end: Terminologies
What is a token? A lexical token is a sequence of
characters that can be treated as a unit in the grammar of
the programming languages. Example of tokens:
Type token (id, number, real, . . . )
Punctuation tokens ( void, return, . . . )
Alphabetic tokens (keywords)
Keywords; Examples-for, while, if etc.
Identifier; Examples-Variable name, function name, etc.
Operators; Examples '+', '++', '-' etc.
Separators; Examples ',' ';' etc
Example of Non-Tokens:
Comments, preprocessor directive, tabs, newline, etc.

.
Front end: Terminologies
• In general, Token is a sequence of characters
that represent lexical unit, which matches
with the pattern, such as keywords, operators,
identifiers, special symbols ,constants etc.
• Note:In programming language, keywords, constants, identifiers,
strings, numbers, operators, and punctuations symbols can be
considered as tokens.
• For example, in C language, the variable declaration line
int value = 100; contains
• the tokens: int (keyword), value (identifier), = (operator), 100
(constant) and ; (symbol).
• Lexeme: Lexeme is instance of a token i.e., group of characters forming
a token.
• Pattern: Pattern describes the rule that the lexemes of a token takes. It is
the structure that must be matched by strings .
Token and Lexeme
 Once a token is generated the corresponding
entry is made in the symbol table.
At lexical analysis phase,
Input: stream of characters
Output: Token
Token Template:
<token-name, attribute-value>
 For example, for c=a+b*5;
Hence,
<id, 1><=>< id, 2>< +><id, 3 >< * >< 5>
How Lexical Analyzer works-

Input preprocessing: This stage involves cleaning up the input text and
preparing it for lexical analysis. This may include removing comments,
whitespace, and other non-essential characters from the input text.
Tokenization: This is the process of breaking the input text into a sequence of
tokens. This is usually done by matching the characters in the input text
against a set of patterns or regular expressions that define the different types
of tokens.
Token classification: In this stage, the lexer determines the type of each token.
For example, in a programming language, the lexer might classify keywords,
identifiers, operators, and punctuation symbols as separate token types.
Token validation: In this stage, the lexer checks that each token is valid
according to the rules of the programming language. For example, it might
check that a variable name is a valid identifier, or that an operator has the
correct syntax.
Output generation: In this final stage, the lexer generates the output of the
lexical analysis process, which is typically a list of tokens. This list of tokens
can then be passed to the next stage of compilation or interpretation
Token and Lexeme Cont’d
Cont….

Example 1: Count number of tokens : int max(int i);


Lexical analyzer first read int and finds it to be valid and accepts as
token
max is read by it and found to be a valid function name after reading (
and )
int is also a token , then again i as another token and finally ;
Answer: Total number of tokens 7:
int, max, ( ,int, i, ), ;
Cont….

Example 2: Show the token classes, or “words”, put out by the lexical
analysis phase corresponding to this Java source input:
sum = sum + unit * /* accumulate sum */ 1.2e-12 ;
Solution:
identifier (sum)
assignment (=)
identifier (sum)
operator (+)
identifier (unit)
operator (*)
numeric constant (1.2e-12)
semicolon (;)
Advantages and disadvantage of Lexical analysis
Advantages
Efficiency: Lexical analysis improves the efficiency of the parsing process
because it breaks down the input into smaller, more manageable chunks. This
allows the parser to focus on the structure of the code, rather than the individual
characters.
Flexibility: Lexical analysis allows for the use of keywords and reserved words in
programming languages. This makes it easier to create new programming
languages and to modify existing ones.
Error Detection: The lexical analyzer can detect errors such as misspelled words,
missing semicolons, and undefined variables. This can save a lot of time in the
debugging process.
Code Optimization: Lexical analysis can help optimize code by identifying
common patterns and replacing them with more efficient code. This can improve
the performance of the program.
Disadvantages:
Complexity: Lexical analysis can be complex and require a lot of computational
power. This can make it difficult to implement in some programming languages.
Limited Error Detection: While lexical analysis can detect certain types of
errors, it cannot detect all errors. For example, it may not be able to detect logic
errors or type errors.
Increased Code Size: The addition of keywords and reserved words can increase
the size of the code, making it more difficult to read and understand.
Reduced Flexibility: The use of keywords and reserved words can also reduce the
flexibility of a programming language. It may not be possible to use certain words
or phrases in a way that is intuitive to the programmer.
THANK YOU

You might also like