Chapter 2-Lexical Analysis
Chapter 2-Lexical Analysis
Chapter 2-Lexical Analysis
Lexical Analysis/Scanning
• Lexical analysis: The process of converting a
sequence of characters (such as in a computer
program) into a sequence of tokens (strings with an
identified "meaning").OR
• The word “lexical” in the traditional sense means “pertaining to
words”. In terms of programming languages, words are objects like
variable names, numbers, keywords etc. Such words are traditionally
called tokens.
Lexical analysis takes the modified source code from
language preprocessors that are written in the form of
sentences.
The lexical analyzer breaks these syntaxes into a
series of tokens, by removing any whitespace or
comments in the source code.
Figure: Scanning
As the first phase of a compiler, the main task of the lexical analyzer is
to read the input characters of the source program, group them into
lexemes, and produce as output tokens for each lexeme in the source
program. This stream of tokens is sent to the parser for syntax
analysis.
Cont…..
Languages
Cont…..
Languages
Cont…language
Based on finite automata:
A set of states S.
A set of input symbols ∑.
the start (or initial) state
A set of states F distinguished as accepting (or final) state.
A number of transition to a single symbol.
Cont…..
Note:The output of lexical analysis is a sequence of tokens that is sent to
the parser for syntax analysis
Lexical Analysis can be implemented with the (Deterministic) finite
Automata.
OR
The lexical analysis process starts with a definition of what it means to be
a token in the language with regular expressions or grammars, then this
is translated to an abstract computational model for recognising tokens (a
non-deterministic finite state automaton), which is then translated to an
implementable model for recognising the defined tokens (a deterministic
finite state automaton) to which optimisations can be made (a minimised
DFA).
Regular expressions are used to describe tokens (lexical constructs).
The formal tools of regular expressions and finite automata allow us to
state very precisely what may appear in a given token type. Then,
automated tools can process these definitions, find errors or
ambiguities, and produce compact, high performance code.
Cont…..
We use regular expressions to describe tokens of a programming language. A
regular expression is built up of simpler regular expressions (using defining
rules).Each regular expression denotes a language. A language denoted by a
regular expression is called as a regular set.
For Regular Expressions over alphabet Σ
Cont…
The transition graph for an NFA that recognizes the language (a|b)*abb is
shown
Cont…
Example:
Cont…
The Finite Automata is called DFA if there is only one path for a specific input
from current state to next state.
.
Front end: Terminologies
• In general, Token is a sequence of characters
that represent lexical unit, which matches
with the pattern, such as keywords, operators,
identifiers, special symbols ,constants etc.
• Note:In programming language, keywords, constants, identifiers,
strings, numbers, operators, and punctuations symbols can be
considered as tokens.
• For example, in C language, the variable declaration line
int value = 100; contains
• the tokens: int (keyword), value (identifier), = (operator), 100
(constant) and ; (symbol).
• Lexeme: Lexeme is instance of a token i.e., group of characters forming
a token.
• Pattern: Pattern describes the rule that the lexemes of a token takes. It is
the structure that must be matched by strings .
Token and Lexeme
Once a token is generated the corresponding
entry is made in the symbol table.
At lexical analysis phase,
Input: stream of characters
Output: Token
Token Template:
<token-name, attribute-value>
For example, for c=a+b*5;
Hence,
<id, 1><=>< id, 2>< +><id, 3 >< * >< 5>
How Lexical Analyzer works-
Input preprocessing: This stage involves cleaning up the input text and
preparing it for lexical analysis. This may include removing comments,
whitespace, and other non-essential characters from the input text.
Tokenization: This is the process of breaking the input text into a sequence of
tokens. This is usually done by matching the characters in the input text
against a set of patterns or regular expressions that define the different types
of tokens.
Token classification: In this stage, the lexer determines the type of each token.
For example, in a programming language, the lexer might classify keywords,
identifiers, operators, and punctuation symbols as separate token types.
Token validation: In this stage, the lexer checks that each token is valid
according to the rules of the programming language. For example, it might
check that a variable name is a valid identifier, or that an operator has the
correct syntax.
Output generation: In this final stage, the lexer generates the output of the
lexical analysis process, which is typically a list of tokens. This list of tokens
can then be passed to the next stage of compilation or interpretation
Token and Lexeme Cont’d
Cont….
Example 2: Show the token classes, or “words”, put out by the lexical
analysis phase corresponding to this Java source input:
sum = sum + unit * /* accumulate sum */ 1.2e-12 ;
Solution:
identifier (sum)
assignment (=)
identifier (sum)
operator (+)
identifier (unit)
operator (*)
numeric constant (1.2e-12)
semicolon (;)
Advantages and disadvantage of Lexical analysis
Advantages
Efficiency: Lexical analysis improves the efficiency of the parsing process
because it breaks down the input into smaller, more manageable chunks. This
allows the parser to focus on the structure of the code, rather than the individual
characters.
Flexibility: Lexical analysis allows for the use of keywords and reserved words in
programming languages. This makes it easier to create new programming
languages and to modify existing ones.
Error Detection: The lexical analyzer can detect errors such as misspelled words,
missing semicolons, and undefined variables. This can save a lot of time in the
debugging process.
Code Optimization: Lexical analysis can help optimize code by identifying
common patterns and replacing them with more efficient code. This can improve
the performance of the program.
Disadvantages:
Complexity: Lexical analysis can be complex and require a lot of computational
power. This can make it difficult to implement in some programming languages.
Limited Error Detection: While lexical analysis can detect certain types of
errors, it cannot detect all errors. For example, it may not be able to detect logic
errors or type errors.
Increased Code Size: The addition of keywords and reserved words can increase
the size of the code, making it more difficult to read and understand.
Reduced Flexibility: The use of keywords and reserved words can also reduce the
flexibility of a programming language. It may not be possible to use certain words
or phrases in a way that is intuitive to the programmer.
THANK YOU