Programming Languaged Scanning Week 1-2

Programming Languages
Topic: Scanning
Scanning: is the process of identifying tokens from the raw text source code of a program. The primary
function of scanning is to read characters from a source file and group them into tokens.
A precise definition of tokens is obviously necessary to ensure that the scanning rules or sometimes
called the lexical rules are clearly stated and properly enforced.
Most languages will have tokens in these categories:
1. Keywords- are words in the language structure itself, like while or class or true. Keywords must
be chosen carefully to reflect the natural structure of the language, without interfering with the
likely names of variables and other identifiers.
2. Identifiers- are the names of variables, functions, classes, and other code elements chosen by
the programmer. Typically, identifiers are arbitrary sequences of letters and possibly numbers.
Some languages require identifiers to be marked with a sentinel (like the dollar sign in Perl) to
clearly distinguish identifiers from keywords.
3. Numbers- could be formatted as integers, or floating-point values, or fractions, or in alternate
bases such as binary, octal or hexadecimal. Each format should be clearly distinguished, so that
the programmer does not confuse one with other.
4. Strings- are literal character sequences that must be clearly distinguished from keywords or
identifiers. Strings are typically quoted with single or double quotes, but also muts have some
facility for containing quotations, newlines, and unprintable characters.
5. Comments & Whitespaces- are used to format a program to make it visually clear, and insome
cases(like Python) are significant to the structure of a program.
When designing a new language or designing a compiler for an existing language, the first job is to state
precisely what characters are permitted in each type of token.
Formal definitions allow a language designer to anticipate design flaws (e.g. virtually all languages allow
fixed decimal numbers such as 0.1 and 10.01 but should be .1 or 10. Be allowed?)
Scanners sometimes called lexical analyzer or Lexers it is used interchangeably.
The goal of a scanner generator is to limit the effort in building a scanner to specifying which tokens
want to recognize.
Programming a scanner generator is an example of declarative programming. That is, unlike ordinary
programming, which we call procedural, programmers do not tell a scanner generator how to scan but
simply what it wants to be scanned. This is a higher-level approach and, in many ways, a more natural
one. Much recent research in computer science is directed toward declarative programing styles.
Declarative programming is most successful in limited domains, such as scanning, where the range of
implementation decisions that must be automatically made is limited. Nonetheless, a long-standing goal
of computer scientists is to automatically generate an entire production-quality compiler from a
specification of the properties of the source language and target computer.
Though our primary focus in this text is on producing correct compilers, performance is sometimes a
real concern, especially in widely-used “production compilers”. Surprisingly, even though scanners
perform a simple task, if poorly implemented, they can be significant performance bottlenecks. The
reason is that scanners must wade through the text of a program character by character.
2.1 A Hand-made Scanner
The basic approach is to read one character at a time from the input stream (fgetc (fp) and then classify
it. Some single-character tokens are easy: if the scanner reads a character, it immediately returns TOKEN
MULTIPLY, and the same would be true for addition, subtraction, and so forth.
However, some characters are part of multiple tokens. If the scanner encounters (!), that could
represent a logical-not operation by itself, or it could be the first character in the (!=) sequence
representing not-equal-to. Upon reading (!), the scanner must immediately read the next character. If
the next character is =, then it has matched the (!=) and returns TOKEN_NOT_EQUAL. But if the
character following (!) is something else, then the non-matching character needs to be put back on the
input stream using ungetc, because it is not part of the current token. The scanner returns TOKEN NOT
and will consume the put-back character on the next call to scan token. In a similar way, once a letter
has been identified by isalpha(c), then the scanner keeps reading letters or numbers, until a non-
matching character is put back, and the scanner returns TOKEN IDENTIFIER.
Backtracking- come up in every stage of the compiler: an unexpected item doesn’t match the current
objective, so it must be put back for later.
Hand-made scanner- More verbose, as more token types are added, the code can become quite
convoluted, particularly if tokens share common sequences of characters. It can also be difficult for a
developer to be certain that the scanner code corresponds to be desired definition of each token, which
can result in unexpected behavior on complex inputs. That said, for a small language with a limited
number of tokens, a hand-made scanner can be an appropriate solution.
Regular Expressions (REs)- Are a language for expressing patterns. They were first described in the 1950s
by Stephen Kleene as an element of his foundational work in automata theory and computability. Today,
REs are found in slightly different forms in programming languages (Perl), standard libraries (PCRE), text
editors (vi), command-line tools (grep), and many other places. Use regular expressions as a compact
and formal way of specifying the tokens accepted by the scanner of a compiler, and then automatically
translate those expressions into working code. REs can be a bit tricky to use, and require some practice
in order to achieve the desired results.
2.3 Finite Automata
Is an abstract machine that can be used to certain forms of computation. Graphically, an FA consists of a
number of states (represented by numbered circles) and a number of edges (represented by labeled
arrows) between those states. Each edge is labeled with one or more symbols drawn from an alphabet.
The machine begins in a start state 50. For each input symbol presented to the FA, it moves to the state
indicated by the edge with the same label as the input symbol. Some states of the FA are known as
accepting states and are indicated by a double circle. If the FA is in an accepting state after all input is
consumed, then the FA accepts the input. That the FA rejects the input string if it ends in a non-
accepting state or if there is no edge corresponding to the current input symbol.
A finite automaton (FA) - can be used to recognize the tokens specified by a regular expression. An FA is
a simple, idealized computer that recognizes strings belonging to regular sets. It consists of:
- A finite set of states

- A set of transitions ( or moves) from one state to another, labeled with characters in V
- A special state called the start state
- A subset of the states called the accepting, or final, states
2.3.1 Deterministic Finite Automata
Each of these three examples is a deterministic finite automaton (DFA). A DFA is a special case of an FA
where every state has no more than one outgoing edge for a given symbol. Put another way, a DFA has
no ambiguity: for every combination of state and input symbol, there is exactly one choice of what to do
next.
Because of this property, a DFA is very easy to implement in software or hardware. One integer © is
needed to keep track of the current state. The transitions between states are represented by a matrix
(M[s,I]) which encodes the next stage, given the current state and input symbol. For each symbol, we
compute c= M [s, I] until all the input is consumed, or an error state is reached.
2.3.2 Nondeterministic Finite Automata
The alternative to a DFA is a nondeterministic finite automaton (NFA). An NFA is a perfectly valid FA, but
it has an ambiguity that makes it somewhat more difficult to work with. Consider the regular expression
[a-z]*ing, which represents all lowercase words ending in the suffix –ing.
But the other, equally valid way would be to stay in state 0 the whole time, matching each letter to the
[a-z] transition. Both ways obey the transition rules, but one results in acceptance, while the other
results in rejection.
There are two common ways to interpret this ambiguity:
- The crystal ball interpretation suggests that the NFA somehow “knows” what the best choice is,
by some means external to the NFA itself. In the example above, the NFA would choose whether
to proceed to state zero, one, or two before consuming the first character, and it would always
make the right choice. This isn’t possible in a real implementation.
- The many-worlds interpretation suggests that that NFA exists in all allowable states
simultaneously. When the input is complete, if any of those states are accepting states, then the
NFA has accepted the input. This interpretation is more useful for constructing a working NFA,
or converting it ti DFA.
2.4 Conversion Algorithms
Regular expressions and finite automata

Programming Languaged Scanning Week 1-2

Uploaded by

Programming Languaged Scanning Week 1-2

Uploaded by

Programming Languages

Most languages will have tokens in these categories:

Scanners sometimes called lexical analyzer or Lexers it is used interchangeably.

2.1 A Hand-made Scanner

- A finite set of states

There are two common ways to interpret this ambiguity:

2.4 Conversion Algorithms

Regular expressions and finite automata

You might also like