Lexical and Syntactic Analysis: Vitaly Shmatikov
Lexical and Syntactic Analysis: Vitaly Shmatikov
slide 1
Reading Assignment
Mitchell, Chapters 4.1 C Reference Manual, Chapters 2 and 7
slide 2
Syntax
Syntax of a programming language is a precise description of all grammatically correct programs
Precise formal syntax was first used in ALGOL 60
Lexical syntax
Basic symbols (names, values, operators, etc.)
Concrete syntax
Rules for writing expressions, statements, programs
Abstract syntax
Internal representation of expressions and statements, capturing their meaning (i.e., semantics)
slide 3
Grammars
A meta-language is a language used to define other languages A grammar is a meta-language used to define the syntax of a language. It consists of:
Finite set of terminal symbols Backus-Naur Finite set of non-terminal symbols Form (BNF) Finite set of production rules Start symbol Language = (possibly infinite) set of all sequences of symbols that can be derived by applying production rules starting from the start symbol
slide 4
slide 6
Leftmost Derivation
Integer Integer Digit Integer Digit Digit Digit Digit Digit 3 Digit Digit 3 5 Digit 352 At each step, the leftmost
non-terminal is replaced
Production rules: Integer Digit | Integer Digit Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
slide 7
Chomsky Hierarchy
Regular grammars
Regular expressions, finite-state automata Used to define lexical structure of the language
Context-free grammars
Non-deterministic pushdown automata Used to define concrete syntax of the language
Regular Grammars
Left regular grammar
All production rules have the form A or A B
Here A, B are non-terminal symbols, is a terminal symbol
Example: grammar of decimal integers Not a regular language: {an bn | n 1 } (why?) What about this: any sequence of integers where ( is eventually followed by )?
slide 9
Lexical Analysis
Source code = long string of ASCII characters Lexical analyzer splits it into tokens
Token = sequence of characters (symbolic name) representing a single terminal symbol
Identifiers: myVariable Literals: 123 5.67 true Keywords: char sizeof Operators: + - * / Punctuation: ; , } { Discards whitespace and comments
slide 10
Regular Expressions
x \x { name } M | N M N M* M+ [x1 xn] character x escaped character, e.g., \n reference to a name M or N M followed by N 0 or more occurrences of M 1 or more occurrences of M One of x1 xn
slide 11
Examples of Tokens in C
Lexical analyzer usually represents each token by a unique integer code
+ - * / { { { { return(PLUS); } return(MINUS); } return(MULT); } return(DIV); } // // // // PLUS = 401 MINUS = 402 MULT = 403 DIV = 404
slide 12
Reserved Keywords in C
auto, break, case, char, const, continue, default, do, double, else, enum, extern, float, for, goto, if, int, long, register, return, short, signed, sizeof, static, struct, switch, typedef, union, unsigned, void, volatile, wchar_t, while C++ added a bunch: bool, catch, class, dynamic_cast, inline, private, protected, public, static_cast, template, this, virtual and others Each keyword is mapped to its own token
slide 13
slide 14
slide 16
Traversing a DFA
Configuration = state + remaining input Move = traversing the arc exiting the state that corresponds to the leftmost input symbol, thereby consuming it If no such arc, then
If no input and state is final, then accept Otherwise, error
Input is accepted if, starting with the start state, the automaton consumes all the input and halts in a final state
slide 17
Context-Free Grammars
Used to describe concrete syntax
Typically using BNF notation
Children nodes = RHS of this production rule Each leaf node = terminal symbol (token) or empty
slide 18
Syntactic Correctness
Lexical analyzer produces a stream of tokens Parser (syntactic analyzer) verifies that this token stream is syntactically correct by constructing a valid parse tree for the entire program
Unique parse tree for each language construct Program = collection of parse trees rooted at the top by a special start symbol
Parser can be built automatically from the BNF description of the languages CFG
Example tools: yacc, Bison
slide 19
::= stands for production rule; <> are non-terminals; | represents alternatives for the right-hand side of a production rule
slide 20
Sample derivation:
slide 21
slide 22
slide 23
Shift-Reduce Parsing
Idea: build the parse tree bottom-up
Lexer supplies a token, parser find production rule with matching right-hand side (i.e., run rules in reverse) If start symbol is reached, parsing is successful
789 7 8 <digit> reduce 7 8 <num> shift 7 <digit> <num> reduce 7 <num> shift <digit> <num> reduce <num>
Production rules: Num Digit | Digit Num Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
slide 24
Expression Notation
Inorder traversal
(3+4)*5=35
3+(4*5)=23
When constructing expression trees, we want inorder traversal to produce correct arithmetic result based on operator precedence and associativity
Postorder traversal
3 4 + 5 * =35
3 4 5 * + =23
Easily evaluated using operand stack (example: Forth) Leaf node: push operand value on the stack Non-leaf binary or unary operator: pop two (resp. one) values from stack, apply operator, push result back on the stack End of evaluation: print top of the stack
slide 26
Prefix:
Need to indicate arity to distinguish between unary and binary minus
slide 27
Ternary conditional
(conditional-expr) ? (then-expr) : (else-expr); Example: int min(int a, int b) { return (a<b) ? a : b; } This is an expression, NOT an if-then-else command What is the type of this expression?
slide 28
slide 29
Syntactic Ambiguity
How to parse a+b*c using this grammar?
This grammar is ambiguous
Only this tree is semantically correct (operator precedence and associativity are semantic, not syntactic rules)
Removing Ambiguity
Define a distinct non-terminal symbol for each operator precedence level Define RHS of production rule to enforce proper associativity Extra non-terminal for smallest subexpressions
slide 31
slide 32
Right-recursive grammar
slide 33
slide 34
slide 35
Algol 68, Modula, Ada: use an explicit delimiter to end every conditional (e.g., if endif) Java: rewrite the grammar and restrict what can appear inside a nested if statement
IfThenStmt if ( Expr ) Stmt IfThenElseStmt if ( Expr ) StmtNoShortIf else Stmt
The category StmtNoShortIf includes all except IfThenStmt
slide 36
This grammar is ambiguous! By default, Yacc shifts (i.e., pushes the token onto the parsers stack) and generates warning
Equivalent to associating else with closest if (this is correct semantics!)
slide 37
Forces parser to shift ELSE onto the stack because it has higher precedence than dummy LOWER_THAN_ELSE token
slide 38