Parsing-Lexicalization Text Mining
Parsing-Lexicalization Text Mining
Parsing-Lexicalization Text Mining
PCFGs
Introduction
Christopher
Manning
Christopher Manning
VP
VP
NP
PP NP PP
Introduction
Christopher
Manning
Lexicalization of
PCFGs
The model of
Charniak (1997)
Christopher Manning
Charniak (1997)
“monolexical” probabilities
Christopher Manning
“Bilexical probabilities”
The model of
Charniak (1997)
PCFG
Independence
Assumptions
Christopher Manning
Non-Independence I
11%
9% 9% 9%
6% 7%
4%
Non-Independence II
• Symptoms of overly strong assumptions:
• Rewrites get used where they don’t belong
• Thesis
• Most of what you need for accurate parsing, and much of what lexicalized
PCFGs actually capture isn’t lexical selection between content words but
just basic grammatical features, like verb form, finiteness, presence of a
verbal auxiliary, etc.
Christopher Manning
Experimental Approach
Horizontal Markovization
74% 12000
72% 6000
71% 3000
70% 0
0 1 2v 2 inf 0 1 2v 2 inf
Horizontal Markov Order Horizontal Markov Order
Christopher Manning
Vertical Markovization
79% 25000
78% 20000
77%
Symbols
76% 15000
75% 10000
74%
73% 5000
72% 0
1 2v 2 3v 3 1 2v 2 3v 3
Vertical Markov Order Vertical Markov Order Model F1 Size
v=h=2v 77.8 7.5K
Christopher Manning
Unary Splits
• Problem: unary
rewrites are used to
transmute
categories so a high-
probability rule can
be used.
Solution: Mark
unary rewrite sites
Annotation F1 Size
with -U
Base 77.8 7.5K
UNARY 78.3 8.0K
Christopher Manning
Tag Splits
Annotation F1 Size
• Partial Solution:
Previous 78.3 8.0K
• Subdivide the IN tag.
SPLIT-IN 80.3 8.1K
Christopher Manning
Yield Splits
• Examples:
• Possessive NPs
• Finite vs. infinite VPs
• Lexical heads!
Parser LP LR F1
Magerman 95 84.9 84.6 84.7
Collins 96 86.3 85.8 86.0
Klein & Manning 03 86.9 85.7 86.3
Charniak 97 87.4 87.5 87.4
Collins 99 88.7 88.6 88.6
X2 X4 X7
X3 X5 X6
.
VP
Christopher Manning
PP
ADVP
ADJP
SBAR
QP
WHNP
PRN
NX
SINV
PRT
WHPP
SQ
CONJP
FRAG
NAC
UCP
WHADVP
INTJ
SBARQ
Number of phrasal subcategories
RRC
WHADJP
ROOT
LST
Christopher Manning
The Latest Parsing Results… (English PTB3 WSJ train 2-21, test 23)
F1 F1
Parser ≤ 40 words all words