cs224n 2023 Lecture04 Dep Parsing
cs224n 2023 Lecture04 Dep Parsing
cs224n 2023 Lecture04 Dep Parsing
Christopher Manning
Lecture 4: Dependency Parsing
Lecture Plan
Syntactic Structure and Dependency parsing
1. Syntactic Structure: Consistency and Dependency (30 mins)
2. Dependency Grammar and Treebanks (15 mins)
3. Transition-based dependency parsing (15 mins)
4. Neural dependency parsing (20 mins)
Key Learnings: Explicit linguistic structure and how a neural net can decide it
Reminders/comments:
• In Assignment 3, out on Tuesday, you build a neural dependency parser using PyTorch!
• Start installing and learning PyTorch (Ass 3 is quite scaffolded)
• Come to the PyTorch tutorial, Friday 3:30pm Friday, Gates B01
• Final project discussions – come meet with us; focus of Tuesday class in week 4
2
1. The linguistic structure of sentences – two views: Constituency
= phrase structure grammar = context-free grammars (CFGs)
Phrase structure organizes words into nested constituents
3
The linguistic structure of sentences – two views: Constituency =
phrase structure grammar = context-free grammars (CFGs)
Phrase structure organizes words into nested constituents.
the cat
a dog
large in a crate
barking on the table
cuddly by the door
large barking
talk to
walked behind
6
Two views of linguistic structure: Dependency structure
• Dependency structure shows which words depend on (modify, attach to, or are
arguments of) which other words.
7
Why do we need sentence structure?
Human listeners need to work out what modifies [attaches to] what
9
Prepositional phrase attachment ambiguity
10
Prepositional phrase attachment ambiguity
11
PP attachment ambiguities multiply
Shuttle veteran and longtime NASA executive Fred Gregory appointed to board
Shuttle veteran and longtime NASA executive Fred Gregory appointed to board
14
Coordination scope ambiguity
15
Adjectival/Adverbial Modifier Ambiguity
16
Verb Phrase (VP) attachment ambiguity
17
Dependency paths help extract semantic interpretation –
simple practical example: extracting protein-protein interaction
demonstrated
nsubj ccomp
18
2. Dependency Grammar and Dependency Structure
Dependency syntax postulates that syntactic structure consists of relations between
lexical items, normally binary asymmetric relations (“arrows”) called dependencies
submitted
ports
by Senator Republican
on and immigration
Kansas
of
19
Dependency Grammar and Dependency Structure
Dependency syntax postulates that syntactic structure consists of relations between
lexical items, normally binary asymmetric relations (“arrows”) called dependencies
submitted
The arrows are obl
nsubj:pass aux
commonly typed
Bills were Brownback
with the name of nmod
grammatical ports case appos
flat
relations (subject,
case cc conj by Senator Republican
prepositional object,
apposition, etc.) on and immigration nmod
Kansas
case
of
20
Dependency Grammar and Dependency Structure
Dependency syntax postulates that syntactic structure consists of relations between
lexical items, normally binary asymmetric relations (“arrows”) called dependencies
22
Pāṇini’s grammar (c. 5th century BCE)
Gallery: http://wellcomeimages.org/indexplus/image/L0032691.html
CC BY 4.0 File:Birch bark MS from Kashmir of the Rupavatra Wellcome L0032691.jpg
But this comes from much later – originally the grammar was oral
23
Dependency Grammar and Dependency Structure
• Some people draw the arrows one way; some the other way!
• Tesnière had them point from head to dependent – we follow that convention
• We usually add a fake ROOT so every word is a dependent of precisely 1 other node
24
The rise of annotated data & Universal Dependencies treebanks
Brown corpus (1967; PoS tagged 1979); Lancaster-IBM Treebank (starting late 1980s);
Marcus et al. 1993, The Penn Treebank, Computational Linguistics;
Universal Dependencies: http://universaldependencies.org/
25
The rise of annotated data
Starting off, building a treebank seems a lot slower and less useful than writing a grammar
(by hand)
26
Dependency Conditioning Preferences
What are the straightforward sources of information for dependency parsing?
1. Bilexical affinities The dependency [discussion à issues] is plausible
2. Dependency distance Most dependencies are between nearby words
3. Intervening material Dependencies rarely span intervening verbs or punctuation
4. Valency of heads How many dependents on which side are usual for a head?
31
Basic transition-based dependency parser
32
Arc-standard transition-based parser
(there are other transition schemes …)
Analysis of “I ate fish”
Shift
[root] I ate fish
33
Arc-standard transition-based parser
Analysis of “I ate fish”
Nota bene:
Left Arc In this example
A += I’ve at each step
[root] I ate [root] ate nsubj(ate → I) made the
“correct” next
Shift transition.
But a parser has
[root] ate fish [root] ate fish to work this out
– by exploring or
Right Arc inferring!
A +=
[root] ate fish [root] ate obj(ate → fish)
Right Arc
A += A = { nsubj(ate → I),
[root] ate [root] root([root] → ate) obj(ate → fish)
Finish root([root] → ate) }
34
MaltParser [Nivre and Hall 2005]
• We have left to explain how we choose the next action 🤷
• Answer: Stand back, I know machine learning!
• Each action is predicted by a discriminative classifier (e.g., softmax classifier) over each
legal move
• Max of 3 untyped choices (max of |R| × 2 + 1 when typed)
• Features: top of stack word, POS; first in buffer word, POS; etc.
• There is NO search (in the simplest form)
• But you can profitably do a beam search if you wish (slower but better):
• You keep k good parse prefixes at each time step
• The model’s accuracy is fractionally below the state of the art in dependency parsing,
but
• It provides very fast linear time parsing, with high accuracy – great for parsing the web
35
Conventional Feature Representation
binary, sparse 0 0 0 1 0 0 1 0 … 0 0 1 0
dim =106 –107
Feature templates: usually a combination of 1–3
elements from the configuration
Indicator features
36
Evaluation of Dependency Parsing: (labeled) dependency accuracy
UAS = 4 / 5 = 80%
ROOT She saw the video lecture
0 1 2 3 4 5
LAS = 2 / 5 = 40%
Gold Parsed
1 2 She nsubj 1 2 She nsubj
2 0 saw root 2 0 saw root
3 5 the det 3 4 the det
4 5 video nn 4 5 video nsubj
5 2 lecture obj 5 2 lecture ccomp
37
Handling non-projectivity
• The arc-standard algorithm we just presented only builds projective dependency trees
• Possible directions to head:
1. Just declare defeat on nonprojective arcs 🤷
2. Use dependency formalism which only has projective representations
• A CFG only allows projective structures; you promote head of projectivity violations
3. Use a postprocessor to a projective dependency parsing algorithm to identify and
resolve nonprojective links
4. Add extra transitions that can model at least most non-projective structures (e.g.,
add an extra SWAP transition will allow any non-projectivity, cf. bubble sort)
5. Move to a parsing mechanism that does not use or require any constraints on
projectivity (e.g., the graph-based MSTParser or Dozat and Manning (2017))
38
4. Why do we gain from a neural dependency parser?
Indicator Features Revisited
Categorical features are: Neural Approach:
• Problem #1: sparse learn a dense and compact feature representation
• Problem #2: incomplete
• Problem #3: expensive to compute
dense
0.1 0.9 -0.2 0.3 … -0.1 -0.5
dim = ~1000
39
A neural dependency parser [Chen and Manning 2014]
• Results on English parsing to Stanford Dependencies:
• Unlabeled attachment score (UAS) = head
• Labeled attachment score (LAS) = head and label
• Meanwhile, part-of-speech tags (POS) and dependency labels are also represented as
d-dimensional vectors. was were
• The smaller discrete sets also exhibit many semantical similarities.
good
is
come
41
Extracting Tokens & vector representations from configuration
}
s1 good JJ ∅ A concatenation
s2 has VBZ ∅ of the vector
b1 control NN ∅ representation of
lc(s1) ∅ + ∅ + ∅ all these is the
rc(s1) ∅ ∅ ∅ neural
lc(s2) He PRP nsubj representation of
rc(s2) ∅ ∅ ∅ a configuration
42
Second win: Deep Learning classifiers are non-linear classifiers
• A softmax classifier assigns classes 𝑦 ∈ 𝐶 based on inputs 𝑥 ∈ ℝ" via the probability:
• Traditional ML classifiers (including Naïve Bayes, SVMs, logistic regression and softmax
classifier) are not very powerful classifiers: they only give linear decision boundaries
• But neural networks can use multiple layers to learn much more complex nonlinear
decision boundaries
43
Neural Dependency Parser Model Architecture
(A simple feed-forward neural network multi-class classifier)
Log loss (cross-entropy error) will be back-
Softmax probabilities propagated to the embeddings
Output layer y
y = softmax(Uh + b2) The hidden layer re-represents the input —
it moves inputs around in an intermediate
Hidden layer h layer vector space—so it can be easily
h = ReLU(Wx + b1) classified with a (linear) softmax
Input layer x
x is result of lookup
x(i,…,i+d) = Le
plus concat
45
Dependency parsing for sentence structure
Chen & Manning (2014) showed that neural networks can accurately
determine the structure of sentences, supporting meaning interpretation
This paper was the first simple and successful neural dependency parser
46
Further developments in transition-based neural dependency parsing
This work was further developed and improved by others, including in particular at Google
• Bigger, deeper networks with better tuned hyperparameters
• Beam search
• Global, conditional random field (CRF)-style inference over the decision sequence
Leading to SyntaxNet and the Parsey McParseFace model (2016):
“The World’s Most Accurate Parser”
https://research.googleblog.com/2016/05/announcing-syntaxnet-worlds-most.html
Method UAS LAS (PTB WSJ SD 3.3)
Chen & Manning 2014 92.0 89.7
Weiss et al. 2015 93.99 92.05
Andor et al. 2016 94.61 92.79
47
Graph-based dependency parsers
• Compute a score for every possible dependency for each word
• Doing this well requires good “contextual” representations of each word token,
which we will develop in coming lectures
0.5 0.8
0.3 2.0
0.3 2.0