0% found this document useful (0 votes)
307 views400 pages

NLP2 7

This document discusses n-gram language models. It begins by explaining what language models are and some of their applications like machine translation and spell checking. It then introduces n-gram language models, specifically unigram, bigram, and n-gram models. It describes how to estimate n-gram probabilities using maximum likelihood estimation and counts from a training corpus. The document also covers evaluating language models using perplexity, which measures how well a model predicts words in a test set that it has not seen before. Lower perplexity indicates a better language model. It provides an example calculation of perplexity.

Uploaded by

srirams007
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
307 views400 pages

NLP2 7

This document discusses n-gram language models. It begins by explaining what language models are and some of their applications like machine translation and spell checking. It then introduces n-gram language models, specifically unigram, bigram, and n-gram models. It describes how to estimate n-gram probabilities using maximum likelihood estimation and counts from a training corpus. The document also covers evaluating language models using perplexity, which measures how well a model predicts words in a test set that it has not seen before. Lower perplexity indicates a better language model. It provides an example calculation of perplexity.

Uploaded by

srirams007
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 400

Natural Language Processing

DSECL ZG565
Prof.Vijayalakshmi
BITS Pilani BITS-Pilani
Pilani Campus
BITS Pilani
Pilani Campus

Session 2
Date – 6th Sep2020
Time – 9 am to 11 am
These slides are prepared by the instructor, with grateful acknowledgement of James
Allen and many others who made their course materialsfreely availableonline.
Session#2 – N gram Language model

• Human word Prediction


• Language Models
 What is language model
 Why language models
• N-gram language models
 Uni-gram , bi-gram and N-gram
• Evaluation of Language models
 Perplexity
• Smoothing
 Laplace smoothing
Interpolation and Back off

BITS Pilani, Pilani Campus


Human word Prediction

• We may have ability to predict future words in an


utterance .
• How?
 Based on domain knowledge
red blood
 Based on syntactic knowledge
the <adj/noun>
 Based on Lexical knowledge
baked <potato>

BITS Pilani, Pilani Campus


Example

BITS Pilani, Pilani Campus


Language modelling

A model that computes either of these:


Probability of a sentence ( P(W) ) or
Probability of an upcoming word (P(wn|w1,w2…wn-
1)) is called a language model.

Simply we can say that


A language model learns to predict the probability of a
sequence of words.

BITS Pilani, Pilani Campus


Language Models

1.Machine Translation:
Machine translation system is used to translate
one language to another language .For example
Chinese to English or German to English etc.

BITS Pilani, Pilani Campus


Continued..

2.Spell correction

P(about fifteen minutes from) > P(about fifteen


minuets from)

BITS Pilani, Pilani Campus


3.Speech Recognition
Speech recognition is the ability of a machine or
program to identify words and phrases in spoken
language and convert them to a machine-readable
format.

BITS Pilani, Pilani Campus


How to build a language model

• Recall the definition of conditional probabilities


p(B|A) = P(A,B)/P(A) Rewriting: P(A,B) = P(A)P(B|A)

• More variables:
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C)
• The Chain Rule in General
P(x1,x2,x3,…,xn) = P(x1)P(x2|x1)P(x3|x1,x2)…P(xn|xn-1,xn-2……,x1)

BITS Pilani, Pilani Campus


Example

In a factory there are 100 units of a certain product, 5 of which are defective. We
pick three units from the 100 units at random. What is the probability that none of
them are defective?
Let Ai as the event and i=1,2,3

BITS Pilani, Pilani Campus


The Chain Rule applied to compute joint
probability of words in sentence

• 𝑃 𝑤1 𝑤2 … … 𝑤𝑛 = 𝑖(𝑃(𝑤𝑖 |𝑤1 𝑤2……..𝑤𝑖−1 )


• For ex:
P(“its water is so transparent”) =
P(its) × P(water|its) × P(is | its water)
× P(so|its water is) × P( transparent |its water is so)
• P(most biologists and specialist believe that in fact the
mythical unicorn horns derived from the narwhal)

BITS Pilani, Pilani Campus


Markov Assumption

• Simplifying assumption:
limit history of fixed number of wordsN1
Andrei Markov

P(the | its water is so transparent that) » P(the | that)

or
P(the | its water is so transparent that) » P(the | transparent that)

BITS Pilani, Pilani Campus


Markov Assumption

BITS Pilani, Pilani Campus


N -Gram Language models

BITS Pilani, Pilani Campus


N-gram Language models

• N gram is a sequence of tokens(words)


• Unigram language model(N=1 )

Example:
P(I want to eat Chinese food)≈ P(I)P(want)
P(to)P(eat)P(Chinese)P(food)

BITS Pilani, Pilani Campus


Bigram model
N=2

Example:
P(I want to eat Chinese food)≈ P(I|<start>) P(want|I) P(to|want)
P(eat|to) P(Chinese|eat) P(food|Chinese)
P(<end>|food)

BITS Pilani, Pilani Campus


N-gram models

• We can extend to trigrams, 4-grams, 5-grams


Advantages
• no human supervision, easy to extend to more data,
allows querying about open-class relations,
Disadvantage:
• In general this is an insufficient model of language
– because language has long-distance dependencies:
“The computer(s) which I had just put into the machine room
on the fifth floor is (are) crashing.”

BITS Pilani, Pilani Campus


Estimating N-gram Probabilities

BITS Pilani, Pilani Campus


Estimating bigram probabilities

• Estimating the probability as the relative frequency is


the maximum likelihood estimate (or MLE ), because this
value makes the observed data maximally likely
• The Maximum Likelihood Estimate
count(wi- 1,wi )
P(wi | wi- 1) =
count(wi- 1 )
c(wi- 1,wi )
P(wi | wi- 1 ) =
c(wi- 1)

BITS Pilani, Pilani Campus


An example

<s> I am Sam </s>


c(wi- 1,wi )
P(wi | wi- 1 ) = <s> Sam I am </s>
c(wi- 1) <s> I do not like green eggs and ham </s>

BITS Pilani, Pilani Campus


Evaluation

Evaluation

BITS Pilani, Pilani Campus


How good is our model?

• Does our language model prefer good sentences to bad


ones?
– Assign higher probability to “real” or “frequently
observed” sentences
• Than “ungrammatical” or “rarely observed” sentences?
• We train parameters of our model on a training set.
• We test the model’s performance on data we haven’t
seen.
– A test set is an unseen dataset that is different from our
training set, totally unused.
– An evaluation metric tells us how well our model does on
the test set.

BITS Pilani, Pilani Campus


Experimental setup

BITS Pilani, Pilani Campus


Example1:

BITS Pilani, Pilani Campus


Extrinsic evaluation of N-gram
models

• Best evaluation for comparing models A and B


– Put each model in a task
• spelling corrector, speech recognizer, MT system
– Run the task, get an accuracy for A and for B
• How many misspelled words corrected properly
• How many words translated correctly
– Compare accuracy for A and B

BITS Pilani, Pilani Campus


Difficulty of extrinsic (in-vivo)
evaluation of N-gram models

• Extrinsic evaluation
– Time-consuming; can take days or weeks
– Bad approximation
• unless the test data looks just like the training data
• So generally only useful in pilot experiments

• So
– Sometimes use intrinsic evaluation: perplexity

BITS Pilani, Pilani Campus


Intuition of Perplexity

• The Shannon Game:


– How well can we predict the next word?
mushrooms 0.1
pepperoni 0.1
I always order pizza with cheese and…. anchovies 0.01
The president of India is…… ….
I wrote a …… fried rice 0.0001
….

– Unigrams are terrible at this game.and(Why?)


1e-100

• A better model of a text


– is one which assigns a higher probability to the
word that actually occurs

BITS Pilani, Pilani Campus


Perplexity

The best language model is one that best predicts an unseen


test set
• Gives the highest P(sentence)
1
-
PP(W) = P(w1w2 ...wN ) N
Perplexity is the inverse probability of
the test set, normalized by the number 1
of words: = N
P(w1w2 ...wN )

Chain rule:

For bigrams:

Minimizing perplexity is the same as maximizing probability

BITS Pilani, Pilani Campus


Perplexity as branching factor

Suppose we have vocabulary ok k word, and our model assigns probability 1/k
to each word. for a sentence consisting of N random words:

Let’s suppose a sentence consisting of random digits


• What is the perplexity of this sentence according to a model that assign
P=1/10 to each digit?

BITS Pilani, Pilani Campus


Example1

BITS Pilani, Pilani Campus


Example2

BITS Pilani, Pilani Campus


Corpora

BITS Pilani, Pilani Campus


Lower perplexity = better model

• Training 38 million words, test 1.5 million words, WSJ

N-gram Unigram Bigram Trigram


Order
Perplexity 962 170 109

BITS Pilani, Pilani Campus


Generalization and Zeros

BITS Pilani, Pilani Campus


The perils of overfitting

• N-grams only work well for word prediction if the


test corpus looks like the training corpus
– In real life, it often doesn’t
– We need to train robust models that generalize!
– One kind of generalization: Zeros!
• Things that don’t ever occur in the training set
–But occur in the test set

BITS Pilani, Pilani Campus


Problems with simple MLE estimations

• Training set: • Test set


… denied the allegations … denied the offer
… denied the reports … denied the loan
… denied the claims
… denied the request

P(“offer” | denied the) = 0

BITS Pilani, Pilani Campus


Zero probability bigrams

• Bigrams with zero probability


– mean that we will assign 0 probability to the test
set!
• And hence we cannot compute perplexity
(can’t divide by 0)!

BITS Pilani, Pilani Campus


Laplace Smoothing(Add 1 smoothing)

• Pretend we saw each word one more time than we


did
• Just add one to all the counts!

• MLE estimate: c(wi- 1, wi )


PMLE (wi | wi- 1 ) =
c(wi- 1 )
• Add-1 estimate:
c(wi- 1, wi ) +1
PAdd- 1 (wi | wi- 1 ) =
c(wi- 1 ) +V

BITS Pilani, Pilani Campus


Example

BITS Pilani, Pilani Campus


BITS Pilani, Pilani Campus
After Add-1 smoothing

BITS Pilani, Pilani Campus


Berkeley Restaurant Project corpus

BITS Pilani, Pilani Campus


Raw Bigram Counts

BITS Pilani, Pilani Campus


Converting to probabilities

BITS Pilani, Pilani Campus


Contd…

BITS Pilani, Pilani Campus


Berkeley Restaurant Corpus:
Laplace smoothed bigram
counts

BITS Pilani, Pilani Campus


Laplace-smoothed bigrams

V=1446 in the Berkeley restaurant project corpus


BITS Pilani, Pilani Campus
Reconstituted counts

BITS Pilani, Pilani Campus


Compare with raw bigram counts

BITS Pilani, Pilani Campus


Add-1 estimation is a blunt
instrument

• So add-1 isn’t used for N-grams:


– We’ll see better methods
• But add-1 is used to smooth other NLP models
– For text classification
– In domains where the number of zeros isn’t so huge.

BITS Pilani, Pilani Campus


Backoff and Interpolation

• Sometimes it helps to use less context


– Condition on less context for contexts you haven’t learned much
about
• Backoff:
– use trigram if you have good evidence,
– otherwise bigram, otherwise unigram
• Interpolation:
– mix unigram, bigram, trigram

• Interpolation works better

BITS Pilani, Pilani Campus


Linear Interpolation

• Involves using different pieces of information to derive


a probability

Simple interpolation

BITS Pilani, Pilani Campus


How to set the lambdas?

• Use a held-out corpus


Held-Out Test
Training Data Data Data

• Choose λs to maximize the probability of held-out


data:
– Fix the N-gram probabilities (on the training data)
– Then search for λs that give largest probability to held-out set:

BITS Pilani, Pilani Campus


Unknown words: Open versus
closed vocabulary tasks

• If we know all the words in advanced


– Vocabulary V is fixed
– Closed vocabulary task
• Often we don’t know this
– Out Of Vocabulary = OOV words
– Open vocabulary task
• Instead: create an unknown word token <UNK>
– Training of <UNK> probabilities
• Create a fixed lexicon L of size V
• At text normalization phase, any training word not in L changed to <UNK>
• Now we train its probabilities like a normal word
– At decoding time
• If text input: Use UNK probabilities for any word not in training

BITS Pilani, Pilani Campus


Huge web-scale n-grams

• How to deal with, e.g., Google N-gram corpus


• Pruning
– Entropy-based pruning
– Only store N-grams with count > threshold.

• Efficiency
– Efficient data structures like tries
– Bloom filters: approximate language models
– Store words as indexes, not strings
• Use Huffman coding to fit large numbers of words into two bytes

BITS Pilani, Pilani Campus


Smoothing for Web-scale N-grams

• “Stupid backoff” (Brants et al. 2007)

ì
count(wi-i k+1 ) i
ïï if count(wi- k+1 )>0
i- 1
S(wi | w
i- k+1 )=í count(wi-i- 1k+1 )
ï
ïî 0.4S(wi | wi-i- 1k+2 ) otherwise

count(wi )
S(wi ) =
N

BITS Pilani, Pilani Campus


Exercise 1

1.<s> do? Two gram


2.<s>I like Hendry?
Two gram
3.<s> Do I like? Use
trigram model
4.<s>Do I like
college?
Four grams

BITS Pilani, Pilani Campus


Exercise2

Which of the following sentence is better. Find


out using bigram model
1.<S>I like college<S>
2.<S>Do I like
Hendry<S>

BITS Pilani, Pilani Campus


Thank You… 

• Q&A
• Suggestions / Feedback

BITS Pilani, Pilani Campus


Natural Language
Processing
DSECL ZG565
Prof.Vijayalakshmi
BITS Pilani BITS-Pilani
Pilani Campus

BITS Pilani, Pilani Campus


BITS Pilani
Pilani Campus

Session 3: POS tagging


Date – 20thSep2020
Time – 9 am to 11 am
These slides are prepared by the instructor, with grateful acknowledgement of James
Allen and many others who made their course materialsfreely availableonline.
Language Models
 What is language model
 Why language models
N-gram language models
 Uni-gram , bi-gram and N-gram
Evaluation of Language models
 Perplexity
Smoothing
 Laplace smoothing
Interpolation and Back off

BITS Pilani, Pilani Campus


Session3:POS tagging
 What is Part of speech tagging
 Why POS tagging is required
 Application
 (Mostly) English Word Classes
 Tag set
• Penn tree bank
 Part-of-Speech Tagging
 Different approaches
• Rule based
• Stochastic based
 HMM Part-of-Speech Tagging
• Hybrid system
BITS Pilani, Pilani Campus
What is POS Tagging

• The process of assigning a part-of-speech


or lexical class marker to each word in a
sentence (and all sentences in a
collection).

BITS Pilani, Pilani Campus


Process

List of all possible tag for each word in


sentence
Eg:

Choose best suitable tag sequence

BITS Pilani, Pilani Campus


Why POS?

First step in many applications


POS tell us a lot about word
Pronunciation depends on POS
To find named entities
Stemming

BITS Pilani, Pilani Campus


Application of POS

Parsing
Recovering syntactic structures requires correct
POS tags
Eg:

BITS Pilani, Pilani Campus


Information retrieval

Word Disambiguation
-ability to determine which meaning of word is
activated by the use of word in a particular
context.
Eg:
• I can hear bass sound.
• He likes to eat grilled bass.

BITS Pilani, Pilani Campus


Question answering system

automatically answer questions posed by


humans in a natural language.
Eg: Does he eat bass?

BITS Pilani, Pilani Campus


Why POS tagging is hard?

Ambiguity
Eg:
• He will race/VB the car.
• When will the race/NOUN end?
• The boat floated/VBD … down the river sank
 Average of ~2 parts of speech for each word

BITS Pilani, Pilani Campus


Parts of Speech

• 8 (ish) traditional parts of speech


– Noun, verb, adjective, preposition, adverb, article,
interjection, pronoun, conjunction, etc
– Called: parts-of-speech, lexical categories, word
classes, morphological classes, lexical tags...
– Lots of debate within linguistics about the
number, nature, and universality of these
• We’ll completely ignore this debate.

BITS Pilani, Pilani Campus


POS examples

• N noun chair, bandwidth, pacing


• V verb study, debate, munch
• ADJ adjective purple, tall, ridiculous
• ADV adverb unfortunately, slowly
• P preposition of, by, to
• PRO pronoun I, me, mine
• DET determiner the, a, that, those

BITS Pilani, Pilani Campus


POS Tagging
• Assigning label to something or someone
• The process of assigning a part-of-speech to
each word in a collection.
WORD tag
the DET
koala N
put V
the DET
keys N
on P
the DET
table N
BITS Pilani, Pilani Campus
Open and Closed Classes

• Open class: new ones can be created all the time


– English has 4: Nouns, Verbs, Adjectives, Adverbs
– Many languages have these 4, but not all!
• Closed class: a small fixed membership
– Prepositions: of, in, by, …
– Auxiliaries: may, can, will had, been, …
– Pronouns: I, you, she, mine, his, them, …
– Usually function words (short common words which play a
role in grammar)

BITS Pilani, Pilani Campus


Closed Class Words

Examples:
– prepositions: on, under, over, …
– particles: up, down, on, off, …
– determiners: a, an, the, …
– pronouns: she, who, I, ..
– conjunctions: and, but, or, …
– auxiliary verbs: can, may should, …
– numerals: one, two, three, third, …

BITS Pilani, Pilani Campus


Tagsets

• A set of all POS tags used in a corpus is called


a tagset
• There are various standard tag sets to choose
from; some have a lot more tags than others
• The choice of tagset is based on the
application
• Accurate tagging can be done with even large
tagsets

BITS Pilani, Pilani Campus


Penn tree bank

 Background
• From the early 90s
• Developed at the University of Pennsylvania
• (Marcus,Santorinin and Marcinkiewicz 1993)
 Size
• 40000 training sentences
• 2400 test sentences
• Genre
• Mostly wall street journal news stories and some spoken
conversations
 Importance
• Helped launch modern automatic parsing methods.

BITS Pilani, Pilani Campus


Penn TreeBank POS Tagset

BITS Pilani, Pilani Campus


Just for Fun…

• Using Penn Treebank tags, tag the following


sentence from the Brown Corpus:

• The grand jury commented on a number of


other topics.

BITS Pilani, Pilani Campus


Just for Fun…

• Using Penn Treebank tags, tag the following


sentence from the Brown Corpus:

• The/DT grand/JJ jury/NN commented/VBD


on/IN a/DT number/NN of/IN other/JJ
topics/NNS ./.

BITS Pilani, Pilani Campus


POS Tagging-
Choosing a Tag set
 There are so many parts of speech, potential distinctions we
can draw
 To do POS tagging, we need to choose a standard set of tags
to work with
 Could pick very coarse tagsets
 N, V, Adj, Adv.
 More commonly used set is finer grained, the “Penn TreeBank
tagset”, 45 tags
 PRP$, WRB, WP$, VBG
 Even more fine-grained tagsets exist

BITS Pilani, Pilani Campus


Approaches to POS Tagging
 Rule-based Approach
• Uses handcrafted sets of rules to tag input
sentences

 Statistical approaches
• Use training corpus to compute probability of a tag
in a context-HMM tagger

 Hybrid systems (e.g. Brill’s transformation-based


learning)

BITS Pilani, Pilani Campus


Rule based POS tagging

First stage − In the first stage, it uses a


dictionary to assign each word a list of
potential parts-of-speech.
Second stage − In the second stage, it uses
large lists of hand-written disambiguation
rules to sort down the list to a single part-of-
speech for each word.
Eg:EngGC

BITS Pilani, Pilani Campus


Hidden Markov model

BITS Pilani, Pilani Campus


Markov model /Markov chain

A Markov process is a process that generates a


sequence of outcomes in such a way that the
probability of the next outcome depends only
on the current outcome and not on what
happened earlier.

BITS Pilani, Pilani Campus


MARKOV CHAIN: WEATHER
EXAMPLE

Design a Markov Chain to


predict the weather of
tomorrow using previous
information of the past days.
 Our model has only 3 states:
𝑆 = 𝑆1, 𝑆2, 𝑆3
𝑆1 = 𝑆𝑢𝑛𝑛𝑦 , 𝑆2 = 𝑅𝑎𝑖𝑛𝑦, 𝑆3
= 𝐶𝑙𝑜𝑢𝑑𝑦.

BITS Pilani, Pilani Campus


Contd..

BITS Pilani, Pilani Campus


Contd..

state sequence notation: 𝑞1, 𝑞2, 𝑞3, 𝑞4, 𝑞5, …


. ., where 𝑞𝑖 𝜖 {𝑆𝑢𝑛𝑛𝑦, 𝑅𝑎𝑖𝑛𝑦, 𝐶𝑙𝑜𝑢𝑑𝑦}.

 Markov Property

BITS Pilani, Pilani Campus


Example

Given that today is Sunny, what’s the probability


that tomorrow is Sunny and the next day Rainy?

BITS Pilani, Pilani Campus


Example2

Assume that yesterday’s weather was Rainy, and


today is Cloudy, what is the probability that
tomorrow will be Sunny?

BITS Pilani, Pilani Campus


WHAT IS A HIDDEN MARKOV
MODEL (HMM)?

A Hidden Markov Model, is a stochastic model where


the states of the model are hidden. Each state can
emit an output which is observed.
Imagine: You were locked in a room for several days
and you were asked about the weather outside. The
only piece of evidence you have is whether the
person who comes into the room bringing your daily
meal is carrying an umbrella or not.
• What is hidden? Sunny, Rainy, Cloudy
• What can you observe? Umbrella or Not
BITS Pilani, Pilani Campus
Markov chain Vs HMM

Markov chain HMM

BITS Pilani, Pilani Campus


Hidden Markov Models (Formal)

• States Q = q1, q2…qN;


• Observations O= o1, o2…oN;
• Transition probabilities
– Transition probability matrix A = {aij}
aij = P(qt = j | qt- 1 = i) 1£ i, j £ N
• Emission Probability /Output probability
– Output probability matrix B={bi(k)}
bi (k) = P(Xt = ok | qt = i)
• Special initial probability vector 
p i = P(q1 = i) 1£ i £ N

BITS Pilani, Pilani Campus


First-Order HMM Assumptions

• Markov assumption: probability of a state depends


only on the state that precedes it
P(qi | q1...qi1)  P(qi | qi1)

฀

BITS Pilani, Pilani Campus


How to build a second-order HMM?

• Second-order HMM
– Current state only depends on previous 2 states
• Example
– Trigram model over POS tags
𝑛
– 𝑃 𝒕 = Π𝑖=1 𝑃 𝑡𝑖 ∣ 𝑡𝑖−1 , 𝑡𝑖−2
𝑛
– 𝑃 𝒘, 𝒕 = Π𝑖=1 𝑃 𝑡𝑖 ∣ 𝑡𝑖−1 , 𝑡𝑖−2 𝑃(𝑤𝑖 ∣ 𝑡𝑖 )

BITS Pilani, Pilani Campus


Markov Chain for Weather

• What is the probability of 4 consecutive warm


days?

• Sequence is
warm-warm-warm-warm
• And state sequence is
3-3-3-3
• P(3,3,3,3) =
– 3a33a33a33a33 = 0.2 x (0.6)3 = 0.0432

BITS Pilani, Pilani Campus


POS Tagging as Sequence
Classification

• We are given a sentence (an “observation”


or “sequence of observations”)
– Secretariat is expected to race tomorrow
• What is the best sequence of tags that
corresponds to this sequence of
observations?
• Probabilistic view
– Consider all possible sequences of tags
– Out of this universe of sequences, choose the
tag sequence which is most probable given the
observation sequence of n words w1…wn.
BITS Pilani, Pilani Campus
Probabilistic Sequence Models

• Probabilistic sequence models allow


integrating uncertainty over multiple,
interdependent classifications and collectively
determine the most likely global assignment.
• standard model
– Hidden Markov Model (HMM)

BITS Pilani, Pilani Campus


How you predict the tags?

• Two types of information are useful


– Relations between words and tags
– Relations between tags and tags
• DT NN, DT JJ NN…

BITS Pilani, Pilani Campus


Statistical POS Tagging

• We want, out of all sequences of n tags t1…tn


the single tag sequence such that
P(t1…tn|w1…wn) is highest.

• Hat ^ means “our estimate of the best one”


• Argmaxx f(x) means “the x such that f(x) is
maximized”
BITS Pilani, Pilani Campus
Statistical POS Tagging

This equation should give us the best tag


sequence

But how to make it operational? How to


compute this value?
Intuition of Bayesian inference:
• Use Bayes rule to transform this equation into
a set of probabilities that are easier to
compute (and give the right answer)
BITS Pilani, Pilani Campus
Using Bayes Rule

= PROB(T1,…Tn | w1,…wn)

Estimating the above takes far too much data. Need to


do some reasonable approximations.

Bayes Rule:
PROB(A | B) = PROB(B | A) * PROB(A) / PROB(B)

Rewriting:
PROB(w1,…wn | T1,…Tn) * PROB(T1,…Tn) / PROB(w1,…wn)

BITS Pilani, Pilani Campus


Contd..

=PROB(w1,…wn | T1,…Tn) * PROB(T1,…Tn) /


PROB(w1,…wn)
=PROB(w1,…wn | T1,…Tn) * PROB(T1,…Tn)

BITS Pilani, Pilani Campus


Independent assumptions

So, we want to find the sequence of tags that maximizes


PROB(T1,…Tn) * PROB(w1,…wn | T1,…Tn)

 For Tags – use bigram probabilities


PROB(T1,…Tn) ≈ πi=1,n PROB(Ti | Ti-1)
PROB(ART N V N) ≈ PROB(ART | Φ) * PROB(N | ART) * PROB(V |
N) * PROB(N | V)

 For second probability: assume word tag is independent of


words around it:
PROB(w1,…wn | T1,…Tn) ≈ πi=1,n PROB(wi | Ti)

BITS Pilani, Pilani Campus


POS formula

• Find the sequence of tags that maximizes:


πi=1,n PROB(Ti | Ti-1) * PROB(wi | Ti)

BITS Pilani, Pilani Campus


POS Tagging using HMM
• States T = t1, t2…tN;
• Observations W= w1, w2…wN;
– Each observation is a symbol from a vocabulary V =
{v1,v2,…vV}
• Transition probabilities
– Transition probability matrix A = {aij}
𝑎𝑖𝑗 = 𝑃 𝑡𝑖 = 𝑗 𝑡𝑖−1 = 𝑖 1 ≤ 𝑖, 𝑗 ≤ 𝑁
• Observation likelihoods
– Output probability matrix B={bi(k)}
𝑏𝑖 (𝑘) = 𝑃 𝑤𝑖 = 𝑣𝑘 𝑡𝑖 = 𝑖

• Special initial probability vector 


𝜋𝑖 = 𝑃 𝑡1 = 𝑖 ≤𝑖≤𝑁
BITS Pilani, Pilani Campus
Two Kinds of Probabilities

1. State transition probabilities -- p(ti|ti-1)


– State-to-state transition probabilities

2. Observation/Emission probabilities -- p(wi|ti)


– Probabilities of observing various values at a
given state

BITS Pilani, Pilani Campus


Two Kinds of Probabilities

1. Tag transition probabilities -- p(ti|ti-1)


– Determiners likely to precede adjs and nouns
• That/DT flight/NN
• The/DT yellow/JJ hat/NN
• So we expect P(NN|DT) and P(JJ|DT) to be high
– Compute P(NN|DT) by counting in a labeled
corpus:

BITS Pilani, Pilani Campus


Two Kinds of Probabilities

2. Word likelihood/emission probabilities


p(wi|ti)
– VBZ (3sg Pres Verb) likely to be “is”
– Compute P(is|VBZ) by counting in a labeled
corpus:

BITS Pilani, Pilani Campus


Sample Probabilities

Bigram Probabilities
• Bigram(Ti, Tj) Count(i, i + 1) Prob(Tj|Ti) Tag Frequencies
• φ,ART 213 .71 (213/300) Φ ART N V P
300 633 1102 358 366
• φ,N 87 .29 (87/300)
• φ,V 10 .03 (10/300)
• ART,N 633 1
• N,V 358 .32
• N,N 108 .10
• N,P 366 .33
• V,N 134 .37
• V,P 150 .42
• V,ART 194 .54
• P,ART 226 .62
• P,N 140 .38
• V,V 30 .08
BITS Pilani, Pilani Campus
Sample Lexical Generation Probabilities

• P(an | ART) .36


• P(an | N) .001
• P(flies | N) .025
• P(flies | V) .076
• P(time | N) .063
• P(time | V) .012
• P(arrow | N) .076
• P(like | N) .012
• P(like | V) .10 BITS Pilani, Pilani Campus
Two types of stochastic POS tagging

Word frequency tagging


• based on the probability that a word occurs
with a particular tag.
Tag sequence probabilities
• calculates the probability of a given sequence
of tags occurring.

BITS Pilani, Pilani Campus


Example1-Some Data on race

• Secretariat/NNP is/VBZ expected/VBN to/TO


race/VB tomorrow/NR
• People/NNS continue/VB to/TO inquire/VB
the/DT reason/NN for/IN the/DT race/NN
for/IN outer/JJ space/NN
• How do we pick the right tag for race in new
data?

5/29/2021 54
BITS Pilani, Pilani Campus
Disambiguating to race tomorrow

5/29/2021 55
BITS Pilani, Pilani Campus
Look Up the Probabilities
• P(NN|TO) = .00047
• P(VB|TO) = .83
• P(race|NN) = .00057
• P(race|VB) = .00012
• P(NR|VB) = .0027
• P(NR|NN) = .0012
• P(VB|TO)P(NR|VB)P(race|VB) = .00000027
• P(NN|TO)P(NR|NN)P(race|NN)=.00000000032
• So we (correctly) choose the verb reading

5/29/2021 56
BITS Pilani, Pilani Campus
Example2:Statistical POS tagging-
Whole tag sequence

• What is the most likely sequence of tags for


the given sequence of words w

P( DT JJ NN | a smart dog)
= P(DD JJ NN a smart dog) / P (a smart dog)
= P(DD JJ NN) P(a smart dog | DD JJ NN )
BITS Pilani, Pilani Campus
Tag Transition Probability

• Joint probability 𝑃(𝒕, 𝒘) = 𝑃 𝒕 𝑃(𝒘|𝒕)


• 𝑃 𝒕 = 𝑃 𝑡1 , 𝑡2 , … 𝑡𝑛
= 𝑃 𝑡1 𝑃 𝑡2 ∣ 𝑡1 𝑃 𝑡3 ∣ 𝑡2 , 𝑡1 … 𝑃 𝑡𝑛 𝑡1 … 𝑡𝑛−1
∼ P t1 P t 2 𝑡1 𝑃 𝑡3 𝑡2 … 𝑃(𝑡𝑛 ∣ 𝑡𝑛−1 )
𝑛
= Π𝑖=1 𝑃 𝑡𝑖 ∣ 𝑡𝑖−1
Markov assumption

• Bigram model over POS tags!


(similarly, we can define a n-gram model over POS tags,
usually we called high-order HMM)

BITS Pilani, Pilani Campus


Word likelihood /Emission
Probability

• Joint probability 𝑃(𝒕, 𝒘) = 𝑃 𝒕 𝑃(𝒘|𝒕)


• Assume words only depend on their POS-tag
• 𝑃 𝒘 𝒕 ∼ 𝑃 𝑤1 𝑡1 𝑃 𝑤2 𝑡2 … 𝑃(𝑤𝑛 ∣
𝑡𝑛 ) Independent assumption
𝑛
= Π𝑖=1 𝑃 𝑤𝑖 𝑡𝑖

i.e., P(a smart dog | DD JJ NN )


= P(a | DD) P(smart | JJ ) P( dog | NN )
BITS Pilani, Pilani Campus
Put them together

• Joint probability 𝑃(𝒕, 𝒘) = 𝑃 𝒕 𝑃(𝒘|𝒕)


• 𝑃 𝒕, 𝒘
= P t1 P t 2 𝑡1 𝑃 𝑡3 𝑡2 … 𝑃 𝑡𝑛 𝑡𝑛−1
𝑃 𝑤1 𝑡1 𝑃 𝑤2 𝑡2 … 𝑃(𝑤𝑛 ∣ 𝑡𝑛 )
𝑛
= Π𝑖=1 𝑃 𝑤𝑖 𝑡𝑖 𝑃 𝑡𝑖 ∣ 𝑡𝑖−1
e.g., P(a smart dog , DD JJ NN )
= P(a | DD) P(smart | JJ ) P( dog | NN )
P(DD | start) P(JJ | DD) P(NN | JJ )

BITS Pilani, Pilani Campus


Put them together

• Two independent assumptions


– Approximate P(t) by a bi(or N)-gram model
– Assume each word depends only on its POStag

initial probability
𝑝(𝑡1 )

BITS Pilani, Pilani Campus


Example1-HMM

Will can spot Mary


M V N N
Will Can spot Mary
N M V N

BITS Pilani, Pilani Campus


Emission probabilities

BITS Pilani, Pilani Campus


Transition probabilities

BITS Pilani, Pilani Campus


Correct sentence is Will/N can/M Spot/V Mary /N

BITS Pilani, Pilani Campus


Thank You… 

• Q&A
• Suggestions / Feedback

BITS Pilani, Pilani Campus


Natural Language Processing
DSECL ZG565
Prof.Vijayalakshmi Anand
BITS Pilani BITS-Pilani
Pilani Campus
BITS Pilani
Pilani Campus

Session 4-Part-of-Speech Tagging (Viterbi ,Maximum entropy)


Date – 2nd May Sep 2021
Time – 9 am to 11am
These slides are prepared by the instructor,withgratefulacknowledgement of James Allen and
many others who made their coursematerialsfreelyavailableonline.
What is POS tagging
Application of POS tagging
Tag sets –standard tag set
Approaches of POS tagging
Introduction to HMM
How HMM is used in POS tagging

BITS Pilani, Pilani Campus


Session4-POS tagging(Viterbi
,Maximum entropy model)
• The Hidden Markov Model
• Likelihood Computation:
 The Forward Algorithm
• Decoding: The Viterbi Algorithm
• Bidirectionalty
• Maximum entropy model

BITS Pilani, Pilani Campus


Hidden Markov Models
It is a sequence model.
Assigns a label or class to each unit in a sequence,
thus mapping a sequence of observations to
a sequence of labels.
Probabilistic sequence model: given a sequence of
units (e.g. words, letters, morphemes,
sentences), compute a probability distribution
over possible sequences of labels and choose
the best label sequence.
This is a kind of generative model.

BITS Pilani, Pilani Campus


Hidden Markov Model (HMM)
Oftentimes we want to know what produced the sequence
– the hidden sequence for the observed sequence.
For example,
– Inferring the words (hidden) from acoustic signal (observed) in speech
recognition
– Assigning part-of-speech tags (hidden) to a sentence (sequence of words) – POS
tagging.
– Assigning named entity categories (hidden) to a sentence (sequence of words) –
Named Entity Recognition.

BITS Pilani, Pilani Campus


Definition of HMM
States Q = q1, q2…qN;
Observations O= o1, o2…oN;
– Each observation is a symbol from a vocabulary V = {v1,v2,…vV}
Transition probabilities
– Transition probability matrix A = {aij}
aij = P(qt = j | qt- 1 = i) 1£ i, j £ N
Observation likelihoods
– Output probability matrix B={bi(k)}
bi (k) = P(Xt = ok | qt = i)

Special initial probability vector 


p i = P(q1 = i) 1£ i £ N
BITS Pilani, Pilani Campus
Three Problems
Given this framework there are 3 problems that we can
pose to an HMM
1. Given an observation sequence, what is the
probability of that sequence given a model?
2. Given an observation sequence and a model, what
is the most likely state sequence?
3. Given an observation sequence, find the best model
parameters for a partially specified model

BITS Pilani, Pilani Campus


Problem 1:
Observation Likelihood
• The probability of a observation sequence given
a model and state sequence
• Evaluation problem

BITS Pilani, Pilani Campus


Problem 2:

• Most probable state sequence given a model


and an observation sequence
• Decoding problem

BITS Pilani, Pilani Campus


Problem 3:

• Infer the best model parameters, given a partial model


and an observation sequence...
– That is, fill in the A and B tables with the right numbers --
• the numbers that make the observation sequence most likely
• This is to learn the probabilities!

BITS Pilani, Pilani Campus


Solutions
Problem 1: Forward (learn observation sequence)
Problem 2: Viterbi (learn state sequence)
Problem 3: Forward-Backward (learn probabilities)
– An instance of EM (Expectation Maximization)

BITS Pilani, Pilani Campus


Example :HMMs for Ice Cream

You are a climatologist in the year 2799 studying global


warming
You can’t find any records of the weather in Baltimore for
summer of 2007
But you find Jason Eisner’s diary which lists how many ice-
creams Jason ate every day that summer
Your job: figure out how hot it was each day

BITS Pilani, Pilani Campus


Problem 1

 1.Consider all possible 3-day weather sequences [H, H, H],


[H, H, C], [H, C, H],
2. For each 3-day weather sequence, consider the probability
of the ice cream Consumption sequences [1,2,1]
3.Add all the probability
 Not efficient
 Forward algorithm
BITS Pilani, Pilani Campus
Problem2
.

 1.Consider all possible 3-day weather sequences [H, H, H], [H, H, C], [H, C,
H],

2. For each 3-day weather sequence, consider the probability of the ice
cream Consumption sequences [1,2,1]
3. Pick out the sequence that has the highest probability from step #2.
 Not efficient
 Viterbi algorithm BITS Pilani, Pilani Campus
Problem3

 Find :
The start probabilities
The transition probabilities
Emission probabilities
 Forward –backward algorithm

BITS Pilani, Pilani Campus


Problem 2: Decoding
We want, out of all sequences of n tags t1…tn the single tag
sequence such that
P(t1…tn|w1…wn) is highest.

Hat ^ means “our estimate of the best one”


Argmaxx f(x) means “the x such that f(x) is maximized”

BITS Pilani, Pilani Campus


Getting to HMMs
This equation should give us the best tag sequence

But how to make it operational? How to compute this


value?
Intuition of Bayesian inference:
– Use Bayes rule to transform this equation into a set of probabilities that are
easier to compute (and give the right answer)

BITS Pilani, Pilani Campus


Using Bayes Rule

Know this.

BITS Pilani, Pilani Campus


Likelihood and Prior

BITS Pilani, Pilani Campus


HMM for Ice Cream
Given
– Ice Cream Observation Sequence: 1,2,3,2,2,2,3…
Produce:
– Hidden Weather Sequence:
H,C,H,H,H,C, C…

BITS Pilani, Pilani Campus


HMM for Ice Cream

the observed sequence


“1 2 3”

BITS Pilani, Pilani Campus


HMM for Ice Cream

Transition probability
Emission probabilities
H C 1 2 3
H 0.7 0.3 H 0.2 0.4 0.4
C 0.4 0.6 C 0.5 0.4 0.1
BITS Pilani, Pilani Campus
Find 131 sequence

P(a,b)=P(a|b) P(b)
P(131|HCH)=P(1|H)P(3|C)P(1|H) P(S|H)P(C|H)P(H|C)
1 2 ------8(N^T=2^3)

H c H H H c

1 3 3 1 3
3

BITS Pilani, Pilani Campus


Decoding
• Given an observation sequence
 313
• And an HMM
• The task of the decoder
 To find the best hidden state sequence most likely to
have produced the observed sequence
• Given the observation sequence
O=(o1o2…oT), and an HMM model Φ = (A,B),
how do we choose a corresponding state
sequence Q=(q1q2…qT) that is optimal in
some sense (i.e., best explains the
observations)
BITS Pilani, Pilani Campus
Contd..
• One possibility to find the best sequence is :
- For each hidden state sequence Q
- HHH, HHC, HCH,
- Compute P(O|Q)
- Pick the highest one
• Why not?
-
• Instead:
- The Viterbi algorithm
- Is a dynamic programming algorithm

BITS Pilani, Pilani Campus


Viterbi intuition
• We want to compute the joint probability
of the observation sequence together
with the best state sequence

BITS Pilani, Pilani Campus


P(1,H)=P(1/H)*P(H/H) P(3,H)=P(3/H)*P(H/H)
=0.2*0.7 = 0.4*0.7
=0.14 =0.28
P(1,H)=P(1/H)*P(C/H) P(3,H)=P(3/H)*P(H/C)
=0.2*0.4 = 0.1*0.2
=0.08 =0.02
V2=Max
3,H)=P(3/H)*P(H) [(0.14*0.32=0.0448)
= 0.4*0.8 ,(0.08*0.02=0.0016) V3=Max[(0.28*0.0448=0.013)
=0.32 ] ,(0.02*0.048=0.00096)
V1=0.32 V2=0.0448 V3=0.013

0.14 H 0.28 H
H
0.08
0.15 0.02
0.04

C C C
S
0.3 0.06
0.02 V2=0.048 V3=.00288 V3=Max[(0.048*0.06=.00288),
P(3,C)=P(3/C)*P(C) (0.0448*0.04=0.001792)
3 1 3
= 0.1*0.2
=0.02
P(1,C)=P(1/C)*P(C/H) V2 P(3,C)=P(3/C)*P(C/C)
=0.5*0.3 =Max[(0.15*0. = 0.1*0.6
=0.15 32=0.048),(0.0 =0.06
2*0.3=0.006)]
P(1,C)=P(1/C)*P(C/C) P(3,C)=P(3/C)*P(C/H)
=0.5*0.6 = 0.1*0.4
=0.30 =0.04 BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
Example: Rainy and Sunny Days
Your colleague in another city either walks to work or drives every
day and his decision is usually based on the weather
Given daily emails that include whether he has walked or driven to
work, you want to guess the most likely sequence of whether the
days were rainy or sunny
Two hidden states: rainy and sunny
Two observables: walking and driving
Assume equal likelihood of the first day being rainy or sunny
Transitional probabilities
rainy given yesterday was (rainy = .7, sunny = .3)
sunny given yesterday was (rainy = .4, sunny = .6)
Output (emission) probabilities
sunny given walking = .1, driving = .9
rainy given walking = .8, driving = .2
Given that your colleague walked, drove, walked, what is the most
likely sequence of days?

BITS Pilani, Pilani Campus


State1:Walk
P(walk, rainy)=P(walk| rainy)*P(rainy) V1=0.40
=0.8*0.5=0.40
P(walk, sunny)=P(walk| sunny)*P(sunny) V1=0.05
=0.1*0.5=0.05
State2: Drive
P(drive ,rainy)=P(drive| rainy)*P(rainy|rainy) V2=max(0.40*0.14=0.056,
=0.2*0.7=0.14 0.05*0.06=0.003)
=P(drive|rainy)*P(rainy|sunny) =0.56
=0.2*0.3=0.06
P(drive, sunny)=P(drive|sunny)*P(sunny|sunny) V2 =max(0.36*0.40=0.144,
= 0.9*0.6=0.54 0.05*0.5=0.0027)
=P(drive|sunny)*P(sunny|rainy) =0.144
=0.9*0.4=0.36

BITS Pilani, Pilani Campus


State 3:Walk
P(walk rainy)=P(rainy|walk)*P(rainy|rainy) v3=max(0.56*0.056=0.03136,
=0.8*0.7=0.56 (0.0144*0.04=0.00051)
=P(rainy|walk)*P(rainy|sunny) =0.03136
=0.8*0.3=0.24

P(walk,sunny)=P(sunny|walk)*P(sunny|sunny) V3=max(0.056*0.24=0.0134,
=0.1*0.6=0.06 0.144*0.06=0.008)
=P(sunny|walk)*P(sunny|raniy) =0.0134
=0.1*0.4=0.04

BITS Pilani, Pilani Campus


v1=0.40 V2=0.056 V3=0.3136

0.56
rainy 0.14 rainy rainy
0.8*0.5=0.40

0.36
start sunny sunny sunny

0.1*0.5=0.05 0.54
v1=0.05 V2=0.144 V3=0.0134

Walk drive walk

The best sequence =rainy sunny rainy

BITS Pilani, Pilani Campus


The Viterbi Algorithm

BITS Pilani, Pilani Campus


Viterbi Example: Ice Cream

BITS Pilani, Pilani Campus


Example3

BITS Pilani, Pilani Campus


BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
Practice question
Suppose you want to use a HMM tagger to tag the phrase ,”the
light book “,Where we have the following probabilities
P(the|Det)=0.3,P(the|Noun)=0.1,P(light|noun)=0.003,P(light|Adj)
=0.002,P(light|verb)=0.06,P(book|noun)=0.003,P(book|verb)=0.0
1

P(Verb|Det)=0.00001,P(Noun|Det)=0.5,P(Adj|Det)=0.3,
P(Noun|Noun)=0.2,P(Adj|Noun)=0.002,P(Noun|Adj)=0.2,
P(Noun|Verb)=0.3,P(Verb|Noun)=0.3,P(Verb|Adj)=0.001,P(verb|ve
rb)=0.1

Assume that all the tags have the same probabilities at the
beginning of the sentence . Using Viterbi algorithm find out the
best tag sequence.

BITS Pilani, Pilani Campus


Forward
Efficiently computes the probability of an observed
sequence given a model
– P(sequence|model)

Nearly identical to Viterbi; replace the MAX with a SUM

BITS Pilani, Pilani Campus


The Forward Algorithm

BITS Pilani, Pilani Campus


Visualizing Forward

BITS Pilani, Pilani Campus


Forward Algorithm: Ice Cream

BITS Pilani, Pilani Campus


Problem 3
Infer the best model parameters, given a skeletal model
and an observation sequence...
– That is, fill in the A and B tables with the right numbers...

• The numbers that make the observation sequence


most likely
– Useful for getting an HMM without having to hire annotators...

45
BITS Pilani, Pilani Campus
Forward-Backward
Baum-Welch = Forward-Backward Algorithm (Baum
1972)
Is a special case of the EM or Expectation-Maximization
algorithm
The algorithm will let us learn the transition probabilities
A= {aij} and the emission probabilities B={bi(ot)} of the
HMM

39
BITS Pilani, Pilani Campus
Bidirectionality

• One problem with the HMM models as


presented is that they are exclusively run
left-to-right.
• Viterbi algorithm still allows present
decisions to be influenced indirectly by
future decisions,
• It would help even more if a decision about
word wi could directly use information
about future tags ti+1 and ti+2.

BITS Pilani, Pilani Campus


Bidirectionality

• Any sequence model can be turned into a bidirectional


model by using multiple passes.
• For example, the first pass would use only part-of-speech
features from already-disambiguated words on the left. In
the second pass, tags for all words, including those on the
right, can be used.
• Alternately, the tagger can be run twice, once left-to-right
and once right-to-left.
• In Viterbi decoding, the classifier chooses the higher
scoring of the two sequences (left-to-right or right-to-
left).
• Modern taggers are generally run bidirectionally.

BITS Pilani, Pilani Campus


Some limitation of HMMS

• Unknown words
• First order HMM
Eg:
Is clearly marked
He clearly marked

BITS Pilani, Pilani Campus


Maximum Entropy Markov
Model
• Turn logistic regression into a discriminative sequence
model simply by running it on successive words, using
the class assigned to the prior word as a feature in the
classification of the next word.
• When we apply logistic regression in this way, it’s called
maximum entropy Markov model or MEMM

BITS Pilani, Pilani Campus


Maximum Entropy Markov
Model
• Let the sequence of words be W = wn 1 and the
sequence of tags T = tn1.
• In an HMM to compute the best tag sequence that
maximizes P(T|W) we rely on Bayes’ rule and the
likelihood P(W|T):

BITS Pilani, Pilani Campus


Maximum Entropy Markov
Model
• In an MEMM, by contrast, we compute the posterior
P(T|W) directly, training it to discriminate among the
possible tag sequences:

BITS Pilani, Pilani Campus


Maximum Entropy Markov
Model
• Consider tagging just one word. A multinomial logistic regression classifier
could compute the single probability P(ti|wi,ti−1) in a different way than an
HMM
• HMMs compute likelihood (observation word conditioned on tags) but
MEMMs compute posterior (tags conditioned on observation words).

BITS Pilani, Pilani Campus


Learning MEMM

• Learning in MEMMs relies on the same supervised


learning algorithms we presented for logistic regression.
• Given a sequence of observations, feature functions,
and corresponding hidden states, we use gradient
descent to train the weights to maximize the log-
likelihood of the training corpus.

BITS Pilani, Pilani Campus


Maximum Entropy Markov
Model
• Reason to use a discriminative sequence model is that
it’s easier to incorporate a lot of features

BITS Pilani, Pilani Campus


MEMM

• Janet/NNP will/MD back/VB the/DT bill/NN, when wi is


the word back, would generate the following features

BITS Pilani, Pilani Campus


Training MEMMs
• The most likely sequence of tags is then computed by combining these
features of the input word wi, its neighbors within l words wi+l i−l, and the
previous k tags ti−1 i−k as follows (using θ to refer to feature weights instead
of w to avoid the confusion with w meaning words):

BITS Pilani, Pilani Campus


How to decode to find this
optimal tag sequence ˆ T?
• Simplest way to turn logistic regression into a sequence
model is to build a local classifier that classifies each
word left to right, making a hard classification on the first
word in the sentence, then a hard decision on the
second word, and so on.
• This is called a greedy decoding algorithm

BITS Pilani, Pilani Campus


Issue with greedy algorithm

• The problem with the greedy algorithm is that by making


a hard decision on each word before moving on to the
next word, the classifier can’t use evidence from future
decisions.
• Although the greedy algorithm is very fast, and
occasionally has sufficient accuracy to be useful, in
general the hard decision causes too great a drop in
performance, and we don’t use it.
• MEMM with the Viterbi algorithm just as with the HMM,
Viterbi finding the sequence of part-of-speech tags that
is optimal for the whole sentence

BITS Pilani, Pilani Campus


MEMM with Viterbi algorithm

• Finding the sequence of part-of-speech tags that is


optimal for the whole sentence. Viterbi value of time t for
state j

• In HMM

• In MEMM

BITS Pilani, Pilani Campus


HMM Applications
Speech Recognition including siri
Gene Prediction
Handwriting recognition
Transportation forecasting
Computational finance
And all applications which requires sequence processing…

BITS Pilani, Pilani Campus


References
• https://www.nltk.org/
• https://likegeeks.com/nlp-tutorial-using-python-nltk/
• https://www.guru99.com/pos-tagging-chunking-nltk.html
• https://medium.com/greyatom/learning-pos-tagging-
chunking-in-nlp-85f7f811a8cb
• https://nlp.stanford.edu/software/tagger.shtml
• https://www.forbes.com/sites/mariyayao/2020/01/22/what-
are--important-ai--machine-learning-trends-for-
2020/#601ce9623239
• https://medium.com/fintechexplained/nlp-text-processing-in-
data-science-projects-f083009d78fc

BITS Pilani, Pilani Campus


Maximum entropy classifiction
From: [email protected]
Har har!
Allow mes to introduce myself. Myname is Lechuck ,and I got your email from the
interwebs mail directory. I threw a dart at the directory and hit your name. You
seam like a good lad based on your name, so I here I am writing this email.
I live out here in this faraway island, and I have some moneys ($122K USD, to be exact)
that I need to send overseas. Could you do mes a favor and help me with my
moneys transfer?
1) Provide mes a bank account where this money would be transferred to.
2) Send me a picture of yourselfs so I know who to look for and thank when I sail to the
US. Click heres to my Facebook
[www.lechuck.myfacebook.com/fake.link/give_me_money] and post me your pic.
Monkeys bananas money rich
As reward, I are willing to offer you 15% of the moneys as compensation for effort
input after the successful transfer of this fund to your designate account overseas.
please feel free to contact ,me via this email address
[email protected]

BITS Pilani, Pilani Campus


Features and weights

F1=Email contains spelling/grammatical errors


F2=Email asks for money to be transferred
F3=Email mentions account holder’s name
Weights for spam
W1 =Email contains spelling/grammatical errors): 0.5
W2=Email asks for money to be transferred): 0.2
W3=Email mentions account holder’s name): -0.5
Weights for not spam
W1=(Email contains spelling/grammatical errors): -0.2
W2=(Email asks for money to be transferred): 0
W3 =(Email mentions account holder’s name): 1
BITS Pilani, Pilani Campus
Function

Formula

BITS Pilani, Pilani Campus


F1=Email contains spelling/grammatical errors -Yes
F2=Email asks for money to be transferred-Yes
F3=Email mentions account holder’s name-No

BITS Pilani, Pilani Campus


Spam score

Not spam score

BITS Pilani, Pilani Campus


Forward algorithm-example

BITS Pilani, Pilani Campus


BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
BITS Pilani, Pilani Campus
Natural Language Processing
DSECL ZG565
Prof.Vijayalakshmi Anand
BITS Pilani BITS-Pilani
Pilani Campus
BITS Pilani
Pilani Campus

Session 5-Parsing
Date – 9 May 2021
Time – 9to 11am
These slides are prepared by the instructor, with grateful acknowledgement of James
Allen and many others who made their course materialsfreely availableonline.
 POS tagging
 Approaches used in POS tagging
 HMM
 HMM tagger
 Viterbi algorithm
 Forward algorithm
 Maximum entropy model
Session5:Parsing
• Grammars and Sentence Structure
• What Makes a Good Grammar
• Parsing
• A Top-Down Parser
• A Bottom-Up Chart Parser
• Top-Down Chart Parsing
• Finite State Models and Morphological Parsing.

BITS Pilani, Pilani Campus


What is the structure of a
sentence in natural language ?
• Sentence structure is hierarchical:
A sentence consists of words (I, eat, sushi, with, tuna)
..which form phrases: “sushi with tuna”
• Sentence structure defines dependencies between
words or phrases:

BITS Pilani, Pilani Campus


How to compute the Sentence
structure
To compute the syntactic structure ,must consider 2 things
• Grammar
A precise way to define and describe the structure of the
sentence
• Parsing
The method of analyzing a sentence to determine its
structure according to the grammar.

BITS Pilani, Pilani Campus


Why NLP needs grammars ?

• Regular languages and part of speech refers to the way


words are arranged together but cannot support easily:
Constituency, Grammatical relations and
Subcategorization and dependency relations
• They can be modelled by grammars.

BITS Pilani, Pilani Campus


Applications -Machine
translation
The output of current systems is often ungrammatical:
Daniel Tse, a spokesman for the Executive Yuan said the
referendum demonstrated for democracy and human
rights, the President on behalf of the people of two. 3
million people for the national space right, it cannot say
on the referendum, the legitimacy of Taiwan’s position
full. (BBC Chinese news, translated by Google Chinese
to E
Correct translation requires grammatical knowledge:
English)

BITS Pilani, Pilani Campus


Question Answering

This requires grammatical knowledge.


..: John persuaded/promised Mary to leave.
- Who left? ...
and inference: John managed/failed to leave. - Did John
leave?
John and his parents visited Prague. They went to the
castle.
- Was John in Prague?
- Has John been to the Czech Republic?
- Has John’s dad ever seen a castle?

BITS Pilani, Pilani Campus


Contd..

Sentiment Analysis:
"I like Frozen"
"I do not like Frozen“
"I like frozen yogurt"

Relation Extraction:
"Rome is the capital of Italy and the region of Lazio".

BITS Pilani, Pilani Campus


Context free grammar

• Simple yet powerful formalism to describe the syntactic


structure of natural languages
• Developed in the mid-1950s by Noam Chomsky
• Allows one to specify rules that state how a constituent
can be segmented into smaller and smaller constituents,
up to the level of individual words

BITS Pilani, Pilani Campus


Context free grammar -
Definition

BITS Pilani, Pilani Campus


Contd..
• Terminals ∑
- words - {sleeps, saw, man, woman,
telescope, the, with, in}
• Non-Terminals -N
- The constituents in a language . Such as
noun phrases, verb phrases and sentences
- {S, NP, VP, PP, DT, Vi, Vt, NN, IN}
• Rules
- Rules are equations that consist of a single
non-terminal on the left and any number of
terminals and nonterminals on the right.
BITS Pilani, Pilani Campus
Some sample rules
lexicons
grammar

BITS Pilani, Pilani Campus


Categories of Phrases

Noun phrase (NP): Noun acts as the head word. They


start with an article or noun.
Verb phrase (VP): Verb acts as the head word. They start
with an verb
Adjective phrase (ADJP): Adjective as the head word.
They start with an adjective
Adverb phrase (ADVP): Adverb acts as the head word.
They usually start with an adjective
Prepositional phrase (PP): Preposition as the head word.
They start with an preposition.

BITS Pilani, Pilani Campus


Verb phrase

English VPs consist of a head verb along with 0 or more


following constituents which we’ll call arguments.

BITS Pilani, Pilani Campus


Parsing

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Why parsing ?

• Parsing adds information about sentence


structure and constituents
• Allows us to see what constructions words enter
into
– eg, transitivity, passivization, argument structure for
verbs
• Allows us to see how words function relative to
each other
– eg, what words can modify / be modified by other
words

BITS Pilani, Pilani Campus


Parsing as a search procedure

• Parsing as a special case of a search problem as


defined in Al.
1. Select the first state from the possibilities list (and
remove it from the list).
2. Generate the new states by trying every possible option
from the selected state (there may be none if we are on
a bad path).
3. Add the states generated in step 2 to the possibilities list.

BITS Pilani, Pilani Campus


Two types of Parsing

BITS Pilani, Pilani Campus


A Simple Top-Down Parsing Algorithm

The algorithm starts with the initial state ((S) 1) and no


backup states.
1.Take the first state off the possibilities list and call it C.
IF the list is empty, THEN fails
2.IF C consists of an empty symbol list and the word
position is at the end of the sentence, THEN succeeds
3. OTHERWISE, generate the next possible states.
1. IF the first symbol of C is a lexical symbol, AND
the next word in the sentence can be in that class,
THEN – create a new state by removing the first symbol
– updating the word position
– add it to the possibilities list.
2. OTHERWISE, IF the first symbol of C is a non-terminal
THEN – generate a new state for each rule that can rewrite
that non-terminal symbol
– add them all to the possibilities list
21

BITS Pilani, Pilani Campus


Toy example

BITS Pilani, Pilani Campus


Contd..

BITS Pilani, Pilani Campus


Example2

“1 The 2 dogs 3 cried 4”

“1 The 2 old 3 man 4 cried 5”

BITS Pilani, Pilani Campus


Example- “The dogs cried”

Step Current State Backup States Comment


1. ((S) 1) initial position
2. ((NP VP) 1) rewrite S by rule 1
3. ((ART N VP) rewrite NP by rules 2&3
1)
((ART ADJ N VP) 1)
4. ((N VP) 2) match ART with the
((ART ADJ N VP) 1)
5. ((VP) 3) match N with dogs
((ART ADJ N VP) 1)
6. ((V) 3) rewrite VP by rules 4&5
((V NP) 3)
((ART ADJ N VP) 1)
7. ( ) the parse succeeds as V
25
is matched to cried
BITS Pilani, Pilani Campus
Example-“The old man cried”

Step Current State Backup States Comment


1. ((S) 1) initial position
2. ((NP VP) 1) S rewritten to NP VP
3. ((ART N VP) 1) NP rewritten by rules 2&3
((ART ADJ N VP) 1)
4. ((N VP) 2)
((ART ADJ N VP) 1)
5. ((VP) 3)
((ART ADJ N VP) 1)
6. ((V) 3) VP rewritten by rules 4&5
((V NP) 3)
((ART ADJ N VP) 1)
7. (( ) 4)
((V NP) 3)
((ART ADJ N VP) 1)
8. ((V NP) 3) the first backup is chosen
((ART ADJ N VP) 1)
26

BITS Pilani, Pilani Campus


con’t

Step Current State Backup States Comment


9. ((NP) 4)
((ART ADJ N VP) 1)
10. ((ART N) 4) looking for ART fails
((ART ADJ N) 4)
((ART ADJ N VP) 1)
11. ((ART ADJ N) 4) fails again
((ART ADJ N VP) 1)
12. ((ART ADJ N VP) 1) exploring backup state
saved in step 3
13.((ADJ N VP) 2)
14. ((N VP) 3)
15. ((VP) 4)
16. ((V) 4)
((V NP) 4)
17. (( ) 5) success! 27

BITS Pilani, Pilani Campus


Bottom up parsing

BITS Pilani, Pilani Campus


Sample grammer

BITS Pilani, Pilani Campus


Example

BITS Pilani, Pilani Campus


Contd ….

BITS Pilani, Pilani Campus


Chart Parsing

BITS Pilani, Pilani Campus


Chart Parsing
• Assume you are parsing a sentence that starts with an ART.
• With this ART as the key, rules 2 and 3 are matched because
they start with ART.
• To record this for analyzing the next key, you need to record
that rules 2 and 3 could be continued at the point after the
ART.
• You denote this fact by writing the rule with a dot (o),
indicating what has been seen so far. Thus you record
2'. NP -> ART o ADJ N
3'. NP -> ART o N
• If the next input key is an ADJ, then rule 4 may be started,
and the modified rule 2 may be extended to give
2''. NP -> ART ADJ o N
BITS Pilani, Pilani Campus
The Chart Parsing Algorithm

BITS Pilani, Pilani Campus


Chart parsing example

BITS Pilani, Pilani Campus


Chart parsing example

BITS Pilani, Pilani Campus


Chart parsing example

BITS Pilani, Pilani Campus


Chart parsing example

BITS Pilani, Pilani Campus


Chart parsing example

BITS Pilani, Pilani Campus


Final Chart

BITS Pilani, Pilani Campus


Example2

BITS Pilani, Pilani Campus


Contd..

BITS Pilani, Pilani Campus


Top down chart parsing

BITS Pilani, Pilani Campus


Top down chart parsing example

BITS Pilani, Pilani Campus


Top down chart parsing example

BITS Pilani, Pilani Campus


Top down chart parsing example

BITS Pilani, Pilani Campus


Top down chart parsing example

BITS Pilani, Pilani Campus


Morphological Parsing

• Not only are there a large number of words, but each word may
combine with affixes to produce additional related words.
• One way to address this problem is to preprocess the input
sentence into a sequence of morphemes.
• A word may consist of single morpheme, but often a word consists
of a root form plus an affix. .

BITS Pilani, Pilani Campus


Components of Morphological
parsing
Lexicon
Morphotactic
Orthographic rules

BITS Pilani, Pilani Campus


Finite state automation

BITS Pilani, Pilani Campus


Finite State Transducers

The simple story


Add another tape
Add extra symbols to the transitions

On one tape we read “cats”, on the


other we write “cat +N +PL”, or the
other way around.

CSC 9010- NLP - 3: Morphology, Finite


State Transducers
BITS Pilani, Pilani Campus
FSTs

CSC 9010- NLP - 3: Morphology, Finite


State Transducers
BITS Pilani, Pilani Campus
Example

BITS Pilani, Pilani Campus


Parsing/Generation vs. Recognition
Recognition is usually not quite what we need.
Usually if we find some string in the language
we need to find the structure in it (parsing)
Or we have some structure and we want to
produce a surface form
(production/generation)
Example
From “cats” to “cat +N +PL” and back

Morphological analysis
CSC 9010- NLP - 3: Morphology, Finite
State Transducers
BITS Pilani, Pilani Campus
Stemming in IR

Run a stemmer on the documents to be


indexed
Run a stemmer on users queries
Match
This is basically a form of hashing
Example: Computerization
ization -> -ize computerize
ize -> ε computer

CSC 9010- NLP - 3: Morphology, Finite


State Transducers
Errors in Stemming:

There are mainly two errors in stemming –

• over-stemming
• under-stemming
Porter

No lexicon needed
Basically a set of staged sets of rewrite
rules that strip suffixes
Handles both inflectional and
derivational suffixes
Doesn’t guarantee that the resulting stem
is really a stem (see first bullet)
Lack of guarantee doesn’t matter for IR

CSC 9010- NLP - 3: Morphology, Finite


State Transducers
Porter Stemmer
Errors of Omission
European Europe
analysis analyzes
matrices matrix
noise noisy
explain explanation
Errors of Commission
organization organ
doing doe
generalization generic
numerical numerous
university universe
BITS Pilani, Pilani Campus
References

• Speech and Language processing: An


introduction to Natural Language
Processing, Computational Linguistics and
speech Recognition by Daniel Jurafsky and
James H. Martin
• Natural language understanding by James
Allen

BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956


Thank You… 
• Q&A
• Suggestions / Feedback
Natural Language Processing
DSECL ZG565
Prof.Vijayalakshmi
BITS Pilani BITS-Pilani
Pilani Campus
BITS Pilani
Pilani Campus

Session 6 -Probabilistic Context Free Grammar


Date – 4th Oct2020
Time – 9 am to 11am
These slides are prepared by the instructor,withgratefulacknowledgement of Prof. Jurafsky and
Prof. Martin and many others who made their course materialsfreelyavailableonline.
Session Content
(Ref: Chapter 14 Jurafsky and Martin)

• CKY Parsing
• Probabilistic Context-Free Grammars
• PCFG for disambiguation
• Probabilistic CKY Parsing of PCFGs
• Ways to Learn PCFG Rule Probabilities
• Probabilistic Lexicalized CFGs
• Evaluating Parsers
• Problems with PCFGs
BITS Pilani, Pilani Campus
Parsing algorithm

• Top-down vs. bottom-up:


Top-down: (goal-driven): from the start symbol down.
Bottom-up: (data-driven): from the symbols up.
• Naive vs. dynamic programming:
Naive: enumerate everything.
Backtracking: try something, discard partial solutions.
Dynamic programming: save partial solutions in a table.
• Examples:
CKY: bottom-up dynamic programming.
Earley parsing: top-down dynamic programming.
Chart parsing –bottom up chart parsing 4

BITS Pilani, Pilani Campus


Issues with top down and
Bottom up parsing
construction on other branch
Top down and bottom –up parsing both lead to repeated
substructures
Globally bad parses can construct good subtrees
• but overall parse will fail
• Require tatic backtracking strategy can avoid
• Efficient parsing techniques require storage of shared
substructure
Typically with dynamic programming
• Several implementations
CKY algorithm
Early algorithm
Chart parsing algorithm
5

BITS Pilani, Pilani Campus


CKY parsing

Classic, bottom-up dynamic programming algorithm (Cocke-Kasami-


Younger).
Requires input grammar based on Chomsky Normal Form
– A CNF grammar is a Context-Free Grammar in which:
• Every rule LHS is a non-terminal
• Every rule RHS consists of either a single terminal or two
non-terminals.
• Examples:
– A  BC
– NP  N PP
– A a
– Noun  man
• But not: 6
– NP  the N
– S VP BITS Pilani, Deemed to be University under Section 3 of UGC Act, 1956
Chomsky Normal Form

Any CFG can be re-written in CNF, without any


loss of expressiveness.

– That is, for any CFG, there is a corresponding


CNF grammar which accepts exactly the
same set of strings as the original CFG.

BITS Pilani, Pilani Campus


Converting a CFG to CNF

To convert a CFG to CNF, we need to deal with


three issues:
1. Rules that mix terminals and non-terminals on
the RHS
• E.g. NP  the Nominal
2. Rules with a single non-terminal on the RHS
(called unit productions)
• E.g. NP  Nominal
3. Rules which have more than two items on the
RHS 8
• E.g. NP  Det Noun PP BITS Pilani, Pilani Campus
Converting a CFG to CNF
1. Rules that mix terminals and non-terminals on
the RHS
– E.g. NP  the Nominal

– Solution:
• Introduce a dummy non-terminal to
cover the original terminal
– E.g. Det  the
• Re-write the original rule: 9

BITS Pilani, Pilani Campus


Converting a CFG to CNF
2. Rules with a single non-terminal on the RHS
(called unit productions)
– E.g. NP  Nominal

– Solution:
• Find all rules that have the form Nominal  ...
– Nominal  Noun PP
– Nominal  Det Noun
• Re-write the above rule several times to eliminate
the intermediate non-terminal:
– NP  Noun PP
– NP  Det Noun
10
– Note that this makes our grammar “flatter”
BITS Pilani, Pilani Campus
Converting a CFG to CNF

3. Rules which have more than two items on the RHS


– E.g. NP  Det Noun PP

Solution:
– Introduce new non-terminals to spread the sequence on the
RHS over more than 1 rule.
• Nominal  Noun PP
• NP  Det Nominal

11

BITS Pilani, Pilani Campus


CNF Grammar
If we parse a sentence with a CNF grammar, we
know that:
– Every phrase-level non-terminal (above the
part of speech level) will have exactly 2
daughters.
• NP  Det N
– Every part-of-speech level non-terminal will
have exactly 1 daughter, and that daughter is
a terminal:
• N  lady
12

BITS Pilani, Pilani Campus


Recognising strings with CKY
Example input: The flight includes a meal.

The CKY algorithm proceeds by:


1. Splitting the input into words and indexing each position.
(0) the (1) flight (2) includes (3) a (4) meal (5)

2. Setting up a table. For a sentence of length n, we need


(n+1) rows and (n+1) columns.

3. Traversing the input sentence left-to-right

4. Use the table to store constituents and their span.

13

BITS Pilani, Pilani Campus


CKY example
S  NP VP
NP  Det N
VP  V NP
V  includes
Det  the
Det  a
N  meal
N  flight

BITS Pilani, Pilani Campus


CKY example

Rule: Det  the

[0,1] for “the”

1 2 3 4 5
0 Det S
1

2
3
4

the flight includes a meal

BITS Pilani, Pilani Campus


CKY example

Rule1: Det  the


Rule 2: N  flight

[0,1] for “the” [1,2] for “flight”

1 2 3 4 5
0 Det S
1 N

2
3
4

the flight includes a meal

BITS Pilani, Pilani Campus


CKY example

[0,2] for “the Rule1: Det  the


flight” Rule 2: N  flight
Rule 3: NP  Det N

[0,1] for “the” [1,2] for “flight”

1 2 3 4 5
0 Det NP S
1 N
2
3
4

the flight includes a meal

BITS Pilani, Pilani Campus


CKY: lexical step (j = 1)
1 2 3 4 5
0 Det
1
2
3
4
5

Lexical lookup
• Matches Det  the
The flight includes a meal.

BITS Pilani, Pilani Campus


CKY: lexical step (j = 2)
1 2 3 4 5
0 Det
1 N
2
3
4
5

Lexical lookup
• Matches N  flight The flight includes a meal.

BITS Pilani, Pilani Campus


CKY: syntactic step (j = 2)
1 2 3 4 5
0 Det NP
1 N
2
3
4
5

Syntactic lookup:
• look backwards and see if there
is any rule that will cover what
we’ve done so far. The flight includes a meal.

BITS Pilani, Pilani Campus


CKY: lexical step (j = 3)
1 2 3 4 5
0 Det NP
1 N
2 V
3
4
5

Lexical lookup
• Matches V  includes
The flight includes a meal.

BITS Pilani, Pilani Campus


CKY: lexical step (j = 3)
1 2 3 4 5
0 Det NP
1 N
2 V
3
4
5

Syntactic lookup
• There are no rules in our
grammar that will cover
Det, NP, V
The flight includes a meal.

BITS Pilani, Pilani Campus


CKY: lexical step (j = 4)
1 2 3 4 5
0 Det NP
1 N

2 V
3 Det
4
5

Lexical lookup
• Matches Det  a
The flight includes a meal.

BITS Pilani, Pilani Campus


CKY: lexical step (j = 5)
1 2 3 4 5
0 Det NP
1 N

2 V
3 Det
4 N

Lexical lookup
• Matches N  meal The flight includes a meal.

BITS Pilani, Pilani Campus


CKY: syntactic step (j = 5)
1 2 3 4 5
0 Det NP
1 N

2 V
3 Det NP
4 N

Syntactic lookup
• We find that we have
NP  Det N
The flight includes a meal.

BITS Pilani, Pilani Campus


CKY: syntactic step (j = 5)
1 2 3 4 5
0 Det NP
1 N

2 V VP
3 Det NP
4 N

Syntactic lookup
• We find that we have
VP  V NP

The flight includes a meal.

BITS Pilani, Pilani Campus


CKY: syntactic step (j = 5)
1 2 3 4 5
0 Det NP S
1 N

2 V VP
3 Det NP
4 N

Syntactic lookup
• We find that we have
S  NP VP
The flight includes a meal.

27

BITS Pilani, Pilani Campus


From recognition to parsing
The procedure so far will recognise a string as a
legal sentence in English.
But we’d like to get a parse tree back!
Solution:
– We can work our way back through the table and
collect all the partial solutions into one parse tree.
– Cells will need to be augmented with “backpointers”,
i.e. With a pointer to the cells that the current cell
covers.

28

BITS Pilani, Pilani Campus


From recognition to parsing
1 2 3 4 5

0 Det NP S

1 N

2 V VP

3 Det NP

4 N

BITS Pilani, Pilani Campus


From recognition to parsing
1 2 3 4 5

0 Det NP S

1 N

2 V VP

3 Det NP

4 N

NB: This algorithm always fills the top “triangle” of the table!
30

BITS Pilani, Pilani Campus


What about ambiguity?
The algorithm does not assume that there is only
one parse tree for a sentence.
– (Our simple grammar did not admit of any ambiguity, but this isn’t realistic of
course).

There is nothing to stop it returning several parse


trees.

If there are multiple local solutions, then more than


one non-terminal will be stored in a cell of the
table. 31

BITS Pilani, Pilani Campus


Some funny examples

– Policeman to littleboy: “We are looking for a


thief with a bicycle.” Little boy: “Wouldn’t you be
better using your eyes.”
– Why is the teacher wearing sun-glasses.
Because the class is so bright.

32

BITS Pilani, Pilani Campus


Ambiguity is Explosive

33

BITS Pilani, Pilani Campus


Motivation
• Context-free grammars can be generalized to
include probabilistic information by adding it
to CFG rule
• Probabilistic Context Free Grammars (PCFGs)
are the simplest and most natural
probabilistic model for tree structures and the
algorithms for them are closely related to
those for HMMs.
• PCFG are also known as Stochastic Context-
Free Grammar (SCFG)
34
Fall 2001 EE669: Natural Language Processing
BITS Pilani, Pilani Campus
CFG definition (reminder)
A CFG is a 4-tuple: (N,Σ,R,S):

– N = a set of non-terminal symbols (e.g. NP, VP)

– Σ = a set of terminals (e.g. words)


• N and Σ are disjoint (no element of N is also an element of Σ)

– R = a set of rules of the form Aβ where:


• A is a non-terminal (a member of N)
• β is any string of terminals and non-terminals

– S = a designated start symbol (usually, “sentence”)

35

BITS Pilani, Pilani Campus


Formal Definition of a PCFG
• A PCFG consists of:
– A set of terminals, {wk}, k= 1,…,V
– A set of nonterminals, Ni, i= 1,…, n
– A designated start symbol N1
– A set of rules, {Ni  j}, (where j is a sequence
of terminals and nonterminals)
– A corresponding set of probabilities on rules
such that: i j P(Ni  j) = 1

36
Fall 2001 EE669: Natural Language Processing 36
BITS Pilani, Pilani Campus
Example: Probability of a Derivation
Tree

Fall 2001 EE669: Natural Language Processing 37


BITS Pilani, Pilani Campus
Probability of a Derivation Tree and
a String
• The probability of a derivation (i.e. parse) tree:
P(T) =  i=1..k p(r(i))
where r(1), …, r(k) are the rules of the CFG
used to generate the sentence w1m of which T
is a parse.
• The probability of a sentence (according to
grammar G) is given by:
P(w1m) = t P(w1m,t) = {t: yield(t)=w1m} P(t)
where t is a parse tree of the sentence. Need
dynamic programming to make this efficient!
38
Fall 2001 EE669: Natural Language Processing
BITS Pilani, Pilani Campus
Example

• Terminals with, saw, astronomers, ears, stars, telescopes

• Nonterminals S, PP, P, NP, VP, V


• Start symbol S

BITS Pilani, Pilani Campus


astronomers saw stars with
ears

BITS Pilani, Pilani Campus


astronomers saw stars with
ears

BITS Pilani, Pilani Campus


Probabilities

P (t 1 ) = 1.0 × 0.1 × 0.7 × 1.0 × 0.4


×0.18 × 1.0 × 1.0 × 0.18
= 0.0009072
P (t 2 ) = 1.0 × 0.1 × 0.3 × 0.7 × 1.0
×0.18 × 1.0 × 1.0 × 0.18
= 0.0006804
P (w15) = P ( t 1 ) + P (t 2 ) = 0.0015876

BITS Pilani, Pilani Campus


Example2

P(TL) = 1.5 x 10-6

P(TR) = 1.7 x 10-6

P(S) = 3.2 x 10-6

43

BITS Pilani, Pilani Campus


Some Features of PCFGs
• A PCFG gives some idea of the plausibility of
different parses; however, the probabilities are
based on structural factors and not lexical
ones.
• PCFGs are good for grammar induction.
• PCFGs are robust.
• PCFGs give a probabilistic language model for
English.
• The predictive power of a PCFG tends to be
greater than for an HMM.
• PCFGs are not good models alone but they can
be combined with a trigram model.
44
Fall 2001 EE669: Natural Language Processing 44
BITS Pilani, Pilani Campus
Properties of PCFGs

► Assigns a probability to each left-most derivation, or


parse-tree, allowed by the underlying CFG
► Say we have a sentence s, set of derivations for that
sentence is T (s). Then a PCFG assigns a probability
p(t) to each member of T (s). i.e., we now have a
ranking in order of probability.
► The most likely parse tree for a sentence s is

arg max p(t)


t∈T (s)
45

BITS Pilani, Pilani Campus


Data for Parsing Experiments: Treebanks
► Penn WSJ Treebank = 50,000 sentences with associated trees

► Usual set-up: 40,000 training sentences, 2400 test sentences

An example tree:
T OP

NP VP

N N P N N PS VB D NP PP

NP PP AD VP IN NP

C D N N IN NP RB NP PP

QP PR P$ JJ N N C C JJ N N N N S IN NP

$ C D C D PU N C , NP SB AR

N N P PU N C , W H AD VP S

W RB NP VP

DT NN VB Z NP

QP N N S PU N C .

RB CD

C an ad ian Utilities h ad 1988 r ev en u e of C $ 1.16 billion , m ain ly f r o m its natural g as an d elect r ic utility b u sin essesin Albert a , wh er e the co m p an y ser v es ab o u t 800,000 cu st o m er s .

46

BITS Pilani, Pilani Campus


Example tree

BITS Pilani, Pilani Campus


Word/Tag Counts
N V ART P TOTAL
flies 21 23 0 0 44
fruit 49 5 1 0 55
like 10 30 0 21 61
a 1 0 201 0 202
the 1 0 300 2 303
flower 53 15 0 0 68
flowers 42 16 0 0 58
birds 64 1 0 0 65
others 592 210 56 284 1142
TOTAL 833 300 558 307 1998

Fall 2001 EE669: Natural Language Processing 48


BITS Pilani, Pilani Campus
Lexical Probability Estimates
The table below gives the lexical probabilities which
are needed for our example:

P
(t
he|A RT) .5
4 P
(a
|A RT) .3
60
P
(f
lies|N
) .0
25 P
(a
|N ) .0
01
P
(f
lies|V
) .0
76 P
(f
lowe r
|N) .0
63
P
(l
ike|V) .1 P
(f
lowe r
|V) .0
5
P
(l
ike|P) .0
68 P
(bi
rds|N) .0
76
P
(l
ike|N) .0
12

49
EE669: Natural Language Processing 49
BITS Pilani, Pilani Campus
The PCFG
• Below is a probabilistic CFG (PCFG) with
probabilities derived from analyzing a
parsed version of Allen's corpus.
Rule Count for Coun t for PROB
LH S Rule
1. SNPVP 300 300 1
2. VPV 300 116 .386
3. VPVNP 300 118 .393
4. VPVNPP P 300 66 .22
5. NPNPPP 1032 241 .23
6. NPNN 1032 92 .09
7. NPN 1032 141 .14
8. NPARTN 1032 558 .54
9. PPPNP 307 307 1

50
Fall 2001 EE669: Natural Language Processing
BITS Pilani, Pilani Campus
Parsing with a PCFG
• Using the lexical probabilities, we can
derive probabilities that the constituent NP
generates a sequence like a flower. Two
rules could generate the string of words:
NP NP
8 6
ART N N N

a flower a flower

51
Fall 2001 EE669: Natural Language Processing
BITS Pilani, Pilani Campus
Three Possible Trees for an S

S S
1 1
NP VP NP VP
2 7 3
8
N V NP
ART N V 7
a N
a flowerwilted S flower
1 wilted
NP VP2
6
N N V
a flowerwilted

52
Fall 2001 EE669: Natural Language Processing
BITS Pilani, Pilani Campus
Parsing with a PCFG
• The probability of a sentence generating A flower
wilted:
P(a flower wilted|S) = P(R1|S) × P(a flower|NP) ×
P(wilted|VP) + P(R1|S) × P(a|NP) × P(flower wilted|VP)
Using this approach, the probability that a given
sentence will be generated by the grammar can
be efficiently computed.
• It only requires some way of recording the value
of each constituent between each two possible
positions. The requirement can be filled by a
packed chart structure.
53
Fall 2001 EE669: Natural Language Processing
BITS Pilani, Pilani Campus
Uses of probabilities in parsing
Disambiguation: given n legal parses of a string, which is the most
likely?
– e.g. PP-attachment ambiguity can be resolved this way

Speed: we’ve defined parsing as a search problem


– search through space of possible applicable derivations
– search space can be pruned by focusing on the most likely sub-
parses of a parse

Parser can be used as a model to determine the probability of a


sentence, given a parse
– typical use in speech recognition, where input utterance can be
“heard” as several possible sentences

54

BITS Pilani, Pilani Campus


Using PCFG probabilities
PCFG assigns a probability to every parse-tree t of a
string W
– e.g. every possible parse (derivation) of a sentence
recognised by the grammar

Notation:
– G = a PCFG
– s = a sentence
– t = a particular tree under our grammar
• t consists of several nodes n
• each node is generated by applying some rule r
55

BITS Pilani, Pilani Campus


Probability of a tree vs. a sentence

We work out the probability of a parse tree t by


multiplying the probability of every rule (node) that
gives rise to t (i.e. the derivation of t).

Note that:
– A tree can have multiple derivations
• (different sequences of rule applications could give rise
to the same tree)
– But the probability of the tree remains the same
• (it’s the same probabilities being multiplied)
– We usually speak as if a tree has only one derivation, called the canonical derivation

56
BITS Pilani, Pilani Campus
Picking the best parse in a PCFG
A sentence will usually have several parses
– we usually want them ranked, or only want the n best
parses
– we need to focus on P(t|s,G)
• probability of a parse, given our sentence and our
grammar

– definition of the best parse for s:


• The tree for which P(t|s,G) is highest

57
BITS Pilani, Pilani Campus
Probability of a sentence
Given a probabilistic context-free grammar G, we can the
probability of a sentence (as opposed to a tree).

Observe that:
– As far as our grammar is concerned, a sentence is only a
sentence if it can be recognised by the grammar (it is “legal”)
– There can be multiple parse trees for a sentence.
• Many trees whose yield is the sentence
– The probability of the sentence is the sum of all the probabilities
of the various trees that yield the sentence.

58

BITS Pilani, Pilani Campus


Using CKY to parse with a PCFG

The basic CKY algorithm remains unchanged.

However, rather than only keeping partial solutions


in our table cells (i.e. The rules that match some
input), we also keep their probabilities.

59

BITS Pilani, Pilani Campus


Probabilistic CKY: example PCFG

S  NP VP [.80]
NP  Det N [.30]
VP  V NP [.20]
V  includes [.05]
Det  the [.4]
Det  a [.4]
N  meal [.01]
N  flight [.02]

60

BITS Pilani, Pilani Campus


Probabilistic CYK: initialisation
1 2 3 4 5
0
1
2
3
4
5

 S  NP VP [.80]
 NP  Det N [.30]
 VP  V NP [.20]
 V  includes [.05] The flight includes a meal.
 Det  the [.4]
 Det  a [.4]
 N  meal [.01]
 N  flight [.02]
61

BITS Pilani, Pilani Campus


Probabilistic CYK: lexical step
1 2 3 4 5
0 Det
(.4)
1
2
3
4
5

 S  NP VP [.80]
 NP  Det N [.30]
 VP  V NP [.20]
 V  includes [.05] The flight includes a meal.
 Det  the [.4]
 Det  a [.4]
 N  meal [.01]
 N  flight [.02] 62

BITS Pilani, Pilani Campus


Probabilistic CYK: lexical step
1 2 3 4 5
0 Det
(.4)
1 N
.02
2
3
4
5

 S  NP VP [.80]
 NP  Det N [.30]
 VP  V NP [.20]
 V  includes [.05] The flight includes a meal.
 Det  the [.4]
 Det  a [.4]
63
 N  meal [.01]
 N  flight [.02]
BITS Pilani, Pilani Campus
Probabilistic CYK: syntactic step
1 2 3 4 5
0 Det NP
(.4) .0024
1 N
.02
2
3
4
5

 S  NP VP [.80]


NP  Det N [.30]
VP  V NP [.20]
The flight includes a meal.
 V  includes [.05]
 Det  the [.4]
 Det  a [.4] Note: probability of NP in [0,2]
 N  meal [.01] P(Det  the) * P(N  flight) * P(NP  Det N)
 N  flight [.02] 64

BITS Pilani, Pilani Campus


Probabilistic CYK: lexical step
1 2 3 4 5
0 De NP
 S  NP VP [.80] t .0024
 NP  Det N [.30] (.4
)
 VP  V NP [.20] 1 N
 V  includes [.05] .02
2 V
 Det  the [.4] .05
 Det  a [.4] 3

 N  meal [.01] 4
5
 N  flight [.02]

The flight includes a meal.


BITS Pilani, Pilani Campus
Probabilistic CYK: lexical step
1 2 3 4 5
0 Det NP
 S  NP VP [.80] (.4 .0024
)
 NP  Det N [.30] 1 N
 VP  V NP [.20] .02
2 V
 V  includes [.05] .05
 Det  the [.4] 3 Det
 Det  a [.4] .4
4
 N  meal [.01] 5
 N  flight [.02]

The flight includes a meal.


BITS Pilani, Pilani Campus
Probabilistic CYK: syntactic step
1 2 3 4 5
0 Det NP
 S  NP VP [.80] (.4) .0024
 NP  Det N [.30] 1 N
.02
 VP  V NP [.20] 2 V
 V  includes [.05] .05

 Det  the [.4] 3 Det


.4
 Det  a [.4] 4 N
 N  meal [.01] .01

 N  flight [.02]
The flight includes a meal.

BITS Pilani, Pilani Campus


Probabilistic CYK: syntactic step
1 2 3 4 5

 S  NP VP [.80]
0 Det NP
(.4) .0024
 NP  Det N 1 N
[.30] .02
 VP  V NP 2 V
[.20] .05
 V  includes 3 Det NP
[.05] .4 .001
4
 Det  the [.4] N
.01
 Det  a [.4]
 N  meal [.01]
The flight includes a meal.
 N  flight [.02]

BITS Pilani, Pilani Campus


Probabilistic CYK: syntactic step
1 2 3 4 5
0 Det NP
 S  NP VP [.80] (.4) .0024
 NP  Det N 1 N
[.30] .02
 VP  V NP [.20] 2 V VP

 V  includes .05 .00001


3 Det NP
[.05]
.4 .001
 Det  the [.4] 4 N
 Det  a [.4] .01
 N  meal [.01]
 N  flight [.02]
The flight includes a meal.

BITS Pilani, Pilani Campus


1 2 3 4 5

 S  NP VP [.80] 0 Det NP S
(.4) .0024 .0000000192
 NP  Det N 1 N
[.30] .02
 VP  V NP [.20] 2 V VP
 V  includes .05 .00001
[.05] 3 Det NP

 Det  the [.4] .4 .001


4 N
 Det  a [.4] .01
 N  meal [.01]
 N  flight [.02]
The flight includes a meal.

BITS Pilani, Pilani Campus


Probabilistic CYK: summary
Cells in chart hold probabilities

Bottom-up procedure computes probability of a


parse incrementally.

To obtain parse trees, we traverse the table


“backwards” as before.
– Cells need to be augmented with backpointers.

71

BITS Pilani, Pilani Campus


Problems with PCFGs
• No Context
– (immediate prior context, speaker, …)
• No Lexicalization
– “VP NP NP” more likely if verb is “hand” or “tell”
– fail to capture lexical dependencies (n‐grams do)

72

BITS Pilani, Pilani Campus


Flaws I: Structural independence
Probability of a rule r expanding node n depends only on n.
Independent of other non-terminals

Example:
– P(NP  Pro) is independent of where the NP is in the
sentence
– but we know that NPPro is much more likely in
subject position
– Francis et al (1999) using the Switchboard corpus:
• 91% of subjects are pronouns;
• only 34% of objects are pronouns
73

BITS Pilani, Pilani Campus


Flaws II: lexical independence
vanilla PCFGs ignore lexical material
– e.g. P(VP  V NP PP) independent of the head of NP or
PP or lexical head V
Examples:
– prepositional phrase attachment preferences depend on
lexical items; cf:
• dump [sacks into a bin]
• dump [sacks] [into a bin] (preferred parse)
– coordination ambiguity:
• [dogs in houses] and [cats]
• [dogs] [in houses and cats]

74
BITS Pilani, Pilani Campus
Lexicalised PCFGs
Attempt to weaken the lexical independence
assumption.

Most common technique:


– mark each phrasal head (N,V, etc) with the lexical material
– this is based on the idea that the most crucial lexical dependencies are between
head and dependent
– E.g.: Charniak 1997, Collins 1999

75
BITS Pilani, Pilani Campus
Lexicalised PCFGs: Matt walks
Makes probabilities partly dependent on lexical
content. S(walks)

P(VPVBD|VP) becomes:
P(VPVBD|VP,h(VP)=walks) NP(Matt) VP(walks)

NB: normally, we can’t assume that all heads of a phrase of category C


are equally probable. NNP(Matt) VBD(walks)

Matt walks

BITS Pilani, Pilani Campus


Practical problems for lexicalised PCFGs
Data sparseness: we don’t necessarily see all heads of all
phrasal categories often enough in the training data

Flawed assumptions: lexical dependencies occur


elsewhere, not just between head and complement
• I got the easier problem of the two to solve
• of the two and to solve are very likely because of
the prehead modifier easier

BITS Pilani, Pilani Campus


References

• https://www.youtube.com/watch?v=Z6GsoBA-
09k&list=PLQiyVNMpDLKnZYBTUOlSI9mi9wAErFtFm&i
ndex=62
• https://lost-
contact.mit.edu/afs/cs.pitt.edu/projects/nltk/docs/tutorial/
pcfg/nochunks.html
• http://www.nltk.org/howto/grammar.html

BITS Pilani, Pilani Campus


Thank You… 

Q&A
Suggestions / Feedback

BITS Pilani, Pilani Campus


Thank You

BITS Pilani, Pilani Campus


Natural Language Processing
DSECL ZG565
Prof.Vijayalakshmi
BITS-Pilani
BITS Pilani
Pilani Campus
BITS Pilani
Pilani Campus

Session 7 - Dependency Parsing


Date – 11th October 2020
Time – 9 am to 11 am
These slides are prepared by the instructor,withgratefulacknowledgement of Prof. Jurafsky and
Prof. Martin and many others who made their course materialsfreelyavailableonline.
Recap

• CKY parsing
• PCFG parsing
• Problems with PCFG
• Lexicalized PCFGS

3
BITS Pilani, Pilani Campus
Outline
 Motivation
 Two types of parsing
 Dependency parsing
 Phrase structure parsing
 Dependency structure and Dependency grammar
 Dependency Relation
 Universal Dependencies
 Method of Dependency Parsing
 Dynamic programming
 Graph algorithms
 Constraint satisfaction
 Deterministic Parsing
 Transition based dependency parsing
 Graph based dependency Parsing
 Evaluation 4
BITS Pilani, Pilani Campus
Interpreting Language is Hard!
I saw a girl with a telescope

● “Parsing” resolves structural ambiguity in a formal way

2
5
BITS Pilani, Pilani Campus
Two Types of Parsing
● Dependency: focuses on relations between words

I saw a girl with a telescope


● Phrase structure: focuses on identifying phrases and
their recursive structure
S
VP
PP
NP NP NP
PRP VBD DT NN IN DT NN
3
6
I saw a girl with a telescope BITS Pilani, Pilani Campus
Dependencies Also Resolve Ambiguity

I saw a girl with a telescope I saw a girl with a telescope

4
7
BITS Pilani, Pilani Campus
Dependency Grammar and
Dependency Structure

Dependency syntax postulates that syntactic


structure consists of lexical items linked by binary
asymmetric relations (“arrows”) called dependencies
submitted
nsubjpass auxpass prep
The arrows are Bills were by
commonly typed prep pobj
with the name of on Brownback
grammatical pobj nn appos
relations (subject, ports Senator Republican
prepositional cc conj prep
object, apposition, and immigration of
etc.) pobj
8 Kansas
BITS Pilani, Pilani Campus
Dependency Grammar and
Dependency Structure

The arrow connects a head (governor, superior, regent) with a dependent


(modifier, inferior, subordinate)

Usually, dependencies form a tree (connected, acyclic, single-head)


submitted
nsubjpass auxpass prep

Bills were by
prep pobj
on Brownback
pobj nn appos
ports Senator Republican
cc conj prep
and immigration of
pobj
9 Kansas
BITS Pilani, Pilani Campus
Relation between phrase structure and
dependency structure

• A dependency grammar has a notion of a head. Officially, CFGs don’t.


• But modern linguistic theory and all modern statistical parsers (Charniak,
Collins, Stanford, …) do, via hand-written phrasal “head rules”:
– The head of a Noun Phrase is a noun/number/adj/…
– The head of a Verb Phrase is a verb/modal/….
• The head rules can be used to extract a dependency parse from a CFG parse

• The closure of dependencies


give constituency from a
dependency tree
• But the dependents of a word
must be at the same level
(i.e., “flat”)

10
BITS Pilani, Pilani Campus
Dependency graph

11
BITS Pilani, Pilani Campus
Formal conditions on dependency graph

12
BITS Pilani, Pilani Campus
Universal dependencies

http://universaldependencies.org/
• Annotated treebanks in many languages
• Uniform annotation scheme across all languages:
• Universal POS tags
• Universal dependency relations

13
BITS Pilani, Pilani Campus
Dependency Relations

14
BITS Pilani, Pilani Campus
Example Dependency
Parse

15
BITS Pilani, Pilani Campus
Method of Dependency Parsing

16
BITS Pilani, Pilani Campus
Methods of Dependency Parsing

1. Dynamic programming (like in the CKY algorithm)


You can do it similarly to lexicalized PCFG parsing: an O(n5) algorithm
Eisner (1996) gives a clever algorithm that reduces the complexity to O(n 3), by
producing parse items with heads at the ends rather than in the middle
2. Graph algorithms
You create a Maximum Spanning Tree for a sentence
McDonald et al.’s (2005) MSTParser scores dependencies independently using a
ML classifier (he uses MIRA, for online learning, but it could be MaxEnt)
3. Constraint Satisfaction
Edges are eliminated that don’t satisfy hard constraints. Karlsson (1990), etc.
4. Deterministic parsing
Greedy choice of attachments guided by machine learning classifiers
MaltParser (Nivre et al. 2008) – discussed in the next segment

17
BITS Pilani, Pilani Campus
Deterministic parsing

BITS Pilani, Pilani Campus


Deterministic parsing

19
BITS Pilani, Pilani Campus
Transition based systems for
Dependency parsing

 A transition system for dependency parsing is a quadruple s=(C,T, cs,


Ct),where
• C is a set of configurations
• T is a set of transitions ,such that t:C->c,
• Cs is an initialization function
• is asset of terminal configurations.
 A transition sequence for a sentence x is a set of configurations
C0,m =(c0,c1….cm) such that
C0=Cs(x),cm belongs to Ct ,ci=t(ci-1) for some t belongs to T
• Initialization:[]s,[w1,w2…….wn]b,{})
• Termination: S,[]B,A

20
BITS Pilani, Pilani Campus
Arc eager parsing(Malt parser)

21
BITS Pilani, Pilani Campus
Example1

22
BITS Pilani, Pilani Campus
Example2

23
BITS Pilani, Pilani Campus
Creating an Oracle

24
BITS Pilani, Pilani Campus
How the classifier the learns ?

25
BITS Pilani, Pilani Campus
Feature Models

Feature template:

26
BITS Pilani, Pilani Campus
Feature examples

27
BITS Pilani, Pilani Campus
Classifier at runtime

28
BITS Pilani, Pilani Campus
Training Data

29
BITS Pilani, Pilani Campus
Generating training data example

30
BITS Pilani, Pilani Campus
Standard Oracle for Arc Eager
parsing

31
BITS Pilani, Pilani Campus
Online learning with an oracle

32
BITS Pilani, Pilani Campus
Example

33
BITS Pilani, Pilani Campus
34
BITS Pilani, Pilani Campus
Graph-based
parsing

35
BITS Pilani, Pilani Campus
Graph concepts
refresher

36
BITS Pilani, Pilani Campus
Multi Digraph

37
BITS Pilani, Pilani Campus
Directed
Spanning Trees

38
BITS Pilani, Pilani Campus
Weighted Spanning tree

39
BITS Pilani, Pilani Campus
MST

40
BITS Pilani, Pilani Campus
Finding MST

41
BITS Pilani, Pilani Campus
Chu-Liu-Edmonds algorithm

42
BITS Pilani, Pilani Campus
Chu-Liu-Edmonds Algorithm (2/12)

• x = John saw Mary

9
Gx 10
30
roo
t9 0 Mary
20 saw
30
11
John
3
43
/39
BITS Pilani, Pilani Campus
Chu-Liu-Edmonds Example

BITS Pilani, Pilani Campus


Chu-Liu-Edmonds Example

BITS Pilani, Pilani Campus


Chu-Liu-Edmonds Example

BITS Pilani, Pilani Campus


Chu-Liu-Edmonds Example

BITS Pilani, Pilani Campus


Chu-Liu-Edmonds Example

BITS Pilani, Pilani Campus


MST Learning

BITS Pilani, Pilani Campus


Linear classifiers

50
BITS Pilani, Pilani Campus
Arch features

51
BITS Pilani, Pilani Campus
52
BITS Pilani, Pilani Campus
53
BITS Pilani, Pilani Campus
54
BITS Pilani, Pilani Campus
Learning the parameters

55
BITS Pilani, Pilani Campus
Inference based learning

56
BITS Pilani, Pilani Campus
57
BITS Pilani, Pilani Campus
EXAMPLE

58
BITS Pilani, Pilani Campus
Evaluation

Acc = # correct deps


# of deps

ROOT She saw the video lecture UAS(unlabeled attachment score) =


LAS = 2 / 5 = 40%
0 1 2 3 4 5

Gold Parsed
1 2 She nsubj 1 2 She nsubj
2 0 saw root 2 0 saw root
3 5 the det 3 4 the det
4 5 video nn 4 5 video nsubj
5 2 lecture dobj 5 2 lecture ccomp
59
BITS Pilani, Pilani Campus
References

1.Speech and Language processing: An introduction to Natural


Language Processing, Computational Linguistics and speech
Recognition by Daniel Jurafsky and James H. Martin[3rd edition].
2. https://www.youtube.com/watch?v=PVShkZgXznc
3.https://www.youtube.com/watch?v=02QWRAhGc7g&list=PLJJz
I13YAXCHxbVgiFaSI88hj-mRSoMtI
4.https://www.researchgate.net/publication/328731166_Weight
ed_Machine_Learning

60
BITS Pilani, Pilani Campus
Any Questions?

61
BITS Pilani, Pilani Campus
Thank you

62
BITS Pilani, Pilani Campus

You might also like