It-3035 (NLP) - CS End May 2023
It-3035 (NLP) - CS End May 2023
It-3035 (NLP) - CS End May 2023
Q1. (a) Write a regular expression that matches a string that has the letter 'a' followed by
anything, ending in the letter 'b'.
Answer: a.*?b$
(b) Illustrate with suitable examples how Lemmatization differs from Stemming in NLP?
Answer: Stemming is a process that stems or removes last few characters from a word,
often leading to incorrect meanings and spelling. Stemming is used in case of large dataset
where performance is an issue. Lemmatization considers the context and converts the word
to its meaningful base form, which is called Lemma. Lemmatization is computationally
expensive since it involves look-up tables and other sophisticated tools. For instance,
stemming the word ‘Caring‘ would return ‘Car‘ (which is seemingly may cause incorrect
interpretation of the word under processing), whereas lemmatizing the same word would
return ‘Care‘.
(f) Named Entity Recognition is a sub task of --------- that seeks to locate and classify named
entities in texts.
(i) Information Retrieval
(ii) Information Extraction
(iii) Text Classification
(iv) None of the above
Answer: (ii) Information Extraction
(i) Explain the “Feast or Famine” problem in connection with the task of Information
Retrieval.
Answer: “Feast or Famine” problem occurs when we apply Boolean queries in connection
with the task of Information Retrieval. Boolean queries often result in either too few (~0) or
too many (some thousands or even millions) results. It usually takes a lot of skill to come up
with a query that produces a manageable number of hits.
Q2. (a) Explain the benefits of eliminating stop words. Give examples for situations in which
elimination of stop words may be harmful.
Answer: Stop words are available in abundance in any human language. Stop words often do
not provide any meaningful information and as their frequencies are too high, removing
them from the corpus results in much smaller data in terms of size and faster computations
on text data. By removing these words, we remove the low-level information from our text
in order to give more focus to the important information. In order words, we can say that
the removal of such words does not show any negative consequences on the model we train
for our task.
Stopword removal sometimes yield adverse effect. The removal of stop words is highly
dependent on the task we are performing and the goal we want to achieve. There are some
applications in NLP such as Part of Speech (PoS) Tagging, Named Entity Recognition (NER),
Parsing, etc., where we should not remove stop words; Rather, we should preserve them as
they actually provide grammatical information in those applications. Here are some of these:
Suppose that we are training a model that can perform the sentiment analysis task from
movie reviews. If the original movie Movie review is “The movie was not good at all.”, the
text after removal of stop words would be “movie good”. We can clearly see that the review
for the movie was negative. However, after the removal of stop words, the review became
positive, which is not the reality.
Consider another scenario where we try searching for “To be or not to be” in a search
application that removes stop words. That would miss completely the legendary work of
Shakespeare by that name.
Stop words can help to disambiguate the search query, allowing the user to get more
accurate results. If we try to find “notebook without DVD drive” with stop words removed,
results become irrelevant when stop words are removed.
Evaluation Scheme: 1 Mark for benefits and 3 marks for harmfulness examples.
(b) List the problems associated with N-gram language model. Explain how these problems
are handled.
Answer: Problems associated with N-gram language model:
(i) The N-gram model, like many statistical models, is significantly dependent on the training
corpus.
(ii) The performance of the N-gram model varies with the change in the value of N.
(iii) Since any corpus is limited, some perfectly acceptable English word sequences are
bound to be missing from it and have zero probability for the corresponding N-gram. As a
result of it, the N-gram matrix for any training corpus is bound to have a substantial number
of cases of putative “zero probability N-grams”.
(iv) A similar problem is Unknown or Out Of Vocabulary words. The words that does not
appear in training set, but have appeared in the test set.
Evaluation Scheme: 2 Marks for problems and 2 marks for remedies to them.
Q3. (a) Explain Top Down Parsing and Bottom Up Parsing with example.
Answer:
Top-Down Parsing: It is a parsing strategy that first looks at the highest level of the parse
tree and works down the parse tree by using the rules of grammar. Production is used to
derive and check the similarity in the string. It derives the leftmost string and when the
string matches the requirement it is terminated.
Bottom-Up Parsing: It is a parsing strategy that first looks at the lowest level of the parse
tree and works up the parse tree by using the rules of grammar. can be defined as an
attempt to reduce the input string to the start symbol of a grammar. This parsing technique
uses Right Most Derivation. The main decision is to select when to use a production rule to
reduce the string to get the starting symbol.
Evaluation Scheme: 2 marks for Top Down, 2 marks for Bottom Up. In either of the above, 1
mark for explanation, 1 marks for suitable example.
(b) Define Probabilistic Context-Free Grammar (PCFG). List down the disadvantages of PCFG.
Answer:
Definition of PCFG: A probabilistic context-free grammar G can be defined by a quintuple: G
= (M, T, R, S, P), where
⚫ M is the set of non-terminal symbols
⚫ T is the set of terminal symbols
⚫ R is the set of production rules
⚫ S is the start symbol
⚫ P is the set of probabilities on production rules
Limitation of PCFG:
(i) PCFGs do not take lexical information into account, making parse plausibility less than
ideal and making PCFGs worse than n-grams as a language model.
(ii) PCFGs have certain biases; i.e., the probability of a smaller tree is greater than a larger
tree.
(iii) When two different analyses use the same set of rules, they have the same probability.
Q4. (a) “Naive Bayes’ Classifier is a linear classifier in logarithmic space” --- Justify the above
statement with proper reasoning.
Answer: Naive Bayes is a probabilistic classifier, meaning that for a document d, out of all
classes c ∈ C the classifier returns the class 𝑐̂ which has the maximum posterior probability
given the document. That is:
𝑐̂ = argmax 𝑃(𝑐|𝑑)
c∈C
With simple manipulation, the above reduces to:
𝑐̂ = argmax 𝑃(𝑑|𝑐). 𝑃(𝑐)
c∈C
If the document d is represent by a set of features F = {f1, …, fn }, then the above reduces to:
𝑐̂ = argmax 𝑃(𝑓1 , . . . , 𝑓𝑛 |𝑐). 𝑃(𝑐)
c∈C
Following Bag of Words assumption and the Naive Bayes assumption, the above expression
reduces to:
𝑐̂ = argmax 𝑃(𝑐) ∏ 𝑃(𝑓|𝑐)
c∈C
𝑓∈𝐹
In a natural language processing task, the features are usually words.
𝑐𝑁𝐵 = argmax 𝑃(𝑐) ∏ 𝑃(𝑤𝑖 |𝑐)
c∈C
𝑖∈𝑊𝑜𝑟𝑑 𝑃𝑜𝑠𝑖𝑡𝑖𝑜𝑛𝑠
If we apply logarithm on the above, logarithm being a monotonic increasing function, we
have:
Q5. Write a short note rule-based POS tagging. Discuss the advantages and disadvantages of
the same in this context.
Answer:
Rule-based POS Tagging: One of the oldest techniques of tagging is rule-based POS tagging.
Rule-based taggers use dictionary or lexicon for getting possible tags for tagging each word.
If the word has more than one possible tag, then rule-based taggers use hand-written rules
to identify the correct tag. Disambiguation can also be performed in rule-based tagging by
analyzing the linguistic features of a word along with its preceding as well as following words.
For example, suppose if the preceding word of a word is article then word must be a noun.
As the name suggests, all such kind of information in rule-based POS tagging is coded in the
form of rules. These rules may be either (i) Context-pattern rules, Or, (ii) Regular expression
compiled into finite-state automata, intersected with lexically ambiguous sentence
representation. Rule-based POS tagging follows two-stage architecture. In the first stage, it
uses a dictionary to assign each word a list of potential parts-of-speech. In the second stage,
it uses large lists of hand-written disambiguation rules to sort down the list to a single part-
of-speech for each word.
Evaluation Scheme: 2 marks for Short Note, 1 mark for advantages, 1 mark for
disadvantages.
(b) Consider the following two matrices in connection with an HMM model for POS
tagging:
AB PN PP VB EOS
AB 1/11 1/10 1/12 1/11 1/25
PN 1/11 1/11 1/11 1/10 1/14
PP 1/11 1/12 1/12 1/10 1/16
VB 1/13 1/11 1/12 1/14 1/18
EOS 1/11 1/10 1/10 1/13 1/15
she got up
AB 1/25 1/25 1/14
PN 1/13 1/25 1/25
PP 1/25 1/25 1/13
VB 1/25 1/14 1/19
BOS: Beginning of sentence; EOS: End of Sentence
Find the best POS tag sequence for the sentence: “she got up” following Viterbi
algorithm.
Evaluation Scheme: The POS tag BOS is missing in the table given. So, any honest
attempt to this question should be awarded full credit (4 marks).
Q6. (a) Consider the following tiny phrase structure treebank, consisting of just two
trees. Read off all rules whose left hand sides are either NP or VP and estimate their
rule probabilities using maximum likelihood estimation.
(b) Consider a corpus of English containing approximately 560 million tokens. In this
corpus we have the counts of unigrams and bigrams as per the table below. Estimate
Prob(snow) and Prob(snow | white) using maximum likelihood estimation without
smoothing.
snow white white purple purple
snow snow
38,186 256,091 122 11,218 0
Answer:
38186
P(snow) = 560×106 = 6.8189 × 10−5
122
P(snow|white) = 256091 = 4.7639 × 10−4
Evaluation Scheme: 2 marks for P(snow) and 2 marks for P(snow|white).
Q7. (a) Assuming the grammar below construct the parse tree for the sentence: “the
big yellow dog sat under the house”
S --> NP VP ; VP --> VP PP ; VP --> Verb NP ;
VP --> Verb ; NP --> Det NOM; NOM --> Adj NOM ;
NOM --> Noun ; PP --> Prep NP ; Det --> the ;
Adj --> big ; Adj --> yellow ; Noun --> dog ;
Noun --> house ; Verb --> sat ; Prep --> under
Answer: The following is a parse tree for the given sentence:
(b) Find the Minimum Edit Distance between the words “TEACHER” and
“STUDENT” following a dynamic programming based algorithm.
Answer: If we follow Minimum Edit Distance (Levenshtein) Algorithm, we have the
following:
Q8. (a) Consider the following CFG and convert the same into equivalent CFG in
Chomsky Normal form:
S --> NP VP ; S --> VP ; NP --> NP PP;
NP --> PropNoun ; VP --> Verb ; VP --> Verb NP ;
VP --> VP PP ; PP --> Prep NP ; PropNoun --> DALLAS ;
PropNoun --> ALICE ; PropNoun --> BOB ;
PropNoun --> AUSTIN ; Verb --> ADORE ;
Verb --> SEE ; Prep --> IN ; Prep --> WITH
Answer:
(b) Using the normalized grammar derived above, find the CKY parsing chart for the
sentence: “SEE BOB IN AUSTIN”
Answer: The CKY parsing chart for the sentence is as follows: