Bhawini NLP File

Download as pdf or txt
Download as pdf or txt
You are on page 1of 100

HEMWATI NANDAN BAHUGUNA GARHWAL UNIVERSITY

(A Central University), Srinagar Garhwal, Uttarakhand


School of Engineering and Technology
Department of Computer science and Technology

Session 2020 - 2021

PRACTICAL FILE
FOR

NATURAL LANGUAGE PROCESSING

Submitted to - Submitted by-


Ms Kanchan Naithani Bhawini Raj
B.Tech - VIII Semester
Department of Computer Science and Engineering Roll no - 17134501001

1
CONTENTS

Experiment Experiment Name Page


No. No.

1 Introduction to Natural Language Processing 01

2 Introduction to Grammars, Parsing and PoS tags 14

3 Introduction to NLTK 40

4 Write a Python Program to remove “stopwords” from a given 48


text and generate word tokens and filtered text.

5 Write a Python Program to generate “tokens” and assign 51


“PoS tags” for a given text using NLTK package.

6 Write a Python Program to generate “worldcloud” with 53


maximum words used = 100, in different shapes and save
as a .png file for a given text file.

7 Perform an experiment to learn about morphological 57


features of a word by analyzing it.

8 Perform an experiment to generate word forms from root 62


and suffix information.

9 Perform an experiment to understand the morphology of a 65


word by the use of Add-Delete table

10 Perform an experiment to learn to calculate bigrams from a 69


given corpus and calculate probability of a sentence.

11 Perform an experiment to experiment to learn how to apply 73


add-one smoothing on sparse bigram table.

2
12 Perform an experiment to calculate emission and transition 77
matrix which will be helpful for tagging Parts of Speech
using Hidden Markov Model.

13 Perform an experiment to know the importance of context 83


and size of training corpus in learning Parts of Speech

14 Perform an experiment to understand the concept of 87


chunking and get familiar with the basic chunk tagset.

15 93

3
EXPERIMENT NO 1
INTRODUCTION TO NATURAL LANGUAGE PROCESSING

1. What is NLP?

NLP is an interdisciplinary field concerned with the interactions between computers and natural
human languages (e.g., English) — speech or text. NLP-powered software helps us in our daily lives
in various ways, for example:

● Personal assistants: Siri, Cortana, and Google Assistant.


● Auto-complete: In search engines (e.g., Google, Bing).
● Spell checking: Almost everywhere, in your browser, your IDE (e.g., Visual Studio),
desktop apps (e.g., Microsoft Word).
● Machine Translation: Google Translate.

Figure1.1 - Natural language processing

NLP is divided into two fields: Linguistics and Computer Science.


The Linguistics side is concerned with language, it’s formation, syntax, meaning, different kind of
phrases (noun or verb) and whatnot.

1
The Computer Science side is concerned with applying linguistic knowledge, by transforming it
into computer programs with the help of sub-fields such as Artificial Intelligence (Machine
Learning & Deep Learning).

2. How does Natural Language Processing Work?


NLP enables computers to understand natural language as humans do. Whether the language
is spoken or written, natural language processing uses artificial intelligence to take real-world
input, process it, and make sense of it in a way a computer can understand. Just as humans
have different sensors -- such as ears to hear and eyes to see computers have programs to
read and microphones to collect audio. And just as humans have a brain to process that input,
computers have a program to process their respective inputs. At some point in processing, the
input is converted to code that the computer can understand.

There are two main phases to natural language processing: data preprocessing and algorithm
development.

Data preprocessing involves preparing and "cleaning" text data for machines to be able to
analyze it. preprocessing puts data in workable form and highlights features in the text that
an algorithm can work with. There are several ways this can be done, including:

● Tokenization. This is when text is broken down into smaller units to work with.
● Stop word removal. This is when common words are removed from text so unique
words that offer the most information about the text remain.
● Lemmatization and stemming. This is when words are reduced to their root forms to
process.
● Part-of-speech tagging. This is when words are marked based on the part-of speech they
are -- such as nouns, verbs and adjectives.

Once the data has been preprocessed, an algorithm is developed to process it. There are many
different natural language processing algorithms, but two main types are commonly used:

2
● Rules-based system. This system uses carefully designed linguistic rules. This approach
was used early on in the development of natural language processing, and is still used.
● Machine learning-based system. Machine learning algorithms use statistical methods.
They learn to perform tasks based on training data they are fed, and adjust their methods
as more data is processed. Using a combination of machine learning, deep learning and
neural networks, natural language processing algorithms hone their own rules through
repeated processing and learnin

Figure1.2 Steps of Natural Language processing

3. Phases of NLP:-

There are the following five phases of NLP:

Figure1.3 Phases of NLP

3
1. Lexical Analysis and Morphological

The first phase of NLP is the Lexical Analysis. This phase scans the source code as a stream
of characters and converts it into meaningful lexemes. It divides the whole text into
paragraphs, sentences, and words.

2. Syntactic Analysis (Parsing)

Syntactic Analysis is used to check grammar, word arrangements, and shows the relationship
among the words.

Example: Agra goes to the Poonam

In the real world, Agra goes to the Poonam, does not make any sense, so this sentence
is rejected by the Syntactic analyzer.

3. Semantic Analysis

Semantic analysis is concerned with the meaning representation. It mainly focuses on the
literal meaning of words, phrases, and sentences.

4. Discourse Integration

Discourse Integration depends upon the sentences that proceeds it and also invokes the
meaning of the sentences that follow it.

5. Pragmatic Analysis

Pragmatic is the fifth and last phase of NLP. It helps you to discover the intended effect by
applying a set of rules that characterize cooperative dialogues.

For Example: "Open the door" is interpreted as a request instead of an order.

4. Why is Natural Language Processing Important?

Businesses use massive quantities of unstructured, text-heavy data and need a way to
efficiently process it. A lot of the information created online and stored in databases is
natural human language, and until recently, businesses could not effectively analyze this data.

4
This is where natural language processing is useful.

The advantage of natural language processing can be seen when considering the following
two statements: "Cloud computing insurance should be part of every service-level
agreement," and, "A good SLA ensures an easier night's sleep -- even in the cloud." If a user
relies on natural language processing for search, the program will recognize that cloud
computing is an entity, that cloud is an abbreviated form of cloud computing and that SLA is
an industry acronym for service-level agreement.

Figure1.4 Elements of Natural language Processing

These are some of the key areas in which a business can use natural language
processing (NLP).

These are the types of vague elements that frequently appear in human language and that
machine learning algorithms have historically been bad at interpreting. Now, with
improvements in deep learning and machine learning methods, algorithms can effectively
interpret them. These improvements expand the breadth and depth of data that can be
analyzed.

5. Techniques and Methods of Natural Language Processing.

Syntax and semantic analysis are two main techniques used with natural language processing.

Semantics involves the use of and meaning behind words. Natural language processing applies
algorithms to understand the meaning and structure of sentences.

➔ Parsing. TWhat is parsing? According to the dictionary, to parse is to “resolve a sentence

5
into its component parts and describe their syntactic roles.”

That actually nailed it but it could be a little more comprehensive. Parsing refers to the
formal analysis of a sentence by a computer into its constituents, which results in a parse tree
showing their syntactic relation to one another in visual form, which can be used for further
processing and understanding.

Figure 1.5 parse tree for the sentence "The thief robbed the apartment." Included is a
description of the three different information types conveyed by the sentence.

The letters directly above the single words show the parts of speech for each word (noun,
verb and determiner). One level higher is some hierarchical grouping of words into phrases.
For example, "the thief" is a noun phrase, "robbed the apartment" is a verb phrase and when
put together the two phrases form a sentence, which is marked one level higher.

But what is actually meant by a noun or verb phrase? Noun phrases are one or more words
that contain a noun and maybe some descriptors, verbs or adverbs. The idea is to group
nouns with words that are in relation to them.

A parse tree also provides us with information about the grammatical relationships of the
words due to the structure of their representation. For example, we can see in the structure
that "the thief" is the subject of "robbed."

With structure I mean that we have the verb ("robbed"), which is marked with a "V" above it

6
and a "VP" above that, which is linked with a "S" to the subject ("the thief"), which has a
"NP" above it. This is like a template for a subject-verb relationship and there are many
others for other types of relationships.

➔ Word segmentation. This is the act of taking a string of text and deriving word forms from
it. Example: A person scans a handwritten document into a computer. The algorithm would
be able to analyze the page and recognize that the words are divided by white spaces.
➔ Sentence breaking. This places sentence boundaries in large texts. Example: A natural
language processing algorithm is fed the text, "The dog barked. I woke up." The algorithm
can recognize the period that splits up the sentences using sentence breaking.
➔ Morphological segmentation. This divides words into smaller parts called morphemes.
Example: The word untestably would be broken into [[un[[test]able]]ly], where the algorithm
recognizes "un," "test," "able" and "ly" as morphemes. This is especially useful in machine
translation and speech recognition
➔ Named Entity Recognition (NER) - This technique is one of the most popular and
advantageous techniques in Semantic analysis, Semantics is something conveyed by the text.
Under this technique, the algorithm takes a phrase or paragraph as input and identifies all the
nouns or names present in that input.
➔ Tokenization. First of all, understanding the meaning of Tokenization, it is basically splitting
of the whole text into the list of tokens, lists can be anything such as words, sentences,
characters, numbers, punctuation, etc. Tokenization has two main advantages, one is to
reduce search with a significant degree, and the second is to be effective in the use of storage
space.

➔ Stemming and Lemmatization. The increasing size of data and information on the web is
all-time high from the past couple of years. This huge data and information demand
necessary tools and techniques to extract inferences with much ease.

“Stemming is the process of reducing inflected (or sometimes derived) words to their word
stem, base or root form - generally a written form of the word.” For example, what stemming
does, basically it cuts off all the suffixes. So after applying a step of stemming on the word
“playing”, it becomes “play”, or like, “asked” becomes “ask”.

7
Figure 1.6 Stemming and Lemmatization

Lemmatization usually refers to do things with the proper use of vocabulary and
morphological analysis of words, normally aiming to remove inflectional endings only and to
return the base or dictionary form of a word, which is known as the lemma. In simple words,
Lemmatization deals with the lemma of a word that involves reducing the word form after
understanding the part of speech (POS) or context of the word in any document.

➔ Bag of Words. Bag of words technique is used to pre-process text and to extract all the
features from a text document to use in Machine Learning modeling. It is also a
representation of any text that elaborates/explains the occurrence of the words within a
corpus (document). It is also called “Bag” due to its mechanism, i.e. it is only concerned with
whether known words occur in the document, not the location of the words.

Let’s take an example to understand bag-of-words in more detail. Like below, we are taking 2
text documents:

“Neha was angry on Sunil and he was angry on Ramesh.”


“Neha love animals.”

Above you see two corpora as documents, we treat both documents as a different entity and
make a list of all the words present in both documents except punctuations as here,

“Neha”, “was”, “angry”, “on”, “Sunil”, “and”, “he”, “Ramesh”, “love”,

8
“animals”

Then we create these documents into vectors (or we can say, creating a text into numbers is
called vectorization in ML) for further modelling.

Presentation of “Neha was angry on Sunil and he was angry on Ramesh” into vector form as
[1,1,1,1,1,1,1,0,0] , and the same as in, “Neha love animals” having vector form as
[1,0,0,0,0,0,0,0,1,1]. So, the bag-of-words technique is mainly used for featuring generation
from text data.

➔ Natural Language Generation . Natural language generation (NLG) is a technique that uses
raw structured data to convert it into plain English (or any other) language. We also call it
data storytelling. This technique is very helpful in many organizations where a large amount
of data is used, it converts structured data into natural languages for a better understanding of
patterns or detailed insights into any business.

There are many stages of any NLG;

1. Content Determination: Deciding what are the main content to be represented in text
or information provided in the text.
2. Document Clustering: Deciding the overall structure of the information to convey.
3. Aggregation: Merging of sentences to improve sentence understanding and
readability.
4. Lexical Choice: Putting appropriate words to convey the meaning of the sentence
more clearly.
5. Referring Expression Generation: Creating references to identify main objects and
regions of the text properly.
6. Realization: Creating and optimizing text that should follow all the norms of
grammar (like syntax, morphology, orthography).

➔ Sentiment Analysis It is one of the most common natural language processing techniques.
With sentiment analysis, we can understand the emotion/feeling of the written text.
Sentiment analysis is also known as Emotion AI or Opinion Mining.

9
The basic task of Sentiment analysis is to find whether expressed opinions in any document,
sentence, text, social media, film reviews are positive, negative, or neutral, it is also called
finding the Polarity of Text.

Figure1.7 Analysing sentiments

For example, Twitter is all filled up with sentiments, users are addressing their reactions or
expressing their opinions on each topic whichever or wherever possible. So, to access tweets
of users in a real-time scenario, there is a powerful python library called “twippy”.

➔ Sentence Segmentation The most fundamental task of this technique is to divide all text
into meaningful sentences or phrases. This task involves identifying sentence boundaries
between words in text documents. We all know that almost all languages have punctuation
marks that are presented at sentence boundaries, So sentence segmentation also referred to as
sentence boundary detection, sentence boundary disambiguation or sentence boundary
recognition.

There are many libraries available to do sentence segmentation, like, NLTK, Spacy, Stanford
CoreNLP, etc, that provide specific functions to do the task.

Three tools used commonly for natural language processing include Natural Language
Toolkit (NLTK), Gensim and Intel natural language processing Architect. NLTK is an open
source Python module with data sets and tutorials. Gensim is a Python library for topic
modeling and document indexing. Intel NLP Architect is another Python library for deep
learning topologies and techniques.

10
6. What is Natural Language Processing Used for?

Some of the main functions that natural language processing algorithms perform are:

● Text classification. This involves assigning tags to texts to put them in categories. This
can be useful for sentiment analysis, which helps the natural language processing
algorithm determine the sentiment, or emotion behind a text. For example, when brand A
is mentioned in X number of texts, the algorithm can determine how many of those
mentions were positive and how many were negative. It can also be useful for intent
detection, which helps predict what the speaker or writer may do based on the text they
are producing.
● Text extraction. This involves automatically summarizing text and finding important
pieces of data. One example of this is keyword extraction, which pulls the most
important words from the text, which can be useful for search engine optimization. Doing
this with natural language processing requires some programming -- it is not completely
automated. However, there are plenty of simple keyword extraction tools that automate
most of the process -- the user just has to set parameters within the program. For
example, a tool might pull out the most frequently used words in the text. Another
example is named entity recognition, which extracts the names of people, places and
other entities from text.
● Machine translation. This is the process by which a computer translates text from one
language, such as English, to another language, such as French, without human
intervention.
● Natural language generation. This involves using natural language processing
algorithms to analyze unstructured data and automatically produce content based on that
data. One example of this is in language models such as GPT3, which are able to analyze
an unstructured text and then generate believable articles based on the text.

7. Benefits of Natural language Processing

The main benefit of NLP is that it improves the way humans and computers communicate
with each other. The most direct way to manipulate a computer is through code -- the
computer's language. By enabling computers to understand human language, interacting with

11
computers becomes much more intuitive for humans.

Other benefits include:


● improved accuracy and efficiency of documentation;
● ability to automatically make a readable summary of a larger, more complex
original text;
● useful for personal assistants such as Alexa, by enabling it to understand spoken
word;
● enables an organization to use chatbots for customer support;
● easier to perform sentiment analysis; and
● provides advanced insights from analytics that were previously unreachable due
to data volume.

8. Challenges of Natural language Processing

There are a number of challenges of natural language processing and most of them boil down to the
fact that natural language is ever-evolving and always somewhat ambiguous. They include:

● Precision. Computers traditionally require humans to "speak" to them in a programming


language that is precise, unambiguous and highly structured -- or through a limited
number of clearly enunciated voice commands. Human speech, however, is not always
precise; it is often ambiguous and the linguistic structure can depend on many complex
variables, including slang, regional dialects and social context.
● Tone of voice and inflection. Natural language processing has not yet been perfected.
For example, semantic analysis can still be a challenge. Other difficulties include the fact
that the abstract use of language is typically tricky for programs to understand. For
instance, natural language processing does not pick up sarcasm easily. These topics
usually require understanding the words being used and their context in a conversation.
As another example, a sentence can change meaning depending on which word or
syllable the speaker puts stress on. NLP algorithms may miss the subtle, but important,
tone changes in a person's voice when performing speech recognition. The tone and
inflection of speech may also vary between different accents, which can be challenging
for an algorithm to parse.

12
● Evolving use of language. Natural language processing is also challenged by the fact
that language -- and the way people use it -- is continually changing. Although there are
rules to language, none are written in stone, and they are subject to change over time.
Hard computational rules that work now may become obsolete as the characteristics of
real-world language change over time.

9. The Evolution of Natural Language Processing

NLP draws from a variety of disciplines, including computer science and computational linguistics
developments dating back to the mid-20th century. Its evolution included the following major
milestones:
● 1950s. Natural language processing has its roots in this decade, when Alan Turing
developed the Turing Test to determine whether or not a computer is truly intelligent. The
test involves automated interpretation and the generation of natural language as criterion
of intelligence.
● 1950s-1990s. NLP was largely rules-based, using handcrafted rules developed by
linguists to determine how computers would process language.
● 1990s. The top-down, language-first approach to natural language processing was
replaced with a more statistical approach, because advancements in computing made this
a more efficient way of developing NLP technology. Computers were becoming faster
and could be used to develop rules based on linguistic statistics without a linguist
creating all of the rules. Data-driven natural language processing became mainstream
during this decade. Natural language processing shifted from a linguist-based approach to
an engineer-based approach, drawing on a wider variety of scientific disciplines instead
of delving into linguistics.
● 2000-2020s. Natural language processing saw dramatic growth in popularity as a term.
With advances in computing power, natural language processing has also gained
numerous real-world applications. Today, approaches to NLP involve a combination of
classical linguistics and statistical methods.

Natural language processing plays a vital part in technology and the way humans interact with it. It is used
in many real-world applications in both the business and consumer spheres, including chatbots,
cybersecurity, search engines and big data analytics. Though not without its challenges, NLP is expected to
continue to be an important part of both industry and everyday life.

13
EXPERIMENT NO 2
INTRODUCTION TO GRAMMARS, PARSERS, POS TAG S

1. What is Grammar?
Grammar is defined as the rules for forming well-structured sentences.
While describing the syntactic structure of well-formed programs, Grammar plays a very essential and
important role. In simple words, Grammar denotes syntactical rules that are used for conversation in
natural languages.

For Example, in the ‘C’ programming language, the precise grammar rules state how functions are made
with the help of lists and statements.

Mathematically, a grammar G can be written as a 4-tuple (N, T, S, P) where,

N or VN = set of non-terminal symbols, or variables.


T or ∑ = set of terminal symbols.
S = Start symbol where S ∈ N
P = Production rules for Terminals as well as Non-terminals.
It has the form α → β, where α and β are strings on VN ∪ ∑ and at least one symbol of α belongs to VN

2. Types of Grammar:-
A. Context Free Grammar - A context-free grammar, which is in short represented
as CFG, is a notation used for describing the languages and it is a superset of Regular
grammar which you can see from the following diagram:

Figure2.1 CFG - A superset of regular grammar

14
CFG consists of a finite set of grammar rules having the following four components

● Set of Non-Terminals
● Set of Terminals
● Set of Productions
● Start Symbol

Set of Non-terminals

It is represented by V. The non-terminals are syntactic variables that denote the sets of
strings, which helps in defining the language that is generated with the help of
grammar.

Set of Terminals

It is also known as tokens and represented by Σ. Strings are formed with the help of
the basic symbols of terminals.

Set of Productions

It is represented by P. The set gives an idea about how the terminals and nonterminals
can be combined. Every production consists of the following components:
● Non-terminals,
● Arrow,
● Terminals (the sequence of terminals).
The left side of production is called non-terminals while the right side of production
is called terminals.

Start Symbol

The production begins from the start symbol. It is represented by symbol S.


Non-terminal symbols are always designated as start symbols.

B. Constituency Grammar

It is also known as Phrase structure grammar. It is called constituency Grammar as it


is based on the constituency relation. It is the opposite of dependency grammar.

15
Before deep dive into the discussion of CG, let’s see some fundamental points about
constituency grammar and constituency relation.

● All the related frameworks view the sentence structure in terms of


constituency relation.
● To derive the constituency relation, we take the help of subject-predicate
division of Latin as well as Greek grammar.
● Here we study the clause structure in terms of noun phrase NP and verb
phrase VP.

For Example,

Sentence: This tree is illustrating the constituency relation

FIgure 2.2 A tree illustrating constituency grammar

Now, Let’s deep dive into the discussion on Constituency Grammar:

In Constituency Grammar, the constituents can be any word, group of words, or


phrases and the goal of constituency grammar is to organize any sentence into its
constituents using their properties. To derive these properties we generally take the
help of:

● Part of speech tagging,


● A noun or Verb phrase identification, etc

For Example, constituency grammar can organize any sentence into its three

16
constituents- a subject, a context, and an object.

Sentence: <subject> <context> <object>

These three constituents can take different values and as a result, they can generate
different sentences. For Example, If we have the following constituents, then

C. <subject> The horses / The dogs / They


D. <context> are running / are barking / are eating

<object> in the park / happily / since the morning

Example sentences that we can be generated with the help of the above constituents
are:

E. “The dogs are barking in the park”


F. “They are eating happily”

“The horses are running since the morning”

Now, let’s look at another view of constituency grammar is to define their grammar in
terms of their part of speech tags.

Say a grammar structure containing a

[determiner, noun] [ adjective, verb] [preposition, determiner,

noun]

which corresponds to the same sentence – “The dogs are barking in the park”

G. Another view (Using Part of Speech)

< DT NN > < JJ VB > < PRP DT NN > -------------> The dogs

are barking in the park

17
C. Dependency Grammar

It is opposite to the constituency grammar and is based on the dependency relation.


Dependency grammar (DG) is opposite to constituency grammar because it lacks
phrasal nodes.

Before deep dive into the discussion of DG, let’s see some fundamental points about
Dependency grammar and Dependency relation.

● In Dependency Grammar, the words are connected to each other by directed


links.
● The verb is considered the center of the clause structure.
● Every other syntactic unit is connected to the verb in terms of directed link.
These syntactic units are called dependencies.

For Example,

Sentence: This tree is illustrating the dependency relation

FIgure 2.3 A tree illustrating a dependency relation

Now, Let’s deep dive into the discussion of Dependency Grammar:

1. Dependency Grammar states that words of a sentence are dependent upon other words of the sentence.
For Example, in the previous sentence which we discussed in CG, “barking dog” was mentioned and the
dog was modified with the help of barking as the dependency adjective modifier exists between the two.

18
2. It organizes the words of a sentence according to their dependencies. One of the words in a sentence
behaves as a root and all the other words except that word itself are linked directly or indirectly with the
root using their dependencies. These dependencies represent relationships among the words in a sentence
and dependency grammars are used to infer the structure and semantic dependencies between the words.

For Example, Consider the following sentence:

Sentence: Analytics Vidhya is the largest community of data

scientists and provides the best resources for understanding

data and analytics

The dependency tree of the above sentence is shown below:

In the above tree, the root word is “community” having NN as the part of speech tag
and every other word of this tree is connected to root, directly or indirectly, with the
help of dependency relation such as a direct object, direct subject, modifiers, etc.

These relationships define the roles and functions of each word in the sentence and
how multiple words are connected together.

We can represent every dependency in the form of a triplet which contains a


governor, a relation, and a dependent,

Relation : ( Governor, Relation, Dependent )

which implies that a dependent is connected to the governor with the help of relation,
or in other words, they are considered the subject, verb, and object respectively.

For Example, Consider the following same sentence again:

Sentence: Analytics Vidhya is the largest community of data

scientists

Then, we separate the sentence in the following manner:

19
< Analyticsvidhya> <is> <the largest community of data

scientists>

Now, let’s identify different components in the above sentence:

● Subject: “Analytics Vidhya” is the subject and is playing the role of a


governor.
● Verb: “is” is the verb and is playing the role of the relation.
● Object: “the largest community of data scientists” is the dependent or the
object.

Introduction to Parsers

1. Introduction to Parsing
Parsing is defined as "the analysis of an input to organize the data according to the rule of a
grammar."

There are a few ways to define parsing. However, the gist remains the same: parsing means to find
the underlying structure of the data we are given.

Figure 2.4 Parsing example

In a way, parsing can be considered the inverse of templating: identifying the structure and
extracting the data. In templating, instead, we have a structure and we fill it with data. In the case of
parsing, you have to determine the model from the raw representation, while for templating, you

20
have to combine the data with the model to create the raw representation. Raw representation is
usually text, but it can also be binary data.

Fundamentally, parsing is necessary because different entities need the data to be in different forms.
Parsing allows transforming data in a way that can be understood by a specific software. The
obvious example is programs — they are written by humans, but they must be executed by
computers. So, humans write them in a form that they can understand, then a software transforms
them in a way that can be used by a computer.

2. Role of Parser
In the syntax analysis phase, a compiler verifies whether or not the tokens generated by the

lexical analyzer are grouped according to the syntactic rules of the language. This is done by a
parser. The parser obtains a string of tokens from the lexical analyzer and verifies that the string can
be the grammar for the source language. It detects and reports any syntax errors and produces a parse
tree from which intermediate code can be generated.

Figure 2.5 Parsing

3. Structure of Parser

Having clarified the role of regular expressions, we can look at the general structure of a
parser. A complete parser is usually composed of two parts: a lexer, also known as scanner
or tokenizer, and the proper parser. The parser needs the lexer because it does not work
directly on the text but on the output produced by the lexer. Not all parsers adopt this
two-step schema; some parsers do not depend on a separate lexer and they combine the two
steps. They are called scannerless parsers.

A lexer and a parser work in sequence: the lexer scans the input and produces the matching tokens;

21
the parser then scans the tokens and produces the parsing result.

Let’s look at the following example and imagine that we are trying to parse addition.

437 + 734

The lexer scans the text and finds 4, 3, and 7, and then a space ( ). The job of the lexer is to
recognize that the characters 437 constitute one token of type NUM. Then the lexer finds a +
symbol, which corresponds to the second token of type PLUS, and lastly, it finds another token of
type NUM.

The parser will typically combine the tokens produced by the lexer and group them.

The definitions used by lexers and parsers are called rules or productions. In our example, a lexer
rule will specify that a sequence of digits correspond to a token of type NUM, while a parser rule
will specify that a sequence of tokens of type NUM, PLUS, NUM corresponds to a sum expression.

It is now typical to find suites that can generate both a lexer and parser. In the past, it was instead
more common to combine two different tools: one to produce the lexer and one to produce the
parser. For example, this was the case of the venerable lex and yacc couple: using lex, it was possible
to generate a lexer, while using yacc, it was possible to generate a parser.

4. Lexers and Parsers

A lexer transforms a sequence of characters in a sequence of tokens.

Lexers are also known as scanners or tokenizers. Lexers play a role in parsing because they
transform the initial input in a form that is more manageable by the proper parser, who works at a
later stage. Typically lexers are easier to write than parsers, although there are special cases when
both are quite complicated; for instance, in the case of C

A very important part of the job of the lexer is dealing with whitespace. Most of the time, you want
the lexer to discard whitespace. That is because otherwise, the parser would have to check for the

22
presence of whitespace between every single token, which would quickly become annoying.

5. Parsing Tree and Abstract Syntax Tree

There are two terms that are related and sometimes they are used interchangeably: parse tree and
abstract syntax tree (AST). Technically, the parse tree could also be called a concrete syntax tree
(CST) because it should reflect more concretely the actual syntax of the input, at least compared to
the AST.

Conceptually, they are very similar. They are both trees; there is a root that has nodes representing
the whole source code. The roots have children nodes that contain subtrees representing smaller and
smaller portions of code, until single tokens (terminals) appear in the tree.

The difference is in the level of abstraction. A parse tree might contain all the tokens that appeared
in the program and possibly, a set of intermediate rules. The AST, instead, is a polished version of
the parse tree, in which only the information relevant to understanding the code is maintained. We
are going to see an example of an intermediate rule in the next section.

Some information might be absent both in the AST and the parse tree. For instance, comments and
grouping symbols (i.e. parentheses) are usually not represented. Things like comments are
superfluous for a program and grouping symbols are implicitly defined by the structure of the tree.

Figure 2.6 Example of a parse tree

In the AST the indication of the specific operator has disappeared and all that remains is the
operation to be performed. The specific operator is an example of an intermediate rule.

23
Graphical Representation of a Tree
The output of a parser is a tree, but the tree can also be represented in graphical ways. That is to
allow an easier understanding to the developer. Some parsing generator tools can output a file in the
DOT language, a language designed to describe graphs (a tree is a particular kind of graph). Then
this file is fed to a program that can create a graphical representation starting from this textual
description.

Let’s see a DOT text based on the previous sum example.

1.digraph sum {
2. sum -> 10;
3. sum -> 21;
4. }

The appropriate tool can create the following graphical representation.

6. Parsing Algorithms

Overview
Let’s start with a global overview of the features and strategies of all parsers.

Two Strategies
There are two strategies for parsing: top-down parsing and bottom-up parsing. Both terms are
defined in relation to the parse tree generated by the parser. Explained in a simple way:
● A top-down parser tries to identify the root of the parse tree first, then moves down the
subtrees until it finds the leaves of the tree.
● A bottom-up parser instead starts from the lowest part of the tree, the leaves, and rises up
until it determines the root of the tree.

Let’s see an example, starting with a parse tree.

24
Figure 2.7 Example parse tree

The same tree would be generated in a different order by a top-down and a bottom-up parser. In the
following images, the number indicates the order in which the nodes are created.

Figure2.8 Top-down order of generation of the tree

Figure 2.9 Bottom-up order of generation of the tree

25
Tables of Parsing Algorithms
We provide a table below to offer a summary of the main information needed to understand and
implement a specific parser algorithm. The table lists:

● A formal description, to explain the theory behind the algorithm


● A more practical explanation
● One or two implementations, usually one easier and the other a professional parser.
Sometimes, though, there is no easier version or a professional one.

Figure 2.10 Table for Parsing algorithms

To understand how a parsing algorithm works, you can also look at the syntax analytic toolkit. It is
an educational parser generator that describes the steps that a generated parser takes to accomplish
its objective. It implements an LL and an LR algorithm.

The second table shows a summary of the main features of the different parsing algorithms and for
what they are generally used.

26
Figure2.11 Table for features of parsing algorithms

1. Top-Down Algorithms
The top-down strategy is the most widespread of the two strategies and there are several successful
algorithms applying it.

LL Parser
LL (Left-to-right read of the input, Leftmost derivation) parsers are table-based parsers without
backtracking, but with lookahead. Table-based means that they rely on a parsing table to decide
which rule to apply. The parsing table use as rows and columns nonterminals and terminals,
respectively.

To find the correct rule to apply:

27
1. Firstly, the parser looks at the current token and the appropriate amount of lookahead tokens.
2. Then, it tries to apply the different rules until it finds the correct match.

The concept of the LL parser does not refer to a specific algorithm, but more to a class of parsers.
They are defined in relation to grammars. That is to say, an LL parser is one that can parse a LL
grammar. In turn, LL grammars are defined in relation to the number of lookahead tokens that are
needed to parse them. This number is indicated between parentheses next to LL, so in the form
LL(k).

An LL(k) parser uses k tokens of lookahead and thus it can parse, at most, a grammar that needs k
tokens of lookahead to be parsed. Effectively, the concept of the LL(k) grammar is more widely
employed than the corresponding parser — which means that LL(k) grammars are used as a meter
when comparing different algorithms. For instance, you would read that PEG parsers can handle
LL(*) grammars.

The Earley parser is a chart parser named after its inventor Jay Earley. The algorithm is usually
compared to CYK, another chart parser, that is simpler but also usually worse in performance and
memory. The distinguishing feature of the Earley algorithm is that, in addition to storing partial
results, it implement a prediction step to decide which rule is going to try to match next.

The Earley parser fundamentally works by dividing a rule in segments, like in the following
example.

Figure 2.12 example for Earley Parser

Then, working on this segment that can be connected at the dot (.), tries to reach a completed state;
that is to say. one with the dot at the end.

The appeal of an Earley parser is that it is guaranteed to be able to parse all context-free languages,
while other famous algorithms (i.e. LL, LR) can parse only a subset of them. For instance, it has no
problem with left-recursive grammars. More generally, an Earley parser can also deal with

28
nondeterministic and ambiguous grammars.

It can do that at the risk of worse performance (O(n3)), in the worst case. However, it has a linear
time performance for normal grammars. The catch is that the set of languages parsed by more
traditional algorithms are the one we are usually interested in.

There is also a side effect of the lack of limitations: by forcing a developer to write the grammar in
certain way the parsing can be more efficient, i.e., building an LL(1) grammar might be harder for
the developer, but the parser can apply it very efficiently. With Earley, you do less work, so the
parser does more of it.

In short, Earley allows you to use grammars that are easier to write, but that might be suboptimal in
terms of performance.

Recursive Descent Parser

A recursive descent parser is a parser that works with a set of (mutually) recursive procedures,
usually one for each rule of the grammars. Thus, the structure of the parser mirrors the structure of
the grammar.

The term predictive parser is used in a few different ways: some people mean it as a synonym for a
top-down parser, some as a recursive descent parser that never backtracks.

Typically, recursive descent parsers have problems parsing left-recursive rules because the algorithm
would end up calling the same function again and again. A possible solution to this problem is using
tail recursion. Parsers that use this method are called tail recursive parsers.

Pratt Parser

A Pratt parser is a widely unused, but much appreciated (by the few who know it), parsing algorithm
defined by Vaughan Pratt in a paper called Top Down Operator Precedence. The paper itself starts
with a polemic on BNF grammars, which the author argues wrongly are the exclusive concerns of
parsing studies. This is one of the reasons for the lack of success. In fact, the algorithm does not rely
on a grammar but works directly on tokens, which makes it unusual to parsing experts.

The second reason is that traditional top-down parsers work great if you have a meaningful prefix
that helps distinguish between different rules. For example, if you get the token FOR, you are
looking at a for statement. Since this essentially applies to all programming languages and their

29
statements, it is easy to understand why the Pratt parser did not change the parsing world.

Parser Combinator

A parser combinator is a higher-order function that accepts parser functions as input and returns a
new parser function as output. A parser function usually means a function that accepts a string and
output a parse tree.

A parser combinator is modular and easy to build, but they are also slower (they have O(n4)
complexity in the worst case) and less sophisticated. They are typically adopted for easier parsing
tasks or for prototyping. In a sense, the user of a parser combinator builds the parser partially by
hand but relies on the hard work done by whoever created the parser combinator.

The most basic example is the Maybe monad. This is a wrapper around a normal type, like integer,
that returns the value itself when the value is valid (i.e. 567), but a special value, Nothing, when it is
not (i.e. undefined or divided by zero). Thus, you can avoid using a null value and unceremoniously
crashing the program. Instead, the Nothing value is managed normally, like it would manage any
other value

2. Bottom-Up Algorithms

The bottom-up strategy's main success is the family of many different LR parsers.
The reason for their relative unpopularity is that historically, they've been harder to build, although
LR parsers are more powerful than traditional LL(1) grammars. So, we mostly concentrate on them,
apart from a brief description of CYK parsers.
This means that we avoid talking about the more generic class of shift-reduce parser, which also
includes LR parsers.

Shift-reduce algorithms work with two steps:

1. Shift: Read one token from the input, which will become a new (momentarily isolated) node.
2. Reduce: Once the proper rule is matched, join the resulting tree with a precedent existing
subtree.

Basically, the Shift step reads the input until completion, while the Reduce step joins the subtrees
until the final parse tree is built.

30
CYK Parser
The Cocke-Younger-Kasami (CYK) algorithm was formulated independently by three authors. Its
notability is due to a great worst-case performance (O(n3)), although it is hampered by
comparatively bad performance in most common scenarios.

However, the real disadvantage of the algorithm is that it requires grammars to be expressed in
Chomsky normal form.

The CYK algorithm is used mostly for specific problems; for instance, the membership problem: to
determine if a string is compatible with a certain grammar. It can also be used in natural language
processing to find the most probable parsing between many options.

LR Parser
LR (Left-to-right read of the input; Rightmost derivation) parsers are bottom-up parsers that can
handle deterministic context-free languages in linear time with lookahead and without backtracking.
The invention of LR parsers is credited to the renowned Donald Knuth.

Traditionally, they have been compared to and have competed with LL parsers. There's a similar
analysis related to the number of lookahead tokens necessary to parse a language. An LR(k) parser
can parse grammars that need k tokens of lookahead to be parsed. However, LR grammars are less
restrictive, and thus more powerful, than the corresponding LL grammars. For example, there is no
need to exclude left-recursive rules.

Technically, LR grammars are a superset of LL grammars. One consequence of this is that you need
only LR(1) grammars, so usually, the (k) is omitted.

They are also table-based, just like LL-parsers, but they need two complicated tables. In very simple
terms:

1. One table tells the parser what to do depending on the current token, the state it's in, and the
tokens that could possibly follow the current one (lookahead sets).

Introduction to POS TAGS

1. Part-of-Speech Tagging

Part-of-Speech(POS) Tagging is the process of assigning different labels known as POS tags to the words in
a sentence that tells us about the part-of-speech of the word.

It is a process of converting a sentence to forms – list of words, list of tuples (where each tuple is having a
form (word, tag)). The tag in case of is a part-of-speech tag, and signifies whether the word is a noun,

31
adjective, verb, and so on.

It is a popular Natural Language Processing process which refers to categorizing words in a text (corpus) in
correspondence with a particular part of speech, depending on the definition of the word and its context.

Broadly there are two types of POS tags:

1. Universal POS Tags: These tags are used in the Universal Dependencies (UD) (latest version 2), a
project that is developing cross-linguistically consistent treebank annotation for many languages.
These tags are based on the type of words. E.g., NOUN(Common Noun), ADJ(Adjective), ADV(Adverb)

2. Detailed POS Tags: These tags are the result of the division of universal POS tags into various tags,
like NNS for common plural nouns and NN for the singular common noun compared to NOUN for
common nouns in English. These tags are language-specific.

Figure 2.13 List of Universal POS Tags

Example 1 of Part-of-speech (POS) tagged corpus

The/at-tl expense/nn and/cc time/nn involved/vbn are/ber astronomical/jj ./.

format for a tagged corpus is of the form word/tag. Each word is with a tag denoting its POS. For example,
nn refers to a noun, vb is a verb.

32
Example 2 of Part-of-speech (POS) tagged corpus

Figure 2.14: Example of POS tagging

In Figure 1, we can see each word has its own lexical term written underneath, however, having to
constantly write out these full terms when we perform text analysis can very quickly become cumbersome
— especially as the size of the corpus grows. Then, we use a short representation referred to as “tags” to
represent the categories.

As earlier mentioned, the process of assigning a specific tag to a word in our corpus is referred to as
part-of-speech tagging (POS tagging for short) since the POS tags are used to describe the lexical terms that
we have within our text.

Figure 2.15: Grid displaying different types of lexical terms, their tags, and random examples

Part-of-speech tags describe the characteristic structure of lexical terms within a sentence or text, therefore,

33
we can use them for making assumptions about semantics. Other applications of POS tagging include:

● Named Entity Recognition


● Co-reference Resolution
● Speech Recognition
When we perform POS tagging, it’s often the case that our tagger will encounter words that were not within
the vocabulary that was used. Consequently, augmenting your dataset to include unknown word tokens will
aid the tagger in selecting appropriate tags for those words.

Markov Chains
Taking the example text we used in Figure 1, “Why not tell someone?”, imaging the sentence is truncated to
“Why not tell … ” and we want to determine whether the following word in the sentence is a noun, verb,
adverb, or some other part-of-speech.
Now, if you are familiar with English, you’d instantly identify the verb and assume that it is more likely the
word is followed by a noun rather than another verb. Therefore, the idea as shown in this example is that the
POS tag that is assigned to the next word is dependent on the POS tag of the previous word.

Figure 2.16: Representing Likelihoods visually

By associating numbers with each arrow direction, of which imply the likelihood of the next word given the
current word, we can say there is a higher likelihood the next word in our sentence would be a noun since it
has a higher likelihood than the next word being a verb if we are currently on a verb. The image in Figure is
a great example of how a Markov Model works on a very small scale.

Given this example, we may now describe markov models as “a stochastic model used to model randomly
changing systems. It is assumed that future states depend only on the current state, not on the events that
occurred before it (that is, it assumes the Markov property)”. Therefore to get the probability of the next
event, it needs only the states of the current event.

34
We can depict a markov chain as directed graph:

Figure 2.17: Depiction of Markov Model as Graph

The lines with arrows are an indication of the direction hence the name “directed graph”, and the circles
may be regarded as the states of the model — a state is simply the condition of the present moment.

We could use this Markov model to perform POS. Considering we view a sentence as a sequence of words,
we can represent the sequence as a graph where we use the POS tags as the events that occur which would
be illustrated by the stats of our model graph.

For example, q1 in Figure would become NN indicating a noun, q2 would be VB which is short for verb,
and q3 would be O signifying all other tags that are not NN or VB. Like in Figure 3, the directed lines
would be given a transition probability that define the probability of going from one state to the next.

Figure 2.17: Example of Markov Model to perform POS tagging.

A more compact way to store the transition and state probabilities is using a table, better known as a
“transition matrix”.

35
Figure 2.18: Transition Matrix (Image by Author)

Notice this model only tells us the transition probability of one state to the next when we know the previous
word. Hence, this model does not show us what to do when there is no previous word. To handle this case,
we add what is known as the “initial state”.

Figure 2.19: Adding an Initial State to deal with beginning of word matrix

You may now be wondering, how did we populate the transition matrix? Great Question. I will use 3
sentences for our corpus. The first is “<s> in a station of the metro”, “<s> the apparition of these faces in the
crowd”, “<s> petals on a wet, black bough.” (Note these are the same sentences used in the course). Next,
we will break down how to populate the matrix into steps:

1. Count occurrences of tag pairs in the training dataset

Formula 1: Counting the occurrences of the tag

At the end of step one, our table would look something like this…

36
Figure 2.20: applying step one with our corpus.

2. Calculate the probability of using the counts

Formula 2: Calculate probabilities using the counts

Applying the above formula to the table in Figure 2.20, our new table would look as follows…

Figure 2.21: Probabilities populating the transition matrix.

You may notice that there are many 0’s in our transition matrix which would result in our model being
incapable of generalizing to other text that may contain verbs. To overcome this problem, we add
smoothing.

Adding smoothing requires we slightly we adjust the formula by adding a small value, epsilon, to each of
the counts in the numerator, and add N * epsilon to the denominator, such that the row sum still adds up to
1.

Formula 3: Calculating the probabilities with smoothing

37
Figure 2.22: New probabilities with smoothing added. N is the length of the corpus and epsilon is some
very small number.

Hidden Markov Model

Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled is assumed
to be a Markov process with unobservable (“hidden”) states . In our case, the unobservable states are the
POS tags of a word.

If we rewind back to our Markov Model , we see that the model has states for part of speech such as VB for
verb and NN for a noun. We may now think of these as hidden states since they are not directly observable
from the corpus. Though a human may be capable of deciphering what POS applies to a specific word, a
machine only sees the text, hence making it observable, and is unaware of whether that word POS tag is
noun, verb, or something else which in-turn means they are unobservable.
The emission probabilities describe the transitions from the hidden states in the model — remember the
hidden states are the POS tags — to the observable states — remember the observable states are the words.

Figure 2.23: Example of Hidden Markov model.

In Figure 2.23 we see that for the hidden VB state we have observable states. The emission probability from
the hidden states VB to the observable eat is 0.5 hence there is a 50% chance that the model would output

38
this word when the current hidden state is VB.
We can also represent the emission probabilities as a table…

Figure 2.24: Emission matrix expressed as a table — The numbers are not accurate representations,
they are just random

Similar to the transition probability matrix, the row values must sum to 1. Also, the reason all of our POS
tags emission probabilities are more than 0 since words can have a different POS tag depending on the
context.

To populate the emission matrix, we’d follow a procedure very similar to the way we’d populate the
transition matrix. We’d first count how often a word is tagged with a specific tag.

Figure 2.25: Calculating the counts of a word and how often it is tagged with a specific tag.

Since the process is so similar to calculating the transition matrix, I will instead provide you with the
formula with smoothing applied to see how it would be calculated.

Formula 4: Formula for calculating transition probabilities where N is the number of tags and epsilon is
a very small number

39
EXPERIMENT NO 3
INTRODUCTION TO NLTK

What is the Natural Language Toolkit(NLTK) in NLP?

Natural language processing is about building applications and assistance/services that can
understand human languages. It is a field that interacts amidst computers and humans. It is mainly
used for text analysis that provides computers with a way to recognize the human language.

Moreover, NLP is the technology that provides the potential to all the chatbots, voice assistants,
predictive text and text applications, has unfolded in recent years. There is a wide variety of
open-source NLP tools available.

With the help of NLP tools and techniques, most of the NLP task can be performed, a few examples
of NLP tasks involve speech recognition, summarization, topic segmentation, understanding what
the content is about or sentiment analysis etc.

Understanding NLTK

NLTK, a preeminent platform, that is used for developing Python programs for operating with
human language data. It is a suite of open source program modules, tutorials and problem sets for
presenting prepared computational linguistics courseware. NLTK incorporates symbolic and
statistical Natural Language Processing and is assimilated to interpreted corpora for teachers and
students especially.

Most significant features of NLTK includes;


1. It presents easy-to-implement interfaces across 50 corpora and linguistics sources, for
example, WordNet, text processing libraries for classification, tokenization, and wrappers for
industrial-strength NLP libraries.
2. NLTK is suitable for translators, educators, researchers, and industrial applications and is
accessible on Windows, Mac OS X, and Linux.
3. It attains a firsthand guide that introduces in computational linguistics and programming
fundamentals for Python due to which it becomes a proper fit for lexicographers who don’t

40
have intense knowledge in programming.
4. NLTK is an ultimate combination of three factors; first, it was intentionally designed as
courseware and provides pedagogical objectives as primary status, second, its target audience
comprises both linguists and computer specialists, and it is not only convenient but
challenging also at various levels of early computational skill and thirdly, it deeply depends
on an object-oriented composing language that supports swift prototyping and intelligent
programming.

Requirements of NLTK

1. Easy to implement: One of the main objectives behind using this toolkit is to enable users to
focus on developing NLP components and system. The more time students must spend
learning to use the toolkit, the less useful it is.
2. Consistency: The toolkit must apply compatible data structures and interfaces.
3. Extensibility: The toolkit easily adapts novel components, whether such components imitate
or prolong the existing functionality and performance of the toolkit. The toolkit should be
arranged in a precise manner that appending new extensions would match into the toolkit’s
existing infrastructure.
4. Documentation: There is a need to cite the toolkit, its data structure and its implementation
delicately. The whole nomenclature must be picked out very sparingly and to be applied
consistently.
5. Monotony: The toolkit should make up the ramification of producing NLP systems, and do
not drop them. So, every class, determined by the tool, must be accessible for users that they
could complete by the time of the rudimentary course in computational linguistics.
6. Modularity: To maintain interaction amid various components of the toolkit, it should be
retained in a minimum, mild, and sharp interfaces. However, it should be plausible to finish
different projects by tiny parts of the toolkit, without agonising about how to cooperate with
the rest of the toolkit.

Uses of NLTK

1. Assignments: NLTK can be used to create assignments for students of various difficulties and
scopes. After becoming familiar with the toolkit, users can make trivial changes or
extensions in an existing module in NLTK. When developing a new module, NLTK gives

41
few useful initiating points: pre-defined interfaces and data structures, and existing modules
that apply the same interface.
2. Class demonstrations: NLTK offers graphical tools that can be utilized in the class
demonstrations, to assist in explaining elementary NLP concepts and algorithms. Such
interactive tools are accepted to represent associated data structures and to bestow the
step-by-step execution of algorithms.
3. Advanced Projects: NLTK presents users with an amenable framework for advanced
projects. Standard projects include the development of totally new functionality for a priorly
unsupported NLP task or the development of an entire system from existing and new
modules.

Text Analysis Operations using NLTK

NLTK is a powerful Python package that provides a set of diverse natural languages algorithms. It is
free, opensource, easy to use, large community, and well documented. NLTK consists of the most
common algorithms such as tokenizing, part-of-speech tagging, stemming, sentiment analysis, topic
segmentation, and named entity recognition. NLTK helps the computer to analyze, preprocess, and
understand the written text.

Now we first install and import nltk in our system. Open the terminal and type the following
command-

!pip install nltk

Now we can see the following message .


Requirement already satisfied: nltk in /home/northout/anaconda2/lib/python2.7/site-packages
Requirement already satisfied: six in /home/northout/anaconda2/lib/python2.7/site-packages (from
nltk)
[33mYou are using pip version 9.0.1, however version 10.0.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m

Further we import the nltk using the following command and start with the operations.

#Loading NLTK
import nltk

42
1. Tokenization

Tokenization is the first step in text analytics. The process of breaking down a text paragraph
into smaller chunks such as words or sentence is called Tokenization. Token is a single entity
that is building blocks for sentence or paragraph.

2. Sentence Tokenization

Sentence tokenizer breaks text paragraph into sentences.


from nltk.tokenize import sent_tokenize
text="""Hello Mr. Smith, how are you doing today? The weather is great, and city is
awesome.
The sky is pinkish-blue. You shouldn't eat cardboard"""
tokenized_text=sent_tokenize(text)
print(tokenized_text)

Output -

['Hello Mr. Smith, how are you doing today?', 'The weather is great, and city is
awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard"]

Here, the given text is tokenized into sentences.

3. Word Tokenization

Word tokenizer breaks text paragraph into words.

from nltk.tokenize import word_tokenize


tokenized_word=word_tokenize(text)
print(tokenized_word)

['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',',
'and', 'city', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat',
'cardboard']

43
4. Frequency Distribution

from nltk.probability import FreqDist


fdist = FreqDist(tokenized_word)
print(fdist)

<FreqDist with 25 samples and 30 outcomes>

fdist.most_common(2)

[('is', 3), (',', 2)]

# Frequency Distribution Plot


import matplotlib.pyplot as plt
fdist.plot(30,cumulative=False)
plt.show()

5. Stopwords

Stopwords considered as noise in the text. Text may contain stop words such as is, am, are,
this, a, an, the, etc.

In NLTK for removing stopwords, you need to create a list of stopwords and filter out your
list of tokens from these words.
from nltk.corpus import stopwords
stop_words=set(stopwords.words("english"))
print(stop_words)

output:-
{'their', 'then', 'not', 'ma', 'here', 'other', 'won', 'up', 'weren', 'being', 'we', 'those', 'an', 'them',
'which', 'him', 'so', 'yourselves', 'what', 'own', 'has', 'should', 'above', 'in', 'myself', 'against',
'that', 'before', 't', 'just', 'into', 'about', 'most', 'd', 'where', 'our', 'or', 'such', 'ours', 'of', 'doesn',
'further', 'needn', 'now', 'some', 'too', 'hasn', 'more', 'the', 'yours', 'her', 'below', 'same', 'how',
'very', 'is', 'did', 'you', 'his', 'when', 'few', 'does', 'down', 'yourself', 'i', 'do', 'both', 'shan', 'have',

44
'itself', 'shouldn', 'through', 'themselves', 'o', 'didn', 've', 'm', 'off', 'out', 'but', 'and', 'doing', 'any',
'nor', 'over', 'had', 'because', 'himself', 'theirs', 'me', 'by', 'she', 'whom', 'hers', 're', 'hadn', 'who',
'he', 'my', 'if', 'will', 'are', 'why', 'from', 'am', 'with', 'been', 'its', 'ourselves', 'ain', 'couldn', 'a',
'aren', 'under', 'll', 'on', 'y', 'can', 'they', 'than', 'after', 'wouldn', 'each', 'once', 'mightn', 'for',
'this', 'these', 's', 'only', 'haven', 'having', 'all', 'don', 'it', 'there', 'until', 'again', 'to', 'while', 'be',
'no', 'during', 'herself', 'as', 'mustn', 'between', 'was', 'at', 'your', 'were', 'isn', 'wasn'}

Removing Stopwords

filtered_sent=[]
for w in tokenized_sent:
if w not in stop_words:
filtered_sent.append(w)
print("Tokenized Sentence:",tokenized_sent)
print("Filterd Sentence:",filtered_sent)

output:-
Tokenized Sentence: ['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?']
Filterd Sentence: ['Hello', 'Mr.', 'Smith', ',', 'today', '?']

6. Lexicon Normalization

Lexicon normalization considers another type of noise in the text. For example, connection,
connected, connecting word reduce to a common word "connect". It reduces derivationally
related forms of a word to a common root word.

7. Stemming

Stemming is a process of linguistic normalization, which reduces words to their word root
word or chops off the derivational affixes. For example, connection, connected, connecting
word reduce to a common word "connect".
# Stemming
from nltk.stem import PorterStemmer

45
from nltk.tokenize import sent_tokenize, word_tokenize

ps = PorterStemmer()

stemmed_words=[]
for w in filtered_sent:
stemmed_words.append(ps.stem(w))

print("Filtered Sentence:",filtered_sent)
print("Stemmed Sentence:",stemmed_words)

Output:-
Filtered Sentence: ['Hello', 'Mr.', 'Smith', ',', 'today', '?']
Stemmed Sentence: ['hello', 'mr.', 'smith', ',', 'today', '?']

8. Lemmatization

Lemmatization reduces words to their base word, which is linguistically correct lemmas. It
transforms root word with the use of vocabulary and morphological analysis. Lemmatization
is usually more sophisticated than stemming. Stemmer works on an individual word without
knowledge of the context. For example, The word "better" has "good" as its lemma. This
thing will miss by stemming because it requires a dictionary look-up.
#Lexicon Normalization
#performing stemming and Lemmatization

from nltk.stem.wordnet import WordNetLemmatizer


lem = WordNetLemmatizer()

from nltk.stem.porter import PorterStemmer


stem = PorterStemmer()

word = "flying"
print("Lemmatized Word:",lem.lemmatize(word,"v"))
print("Stemmed Word:",stem.stem(word))

46
output-
Lemmatized Word: fly
Stemmed Word: fli

9. POS Tagging

The primary target of Part-of-Speech(POS) tagging is to identify the grammatical


group of a given word. Whether it is a NOUN, PRONOUN, ADJECTIVE, VERB,
ADVERBS, etc. based on the context. POS Tagging looks for relationships within the
sentence and assigns a corresponding tag to the word.

sent = "Albert Einstein was born in Ulm, Germany in 1879."

tokens=nltk.word_tokenize(sent)
print(tokens)
output:-
['Albert', 'Einstein', 'was', 'born', 'in', 'Ulm', ',', 'Germany', 'in', '1879', '.'

nltk.pos_tag(tokens)
Output-
[('Albert', 'NNP'),
('Einstein', 'NNP'),
('was', 'VBD'),
('born', 'VBN'),
('in', 'IN'),
('Ulm', 'NNP'),
(',', ','),
('Germany', 'NNP'),
('in', 'IN'),
('1879', 'CD'),
('.', '.')]

47
EXPERIMENT NO 4
WRITE A PYTHON PROGRAM TO REMOVE “STOPWORDS” FROM A GIVEN
TEXT AND GENERATE WORD TOKENS AND FILTERED TEXT

In NLTK for removing stopwords, you need to create a list of stopwords and filter out your
list of tokens from these words.

Code:-
Import nltk
from nltk.corpus import stopwords
import stopwords
stop_words = set(stopwords.words('english'))
print(stop_words)

Output:-

48
Removing Stopwords

Code:-

from nltk.corpus import stopwords


from nltk.tokenize import word_tokenize
example_sent = """ A stop word is a commonly used word that a search engine
has been programmed to ignore, both when indexing entries for
searching and when retrieving them as the result of a search query."""
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(example_sent)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
filtered_sentence = []
for w in word_tokens:
if w not in stop_words:
filtered_sentence.append(w)
print("Tokenized_Sentence:",word_tokens)
print("Filtered_Sentence:",filtered_sentence)

49
Output

50
EXPERIMENT NO 5
WRITE A PYTHON PROGRAM TO GENERATE “TOKENS” AND ASSIGN “POS
TAGS” FOR A GIVEN TEXT USING NLTK PACKAGE

Code -
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize, sent_tokenize

stop_words = set(stopwords.words('english'))

text = "Tokenization is one of the least glamorous parts of NLP. How do we split our text"\
"so that we can do interesting things on it. "\
"Despite its lack of glamour, it’s super important."\
"Tokenization defines what our NLP models can express. "\
"Even though tokenization is super important, it’s not always top of mind."\
"In the rest of this article, I’d like to give you a high-level overview of tokenization, where it came
from,"\
"what forms it takes, and when and how tokenization is important "\

tokenized = sent_tokenize(text)
for i in tokenized:

wordsList = nltk.word_tokenize(i)

wordsList = [w for w in wordsList if not w in stop_words]

pos_tag= nltk.pos_tag(wordsList)

print("Pos-tags",pos_tag)

51
Output:-

52
EXPERIMENT NO 6
WRITE A PYTHON PROGRAM TO GENERATE “WORLDCLOUD” WITH
MAXIMUM WORDS USED = 100, IN DIFFERENT SHAPES AND SAVE AS
A .PNG FILE FOR A GIVEN TEXT FILE.

Wordcloud 1
Code:-

from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator


import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

text = open('batman.txt', 'r').read()


stopwords = set(STOPWORDS)

custom_mask = np.array(Image.open('like.png'))
wc = WordCloud(background_color = 'black',
stopwords = stopwords,
mask = custom_mask,
contour_width = 3,
contour_color = 'black')

wc.generate(text)
image_colors = ImageColorGenerator(custom_mask)
wc.recolor(color_func = image_colors)

#Plotting

wc.to_file('like_cloud.png')

53
The Image :-

Output :-

54
Wordcloud 2
Code:-
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

text = open('batman.txt', 'r').read()


stopwords = set(STOPWORDS)

custom_mask = np.array(Image.open('girl.png'))
wc = WordCloud(background_color = 'white',
stopwords = stopwords,
mask = custom_mask,
contour_width = 3,
contour_color = 'black')

wc.generate(text)
image_colors = ImageColorGenerator(custom_mask)
wc.recolor(color_func = image_colors)

wc.to_file('girl_cloud.png')

The Image : -

55
Output :-

56
EXPERIMENT NO 7
PERFORM AN EXPERIMENT TO LEARN ABOUT MORPHOLOGICAL
FEATURES OF A WORD BY ANALYZING IT.

Introduction : Word Analysis

A word can be simple or complex. For example, the word 'cat' is simple because one cannot further
decompose the word into smaller part. On the other hand, the word 'cats' is complex, because the word is
made up of two parts: root 'cat' and plural suffix '-s'

Theory

Analysis of a word into root and affix(es) is called Morphological analysis of a word. It is
mandatory to identify the root of a word for any natural language processing task. A root word can
have various forms. For example, the word 'play' in English has the following forms: 'play', 'plays',
'played' and 'playing'. Hindi shows a greater number of forms for the word 'खेल' (khela) which is
equivalent to 'play'. The forms of 'खेल'(khela) are the following:

खेल(khela), खेला(khelaa), खेली(khelii), खेलंग


ू ा(kheluungaa), खेलंग
ू ी(kheluungii), खेलेगा(khelegaa),
खेलेगी(khelegii), खेलते(khelate), खेलती(khelatii), खेलने(khelane), खेलकर(khelakar)

For Telugu root ఆడడం (Adadam), the forms are the following::

Adutaanu, AdutunnAnu, Adenu, Ademu, AdevA, AdutAru, Adutunnaru, AdadAniki, Adesariki,


AdanA, Adinxi, Adutunxi, AdinxA, AdeserA, Adestunnaru

Thus we understand that the morphological richness of one language might vary from one language
to another. Indian languages are generally morphologically rich languages and therefore
morphological analysis of words becomes a very significant task for Indian languages.

57
Types of Morphology

Morphology is of two types,

1. Inflectional morphology

Deals with word forms of a root, where there is no change in lexical category. For example, 'played'
is an inflection of the root word 'play'. Here, both 'played' and 'play' are verbs.

2. Derivational morphology

Deals with word forms of a root, where there is a change in the lexical category. For example, the
word form 'happiness' is a derivation of the word 'happy'. Here, 'happiness' is a derived noun form of
the adjective 'happy'.

Morphological Features:

All words will have their lexical category attested during morphological analysis.

A noun and pronoun can take suffixes of the following features: gender, number, person, case

For example, morphological analysis of a few words is given below:

Language input: word output :analysis


Hindi लडके (ladake) rt=लड़का(ladakaa), cat=n, gen=m, num=sg, case=obl

Hindi लडके (ladake) rt=लड़का(ladakaa), cat=n, gen=m, num=pl, case=dir

Hindi लड़कों (ladakoM) rt=लड़का(ladakaa), cat=n, gen=m, num=pl, case=obl

English boy rt=boy, cat=n, gen=m, num=sg

English boys rt=boy, cat=n, gen=m, num=pl

A verb can take suffixes of the following features: tense, aspect, modality, gender, number, person.

Language input: word output :analysis


Hindi लडके (ladake) rt=लड़का(ladakaa), cat=n, gen=m, num=sg, case=obl

Hindi लडके (ladake) rt=लड़का(ladakaa), cat=n, gen=m, num=pl, case=dir

58
'rt' stands for root. 'cat' stands for lexical category. Thev value of lexicat category can be noun, verb,
adjective, pronoun, adverb, preposition. 'gen' stands for gender. The value of gender can be
masculine or feminine.

◆ 'num' stands for number. The value of number can be singular (sg) or plural (pl).
◆ 'per' stands for person. The value of person can be 1, 2 or 3
◆ The value of tense can be present, past or future. This feature is applicable for verbs.
◆ The value of aspect can be perfect (pft), continuous (cont) or habitual (hab). This feature is
not applicable for verbs.
◆ 'case' can be direct or oblique. This feature is applicable for nouns. A case is an oblique case
when a postposition occurs after noun. If no postposition can occur after noun, then the case
is a direct case. This is applicable for hindi but not english as it doesn't have any
postpositions. Some of the postpsitions in hindi are: का(kaa), की(kii), के(ke), को(ko), में (meM)

Objective :- The objective of the experiment is to learn about morphological features of a word by
analysing it.

Procedure and Experiment

STEP1: Select the language.

OUTPUT: Drop down for selecting words will appear.

STEP2: Select the word.

OUTPUT: Drop down for selecting features will appear.

59
STEP3: Select the features.

STEP4: Click "Check" button to check your answer.

60
OUTPUT: Right features are marked by tick and wrong features are marked by cross.

61
EXPERIMENT NO 8
PERFORM AN EXPERIMENT TO GENERATE WORD FORMS FROM
ROOT AND SUFFIX INFORMATION

Introduction : Word Generation

A word can be simple or complex. For example, the word 'cat' is simple because one cannot further
decompose the word into smaller part. On the other hand, the word 'cats' is complex, because the word is
made up of two parts: root 'cat' and plural suffix '-s'

Theory:- Given the root and suffix information, a word can be generated. For example,

Language input:analysis output:word

Hindi rt=लड़का(ladakaa), cat=n, gen=m, लड़के(ladake)


num=sg, case=obl

Hindi rt=लड़का(ladakaa), cat=n, gen=m, लड़के(ladake)


num=pl, case=dir

English rt=boy, cat=n, num=pl boys

English rt=play, cat=v, num=sg, per=3, tense=pr plays

- Morphological analysis and generation: Inverse processes.

- Analysis may involve non-determinism, since more than one analysis is possible.

- Generation is a deterministic process. In case a language allows spelling variation, then till that
extent, generation would also involve non-determinism

62
Objective : The objective of the experiment is to generate word forms from root and suffix information

Procedure :-

STEP1: Select the language.

OUTPUT: Drop downs for selecting root and other features will appear.

STEP2: Select the root and other features.

63
STEP3: After selecting all the features, select the word corresponding above features selected.

STEP4: Click the check button to see whether right word is selected or not

OUTPUT: Output tells whether the word selected is right or wrong

64
EXPERIMENT NO 9
PERFORM AN EXPERIMENT TO UNDERSTAND THE MORPHOLOGY OF
A WORD BY THE USE OF ADD-DELETE TABLE

Introduction : Morphology

Morphology is the study of the way words are built up from smaller meaning bearing units i.e.,
morphemes. A morpheme is the smallest meaningful linguistic unit. For eg:

● बच्चों(bachchoM) consists of two morphemes, बच्चा(bachchaa) has the information of the root word
noun "बच्चा"(bachchaa) and ओं(oM) has the information of plural and oblique case.
● played has two morphemes play and -ed having information verb "play" and "past tense", so given
word is past tense form of verb "play".

Words can be analysed morphologically if we know all variants of a given root word. We can use an
'Add-Delete' table for this analysis.

Theory :-

Morph Analyser

Definition
Morphemes are considered as smallest meaningful units of language. These morphemes can either
be a root word(play) or affix(-ed). Combination of these morphemes is called morphological
process. So, word "played" is made out of 2 morphemes "play" and "-ed". Thus finding all parts of a

65
word(morphemes) and thus describing properties of a word is called "Morphological Analysis". For
example, "played" has information verb "play" and "past tense", so given word is past tense form of
verb "play".

Analysis of a word :
बच्चों (bachchoM) = बच्चा(bachchaa)(root) + ओं(oM)(suffix)

(ओं=3 plural oblique)

A linguistic paradigm is the complete set of variants of a given lexeme. These variants can be
classified according to shared inflectional categories (eg: number, case etc) and arranged into tables.

Paradigm for बच्चा

case/num singular plural

direct बच्चा(bac बच्चे(bachche)


hchaa)
oblique बच्चे(bac बच्चों (bachchoM)
hche)

Algorithm to get बच्चों(bachchoM) from बच्चा(bachchaa)


1. Take Root बच्च(bachch)आ(aa)
2. Delete आ(aa)
3. output बच्च(bachch)
4. Add ओं(oM) to output
5. Return बच्चों (bachchoM)

Therefore आ is deleted and ओं is added to get बच्चों

Add-Delete table for बच्चा

Delete Add Number Case Variants

आ(aa) आ(aa) sing dr बच्चा(bachchaa)

66
आ(aa) ए(e) plu dr बच्चे(bachche)

आ(aa) ए(e) sing ob बच्चे(bachche)

आ(aa) ओं(oM) plu ob बच्चों(bachchoM)

Paradigm Class
Words in the same paradigm class behave similarly, for Example लड़क is in the same paradigm class
as बच्च, so लड़का would behave similarly as बच्चा as they share the same paradigm class.

Objective :- Understanding the morphology of a word by the use of Add-Delete table

Procedure :-

STEP1: Select a word root.

STEP2: Fill the add-delete table and submit.

STEP3: If wrong, see the correct answer or repeat STEP1.

67
Wrong output:-

Right output:-

68
EXPERIMENT NO 10
PERFORM AN EXPERIMENT TO LEARN TO CALCULATE BIGRAMS
FROM A GIVEN CORPUS AND CALCULATE PROBABILITY OF A
SENTENCE.

Introduction :- N - Grams
Probability of a sentence can be calculated by the probability of sequence of words occuring
in it. We can use Markov assumption, that the probability of a word in a sentence depends on the
probability of the word occuring just before it. Such a model is called first order Markov model or
the bigram model.

Here, Wn refers to the word token corresponding to the nth word in a sequence.

Theory

A combination of words forms a sentence. However, such a formation is meaningful only when the
words are arranged in some order.

Eg: Sit I car in the

Such a sentence is not grammatically acceptable. However some perfectly grammatical sentences
can be nonsensical too!

Eg: Colorless green ideas sleep furiously

One easy way to handle such unacceptable sentences is by assigning probabilities to the strings of
words i.e, how likely the sentence is.

Probability of a sentence

If we consider each word occurring in its correct location as an independent event,the probability of
the sentences is : P(w(1), w(2)..., w(n-1), w(n))

69
Using chain rule:

= P(w(1)) * P(w(2) | w(1)) * P(w(3) | w(1)w(2)) ... P(w(n) | w(1)w(2)…w(n-1))

Bigrams

We can avoid this very long calculation by approximating that the probability of a given word
depends only on the probability of its previous words. This assumption is called Markov assumption
and such a model is called Markov model- bigrams. Bigrams can be generalized to the n-gram which
looks at (n-1) words in the past. A bigram is a first-order Markov model.

Therefore ,

P(w(1), w(2)..., w(n-1), w(n))= P(w(2)|w(1)) P(w(3)|w(2)) …. P(w(n)|w(n-1))

We use (eos) tag to mark the beginning and end of a sentence

A bigram table for a given corpus can be generated and used as a lookup table for calculating
probability of sentences.

Eg: Corpus – (eos) You book a flight (eos) I read a book (eos) You read (eos)

Bigram Table:

(eos) you book a flight I read

(eos) 0 0.33 0 0 0 0.25 0

you 0 0 0.5 0 0 0 0.5

book 0.5 0 0 0.5 0 0 0

a 0 0 0.5 0 0.5 0 0

flight 1 0 0 0 0 0 0

I 0 0 0 0 0 0 1

70
read 0.5 0 0 0.5 0 0 0

P((eos) you read a book (eos))

= P(you|eos) * P(read|you) * P(a|read) * P(book|a) * P(eos|book)

= 0.33 * 0.5 * 0.5 * 0.5 * 0.5

=.020625

Objective :- The objective of this experiment is to learn to calculate bigrams from a given corpus and
calculate probability of a sentence.

Procedure:-
STEP1: Select a corpus and click on

Generate bigram table

STEP2: Fill up the table that is generated and hit

Submit

71
STEP3: If incorrect (red), see the correct answer by clicking on show answer or repeat Step 2.

STEP4: If correct (green), click on take a quiz and fill the correct answer

72
EXPERIMENT NO 11
PERFORM AN EXPERIMENT TO LEARN HOW TO APPLY ADD-ONE
SMOOTHING ON SPARSE BIGRAM TABLE.

Introduction : - N-Grams Smoothing


One major problem with standard N-gram models is that they must be trained from some corpus, and
because any particular training corpus is finite, some perfectly acceptable N-grams are bound to be
missing from it. We can see that bigram matrix for any given training corpus is sparse. There are
large number of cases with zero probabilty bigrams and that should really have some non-zero
probability. This method tend to underestimate the probability of strings that happen not to have
occurred nearby in their training corpus.

There are some techniques that can be used for assigning a non-zero probabilty to these 'zero
probability bigrams'. This task of reevaluating some of the zero-probability and low-probabilty
N-grams, and assigning them non-zero values, is called smoothing.

Theory :-
The standard N-gram models are trained from some corpus. The finiteness of the training corpus
leads to the absence of some perfectly acceptable N-grams. This results in sparse bigram matrices.
This method tend to underestimate the probability of strings that do not occur in their training
corpus.

There are some techniques that can be used for assigning a non-zero probabilty to these 'zero
probability bigrams'. This task of reevaluating some of the zero-probability and low-probabilty
N-grams, and assigning them non-zero values, is called smoothing. Some of the techniques are:
Add-One Smoothing, Witten-Bell Discounting, Good-Turing Discounting.

73
Add-One Smoothing
In Add-One smooting, we add one to all the bigram counts before normalizing them into
probabilities. This is called add-one smoothing.

Application on unigrams
The unsmoothed maximum likelihood estimate of the unigram probability can be computed by
dividing the count of the word by the total number of word tokens N
P(wx) = c(wx)/sumi{c(wi)}
= c(wx)/N

Let there be an adjusted count c*.


c i* = (c < sub="">+1)*N/(N+V)
where where V is the total number of word types in the language.
Now, probabilities can be calculated by normalizing counts by N.
p i* = (c < sub="">+1)/(N+V)

Application on bigrams
Normal bigram probabilities are computed by normalizing each row of counts by the unigram count:
P(w n|wn-1) = C(wn-1wn)/C(wn-1)
For add-one smoothed bigram counts we need to augment the unigram count by the number of total
word types in the vocabulary V:
p *(wn|wn-1) = ( C(wn-1wn)+1 )/( C(wn-1)+V )

Objective:- The objective of this experiment is to learn how to apply add-one smoothing on sparse bigram
table.

Procedure :-
STEP1: Select a corpus

74
STEP2: Apply add one smoothing and calculate bigram probabilities using the given bigram
counts,N and V. Fill the table and hit

Submit

STEP3: If incorrect (red), see the correct answer by clicking on show answer or repeat Step 2

75
EXPERIMENT NO 12
PERFORM AN EXPERIMENT TO CALCULATE EMISSION AND TRANSITION
MATRIX WHICH WILL BE HELPFUL FOR TAGGING PARTS OF SPEECH
USING HIDDEN MARKOV MODEL.

Introduction:-
POS TAGGING - Hidden Markov Model

POS tagging or part-of-speech tagging is the procedure of assigning a grammatical category like
noun, verb, adjective etc. to a word. In this process both the lexical information and the context play
an important role as the same lexical form can behave differently in a different context.

For example the word "Park" can have two different lexical categories based on the context.

1. The boy is playing in the park. ('Park' is Noun)


2. Park the car. ('Park' is Verb)

Assigning part of speech to words by hand is a common exercise one can find in an elementary
grammar class. But here we wish to build an automated tool which can assign the appropriate
part-of-speech tag to the words of a given sentence. One can think of creating hand crafted rules by
observing patterns in the language, but this would limit the system's performance to the quality and
number of patterns identified by the rule crafter. Thus, this approach is not practically adopted for
building POS Tagger. Instead, a large corpus annotated with correct POS tags for each word is given
to the computer and algorithms then learn the patterns automatically from the data and store them in
form of a trained model. Later this model can be used to POS tag new sentences

76
In this experiment we will explore how such a model can be learned from the data.

Theory : -

A Hidden Markov Model (HMM) is a statistical Markov model in which the system being modeled
is assumed to be a Markov process with unobserved (hidden) states.In a regular Markov model
(Markov Model (Ref: http://en.wikipedia.org/wiki/Markov_model)), the state is directly visible to
the observer, and therefore the state transition probabilities are the only parameters. In a hidden
Markov model, the state is not directly visible, but output, dependent on the state, is visible.

Hidden Markov Model has two important components-

1)Transition Probabilities: The one-step transition probability is the probability of transitioning from
one state to another in a single step.

2)Emission Probabilties: : The output probabilities for an observation from state. Emission
probabilities B = { bi,k = bi(ok) = P(ok | qi) }, where okis an Observation. Informally, B is the
probability that the output is ok given that the current state is qi

For POS tagging, it is assumed that POS are generated as random process, and each process
randomly generates a word. Hence, transition matrix denotes the transition probability from one POS
to another and emission matrix denotes the probability that a given word can have a particular POS.
Word acts as the observations. Some of the basic assumptions are:

1. First-order (bigram) Markov assumptions:

a. Limited Horizon: Tag depends only on previous tag

P(ti+1 = tk | t1=tj1,…,ti=tji) = P(ti+1 = tk | ti = tj)

b. Time invariance: No change over time

77
P(ti+1 = tk | ti = tj) = P(t2 = tk | t1 = tj) = P(tj -> tk)

2. Output probabilities:

Probability of getting word wk for tag tj: P(wk | tj) is independent of other tags or words!

Calculating the Probabilities

Consider the given toy corpus

EOS/eos

They/pronoun

cut/verb

the/determiner

paper/noun

EOS/eos He/pronoun

asked/verb

for/preposition

his/pronoun

cut/noun.

EOS/eos

Put/verb

the/determiner

paper/noun

in/preposition

the/determiner

cut/noun

EOS/eos

Calculating Emission Probability Matrix

Count the no. of times a specific word occus with a specific POS tag in the corpus.

78
Here, say for "cut"

count(cut,verb)=1

count(cut,noun)=2

count(cut,determiner)=0

... and so on zero for other tags too.

count(cut) = total count of cut = 3

Now, calculating the probability

Probability to be filled in the matrix cell at the intersection of cut and verb

P(cut/verb)=count(cut,verb)/count(cut)=1/3=0.33

Similarly,

Probability to be filled in the cell at he intersection of cut and determiner

P(cut/determiner)=count(cut,determiner)/count(cut)=0/3=0

Repeat the same for all the word-tag combination and fill the

Calculating Transition Probability Matrix

Count the no. of times a specific tag comes after other POS tags in the corpus.

Here, say for "determiner"

count(verb,determiner)=2

count(preposition,determiner)=1

count(determiner,determiner)=0

count(eos,determiner)=0

count(noun,determiner)=0

... and so on zero for other tags too.

count(determiner) = total count of tag 'determiner' = 3

Now, calculating the probability

Probability to be filled in the cell at he intersection of determiner(in the column) and verb(in the

79
row)

P(determiner/verb)=count(verb,determiner)/count(determiner)=2/3=0.66

Similarly,

Probability to be filled in the cell at the intersection of determiner(in the column) and noun(in the
row)

P(determiner/noun)=count(noun,determiner)/count(determiner)=0/3=0

Repeat the same for all the tags

Note: EOS/eos is a special marker which represents End Of Sentence.

Objective - The objective of the experiment is to calculate emission and transition matrix which will be
helpful for tagging Parts of Speech using Hidden Markov Model.

Procedure :-

STEP1: Select the corpus.

STEP2: For the given corpus fill the emission and transition matrix. Answers are rounded to 2 decimal
digits.

80
STEP3: Press Check to check your answer.

Wrong answers are indicated by the red cell.

81
82
EXPERIMENT NO 13
PERFORM AN EXPERIMENT TO KNOW THE IMPORTANCE OF CONTEXT
AND SIZE OF TRAINING CORPUS IN LEARNING PARTS OF SPEECH

Introduction-
Building POS Tagger

In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or
word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a
particular part of speech, based on both its definition, as well as its context—i.e. relationship with adjacent
and related words in a phrase, sentence, or paragraph. A simplified form of this is identification of words as
nouns, verbs, adjectives, adverbs, etc. Once performed by hand, POS tagging is now done in the context of
computational linguistics, using algorithms which associate discrete terms, as well as hidden parts of
speech, in accordance with a set of descriptive tags. POS-tagging algorithms fall into two distinctive groups:
rule-based and stochastic.

Theory:-

Hidden Markov Model

In the mid 1980s, researchers in Europe began to use Hidden Markov models (HMMs) to disambiguate
parts of speech. HMMs involve counting cases, and making a table of the probabilities of certain sequences.
For example, once you've seen an article such as 'the', perhaps the next word is a noun 40% of the time, an
adjective 40%, and a number 20%. Knowing this, a program can decide that "can" in "the can" is far more
likely to be a noun than a verb or a modal. The same method can of course be used to benefit from
knowledge about the following words.

More advanced ("higher order") HMMs learn the probabilities not only of pairs, but triples or even larger
sequences. So, for example, if you've just seen an article and a verb, the next item may be very likely a

83
preposition, article, or noun, but much less likely another verb.

When several ambiguous words occur together, the possibilities multiply. However, it is easy to enumerate
every combination and to assign a relative probability to each one, by multiplying together the probabilities
of each choice in turn.

It is worth remembering, as Eugene Charniak points out in Statistical techniques for natural language
parsing, that merely assigning the most common tag to each known word and the tag "proper noun" to all
unknowns, will approach 90% accuracy because many words are unambiguous.

HMMs underlie the functioning of stochastic taggers and are used in various algorithms. Accuracies for one
such algorithm (TnT) on various training data is shown here.

Conditional Random Field

Conditional random fields (CRFs) are a class of statistical modelling method often applied in machine
learning, where they are used for structured prediction. Whereas an ordinary classifier predicts a label for a
single sample without regard to "neighboring" samples, a CRF can take context into account. Since it can
consider context, therefore CRF can be used in Natural Language Processing. Hence, Parts of Speech
tagging is also possible. It predicts the POS using the lexicons as the context.

If only one neighbour is considered as a context, then it is called bigram. Similarly, two neighbours as the
context is called trigram. In this experiment, size of training corpus and context were varied to know their
importance.

Objective - The objective of the experiment is to know the importance of context and size of training
corpus in learning Parts of Speech

Procedure :-

84
STEP1: Select the language.

OUTPUT: Drop down to select size of corpus, algorithm and features will appear.

STEP2: Select corpus size.

STEP3: Select algorithm "CRF" or "HMM".

STEP4:

Select feature "bigram" or "trigram".

85
OUTPUT: Corresponding accuracy will be shown.

86
EXPERIMENT NO 14
PERFORM AN EXPERIMENT TO UNDERSTAND THE CONCEPT OF
CHUNKING AND GET FAMILIAR WITH THE BASIC CHUNK TAGSET.

Introduction : - Chunking
Chunking of text invloves dividing a text into syntactically correlated words. For example, the
sentence 'He ate an apple.' can be divided as follows:

Each chunk has an open boundary and close boundary that delimit the word groups as a minimal
non-recursive unit. This can be formally expressed by using IOB prefixes.

87
Theory : -

Chunking of text invloves


dividing a text into
syntactically correlated words.

Eg: He ate an apple to satiate his hunger. [NP He ] [VP ate


] [NP an apple] [VP to satiate] [NP his hunger]

Eg: दरवाज़ा खल
ु गया
[NP दरवाज़ा] [VP खल
ु गया]

Chunk Types

The chunk types are based on the


syntactic category part. Besides the head a chunk also
contains modifiers (like determiners, adjectives,
postpositions in NPs).

The basic types of chunks in English are:


Chunk Type Tag Name
1. Noun NP
2. Verb VP
3. Adverb ADVP
4. Adjectivial ADJP
5. Prepositional PP

The basic Chunk Tag Set for Indian Languages

Sl. NoChunk Type Tag Name


1 Noun Chunk NP
2.1 Finite Verb Chunk VGF
2.2 Non-finite Verb Chunk VGNF
2.3 Verb Chunk (Gerund) VGNN
3 Adjectival Chunk JJP
4 Adverb Chunk RBP

NP Noun Chunks

Noun Chunks will be given the tag NP and include


non-recursive noun phrases and postposition for Indian
languages and preposition for English. Determiners,
adjectives and other modifiers will be part of the noun
chunk.

Eg:

(इस/DEM किताब/NN में /PSP)NP

88
'this' 'book' 'in'

((in/IN the/DT big/ADJ room/NN))NP

Verb Chunks

The verb chunks are marked as VP for English, however they


would be of several types for Indian languages. A verb
group will include the main verb and its auxiliaries, if
any.

For English:

I (will/MD be/VB loved/VBD)VP

The types of verb chunks and their tags are described below.

1. VGFFinite Verb Chunk

The auxiliaries in the verb group mark the finiteness of the


verb at the chunk level. Thus, any verb group which is
finite will be tagged as VGF. For example,

Eg: मैंने घर पर (खाया/VM)VGF


'I erg''home' 'at''meal' 'ate'

2. VGNF Non-finite Verb Chunk

A non-finite verb chunk will be tagged as VGNF.

Eg: सेब (खाता/VM हुआ/VAUX)VGNF लड़का जा रहा है


'apple' 'eating' 'PROG' 'boy' go' 'PROG' 'is'

3. VGNN Gerunds

A verb chunk having a gerund will be annotated as VGNN.

Eg: शराब (पीना/VM)VGNN सेहत के लिए हानिकारक है sharAba


'liquor' 'drinking' 'heath' 'for' 'harmful' 'is'

JJP/ADJP Adjectival Chunk

An adjectival chunk will be tagged as ADJP for English and


JJP for Indian languages. This chunk will consist of all
adjectival chunks including the predicative adjectives.

Eg:

वह लड़की है (सन्
ु दर/JJ)JJP

The fruit is (ripe/JJ)ADJP

89
Note: Adjectives appearing before a noun will be grouped
together within the noun chunk.

RBP/ADVP Adverb Chunk

This chunk will include all pure adverbial phrases.

Eg:

वह (धीरे -धीरे /RB)RBP चल रहा था


'he' 'slowly' 'walk' 'PROG' 'was'

He walks (slowly/ADV)/ADVP

PP Prepositional Chunk

This chunk type is present


for only English and not for Indian languages. It consists
of only the preposition and not the NP argument.

Eg:

(with/IN)PP a pen

IOB prefixes

Each chunk has an open boundary and close boundary that


delimit the word groups as a minimal non-recursive
unit. This can be formally expressed by using IOB prefixes:
B-CHUNK for the first word of the chunk and I-CHUNK for each
other word in the chunk. Here is an example of the file
format:

TokensPOS Chunk-Tags

He PRP B-NP
ate VBD B-VP
an DT B-NP
apple NN I-NP
to TO B-VP
satiate VB I-VP
his PRP$ B-NP
hungerNN I-NP

90
Objective : - The objective of this experiment is to understand the concept of chunking and get familiar
with the basic chunk tagset.

Procedure : -

STEP1: Select a language

STEP2: Select a sentence

STEP3: Select the corresponding chunk-tag for each word in the sentence and click the

Submit button.

91
OUTPUT1: The submitted answer will be checked.

Click on the Get Answer button for the correct answer.

92
EXPERIMENT NO 15
THE OBJECTIVE OF THIS EXPERIMENT IS TO FIND POS TAGS OF
WORDS IN A SENTENCE USING VITERBI DECODING.

Introduction- POS Tagging - Viterbi Decoding


In this experiment the transmission and emission matrix will be used to find the POS tag sequence for a
given sentence. When we have an emission and transition matrix, various algorithms can be applied to find
out the POS tags for words. Some of the possible algorithms are: Backward algorithm, forward algorithm
and viterbi algorithm. Here, in this experiment, you can get familiar with Viterbi Decoding

Theory - Viterbi Decoding is based on dynamic programming. This algorithm takes emission and
transmission matrix as the input. Emission matrix gives us information about probabilities of a POS tag for
a given word and transmission matrix gives the probability of transition from one POS tag to another POS
tag. It observes sequence of words and returns the state sequences of POS tags along with its probability.

Here "s" denotes words and "t" denotes tags. "a" is transmission matrix and "b" is emission matrix

93
Using above algorithm, we have to fill the viterbi table column by column.

Objective :- The objective of this experiment is to find POS tags of words in a sentence using Viterbi
decoding.

Procedure : -

STEP1:Select the corpus.

OUTPUT: Emission and Transmission matrix will appear.

STEP2: Fill the column with the probabilty of possible POS tags given the word (i.e. form the viterbi
matrix by filling colum for each observation). Answers submitted are rounded off to 3 digits after decimal
and are than checked.

STEP3: Check the column.

94
Wrong answers are indicated by red backgound in a cell.

If answers are right, then go to step2

95
STEP4: Repeat steps 2 and 3 untill all words of a sentence are covered.

96
STEP5: At last check the POS tag for each word obtained from backtracking

97

You might also like