Unit 1

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 4

UNIT 1

NLP is a subfield of computer science and artificial intelligence concerned with


interactions between computers and human (natural) languages. It is used to apply
machine learning algorithms to text and speech.

For example, we can use NLP to create systems like speech recognition, document
summarization, machine translation, spam detection, named entity recognition,
question answering, autocomplete, predictive typing and so on.

Introduction to the NLTK library for Python


NLTK (Natural Language Toolkit) is a leading platform for building Python programs
to work with human language data. It provides easy-to-use interfaces to many
corpora and lexical resources. Also, it contains a suite of text processing
libraries for classification, tokenization, stemming, tagging, parsing, and
semantic reasoning. Best of all, NLTK is a free, open source, community-driven
project.
The Basics of NLP for Text
In this article, we’ll cover the following topics:

Sentence Tokenization
Word Tokenization
Text Lemmatization and Stemming
Stop Words
Regex
Bag-of-Words
TF-IDF

1. Sentence Tokenization
Sentence tokenization (also called sentence segmentation) is the problem of
dividing a string of written language into its component sentences. The idea here
looks very simple. In English and some other languages, we can split apart the
sentences whenever we see a punctuation mark.

However, even in English, this problem is not trivial due to the use of full stop
character for abbreviations. When processing plain text, tables of abbreviations
that contain periods can help us to prevent incorrect assignment of sentence
boundaries. In many cases, we use libraries to do that job for us, so don’t worry
too much for the details for now.

Word tokenization (also called word segmentation) is the problem of dividing a


string of written language into its component words. In English and many other
languages using some form of Latin alphabet, space is a good approximation of a
word divider.

However, we still can have problems if we only split by space to achieve the wanted
results. Some English compound nouns are variably written and sometimes they
contain a space. In most cases, we use a library to achieve the wanted results, so
again don’t worry too much for the details.

For grammatical reasons, documents can contain different forms of a word such as
drive, drives, driving. Also, sometimes we have related words with a similar
meaning, such as nation, national, nationality.

The goal of both stemming and lemmatization is to reduce inflectional forms and
sometimes derivationally related forms of a word to a common base form.

Stemming usually refers to a crude heuristic process that chops off the ends of
words in the hope of achieving this goal correctly most of the time, and often
includes the removal of derivational affixes.

Lemmatization usually refers to doing things properly with the use of a vocabulary
and morphological analysis of words, normally aiming to remove inflectional endings
only and to return the base or dictionary form of a word, which is known as the
lemma.

A regular expression, regex, or regexp is a sequence of characters that define a


search pattern. Let’s see some basics.

. - match any character except newline


\w - match word
\d - match digit
\s - match whitespace
\W - match not word
\D - match not digit
\S - match not whitespace
[abc] - match any of a, b, or c
[^abc] - not match a, b, or c
[a-g] - match a character between a & g

We can use regex to apply additional filtering to our text. For example, we can
remove all the non-words characters. In many cases, we don’t need the punctuation
marks and it’s easy to remove them with regex.

Next, we need to score the words in each document. The task here is to convert each
raw text into a vector of numbers. After that, we can use these vectors as input
for a machine learning model. The simplest scoring method is to mark the presence
of words with 1 for present and 0 for absence.

Now, let’s see how we can create a bag-of-words model using the mentioned above
CountVectorizer class.

The complexity of the bag-of-words model comes in deciding how to design the
vocabulary of known words (tokens) and how to score the presence of known words.

Designing the Vocabulary


When the vocabulary size increases, the vector representation of the documents also
increases. In the example above, the length of the document vector is equal to the
number of known words.

In some cases, we can have a huge amount of data and in this cases, the length of
the vector that represents a document might be thousands or millions of elements.
Furthermore, each document may contain only a few of the known words in the
vocabulary.

Therefore the vector representations will have a lot of zeros. These vectors which
have a lot of zeros are called sparse vectors. They require more memory and
computational resources.
We can decrease the number of the known words when using a bag-of-words model to
decrease the required memory and computational resources. We can use the text
cleaning techniques we’ve already seen in this article before we create our bag-of-
words model:

Ignoring the case of the words


Ignoring punctuation
Removing the stop words from our documents
Reducing the words to their base form (Text Lemmatization and Stemming)
Fixing misspelled words
Another more complex way to create a vocabulary is to use grouped words. This
changes the scope of the vocabulary and allows the bag-of-words model to get more
details about the document. This approach is called n-grams.

An n-gram is a sequence of a number of items (words, letter, numbers, digits,


etc.). In the context of text corpora, n-grams typically refer to a sequence of
words. A unigram is one word, a bigram is a sequence of two words, a trigram is a
sequence of three words etc. The “n” in the “n-gram” refers to the number of the
grouped words. Only the n-grams that appear in the corpus are modeled, not all
possible n-grams.

Example
Let’s look at the all bigrams for the following sentence:
The office building is open today

All the bigrams are:

the office
office building
building is
is open
open today
The bag-of-bigrams is more
Scoring Words
Once, we have created our vocabulary of known words, we need to score the
occurrence of the words in our data. We saw one very simple approach - the binary
approach (1 for presence, 0 for absence).

Some additional scoring methods are:

Counts. Count the number of times each word appears in a document.


Frequencies. Calculate the frequency that each word appears in document out of all
the words in the document.
TF-IDF
One problem with scoring word frequency is that the most frequent words in the
document start to have the highest scores. These frequent words may not contain as
much “informational gain” to the model compared with some rarer and domain-specific
words. One approach to fix that problem is to penalize words that are frequent
across all the documents. This approach is called TF-IDF.
TF-IDF, short for term frequency-inverse document frequency is a statistical
measure used to evaluate the importance of a word to a document in a collection or
corpus.

The TF-IDF scoring value increases proportionally to the number of times a word
appears in the document, but it is offset by the number of documents in the corpus
that contain the word.
Term Frequency (TF): a scoring of the frequency of the word in the current
document.

Term Frequency Formula


Inverse Term Frequency (ITF): a scoring of how rare the word is across documents.

Inverse Document Frequency Formula


Finally, we can use the previous formulas to calculate the TF-IDF score for a given
term like this
Example
In Python, we can use the TfidfVectorizer class from the sklearn library to
calculate the TF-IDF scores for given documents. Let’s use the same sentences that
we have used with the bag-of-words example.

Summary
In this blog post, you learn the basics of the NLP for text. More specifically you
have learned the following concepts with additional details:

NLP is used to apply machine learning algorithms to text and speech.


NLTK (Natural Language Toolkit) is a leading platform for building Python programs
to work with human language data
Sentence tokenization is the problem of dividing a string of written language into
its component sentences
Word tokenization is the problem of dividing a string of written language into its
component words
The goal of both stemming and lemmatization is to reduce inflectional forms and
sometimes derivationally related forms of a word to a common base form.
Stop words are words which are filtered out before or after processing of text.
They usually refer to the most common words in a language.
A regular expression is a sequence of characters that define a search pattern.
The bag-of-words model is a popular and simple feature extraction technique used
when we work with text. It describes the occurrence of each word within a document.
TF-IDF is a statistical measure used to evaluate the importance of a word to a
document in a collection or corpus.
Awesome! Now we know the basics of how to extract features from a text. Then, we
can use these features as an input for machine learning algorithms.

You might also like