Unit 1
Unit 1
Unit 1
For example, we can use NLP to create systems like speech recognition, document
summarization, machine translation, spam detection, named entity recognition,
question answering, autocomplete, predictive typing and so on.
Sentence Tokenization
Word Tokenization
Text Lemmatization and Stemming
Stop Words
Regex
Bag-of-Words
TF-IDF
1. Sentence Tokenization
Sentence tokenization (also called sentence segmentation) is the problem of
dividing a string of written language into its component sentences. The idea here
looks very simple. In English and some other languages, we can split apart the
sentences whenever we see a punctuation mark.
However, even in English, this problem is not trivial due to the use of full stop
character for abbreviations. When processing plain text, tables of abbreviations
that contain periods can help us to prevent incorrect assignment of sentence
boundaries. In many cases, we use libraries to do that job for us, so don’t worry
too much for the details for now.
However, we still can have problems if we only split by space to achieve the wanted
results. Some English compound nouns are variably written and sometimes they
contain a space. In most cases, we use a library to achieve the wanted results, so
again don’t worry too much for the details.
For grammatical reasons, documents can contain different forms of a word such as
drive, drives, driving. Also, sometimes we have related words with a similar
meaning, such as nation, national, nationality.
The goal of both stemming and lemmatization is to reduce inflectional forms and
sometimes derivationally related forms of a word to a common base form.
Stemming usually refers to a crude heuristic process that chops off the ends of
words in the hope of achieving this goal correctly most of the time, and often
includes the removal of derivational affixes.
Lemmatization usually refers to doing things properly with the use of a vocabulary
and morphological analysis of words, normally aiming to remove inflectional endings
only and to return the base or dictionary form of a word, which is known as the
lemma.
We can use regex to apply additional filtering to our text. For example, we can
remove all the non-words characters. In many cases, we don’t need the punctuation
marks and it’s easy to remove them with regex.
Next, we need to score the words in each document. The task here is to convert each
raw text into a vector of numbers. After that, we can use these vectors as input
for a machine learning model. The simplest scoring method is to mark the presence
of words with 1 for present and 0 for absence.
Now, let’s see how we can create a bag-of-words model using the mentioned above
CountVectorizer class.
The complexity of the bag-of-words model comes in deciding how to design the
vocabulary of known words (tokens) and how to score the presence of known words.
In some cases, we can have a huge amount of data and in this cases, the length of
the vector that represents a document might be thousands or millions of elements.
Furthermore, each document may contain only a few of the known words in the
vocabulary.
Therefore the vector representations will have a lot of zeros. These vectors which
have a lot of zeros are called sparse vectors. They require more memory and
computational resources.
We can decrease the number of the known words when using a bag-of-words model to
decrease the required memory and computational resources. We can use the text
cleaning techniques we’ve already seen in this article before we create our bag-of-
words model:
Example
Let’s look at the all bigrams for the following sentence:
The office building is open today
the office
office building
building is
is open
open today
The bag-of-bigrams is more
Scoring Words
Once, we have created our vocabulary of known words, we need to score the
occurrence of the words in our data. We saw one very simple approach - the binary
approach (1 for presence, 0 for absence).
The TF-IDF scoring value increases proportionally to the number of times a word
appears in the document, but it is offset by the number of documents in the corpus
that contain the word.
Term Frequency (TF): a scoring of the frequency of the word in the current
document.
Summary
In this blog post, you learn the basics of the NLP for text. More specifically you
have learned the following concepts with additional details: