0% found this document useful (0 votes)
176 views62 pages

Data Science With Python - Lesson 09 - Data Science With Python - NLP PDF

This document provides an overview of natural language processing (NLP) with Python. It defines NLP and explains its importance. Some key applications of NLP include machine translation, speech recognition, and sentiment analysis. The document outlines how to set up the NLP environment and perform sentence analysis. It also discusses major NLP libraries like NLTK, Scikit-learn, TextBlob, and spaCy. Specifically, it describes the Scikit-learn approach to NLP, which includes loading content and categories with data load objects, feature extraction, model training, and using modules to optimize the process.

Uploaded by

akshay beniwal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
176 views62 pages

Data Science With Python - Lesson 09 - Data Science With Python - NLP PDF

This document provides an overview of natural language processing (NLP) with Python. It defines NLP and explains its importance. Some key applications of NLP include machine translation, speech recognition, and sentiment analysis. The document outlines how to set up the NLP environment and perform sentence analysis. It also discusses major NLP libraries like NLTK, Scikit-learn, TextBlob, and spaCy. Specifically, it describes the Scikit-learn approach to NLP, which includes loading content and categories with data load objects, feature extraction, model training, and using modules to optimize the process.

Uploaded by

akshay beniwal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 62

Data Science with Python

Natural Language Processing (NLP) with


SciKit Learn
Learning Objectives

By the end of this lesson, you will be able to:

Define natural language processing

Explain the importance of natural language processing

List the applications using natural language processing

Outline the modules to load content and category

Apply feature extraction techniques

Implement the approaches of natural language processing


Introduction to Natural Language
Processing
Natural Language Processing (NLP)

Natural language processing is an automated way to understand and analyze natural human languages and extract
information from such data by applying machine algorithms.

Extract information

Analyze human languages

Machine algorithms and


translations
(mathematics and statistics)
Data from various sources
Natural Language Processing

It is also referred to as, the field of computer science or AI to extract the


linguistics information from the underlying data.

Extract the linguistics information


Why Natural Language Processing

The world is now connected globally due to the advancement of technology and devices.

Analyzing tons of data

Identifying various languages

Applying quantitative analysis

Handling ambiguities
Why Natural Language Processing

NLP can achieve full automation by using modern software libraries, modules, and packages.

Full Intelligent
automation processing

Knowledge about Modern Machine


languages and world software models
libraries
NLP Terminology

Determines where one word Word Splits words, phrases, and idioms
ends and the other begins boundaries Tokenization

Discovers topics in a collection Stemming Maps to the valid root word


Topic
of documents NLP
models

Disambig-
Determines meaning and uation Tf-idf
sense of words (context vs. Semantic Represents term frequency and
intent) analytics inverse document frequency

Compares words, phrases, and


idioms in a set of documents to
extract meaning
NLP Approach for Text Data

Let us look at the Natural Language Processing approaches to analyze text data.

Conduct basic text


processing

Analyze the meaning


Categorize and tag words

Build feature-based NLP Classify text


structure

Analyze sentence
Extract information
structure
NLP Environmental Setup

Problem Statement: Demonstrate the installation of NLP environment

Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Sentence Analysis

Problem Statement: Demonstrate how to perform the sentence analysis

Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Applications of NLP
Applications of NLP

Machine translation is used to translate one language into another. Google Translate
Machine Translation is an example. It uses NLP to translate the input data from one language to another.

Speech Recognition

Sentiment Analysis
Applications of NLP

The speech recognition application understands human speech and uses it as input
Machine Translation information. It is useful for applications like Siri, Google Now, and Microsoft Cortana.

Speech Recognition

Sentiment Analysis
Applications of NLP

Sentiment analysis is achieved by processing tons of data received from different


Machine Translation interfaces and sources. For example, NLP uses all social media activities to find out
the popular topic of discussion or importance.

Speech Recognition

Sentiment Analysis
Major NLP Libraries

NLTK

Scikit-learn

NLP libraries
TextBlob

spaCy
The Scikit-Learn Approach
The Scikit-Learn Approach

It is a very powerful library with a set of modules to process and analyze natural language data, such as text and
images, and extract information using machine learning algorithms.

Built-in module Feature extraction Model training

Contains built-in A way to extract Analyzes the content


modules to load information from based on particular
the dataset’s data which can be categories and then
content and text or images. trains them according
categories. to a specific model.
The Scikit-Learn Approach

It is a very powerful library with a set of modules to process and analyze natural language data, such as texts and
images, and extract information using machine learning algorithms.

Pipeline building
mechanism

A technique to Various stages of pipeline


streamline the learning 1. Vectorization
NLP process into 2. Transformation
stages. 3. Model training and application
The Scikit-Learn Approach

It is a very powerful library with a set of modules to process and analyze natural language data, such as texts and
images, and extract information using machine learning algorithms.

Pipeline building Performance Grid search for finding


mechanism optimization good parameters

A technique in In this stage It’s a powerful way


Scikit-learn we train the to search
approach to models to parameters
streamline the optimize the affecting the
NLP process into overall outcome for
stages. process. model training
purposes.
Modules to Load Content and Category
Modules to Load Content and Category

Scikit-learn has many built-in datasets. There are several methods to load these datasets with the help of a data
load object.

Container
folder

Category 1

Data load object

Category 2

Data load object


Modules to Load Content and Category

The text files are loaded with categories as subfolder names.

Container Extract features


folder

Category 1

NumPy array
SciPy matrix

Category 2
Modules to Load Content and Category

In [ ] : #Build a feature extraction transformer


From sklearn.feature_extraction.text import <appropriate transformer>
Modules to Load Content and Category

The attributes of a data load object are:

Contains fields and can be accessed


Bunch
as dict keys or an object
Attributes

Data load object Target names Has the list of requested categories

Data Refers to an attribute in the memory


Modules to Load Content and Category

The example shows how a dataset can be loaded using Scikit-learn:

Import the dataset

Load dataset

Describe the dataset


Modules to Load Content and Category

Let us see how functions like type, .data, and .target help in analyzing a dataset.

View type of dataset

View data

View target
Feature Extraction

Feature extraction is a technique to convert the content into the numerical vectors to perform machine learning.

For example: Large datasets or documents

Text feature extraction

For example: Patch extraction, hierarchical


clustering

Image feature extraction


Bag of Words
Bag of Words

Bag of words is used to convert text data into numerical feature vectors with a fixed size.

Storing

Counting Store as the


value feature
Tokenizing Number of
occurrences of
each word
Assign a fixed
integer id to each
word

Token 1 Token 2 Token 3 Token 4


Document 1 42 32 119 3
Corpus of document Document 2 1118 0 0 89
Document 3 0 0 0 55
CountVectorizer Class Signature

Specifies number of
components to keep
Class class
sklearn.feature_extraction.text.CountVectorizer
Encoding used to
File name or (input='content', encoding='utf-8',
decode the input
sequence of strings decode_error='strict', strip_accents=None,
Removes accents
lowercase=True, preprocessor=None,
Overrides string
tokenizer tokenizer=None, stop_words=None, Built-in stop words list

token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), Min Threshold


analyzer='word', max_df=1.0, min_df=1,
Max Threshold
max_features=None, vocabulary=None,
binary=False, dtype=<class 'numpy.int64'>)
Bags of Words

Problem Statement: Demonstrate the Bag of Words technique

Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Text Feature Extraction Considerations
Text Feature Extraction Considerations

This utility deals with sparse matrix while storing them in memory. Sparse data
Sparse is commonly noticed when it comes to extracting feature values, especially for
large document datasets.

It implements tokenization and occurrence. Words with minimum two letters


Vectorizer
get tokenized. We can use the analyzer function to vectorize the text data.

It is a term weighing utility for term frequency and inverse document


frequency. Term frequency indicates the frequency of a particular term in the
Tf-idf
document. Inverse document frequency is a factor which diminishes the
weight of terms that occur frequently.

Decoding This utility can decode text files if their encoding is specified.
Model Training

An important task in model training is to identify the right model for the given dataset. The choice of model
completely depends on the type of dataset.

Models predict the outcome of new observations and datasets, and classify
documents based on the features and response of a given dataset.
Supervised
Example: Naïve Bayes, SVM, linear regression, K-NN neighbors

Models identify patterns in the data and extract its structure. They are also used to
Unsupervised group documents using clustering algorithms.

Example: K-means
Naïve Bayes Classifier

It is the most basic technique for classification of text.

Advantages: Uses:
• It is efficient as it uses limited CPU and memory. • Naïve Bayes is used for sentiment analysis,
• It is fast as the model training takes less time. email spam detection, categorization of
documents, and language detection.
• Multinomial Naïve Bayes is used when multiple
occurrences of the words matter.
Naïve Bayes Classifier

Let us take a look at the signature of the multinomial Naïve Bayes classifier:

Learn Class prior probabilities


Class

class sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None)

Smoothing parameter
(0 for no smoothing) Prior probabilities of the
classes
Grid Search and Multiple Parameters

Document classifiers can have many parameters. A Grid approach helps to search the best parameters for
model training and predicting the outcome accurately.

Category 1

Extract features of a document

Document classifier

Category 2
Grid Search and Multiple Parameters

Document classifier Parameter

Parameter Grid searcher

Parameter
Best parameter
Grid Search and Multiple Parameters

In grid search mechanism, the whole dataset can be divided into multiple grids and a search can be run on the
entire grid or a combination of grids.

Grid searcher
Best parameter Parameter 1

Parameter 2

Parameter 3
Pipeline

A pipeline is a combination of vectorizers, transformers, and model training.

Extracts features
around the word of
interest

Transformer Model Training


Vectorizer
(tf-idf) (document classifiers)

Converts a collection
Helps the
of text documents into
model predict
a numerical feature
vector accurately
Pipeline and Grid Search

Problem Statement: Demonstrate the Pipeline and Grid Search technique.

Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Analyzing the Spam Collection Dataset

Problem Statement:

Analyze the given Spam Collection dataset to:


1.View information on the spam data
2.View the length of messages,
3. Define a function to eliminate stop words
4. Apply Bag of Words
5. Apply tf-idf transformer
6. Detect Spam with Naïve Bayes model
Analyzing the Spam Collection Dataset

Instructions on performing the assignment:


•Download the Spam Collection dataset from the “Resource” tab. Upload it using the right
syntax to use and analyze it.

Common instructions:
•If you are new to Python, download the “Anaconda Installation Instructions” document
from the “Resources” tab to view the steps for installing Anaconda and the Jupyter
notebook.
•Download the “Assignment 01” notebook and upload it on the Jupyter notebook to access
it.
•Follow the provided cues to complete the assignment.
Analyzing the Sentiment Dataset using NLP

Problem Statement:

Analyze the Sentiment dataset using NLP to:


1. View the observations
2. Verify the length of the messages and add it as a new column
3. Apply a transformer and fit the data in the bag of words
4. Print the shape for the transformer
5. Check the model for predicted and expected values
Analyzing the Sentiment Dataset using NLP

Instructions on performing the assignment:


• Download the Sentiment dataset from the “Resource” tab. Upload it to your Jupyter
notebook to work on it.

Common instructions:
• If you are new to Python, download the “Anaconda Installation Instructions” document
from the “Resources” tab to view the steps for installing Anaconda and the Jupyter
notebook.
• Download the “Assignment 02” notebook and upload it on the Jupyter notebook to
access it.
• Follow the provided cues to complete the assignment.
Key Takeaways

You are now able to:

Define natural language processing

Explain the importance of natural language processing

List the applications using natural language processing

Outline the modules to load content and category

Apply feature extraction techniques

Implement the approaches of natural language processing


Knowledge Check
Knowledge
Check
In NLP, tokenization is a way to _______________________.
1

a. Find the grammar of the text

b. Analyze the sentence structure

c. Find ambiguities

d. Split text data into words, phrases, and idioms


Knowledge
Check
In NLP, tokenization is a way to _______________________.
1

a. Find the grammar of the text

b. Analyze the sentence structure

c. Find ambiguities

d. Split text data into words, phrases, and idioms

The correct answer is d

Splitting text data into words, phrases, and idioms is known as tokenization and each individual word is
known as token.
Knowledge
Check
What is the tf-idf value in a document?
2

a. Directly proportional to the number of times a word appears

b. Inversely proportional to the number of times a word appears

c. Offset by frequency of the words in corpus

d. Increase with frequency of the words in corpus


Knowledge
Check
What is the tf-idf value in a document?
2

a. Directly proportional to the number of times a word appears

b. Inversely proportional to the number of times a word appears

c. Offset by frequency of the words in corpus

d. Increase with frequency of the words in corpus

The correct answer is a,c

td-idf value reflects how important a word is to a document. It is directly proportional to the number of
times a word appears and is offset by frequency of the words in corpus.
Knowledge
Check
In grid search, if n_jobs = -1, then which of the following is correct?
3

a. Uses only 1 CPU core

b. Detects all installed cores and uses them all

c. Searches for only one parameter

d. All parameters will be searched on a given grid


Knowledge
Check
In grid search, if n_jobs = -1, then which of the following is correct?
3

a. Uses only 1 CPU core

b. Detects all installed cores and uses them all

c. Searches for only one parameter

d. All parameters will be searched on a given grid

The correct answer is b

Detects all installed cores on the machine and uses all of them.
Knowledge
Check
Identify the correct example of Topic Modeling from the following options:
4

a. Machine translation

b. Speech recognition

c. News aggregators

d. Sentiment analysis
Knowledge
Check
Identify the correct example of Topic Modeling from the following options:
4

a. Machine translation

b. Speech recognition

c. News aggregators

d. Sentiment analysis

The correct answer is c

‘Topic model’ is statistical modeling and used to find latent groupings in the documents based upon the
words. For example, news aggregators.
Knowledge
Check How do we save memory while operating on Bag of Words which typically contain high-
dimensional sparse datasets?
5

a. Distribute datasets in several blocks or chunks

b. Store only non-zero parts of the feature vectors

c. Flatten the dataset

d. Decode them
Knowledge
Check How do we save memory while operating on Bag of Words which typically contain high-
dimensional sparse datasets?
5

a. Distribute datasets in several blocks or chunks

b. Store only non-zero parts of the feature vectors

c. Flatten the dataset

d. Decode them

The correct answer is b

In features vector, there will be several values with zeros. The best way to save memory is to store only non-
zero parts of the feature vectors.
Knowledge
Check
What is the function of the sub-module feature_extraction.text.CountVectorizer?
6

a. Convert a collection of text documents to a matrix of token counts

b. Convert a collection of text documents to a matrix of token occurrences

c. Transform a count matrix to a normalized form

d. Convert a collection of raw documents to a matrix of TF-IDF features


Knowledge
Check
What is the function of the sub-module feature_extraction.text.CountVectorizer?
6

a. Convert a collection of text documents to a matrix of token counts

b. Convert a collection of text documents to a matrix of token occurrences

c. Transform a count matrix to a normalized form

d. Convert a collection of raw documents to a matrix of TF-IDF features

The correct answer is a

The function of the sub-module feature_extraction.text.CountVectorizer is to convert a collection of text


documents to a matrix of token counts.
Thank You

You might also like