Data Science With Python - Lesson 09 - Data Science With Python - NLP PDF
Data Science With Python - Lesson 09 - Data Science With Python - NLP PDF
Natural language processing is an automated way to understand and analyze natural human languages and extract
information from such data by applying machine algorithms.
Extract information
The world is now connected globally due to the advancement of technology and devices.
Handling ambiguities
Why Natural Language Processing
NLP can achieve full automation by using modern software libraries, modules, and packages.
Full Intelligent
automation processing
Determines where one word Word Splits words, phrases, and idioms
ends and the other begins boundaries Tokenization
Disambig-
Determines meaning and uation Tf-idf
sense of words (context vs. Semantic Represents term frequency and
intent) analytics inverse document frequency
Let us look at the Natural Language Processing approaches to analyze text data.
Analyze sentence
Extract information
structure
NLP Environmental Setup
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Sentence Analysis
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Applications of NLP
Applications of NLP
Machine translation is used to translate one language into another. Google Translate
Machine Translation is an example. It uses NLP to translate the input data from one language to another.
Speech Recognition
Sentiment Analysis
Applications of NLP
The speech recognition application understands human speech and uses it as input
Machine Translation information. It is useful for applications like Siri, Google Now, and Microsoft Cortana.
Speech Recognition
Sentiment Analysis
Applications of NLP
Speech Recognition
Sentiment Analysis
Major NLP Libraries
NLTK
Scikit-learn
NLP libraries
TextBlob
spaCy
The Scikit-Learn Approach
The Scikit-Learn Approach
It is a very powerful library with a set of modules to process and analyze natural language data, such as text and
images, and extract information using machine learning algorithms.
It is a very powerful library with a set of modules to process and analyze natural language data, such as texts and
images, and extract information using machine learning algorithms.
Pipeline building
mechanism
It is a very powerful library with a set of modules to process and analyze natural language data, such as texts and
images, and extract information using machine learning algorithms.
Scikit-learn has many built-in datasets. There are several methods to load these datasets with the help of a data
load object.
Container
folder
Category 1
Category 2
Category 1
NumPy array
SciPy matrix
Category 2
Modules to Load Content and Category
Data load object Target names Has the list of requested categories
Load dataset
Let us see how functions like type, .data, and .target help in analyzing a dataset.
View data
View target
Feature Extraction
Feature extraction is a technique to convert the content into the numerical vectors to perform machine learning.
Bag of words is used to convert text data into numerical feature vectors with a fixed size.
Storing
Specifies number of
components to keep
Class class
sklearn.feature_extraction.text.CountVectorizer
Encoding used to
File name or (input='content', encoding='utf-8',
decode the input
sequence of strings decode_error='strict', strip_accents=None,
Removes accents
lowercase=True, preprocessor=None,
Overrides string
tokenizer tokenizer=None, stop_words=None, Built-in stop words list
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Text Feature Extraction Considerations
Text Feature Extraction Considerations
This utility deals with sparse matrix while storing them in memory. Sparse data
Sparse is commonly noticed when it comes to extracting feature values, especially for
large document datasets.
Decoding This utility can decode text files if their encoding is specified.
Model Training
An important task in model training is to identify the right model for the given dataset. The choice of model
completely depends on the type of dataset.
Models predict the outcome of new observations and datasets, and classify
documents based on the features and response of a given dataset.
Supervised
Example: Naïve Bayes, SVM, linear regression, K-NN neighbors
Models identify patterns in the data and extract its structure. They are also used to
Unsupervised group documents using clustering algorithms.
Example: K-means
Naïve Bayes Classifier
Advantages: Uses:
• It is efficient as it uses limited CPU and memory. • Naïve Bayes is used for sentiment analysis,
• It is fast as the model training takes less time. email spam detection, categorization of
documents, and language detection.
• Multinomial Naïve Bayes is used when multiple
occurrences of the words matter.
Naïve Bayes Classifier
Let us take a look at the signature of the multinomial Naïve Bayes classifier:
Smoothing parameter
(0 for no smoothing) Prior probabilities of the
classes
Grid Search and Multiple Parameters
Document classifiers can have many parameters. A Grid approach helps to search the best parameters for
model training and predicting the outcome accurately.
Category 1
Document classifier
Category 2
Grid Search and Multiple Parameters
Parameter
Best parameter
Grid Search and Multiple Parameters
In grid search mechanism, the whole dataset can be divided into multiple grids and a search can be run on the
entire grid or a combination of grids.
Grid searcher
Best parameter Parameter 1
Parameter 2
Parameter 3
Pipeline
Extracts features
around the word of
interest
Converts a collection
Helps the
of text documents into
model predict
a numerical feature
vector accurately
Pipeline and Grid Search
Access: Click on the Practice Labs tab on the left side panel of the LMS. Copy or note the
username and password that is generated. Click on the Launch Lab button. On the page that
appears, enter the username and password in the respective fields, and click Login.
Analyzing the Spam Collection Dataset
Problem Statement:
Common instructions:
•If you are new to Python, download the “Anaconda Installation Instructions” document
from the “Resources” tab to view the steps for installing Anaconda and the Jupyter
notebook.
•Download the “Assignment 01” notebook and upload it on the Jupyter notebook to access
it.
•Follow the provided cues to complete the assignment.
Analyzing the Sentiment Dataset using NLP
Problem Statement:
Common instructions:
• If you are new to Python, download the “Anaconda Installation Instructions” document
from the “Resources” tab to view the steps for installing Anaconda and the Jupyter
notebook.
• Download the “Assignment 02” notebook and upload it on the Jupyter notebook to
access it.
• Follow the provided cues to complete the assignment.
Key Takeaways
c. Find ambiguities
c. Find ambiguities
Splitting text data into words, phrases, and idioms is known as tokenization and each individual word is
known as token.
Knowledge
Check
What is the tf-idf value in a document?
2
td-idf value reflects how important a word is to a document. It is directly proportional to the number of
times a word appears and is offset by frequency of the words in corpus.
Knowledge
Check
In grid search, if n_jobs = -1, then which of the following is correct?
3
Detects all installed cores on the machine and uses all of them.
Knowledge
Check
Identify the correct example of Topic Modeling from the following options:
4
a. Machine translation
b. Speech recognition
c. News aggregators
d. Sentiment analysis
Knowledge
Check
Identify the correct example of Topic Modeling from the following options:
4
a. Machine translation
b. Speech recognition
c. News aggregators
d. Sentiment analysis
‘Topic model’ is statistical modeling and used to find latent groupings in the documents based upon the
words. For example, news aggregators.
Knowledge
Check How do we save memory while operating on Bag of Words which typically contain high-
dimensional sparse datasets?
5
d. Decode them
Knowledge
Check How do we save memory while operating on Bag of Words which typically contain high-
dimensional sparse datasets?
5
d. Decode them
In features vector, there will be several values with zeros. The best way to save memory is to store only non-
zero parts of the feature vectors.
Knowledge
Check
What is the function of the sub-module feature_extraction.text.CountVectorizer?
6