Machine Learning: Suprit Shrestha (19BCE2584)

DA 1
Machine Learning
Suprit Shrestha(19BCE2584)
CONTENTS
PART 1(nmf) ............................................................................................................................................... 2
Import the Python Libraries you will need .............................................................................................. 2
Select Data and Import.............................................................................................................................. 3
Isolate Data to Topic Model .......................................................................................................................... 3
Clean Data ..................................................................................................................................................... 4
Create Function to Pre-process Data ............................................................................................................ 5
Create Document Term Matrix ‘V’ ................................................................................................................ 5
Create Function to Display Topics................................................................................................................. 6
Run NMF on Document Term Matrix ‘V’ ...................................................................................................... 6
Iterate until you find useful Topics ............................................................................................................... 6
part 2(rest work) ........................................................................................................................................... 7
data SET......................................................................................................................................................... 7
Data preprocessing ....................................................................................................................................... 7
data sanitation .............................................................................................................................................. 9
Calculation of TF-IDF (term frequency- inverse document frequency) ...................................................... 10
Running LDA using Bag of Words................................................................................................................ 10
Running LDA using TF-IDF ........................................................................................................................... 11
Accuracy ...................................................................................................................................................... 11
Precision ...................................................................................................................................................... 12
Recall ........................................................................................................................................................... 13
F1-score....................................................................................................................................................... 13
Splitting the dataset .................................................................................................................................... 14
Using different models................................................................................................................................ 14
LDA .............................................................................................................................................................. 14
LSA............................................................................................................................................................... 15
PLSA............................................................................................................................................................. 15
NNMF .......................................................................................................................................................... 19
IDA2VEC ...................................................................................................................................................... 19
Hybrid model (LDA + PLSA) ......................................................................................................................... 19
Justification ................................................................................................................................................. 19
PART 1(NMF)
IMPORT THE PYTHON LIBRARIES YOU WILL NEED
You'll need the following items to manipulate and export data frames:
• Pandas
For modelling, you'll need the following items from Scikit-Learn:
• TfidfVectorizer
• NMF
• Text
You may require the following from nltk for text processing:
• stopwords
• word_tokenize
• pos_tag
You may require the following items to clean your text:
• regular expressions
• string
SELECT DATA AND IMPORT
In this we will be selecting and importing our data in this case I have chosen abcnews-date-text.csv.
ISOLATE DATA TO TOPIC MODEL

Now we will be taking the dataset and make data frames and isolate the first term. This makes so that the
document Is with similar length.
CLEAN DATA
Now we make the text comparable as possible by removing punctuation, capitalization, numbers, and
strange characters. This is made into a function and then is called.
CREATE FUNCTION TO PRE-PROCESS DATA
All of the words in the speeches will be lemmatized here, meaning that different forms of the same word
will be reduced to a base form, such as noun plurals becoming singular and all verb tenses becoming
present. This reduces the text by interpreting repeated instances of minor changes of a word as a single
word. It is possible to stem or lemmatize. Another thing we're doing here is isolating the voice text to a
certain segment.
CREATE DOCUMENT TERM MATRIX ‘V’

In we can add some stop words to the stop word list so that we don’t get words that are not meaning full.
We can also use TF-IDF Vectorizer rather than a simple count vectorizer in order to get more value output
as more unique terms.
CREATE FUNCTION TO DISPLAY TOPICS
This is to know how much the NMF is useful and to what they are. We create a function to display the top
words activated for each topic.
RUN NMF ON DOCUMENT TERM MATRIX ‘V’

Run NMF on the document term matrix as the one that we have.
ITERATE UNTIL YOU FIND USEFUL TOPICS

This is where all is done. As a basic model. This is usually used through a count vectorizer and then it is
sent to NMF to get an idea of what we are looking for.
PART 2(REST WORK)
DATA SET
Same Data set as part 1.
DATA PREPROCESSING
Lemmatize example
Stemmer Example
DATA SANITATION
CALCULATION OF TF-IDF (TERM FREQUENCY- INVERSE DOCUMENT
FREQUENCY)
RUNNING LDA USING BAG OF WORDS

RUNNING LDA USING TF-IDF
ACCURACY
PRECISION
RECALL
F1-SCORE
SPLITTING THE DATASET
I split the dataset in the following ratio:
• Training data: 70%

• Testing data: 30%
USING DIFFERENT MODELS
LDA
LSA
PLSA
NNMF
IDA2VEC
HYBRID MODEL (LDA + PLSA)
JUSTIFICATION
My hybrid model is improved from others and is advantageous due to the following reasons:
• High accuracy
• High precision
• No outliers
• High speed
So, for the dataset LSA can be the best option to use. If not using the hybrid model.

Machine Learning: Suprit Shrestha (19BCE2584)

Uploaded by

Machine Learning: Suprit Shrestha (19BCE2584)

Uploaded by

DA 1

IMPORT THE PYTHON LIBRARIES YOU WILL NEED

For modelling, you'll need the following items from Scikit-Learn:

ISOLATE DATA TO TOPIC MODEL

CREATE DOCUMENT TERM MATRIX ‘V’

RUN NMF ON DOCUMENT TERM MATRIX ‘V’

ITERATE UNTIL YOU FIND USEFUL TOPICS

RUNNING LDA USING BAG OF WORDS

• Training data: 70%

USING DIFFERENT MODELS

HYBRID MODEL (LDA + PLSA)

You might also like