0% found this document useful (0 votes)
11 views20 pages

Machine Learning: Suprit Shrestha (19BCE2584)

This document discusses various topic modeling techniques including NMF, LDA, LSA, PLSA, NNMF, and IDA2VEC. It first covers data preprocessing such as cleaning, lemmatization and stemming. Then it explains calculating TF-IDF and running LDA using bag of words and TF-IDF. Evaluation metrics like accuracy, precision, recall and F1-score are discussed. Finally, it proposes using a hybrid LDA+PLSA model for its high performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
11 views20 pages

Machine Learning: Suprit Shrestha (19BCE2584)

This document discusses various topic modeling techniques including NMF, LDA, LSA, PLSA, NNMF, and IDA2VEC. It first covers data preprocessing such as cleaning, lemmatization and stemming. Then it explains calculating TF-IDF and running LDA using bag of words and TF-IDF. Evaluation metrics like accuracy, precision, recall and F1-score are discussed. Finally, it proposes using a hybrid LDA+PLSA model for its high performance.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 20

DA 1

Machine Learning

Suprit Shrestha(19BCE2584)
CONTENTS
PART 1(nmf) ............................................................................................................................................... 2
Import the Python Libraries you will need .............................................................................................. 2
Select Data and Import.............................................................................................................................. 3
Isolate Data to Topic Model .......................................................................................................................... 3
Clean Data ..................................................................................................................................................... 4
Create Function to Pre-process Data ............................................................................................................ 5
Create Document Term Matrix ‘V’ ................................................................................................................ 5
Create Function to Display Topics................................................................................................................. 6
Run NMF on Document Term Matrix ‘V’ ...................................................................................................... 6
Iterate until you find useful Topics ............................................................................................................... 6
part 2(rest work) ........................................................................................................................................... 7
data SET......................................................................................................................................................... 7
Data preprocessing ....................................................................................................................................... 7
data sanitation .............................................................................................................................................. 9
Calculation of TF-IDF (term frequency- inverse document frequency) ...................................................... 10
Running LDA using Bag of Words................................................................................................................ 10
Running LDA using TF-IDF ........................................................................................................................... 11
Accuracy ...................................................................................................................................................... 11
Precision ...................................................................................................................................................... 12
Recall ........................................................................................................................................................... 13
F1-score....................................................................................................................................................... 13
Splitting the dataset .................................................................................................................................... 14
Using different models................................................................................................................................ 14
LDA .............................................................................................................................................................. 14
LSA............................................................................................................................................................... 15
PLSA............................................................................................................................................................. 15
NNMF .......................................................................................................................................................... 19
IDA2VEC ...................................................................................................................................................... 19
Hybrid model (LDA + PLSA) ......................................................................................................................... 19
Justification ................................................................................................................................................. 19
PART 1(NMF)

IMPORT THE PYTHON LIBRARIES YOU WILL NEED

You'll need the following items to manipulate and export data frames:

• Pandas

For modelling, you'll need the following items from Scikit-Learn:

• TfidfVectorizer
• NMF
• Text
You may require the following from nltk for text processing:

• stopwords
• word_tokenize
• pos_tag
You may require the following items to clean your text:

• regular expressions
• string
SELECT DATA AND IMPORT
In this we will be selecting and importing our data in this case I have chosen abcnews-date-text.csv.

ISOLATE DATA TO TOPIC MODEL


Now we will be taking the dataset and make data frames and isolate the first term. This makes so that the
document Is with similar length.
CLEAN DATA
Now we make the text comparable as possible by removing punctuation, capitalization, numbers, and
strange characters. This is made into a function and then is called.
CREATE FUNCTION TO PRE-PROCESS DATA
All of the words in the speeches will be lemmatized here, meaning that different forms of the same word
will be reduced to a base form, such as noun plurals becoming singular and all verb tenses becoming
present. This reduces the text by interpreting repeated instances of minor changes of a word as a single
word. It is possible to stem or lemmatize. Another thing we're doing here is isolating the voice text to a
certain segment.

CREATE DOCUMENT TERM MATRIX ‘V’


In we can add some stop words to the stop word list so that we don’t get words that are not meaning full.
We can also use TF-IDF Vectorizer rather than a simple count vectorizer in order to get more value output
as more unique terms.
CREATE FUNCTION TO DISPLAY TOPICS
This is to know how much the NMF is useful and to what they are. We create a function to display the top
words activated for each topic.

RUN NMF ON DOCUMENT TERM MATRIX ‘V’


Run NMF on the document term matrix as the one that we have.

ITERATE UNTIL YOU FIND USEFUL TOPICS


This is where all is done. As a basic model. This is usually used through a count vectorizer and then it is
sent to NMF to get an idea of what we are looking for.
PART 2(REST WORK)

DATA SET
Same Data set as part 1.

DATA PREPROCESSING

Lemmatize example
Stemmer Example
DATA SANITATION
CALCULATION OF TF-IDF (TERM FREQUENCY- INVERSE DOCUMENT
FREQUENCY)

RUNNING LDA USING BAG OF WORDS


RUNNING LDA USING TF-IDF

ACCURACY
PRECISION
RECALL

F1-SCORE
SPLITTING THE DATASET
I split the dataset in the following ratio:

• Training data: 70%


• Testing data: 30%

USING DIFFERENT MODELS

LDA
LSA

PLSA
NNMF

IDA2VEC

HYBRID MODEL (LDA + PLSA)

JUSTIFICATION
My hybrid model is improved from others and is advantageous due to the following reasons:

• High accuracy
• High precision
• No outliers
• High speed

So, for the dataset LSA can be the best option to use. If not using the hybrid model.

You might also like