Machine Learning: Suprit Shrestha (19BCE2584)
Machine Learning: Suprit Shrestha (19BCE2584)
Machine Learning
Suprit Shrestha(19BCE2584)
CONTENTS
PART 1(nmf) ............................................................................................................................................... 2
Import the Python Libraries you will need .............................................................................................. 2
Select Data and Import.............................................................................................................................. 3
Isolate Data to Topic Model .......................................................................................................................... 3
Clean Data ..................................................................................................................................................... 4
Create Function to Pre-process Data ............................................................................................................ 5
Create Document Term Matrix ‘V’ ................................................................................................................ 5
Create Function to Display Topics................................................................................................................. 6
Run NMF on Document Term Matrix ‘V’ ...................................................................................................... 6
Iterate until you find useful Topics ............................................................................................................... 6
part 2(rest work) ........................................................................................................................................... 7
data SET......................................................................................................................................................... 7
Data preprocessing ....................................................................................................................................... 7
data sanitation .............................................................................................................................................. 9
Calculation of TF-IDF (term frequency- inverse document frequency) ...................................................... 10
Running LDA using Bag of Words................................................................................................................ 10
Running LDA using TF-IDF ........................................................................................................................... 11
Accuracy ...................................................................................................................................................... 11
Precision ...................................................................................................................................................... 12
Recall ........................................................................................................................................................... 13
F1-score....................................................................................................................................................... 13
Splitting the dataset .................................................................................................................................... 14
Using different models................................................................................................................................ 14
LDA .............................................................................................................................................................. 14
LSA............................................................................................................................................................... 15
PLSA............................................................................................................................................................. 15
NNMF .......................................................................................................................................................... 19
IDA2VEC ...................................................................................................................................................... 19
Hybrid model (LDA + PLSA) ......................................................................................................................... 19
Justification ................................................................................................................................................. 19
PART 1(NMF)
You'll need the following items to manipulate and export data frames:
• Pandas
• TfidfVectorizer
• NMF
• Text
You may require the following from nltk for text processing:
• stopwords
• word_tokenize
• pos_tag
You may require the following items to clean your text:
• regular expressions
• string
SELECT DATA AND IMPORT
In this we will be selecting and importing our data in this case I have chosen abcnews-date-text.csv.
DATA SET
Same Data set as part 1.
DATA PREPROCESSING
Lemmatize example
Stemmer Example
DATA SANITATION
CALCULATION OF TF-IDF (TERM FREQUENCY- INVERSE DOCUMENT
FREQUENCY)
ACCURACY
PRECISION
RECALL
F1-SCORE
SPLITTING THE DATASET
I split the dataset in the following ratio:
LDA
LSA
PLSA
NNMF
IDA2VEC
JUSTIFICATION
My hybrid model is improved from others and is advantageous due to the following reasons:
• High accuracy
• High precision
• No outliers
• High speed
So, for the dataset LSA can be the best option to use. If not using the hybrid model.