ML Project Report
ML Project Report
MACHINE LEARNING
ON
SCALING UP THE ACCURACY OF NAIVE-BAYE’S CLASSIFIER AND DECISION
TREE
Submitted by
Section – 23
Batch No: 2
1
KLEF
DEPARTMENT OF COMPUTER SCIENCE ENGINEERING
(DST-FIST Sponsored Department)
CERTIFICATE
This is to certify that the course based project entitled “SCALING UP THE ACCURACY
OF NAIVE-BAYE’S CALSSIFIER AND DECISION TREE” is a bonafide work done by
G.NAGA TEJITH (160030411),G.PRIYANKA (160030459),R.KAMAL(160030559) in
partial fulfilment of the requirement for the award of degree in BACHELOR OF
TECHNOLOGY in Computer Science Engineering during the academic year 2018-2019.
2
DEPARTMENT OF COMPUTER SCIENCE ENGINEERING
(DST-FIST Sponsored Department)
DECLARATION
We hereby declare that this project based lab report entitled “SCALING UP THE
ACCURACY OF NAIVE-BAYE’S CLASSIFIER AND DECISION TREE” has been
prepared by us in partial fulfilment of the requirement for the award of degree “BACHELOR
OF TECHNOLOGY in COMPUTER SCIENCE ENGINEERING” during the academic
year 2018-2019.
We also declare that this project based lab report is of our own effort and it has not
been submitted to any other university for the award of any degree.
Date:
Place: Vaddeswaram
SUBMITTED BY:
3
ACKNOWLEDGMENTS
My sincere thanks DR.SWARNA teachersin the Lab for their outstanding support
We express our gratitude toHARI KIRAN VEGE Head of the Department for Computer
Science and Engineering for providing us with adequate facilities, ways and means by which
We would like to place on record the deep sense of gratitude to the honourable Vice
Chancellor, K L University for providing the necessary facilities to carry the concluded
project.
Last but not the least, we thank all Teaching and Non-Teaching Staff of our department and
especially my classmates and my friends for their support in the completion of our project.
PROJECT
ASSOCIATES
4
TABLEOFCONTENTS PAGENO
1. ACKNOWLEDGMENTS.........................................................................................................4
2. INTRODUCTION......................................................................................................................6
3. LITERATURE...........................................................................................................................7
4. METHODOLOGY.....................................................................................................................8
5. RESULTSANDDISCUSSION................................................................................................14
7. REFERENCES.........................................................................................................................16
5
Scaling up the accuracy of Naive - Baye’s classifier and Decision tree
Classification:
Introduction:
Classification process is divided into two main steps. The first is the training
step where the classification model is built. The second is the classification
itself, in which the trained model is applied to assign unknown data object to
one out of a given set of class label . This paper focuses on a survey of various
classification techniques that are most commonly used. The comparative study
between different algorithms (Naive- Bayes Classification and Decision tree) is
used to show the strength and accuracy of each classification algorithm in term
of performance efficiency and time complexity. A comparative study would
definitely bring out the advantages and disadvantages of one method over the
other. This would provide the guideline for interesting research issues which in
turn help other researchers in developing innovative algorithms for applications
or requirements which are not available.
6
Literature Review:
Many attempts have been made to extend Naive-Bayes or to restrict the learning
of general Bayesian networks. Approaches based on feature subset selection
may help, but they cannot increase the representation power as w&s done here,
thus we will not review them.
7
Methodology:
By using adult dataset we need to pedict the income and it is Considered as the
target attribute in the dataset.
8
ALGORITHMS:
Decision tree induction is the learning of the decision trees from class-labeled
training tuples. A decision tree is a flow chart-like tree structure, where each
internal node denotes a test on an attribute, each branch represents an outcome
of the test, and each leaf node holds a class label. The topmost node in a tree is
the root node .Internal nodes are denoted by rectangles, and leaf nodes are
denoted by ovals. Some decision tree algorithms produce only binary trees,
whereas others can produce non binary trees. The construction of decision tree
classifiers does not require any domain knowledge or parameter setting, and
therefore is appropriate for exploratory knowledge discovery. They can handle
high dimensional data and have good accuracy. The decision tree induction
algorithm applied on the dataset for study is Random Forest. Random forest (or
random forests) is an ensemble classifier that consists of many decision trees
and outputs the class that is the mode of the classes output by individual trees.
9
It runs efficiently on large data bases and can handle thousands of input
variables without variable deletion. Generated forests can be saved for future
use on other data.
Dataset:
Adult dataset available on UCI Machine Learning Repository and has a size of
3,755KB. The adult dataset consists of 32561 records and 15 attributes.
Data Cleaning
Data Integration
Data Transformation
Data Reduction
These data preprocessing techniques are not mutually exclusive;they may work
together. Data processing techniques, when applied be fore mining, can
substantially improve the overall quality of patterns mined and the time required
for actual mining. Data preprocessing techniques can improve the quality of the
data, accuracy and efficiency of the mining process.
Inorder to improve the quality of the data, accuracy and efficiency of the
mining process the adult dataset undergoes a preprocessing step. The less
sensitive attributes like final weight, capital gain, capital loss, hours per week
are removed since they are not considered as relevant attribute for privacy
preservation in data mining. So the number of attributes is reduced to 10.
The first 100 instances of the dataset is taken and then the instances with
missing values are removed resulting in a dataset of 91 attributes.
10
SOURCE CODE:
install.packages("caret")
library(caret)
install.packages("rpart.plot")
library(rpart.plot)
setwd("C:\\Users\\USER\\Documents\\3-2\\skilling")
adult<-read.csv("adults.csv",sep=',',header = FALSE)
str(adult)
head(adult)
set.seed(3033)
intrain<-createDataPartition(y=adult$V15,p=0.7,list=FALSE)
training<-adult[intrain,]
testing<-adult[-intrain]
dim(training)
dim(testing)
anyNA(adult)
summary(adult)
trctrl<-trainControl(method ="repeatedcv",number=10,repeats = 3)
set.seed(3333)
dtree_fit_gini<-
train(V15~.,data=training,method="rpart",parms=list(split="gini"),trControl=trc
trl,tuneLength=10)
dtree_fit_gini
11
NAIVE-BAYE’S CODE:
setwd("C:\\Users\\USER\\Documents\\3-2\\skilling")
mydata<-read.csv(file = "adult.csv")
str(mydata)
dim(mydata)
mtraining<-mydata[tindex,]
mtesting<-mydata[-tindex,]
install.packages("caTools")
#library(caTools)
#msplit<-sample.split(mydata,SplitRatio = 0.8)
#mtraining<-subset(mydata,msplit=="TRUE")
#mtesting<-subset(mydata,msplit=="FALSE")
install.packages("e1071")
library(e1071)
NB<-naiveBayes(income~., data=mtraining)
print(NB)
summary(NB)
predNB1<-predict(NB,mtesting,type=c("class"))
summary(predNB1)
table(mtesting$income,predNB1)
plot(predNB1)
install.packages("caret")
12
library(caret)
x<-mtraining[,-4]
y<-mtraining$income
model
13
Results and Discussions:
Accuracy:
14
2)Decision Tree :
Accuracy:
15
Conclusion:
Future Scope:
Our work can be extended to other data mining techniques like clustering
,association etc. It can also be extended for other classification algorithms. We
have implemented the classification technique and found the accuracy for a
dataset with just 91 instances. This study can be carried forward by
implementing the same algorithms on larger data sets.
References:
https://rd.springer.com/content/pdf/10.1007%2Fs00500-012-0947-9.pdf
https://www.ijert.org/research/a-comparative-study-of-classification-
techniques-on-adult-data-set-IJERTV1IS8243.pdf
file:///C:/Users/USER/AppData/Local/Temp/Rar$DIa0.530/KDD96-033.pdf
16