0% found this document useful (0 votes)
79 views

ML Project Report

This document is a project report submitted by three students for their Machine Learning course. It details their project aimed at improving the accuracy of Naive Bayes classifiers and decision trees. The report includes an introduction outlining classification techniques, a literature review on related work, their methodology for scaling up accuracy, results and discussion, and a conclusion with future scope. It was submitted to fulfill the requirements for a Bachelor of Technology degree and includes certificates, declarations, acknowledgments and references.

Uploaded by

Candy Angel
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
79 views

ML Project Report

This document is a project report submitted by three students for their Machine Learning course. It details their project aimed at improving the accuracy of Naive Bayes classifiers and decision trees. The report includes an introduction outlining classification techniques, a literature review on related work, their methodology for scaling up accuracy, results and discussion, and a conclusion with future scope. It was submitted to fulfill the requirements for a Bachelor of Technology degree and includes certificates, declarations, acknowledgments and references.

Uploaded by

Candy Angel
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

KLEF

Department of Computer Science Engineering

Course code -15CS4171

MACHINE LEARNING

III B. Tech – 2nd Semester

Academic Year 2018-2019

Project Based Lab

ON
SCALING UP THE ACCURACY OF NAIVE-BAYE’S CLASSIFIER AND DECISION
TREE
Submitted by
Section – 23
Batch No: 2

Student ID Student Name


160030411 G.NAGA TEJITH
160030459 G.PRIYANKA
160030559 R.KAMAL

1
KLEF
DEPARTMENT OF COMPUTER SCIENCE ENGINEERING
(DST-FIST Sponsored Department)

CERTIFICATE

This is to certify that the course based project entitled “SCALING UP THE ACCURACY
OF NAIVE-BAYE’S CALSSIFIER AND DECISION TREE” is a bonafide work done by
G.NAGA TEJITH (160030411),G.PRIYANKA (160030459),R.KAMAL(160030559) in
partial fulfilment of the requirement for the award of degree in BACHELOR OF
TECHNOLOGY in Computer Science Engineering during the academic year 2018-2019.

Faculty In Charge Head of the Department


DR.SWARNA

2
DEPARTMENT OF COMPUTER SCIENCE ENGINEERING
(DST-FIST Sponsored Department)

DECLARATION

We hereby declare that this project based lab report entitled “SCALING UP THE
ACCURACY OF NAIVE-BAYE’S CLASSIFIER AND DECISION TREE” has been
prepared by us in partial fulfilment of the requirement for the award of degree “BACHELOR
OF TECHNOLOGY in COMPUTER SCIENCE ENGINEERING” during the academic
year 2018-2019.

We also declare that this project based lab report is of our own effort and it has not
been submitted to any other university for the award of any degree.

Date:

Place: Vaddeswaram

SUBMITTED BY:

Student ID Student Name


160030411 G.NAGA TEJITH
160030459 G.PRIYANKA
160030559 R.KAMAL
.

3
ACKNOWLEDGMENTS

My sincere thanks DR.SWARNA teachersin the Lab for their outstanding support

throughout the project for the successful completion of the work

We express our gratitude toHARI KIRAN VEGE Head of the Department for Computer

Science and Engineering for providing us with adequate facilities, ways and means by which

we are able to complete this project.

We would like to place on record the deep sense of gratitude to the honourable Vice

Chancellor, K L University for providing the necessary facilities to carry the concluded

project.

Last but not the least, we thank all Teaching and Non-Teaching Staff of our department and

especially my classmates and my friends for their support in the completion of our project.

PROJECT
ASSOCIATES

Student ID Student Name


160030411 G.NAGA TEJITH
160030459 G.PRIYANKA
160030559 R.KAMAL

4
TABLEOFCONTENTS PAGENO

1. ACKNOWLEDGMENTS.........................................................................................................4

2. INTRODUCTION......................................................................................................................6

3. LITERATURE...........................................................................................................................7

4. METHODOLOGY.....................................................................................................................8

5. RESULTSANDDISCUSSION................................................................................................14

6. CONCLUSION AND FUTURESCOPE................................................................................16

7. REFERENCES.........................................................................................................................16

5
Scaling up the accuracy of Naive - Baye’s classifier and Decision tree
Classification:

Introduction:

Data classification is the process of organizing data into categories/groups in


such a way that data objects of same group are more similar and data objects
from different groups are very dissimilar. Classification algorithm assigns each
instance to a particular class such that classification error will be least. It is used
to extract models that accurately define important data classes within the given
adult dataset .

Classification techniques can handle processing of large volume of data. It can


predict categorical class labels and classifies data based on model built by using
training set and associated class labels and then can be used for classifying
newly available test data. Thus, it is outlined as an integral part of data analysis
and is gaining more popularity. Classification uses supervised learning
approach. In supervised learning, a training dataset of records is available with
associated class labels.

Classification process is divided into two main steps. The first is the training
step where the classification model is built. The second is the classification
itself, in which the trained model is applied to assign unknown data object to
one out of a given set of class label . This paper focuses on a survey of various
classification techniques that are most commonly used. The comparative study
between different algorithms (Naive- Bayes Classification and Decision tree) is
used to show the strength and accuracy of each classification algorithm in term
of performance efficiency and time complexity. A comparative study would
definitely bring out the advantages and disadvantages of one method over the
other. This would provide the guideline for interesting research issues which in
turn help other researchers in developing innovative algorithms for applications
or requirements which are not available.

6
Literature Review:

Many attempts have been made to extend Naive-Bayes or to restrict the learning
of general Bayesian networks. Approaches based on feature subset selection
may help, but they cannot increase the representation power as w&s done here,
thus we will not review them.

A Naive Bayes Style Possibilistic Classifier (NBSPC) is proposed by Borgelt


and Gebhardt (1999) to deal with imprecise training sets. For this classifier,
imprecision concerns only attribute values of instances (the class attribute and
the testing set are supposed to be perfect). Given the class attribute, possibility
distributions for attributes are estimated from the computation of the maximum-
based projection (Borgelt and Kruse 1988) over the set S of precise instances (S
is included in the extended dataset) which contains both the target value of the
considered attribute with the class.

A naive possibilistic network classifier, proposed by Haouari et al. (2009),


presents a building procedure that deals with imperfect dataset attributes and
classes, and a classification procedure used to classify unseen examples which
may have imperfect attribute values. This imperfection is modeled through a
possibility distribution given by an expert who expresses its partial ignorance,
due to a lack of a priori knowledge. There are some similarities between our
proposed approach and the one by Haouari et al. (2009). In particular, they are
based on the same idea stating that an attribute value is all the more possible if
there is an example, in the training set, with the same attribute value (in the
discrete case in Haouari et al. 2009) and very close attribute value (in terms of
similarity in the numerical case). However, the approach in Haouari et al.
(2009) does not require any conditional distribution over attributes to be defined
in the certain case, whereas the main focus, in our proposed approaches, is how
to estimate such possibility distribution for numerical data in the certain case.

7
Methodology:

We briefly review methods for induction of decision- trees and Naive-Bayes.


Decision-tree (Quinlan 1993; Breiman et al. 1984) are commonly built by
recursive partitioning. A univariate (single attribute) split is chosen for the root
of the tree using some criterion (e.g., mutual information, gain-ratio, gini
index). The data is then divided according to the test, and the process repeats
recursively for each child. After a full tree is built, a pruning step is executed,
which reduces the tree size. In the experiments, we compared our results with
the C4.5 decision-tree induction algorithm (Quinlan 1993), which is a stateof-
the-art algorithm. Naive-Bayes (Good 1965; Langley, Iba, & Thomp- son 1992)
uses Bayes rule to compute the probability of each class given the instance,
assuming the attributes are conditionally independent given the la- bel. The
version of Naive-Bayes we use in our experiments was implemented in MCC++
(Kohavi et al. 1994). The data is prediscretized using the an entropy-based
algorithm (Fayyad & Irani 1993; Dougherty, Kohavi, & Sahami 1995). The
probabilities are estimated directly from data based directly on counts (without
any corrections, such as Laplace or m-estimates).

Accuracy Scale-Up: A Naive-Bayes classifier requires estimation of the


conditional probabilities for each attribute value given the label. For discrete
data, because only few parameters need to be estimated, the estimates tend to
stabilize quickly and more data does not change the underlying model much.
With continuous attributes, the discretization is likely to form more intervals as
more data is available, thus increasing the representation power. However, even
with continuous data, the discretization is global and cannot take into account
attribute inter- actions. Decision-trees are non-parametric estimators and can
approximate any “reasonable” function as the database size grows (Gordon &
Olshen 1984). This theoretical result, however, may not be very comforting if
the database size required to reach the asymp- totic performance is more than
the number of atoms in the universe, as is sometimes the case. In practice, some
parametric estimators, such as Naive-Bayes, may perform better.

By using adult dataset we need to pedict the income and it is Considered as the
target attribute in the dataset.

8
ALGORITHMS:

Naive Bayes Classification:

Bayesian classifiers are statistical classifiers. They can predict class


membership probabilities, such as the probabilities, such as the probability that
a given tuple belongs to particular class. Bayesian classification is based on
Bayes Theorem. Bayesian classifiers exhibit high accuracy and speed when
applied to large database. It consists of Naïve Bayesian Classifiers and Bayesian
Belief Netwoks.Naive Bayesian Classifiers assume that the effect of an attribute
value on a given class is independent of the values of the other attribute while
Bayesian Belief Networks are graphical methods which allow the representation
of dependencies among subsets of attributes. In this paper for comparative study
of classification algorithms we have taken Naïve Bayesian Classification. The
Naïve Bayesian classification is a simple and well-known method for
performing supervised learning of a classification problem. It makes the
assumption of class conditional independence, i.e, given the class label of a
tuple, the values of the attributes are assumed to be conditionally independent of
one another.

Decision Tree Induction:

Decision tree induction is the learning of the decision trees from class-labeled
training tuples. A decision tree is a flow chart-like tree structure, where each
internal node denotes a test on an attribute, each branch represents an outcome
of the test, and each leaf node holds a class label. The topmost node in a tree is
the root node .Internal nodes are denoted by rectangles, and leaf nodes are
denoted by ovals. Some decision tree algorithms produce only binary trees,
whereas others can produce non binary trees. The construction of decision tree
classifiers does not require any domain knowledge or parameter setting, and
therefore is appropriate for exploratory knowledge discovery. They can handle
high dimensional data and have good accuracy. The decision tree induction
algorithm applied on the dataset for study is Random Forest. Random forest (or
random forests) is an ensemble classifier that consists of many decision trees
and outputs the class that is the mode of the classes output by individual trees.

9
It runs efficiently on large data bases and can handle thousands of input
variables without variable deletion. Generated forests can be saved for future
use on other data.

 Dataset:

Adult dataset available on UCI Machine Learning Repository and has a size of
3,755KB. The adult dataset consists of 32561 records and 15 attributes.

 Data Pre –Processing:

Data preprocessing is a type of processing on raw data to make it easier


and effective for further processing. It is an important step in data mining
process. The product of data preprocessing is the final training set. Kotsiantis et
al. (2006) present a well known algorithm for each step of data pre-processing .

The data preprocessing techniques are

 Data Cleaning
 Data Integration
 Data Transformation
 Data Reduction

These data preprocessing techniques are not mutually exclusive;they may work
together. Data processing techniques, when applied be fore mining, can
substantially improve the overall quality of patterns mined and the time required
for actual mining. Data preprocessing techniques can improve the quality of the
data, accuracy and efficiency of the mining process.

 Preprocessing of adult dataset:

Inorder to improve the quality of the data, accuracy and efficiency of the
mining process the adult dataset undergoes a preprocessing step. The less
sensitive attributes like final weight, capital gain, capital loss, hours per week
are removed since they are not considered as relevant attribute for privacy
preservation in data mining. So the number of attributes is reduced to 10.

The first 100 instances of the dataset is taken and then the instances with
missing values are removed resulting in a dataset of 91 attributes.

10
SOURCE CODE:

DECISION –TREE CODE:

install.packages("caret")

library(caret)

install.packages("rpart.plot")

library(rpart.plot)

setwd("C:\\Users\\USER\\Documents\\3-2\\skilling")

adult<-read.csv("adults.csv",sep=',',header = FALSE)

str(adult)

head(adult)

set.seed(3033)

intrain<-createDataPartition(y=adult$V15,p=0.7,list=FALSE)

training<-adult[intrain,]

testing<-adult[-intrain]

dim(training)

dim(testing)

anyNA(adult)

summary(adult)

trctrl<-trainControl(method ="repeatedcv",number=10,repeats = 3)

set.seed(3333)

dtree_fit_gini<-
train(V15~.,data=training,method="rpart",parms=list(split="gini"),trControl=trc
trl,tuneLength=10)

dtree_fit_gini

prp(dtree_fit_gini$finalModel,box.palette = "Blues",tweak = 1.2)

11
NAIVE-BAYE’S CODE:

#sample(x, size, replace = FALSE, prob = NULL)

setwd("C:\\Users\\USER\\Documents\\3-2\\skilling")

mydata<-read.csv(file = "adult.csv")

str(mydata)

dim(mydata)

tindex = sort(sample(nrow(mydata), nrow(mydata)*.7))

mtraining<-mydata[tindex,]

mtesting<-mydata[-tindex,]

install.packages("caTools")

#library(caTools)

#msplit<-sample.split(mydata,SplitRatio = 0.8)

#mtraining<-subset(mydata,msplit=="TRUE")

#mtesting<-subset(mydata,msplit=="FALSE")

install.packages("e1071")

library(e1071)

NB<-naiveBayes(income~., data=mtraining)

print(NB)

summary(NB)

predNB1<-predict(NB,mtesting,type=c("class"))

summary(predNB1)

table(mtesting$income,predNB1)

plot(predNB1)

install.packages("caret")

12
library(caret)

x<-mtraining[,-4]

y<-mtraining$income

model<-train(x,y,'nb',trControl = trainControl(method = 'cv',number = 10))

model

13
Results and Discussions:

1)Naive Bayesian implemented on adult dataset:

Ploting the graph to predict income

Accuracy:

14
2)Decision Tree :

Accuracy:

15
Conclusion:

The above experimentation of various classification algorithm on adult data set


shows that Naïve Bayesian is the best. Though Naïve Bayesian is followed by
Zero and Decision tree.So the Naive Bayes classifier is simple and fast and they
also exhibit higher accuracy rate than the algorithms discussed above.

Accuracy of Naive Bayes Classification is 97%.

Accuracy of Desicion tree is 85%.

The highest accuracy can be occurred in Naive Bayes classifier.

Future Scope:

Our work can be extended to other data mining techniques like clustering
,association etc. It can also be extended for other classification algorithms. We
have implemented the classification technique and found the accuracy for a
dataset with just 91 instances. This study can be carried forward by
implementing the same algorithms on larger data sets.

References:

https://rd.springer.com/content/pdf/10.1007%2Fs00500-012-0947-9.pdf

https://www.ijert.org/research/a-comparative-study-of-classification-
techniques-on-adult-data-set-IJERTV1IS8243.pdf

file:///C:/Users/USER/AppData/Local/Temp/Rar$DIa0.530/KDD96-033.pdf

16

You might also like