Manisha 3001 Week 12

Writing weightage: 2500
Complication: Technical works

Weightage for complication: 5000
Total weightage: 7500
File type: Multiple
Font: Times New Roman
Font size: 12
Alignment: Justified
Line and paragraph spacing: 1.5
Margin: Normal
Number of footnote link required: 5
Deadline: 15th april
Task 4 Detail
For this task, you are required to use Weka software and a text editor such as WordPad,
Notepad++ for windows system or Textedit for Mac.
You can download Weka from https://www.cs.waikato.ac.nz/ml/weka/downloading.html).
Task 1: Create and explore Weka data file of type ARFF
Download a text file called data.csv from the subject site (Canvas) and open it using a text editor
such as WordPad, Notepad++ etc., for windows system or Textedit for Mac. You need to explore
and convert this file into an ARFF file for Weka. The text file you will be using contains a
sample of real-life data related to customers. The data.csv file is not entirely formatted as a Weka
file (ARFF). This file has some formatting errors, and your task is to find these errors and fix
them to have a valid ARFF file. Save the valid file as data.arff.
Explore the data.arff dataset using Weka Explorer and answer the following questions.
Make sure to include screenshots of the visualisations to support your answers.
1. Take a screenshot of your corrected ARFF file.
2. Which attribute in the dataset do you think is useless and did not provide useful information
for prediction?
The given dataset has twelve attributes and these are id, age, sex, region, income, married,
children, car, save_act, current_act, mortgage and pep. Among which id, age, sex, save_Act,
current_act, and children are the attributes that does not provide useful information for
prediction.
3. How many attributes the dataset has?

The dataset has total of twelve datasets and these are id, age, sex, region, income, married,
children, car, save_act, current_act, mortgage and pep.
4. How many instances the dataset has?

The dataset has total of 500 instances. Each instance is represented by a row in the given dataset
and stores values of each of the given attributes.
5. What is the attribute in the data.arff dataset?

The attribute of the data.arff dataset is the columns in the dataset. It is the property of the
instance. It also describes the data of the instances.
6. What proportion of customers who has a mortgage and living in Inner City?
More than 144 persons of the inner city has a mortgage.
7. What proportion of customers who has a mortgage and living in Inner City?
More than 144 persons of the inner city has a mortgage.
8. What proportion of customers who has a mortgage and their income is between $8000 and
$29000?
62% of the customers having a mortgage have their income between $8000 and $29000.
9. How many customers are married and has no mortgage?
More than 120 customers are married and has no mortgage. Which is less than half of the total
married customers which is 172.
10. How many customers have not owned a car and has a mortgage?
156 customers have mortgage but does not own any car. Which is around the half of the total
customers.
Task 2: Practical Analysis
Use the dataset from Task 1 to perform data mining tasks for Task 2 and compare the
performance on this data set for the following classification algorithms using classification
algorithms:
Classification Algorithms
• Naive Bayes
Naïve bayes algorithms are widely used in data mining and machine learning based solutions. It is a
collection of algorithms that are based on bayes classifiers. It is a collection of algorithms and not a single
algorithm. The Bayesian theorem is used to find the chances of occurrence of an event given that another
event has already been occurred. With a consideration of treating the features of Bayesian independently,
we can split the feature that gives the naïve Bayesian algorithm. In this algorithm every pair of features
are classified independently. Although most of the assumptions of independence does not fit in the real-
world scenario assumption of independence the efficiency of this algorithm is found to be higher.
However, the traditional approach for attributes selection requires a complex computation. The
computation of it can be minimized if only some of the attributes are chosen to form a selective naïve
bayes model. 1 There are a number of ways a naïve bayes classifier can be implemented, Gaussian naïve
bayes classifier, Multinominal naïve bayes classifier, Bernoulli Naïve Bayes classifier are some examples.
Even though naïve bayes classifier uses assumption that may not be relevant to the real world, it has great
use in the field of machine learning, data science. Spam email filtering, document classifications are some
of the use cases of this algorithm. 2
The naïve bayes classifier when tested on WEKA, on the given dataset with cross validation fold value
set as 10, took almost no time to build the model. The weka.classifiers.bayes.NaiveBayes classifier
scheme was used to build the model for the given dataset. The given dataset has a total of 500 instances
and the summary of the classifier suggested that 61.8% of the total instance which is equal to 309 is
correctly classified whereas 38.2% of the given data which is equal to total 191 instances have been found
incorrectly classified. The kappa statistics was found to be 0.2198, The mean absolute error was 0.4346,
similarly, the root mean squared error, relative mean error, root relative square error was found to be
0.4784, 87.602%, 96.0465%.
Figure-1: Summary of Naïve Bayes classifier
1
https://www.sciencedirect.com/science/article/abs/pii/S0950705119306185
2
https://www.infona.pl/resource/bwmeta1.element.baztech-abc8af42-7c69-453b-b291-4a67ddfba8a0
As we can see that the value of Kappa statics is 0.2198, it means that there is less agreement among
attributes. The summary also says that there are multiple errors in the model and is point of concern as it
can drop the efficiency of the predictions.
The detailed accuracy by class table figure-2 gives clearer picture of the analysis.
Figure-2
The total true positive data for “YES” is found to be 0.5 and for class “No” it is 0.717 which indicates that
almost half of the data was correctly identified and almost two third of the data was incorrectly identified
by the mode. The weighted average of the true positive rate is found to be 0.618. Similarly, the false
positive for class “yes” is found to be 0.283 and its value is 0.5 for that of “No” class. The weighted
average of the false positive is found to be 0.401. Similarly, other measurements can be seen in the table
shown in figure-2 for both classes.
The confusion matrix of the created model gives an idea of correctly identified instances and incorrectly
identified instances. The major diagonal of the confusion matrix shown the correctly identified instances
where as the other diagonal shows incorrectly identified instances, and both are classified as a as “Yes”
and b as “No” classes.
The figure-3 shows the graphical representation between given pep and predicted prep. Similarly, figure-4
is the visual of the marginal curve.
Figure-3: Visualization of the classifier errors
Figure-4: visualization of the margin curve
• HoeffdingTree
The HoeffdingTree is an incremental, adaptive decision tree which is capable of learning from
huge data stream. The only consideration is that the distribution generating samples should not
be changes over time. HoeffdingTree is used in Hoeffding tree algorithm which is a learning
algorithm, widely used in the field of data science and machine learning. Hoeffding tree method
is used for stream data classification. It is called a decision tree as it tries to predict the next most
probable event that may occur. For example, it was initially used to predict which site has higher
chance of being accessed than others, the concept is also known as tracing of web clickstreams.
At the nutshell, it runs a sublinear time and yields almost similar decision tree as compared to
traditional batch learners. 3
The Hoeffding tree classifier when tested on WEKA, on the given dataset with cross validation
fold value set as 10, took almost no time to build the model. The
weka.classifiers.trees.HoeffdingTree -L 2 -S 1 -E 1.0E-7 -H 0.05 -M 0.01 -G 200.0 -N 0.0
classifier scheme was used to build the model for the given dataset.
The model has been generated for a given dataset which has total instance of 500. It took almost
no time to build the model for out of it, the summary of the build has been shown on figure-5.
Figure-5: summary of Stratified cross-validation
As the figure-5 indicates, 62.6% of the given total instances has been classified correctly whereas 37.4%
of the total instances have been incorrectly classified. Overall, the accuracy is decent but is not reliable.
Here, the value of Kappa statistic is 0.2356 which means that there is less agreement among attributes.
Similarly, the mean absolute error is found to be 0.432 which is quite a huge number. It means that we
can expect 43.2% error in the classification of the model. Similarly, the root mean squared error has been
found to be 0.4769 and relative absolute error and root relative squared error has been found 87.0726%
and 95.7448% respectively. Since those errors have higher values, the model cannot be reliable and
trusted.
3
https://weka.sourceforge.io/doc.dev/weka/classifiers/trees/HoeffdingTree.html
The Figure-6 shows the detailed accuracy by class. The total true positive data for “YES” is found to be
0.504 and for class “No” it is 0.728 which indicates that almost half of the data was correctly identified
and almost two third of the data was incorrectly identified by the mode. The weighted average of the true
positive rate is found to be 0.626. Similarly, the false positive for class “yes” is found to be 0.272 and its
value is 0.496 for that of “No” class. The weighted average of the false positive is found to be 0.394.
Similarly, other measurements can be seen in the table shown in figure-2 for both classes.
Figure-6: Detailed accuracy by class
The confusion matrix of the created model gives an idea of correctly identified instances and incorrectly
identified instances. The major diagonal of the confusion matrix shown the correctly identified instances
where as the other diagonal shows incorrectly identified instances, and both are classified as a as “Yes”
and b as “No” classes.
Figure-7: Confusion Matrix
The figure-9 shows the graphical representation between given pep and predicted prep. Similarly, figure-
10 is the visual of the marginal curve.
Figure-8: visulaizing the classifier errors
Figure-9: visualization of the margin curve
The classifier model for the HoeffdingTree classifier is shown below:
=== Classifier model (full training set) ===
NO (273.000) NB1 NB adaptive1

• SVM (or SMO)
SMO stands for the Sequential minimal optimization. It is an algorithm that is used to train the
support vector classifier. Basically SMO implementation is used in WEKA software for the data
mining and analysis purpose. It replaces al the missing values and the nominal attributes that are
present in the dataset are converted into binary ones. In our dataset also the SMO algorithm by
default normalizes all the attributes in the dataset i.e. Data.arf. The output and the coefficients
presented in the output of the SMO are completely based on this normalized data rather than the
origin data in the dataset. The quadratic programming problems that are born due to the training
of the support-vector machines are solved by the Sequential minima optimization algorithm.
SMO is also an iterative algorithm, in which the large problem set in divided into a series of
many smaller possible sub-problems. These smaller problems are solved analytically. This
algorithm is present in the functions section of the WEKA Explorer. The scheme used in our
case was:
Scheme: weka.classifiers.functions.SMO -C 1.0 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K
"weka.classifiers.functions.supportVector.PolyKernel -E 1.0 -C 250007" -calibrator
"weka.classifiers.functions.Logistic -R 1.0E-8 -M -1 -num-decimal-places 4"
The test of the datasest was done on the 10-folds cross-validation. And, the classifier model used
the full training set. Total of 500 instances with 12 attributes id age, sex, region, income,
married, children, car, save_act, current_act, mortgage and pep.
We can observe the normalization of the data in the following figure:
Fig: Normalizing the original data by the SMO algorithm
The summary of the classifier output can be observed in the figure below:
Fig.: Summary of the classifier output of functions.SMO

The correctly classified instances were around 298 (59.6%) and 02 were incorrectly classified
instances. We observed the Kappa statistic to be 0.1763, which means there is a slight
agreement. The mean absolute error was found to be 0.404. Lower the mean absolute error
better the algorithm is considered to be.
The detailed accuracy by class for the dataset used Data.arff is shown in the figure below:
Fig.: Detailed Accuracy By Class

The true positive rate i.e. the sensitivity, was found to be just 0.487. This represents the rate at
which our model correctly predicted the positive class. The model had a negative rate i.e., the
specificity, to be 0.688, which is the rate at which our model correctly predicted the negative
class. Yes is the positive class and No is the negative class in our model. The statistical
classification always tend to have some sort of errors, and presented in the error matrix form,
called as a confusion matrix. The performance of the SMO algorithm can be visualized by the
confusion matrix follows:
Fig.: Confusion matrix in the classifier output of the SMO algorithm
The classifier errors for the Data.arff was observed as below:
And the margin curve for the given dataset in the SMO algorithm was observed as follows:
• J48
J48 is an algorithm that is preferred for the datasets classification. It is one of the best algorithms
that examines the data categorically and continuously. The algorithm generate the decision tree
via the C4 (which is the extension of ID3 algorithm). The algorithm ignores the missing values
completely while building the tree. The value of the ignored items’ is possible to be predicted
through the attributes values for other records. The main idea of the J48 algorithm is the division
of the dataset into ranges based on the values of the attributes that are present in the training
sample. The data set when passed through the classifier J48 the schema used was:
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
The J48 pruned tree obtained in the classifier output is as follows:
The decision tree obtained had the number of leaves equal to 23, and the size of the tree equal to
45. The nodes present internally represented the attributes, and the braches represented the
values that could be held by the attributes in the observed samples, and the nodes at the terminal
end represented the final value (i.e. classification) of the dependent variable). The decision tree
visualized in our case is as follows:
Fig.: Visualize the J48 decision tree
The summary of the classifier output for the J48 algorithm is presented as follows:
Fig.: Summary if the J48 algorithm classifier output
Total of 439 instances were correctly classified instances, which 87.9 %, and the 61 instances
were incorrectly classified instances. The Kappa statistic for theJ48 algorithm over our dataset
Data.arff was found to be 0.7533. This means there is almost a substantial agreement. Other stats
like mean absolute error and root mean squared error are lower i.e. 0.1747 and 0.3265
respectively. The detailed accuracy by class in the classifier output was observed as follows:
The true positive (i.e. yes) outcome was observed to be 0.878 on average, i.e. the rate at which
our model correctly predicted the positive class. And, the False Positive prediction rate was
0.127 (on average). The error or the confusion matrix as observed is presented below for the J48
classifier.
This confusion matrix justifies that 193 + 246 = 439 instances were correctly classified, and the
35+26 instances were incorrectly classified.
The visualization of the classifier error provided the following graph:

And, the visualization of the margin curve provided the following graphic output:
Direct comparison of the algorithms: Naive Bayes, SMO, HoeffdingTree, and
J48.
Comparing the correct percentage the J48 classification algorithm was victorious. While the
other 3 algorithms were close to each other so could not be distinguish between their
performances.
Fig.: Test Output for the comparison field: Percent_correct in WEKA Software
Therefore, the J48 algorithm become last in the percent_incorrect comparison field.
Fig.: Test Output for the comparison field: percent_incorrect
Now, comparing the Kappa statistics, once again the J48 tree algorithm was victorious with the
point 0.76, and other three algorithm remained undistinguished.
Fig.: Test Output for the comparison field: Kappa statistic

The NaiveBayes algorithm had the highest of the mean absolute error i.e. 0.43. The comparison
among the algorithms based on the mean absolute error is presented below:
Fig.: Test Output for the comparison field: Mean_absolute_error
Comparing in terms of the F-measure also the J48 classifier is best as it has 0.87 points. F-
measure for the Naïve Bayes and HoeffdingTree is 0.55, and 0.53 for the SMO algorithm. F
measure is the harmonic fraction of precision and recall. It is better as higher it is.
So, based on the results of the test output, we can conclude that J48 classification algorithm is
best compared to other.

Manisha 3001 Week 12

Uploaded by

Copyright:

Available Formats

Manisha 3001 Week 12

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Manisha 3001 Week 12

Uploaded by

Copyright:

Available Formats

Writing weightage: 2500

Complication: Technical works

3. How many attributes the dataset has?

4. How many instances the dataset has?

5. What is the attribute in the data.arff dataset?

Figure-1: Summary of Naïve Bayes classifier

Figure-5: summary of Stratified cross-validation

Figure-6: Detailed accuracy by class

Figure-7: Confusion Matrix

The classifier model for the HoeffdingTree classifier is shown below:

=== Classifier model (full training set) ===

NO (273.000) NB1 NB adaptive1

Fig.: Summary of the classifier output of functions.SMO

Fig.: Detailed Accuracy By Class

Fig.: Visualize the J48 decision tree

The visualization of the classifier error provided the following graph:

Fig.: Test Output for the comparison field: percent_incorrect

Fig.: Test Output for the comparison field: Kappa statistic

Fig.: Test Output for the comparison field: Mean_absolute_error

You might also like