Manisha 3001 Week 12
Manisha 3001 Week 12
Manisha 3001 Week 12
Task 4 Detail
For this task, you are required to use Weka software and a text editor such as WordPad,
Notepad++ for windows system or Textedit for Mac.
You can download Weka from https://www.cs.waikato.ac.nz/ml/weka/downloading.html).
Task 1: Create and explore Weka data file of type ARFF
Download a text file called data.csv from the subject site (Canvas) and open it using a text editor
such as WordPad, Notepad++ etc., for windows system or Textedit for Mac. You need to explore
and convert this file into an ARFF file for Weka. The text file you will be using contains a
sample of real-life data related to customers. The data.csv file is not entirely formatted as a Weka
file (ARFF). This file has some formatting errors, and your task is to find these errors and fix
them to have a valid ARFF file. Save the valid file as data.arff.
Explore the data.arff dataset using Weka Explorer and answer the following questions.
Make sure to include screenshots of the visualisations to support your answers.
1. Take a screenshot of your corrected ARFF file.
2. Which attribute in the dataset do you think is useless and did not provide useful information
for prediction?
The given dataset has twelve attributes and these are id, age, sex, region, income, married,
children, car, save_act, current_act, mortgage and pep. Among which id, age, sex, save_Act,
current_act, and children are the attributes that does not provide useful information for
prediction.
7. What proportion of customers who has a mortgage and living in Inner City?
More than 144 persons of the inner city has a mortgage.
8. What proportion of customers who has a mortgage and their income is between $8000 and
$29000?
62% of the customers having a mortgage have their income between $8000 and $29000.
9. How many customers are married and has no mortgage?
More than 120 customers are married and has no mortgage. Which is less than half of the total
married customers which is 172.
10. How many customers have not owned a car and has a mortgage?
156 customers have mortgage but does not own any car. Which is around the half of the total
customers.
Task 2: Practical Analysis
Use the dataset from Task 1 to perform data mining tasks for Task 2 and compare the
performance on this data set for the following classification algorithms using classification
algorithms:
Classification Algorithms
• Naive Bayes
Naïve bayes algorithms are widely used in data mining and machine learning based solutions. It is a
collection of algorithms that are based on bayes classifiers. It is a collection of algorithms and not a single
algorithm. The Bayesian theorem is used to find the chances of occurrence of an event given that another
event has already been occurred. With a consideration of treating the features of Bayesian independently,
we can split the feature that gives the naïve Bayesian algorithm. In this algorithm every pair of features
are classified independently. Although most of the assumptions of independence does not fit in the real-
world scenario assumption of independence the efficiency of this algorithm is found to be higher.
However, the traditional approach for attributes selection requires a complex computation. The
computation of it can be minimized if only some of the attributes are chosen to form a selective naïve
bayes model. 1 There are a number of ways a naïve bayes classifier can be implemented, Gaussian naïve
bayes classifier, Multinominal naïve bayes classifier, Bernoulli Naïve Bayes classifier are some examples.
Even though naïve bayes classifier uses assumption that may not be relevant to the real world, it has great
use in the field of machine learning, data science. Spam email filtering, document classifications are some
of the use cases of this algorithm. 2
The naïve bayes classifier when tested on WEKA, on the given dataset with cross validation fold value
set as 10, took almost no time to build the model. The weka.classifiers.bayes.NaiveBayes classifier
scheme was used to build the model for the given dataset. The given dataset has a total of 500 instances
and the summary of the classifier suggested that 61.8% of the total instance which is equal to 309 is
correctly classified whereas 38.2% of the given data which is equal to total 191 instances have been found
incorrectly classified. The kappa statistics was found to be 0.2198, The mean absolute error was 0.4346,
similarly, the root mean squared error, relative mean error, root relative square error was found to be
0.4784, 87.602%, 96.0465%.
1
https://www.sciencedirect.com/science/article/abs/pii/S0950705119306185
2
https://www.infona.pl/resource/bwmeta1.element.baztech-abc8af42-7c69-453b-b291-4a67ddfba8a0
As we can see that the value of Kappa statics is 0.2198, it means that there is less agreement among
attributes. The summary also says that there are multiple errors in the model and is point of concern as it
can drop the efficiency of the predictions.
The detailed accuracy by class table figure-2 gives clearer picture of the analysis.
Figure-2
The total true positive data for “YES” is found to be 0.5 and for class “No” it is 0.717 which indicates that
almost half of the data was correctly identified and almost two third of the data was incorrectly identified
by the mode. The weighted average of the true positive rate is found to be 0.618. Similarly, the false
positive for class “yes” is found to be 0.283 and its value is 0.5 for that of “No” class. The weighted
average of the false positive is found to be 0.401. Similarly, other measurements can be seen in the table
shown in figure-2 for both classes.
The confusion matrix of the created model gives an idea of correctly identified instances and incorrectly
identified instances. The major diagonal of the confusion matrix shown the correctly identified instances
where as the other diagonal shows incorrectly identified instances, and both are classified as a as “Yes”
and b as “No” classes.
The figure-3 shows the graphical representation between given pep and predicted prep. Similarly, figure-4
is the visual of the marginal curve.
Figure-3: Visualization of the classifier errors
Figure-4: visualization of the margin curve
• HoeffdingTree
The HoeffdingTree is an incremental, adaptive decision tree which is capable of learning from
huge data stream. The only consideration is that the distribution generating samples should not
be changes over time. HoeffdingTree is used in Hoeffding tree algorithm which is a learning
algorithm, widely used in the field of data science and machine learning. Hoeffding tree method
is used for stream data classification. It is called a decision tree as it tries to predict the next most
probable event that may occur. For example, it was initially used to predict which site has higher
chance of being accessed than others, the concept is also known as tracing of web clickstreams.
At the nutshell, it runs a sublinear time and yields almost similar decision tree as compared to
traditional batch learners. 3
The Hoeffding tree classifier when tested on WEKA, on the given dataset with cross validation
fold value set as 10, took almost no time to build the model. The
weka.classifiers.trees.HoeffdingTree -L 2 -S 1 -E 1.0E-7 -H 0.05 -M 0.01 -G 200.0 -N 0.0
classifier scheme was used to build the model for the given dataset.
The model has been generated for a given dataset which has total instance of 500. It took almost
no time to build the model for out of it, the summary of the build has been shown on figure-5.
As the figure-5 indicates, 62.6% of the given total instances has been classified correctly whereas 37.4%
of the total instances have been incorrectly classified. Overall, the accuracy is decent but is not reliable.
Here, the value of Kappa statistic is 0.2356 which means that there is less agreement among attributes.
Similarly, the mean absolute error is found to be 0.432 which is quite a huge number. It means that we
can expect 43.2% error in the classification of the model. Similarly, the root mean squared error has been
found to be 0.4769 and relative absolute error and root relative squared error has been found 87.0726%
and 95.7448% respectively. Since those errors have higher values, the model cannot be reliable and
trusted.
3
https://weka.sourceforge.io/doc.dev/weka/classifiers/trees/HoeffdingTree.html
The Figure-6 shows the detailed accuracy by class. The total true positive data for “YES” is found to be
0.504 and for class “No” it is 0.728 which indicates that almost half of the data was correctly identified
and almost two third of the data was incorrectly identified by the mode. The weighted average of the true
positive rate is found to be 0.626. Similarly, the false positive for class “yes” is found to be 0.272 and its
value is 0.496 for that of “No” class. The weighted average of the false positive is found to be 0.394.
Similarly, other measurements can be seen in the table shown in figure-2 for both classes.
The confusion matrix of the created model gives an idea of correctly identified instances and incorrectly
identified instances. The major diagonal of the confusion matrix shown the correctly identified instances
where as the other diagonal shows incorrectly identified instances, and both are classified as a as “Yes”
and b as “No” classes.
The figure-9 shows the graphical representation between given pep and predicted prep. Similarly, figure-
10 is the visual of the marginal curve.
Figure-8: visulaizing the classifier errors
Figure-9: visualization of the margin curve
SMO stands for the Sequential minimal optimization. It is an algorithm that is used to train the
support vector classifier. Basically SMO implementation is used in WEKA software for the data
mining and analysis purpose. It replaces al the missing values and the nominal attributes that are
present in the dataset are converted into binary ones. In our dataset also the SMO algorithm by
default normalizes all the attributes in the dataset i.e. Data.arf. The output and the coefficients
presented in the output of the SMO are completely based on this normalized data rather than the
origin data in the dataset. The quadratic programming problems that are born due to the training
of the support-vector machines are solved by the Sequential minima optimization algorithm.
SMO is also an iterative algorithm, in which the large problem set in divided into a series of
many smaller possible sub-problems. These smaller problems are solved analytically. This
algorithm is present in the functions section of the WEKA Explorer. The scheme used in our
case was:
Scheme: weka.classifiers.functions.SMO -C 1.0 -L 0.001 -P 1.0E-12 -N 0 -V -1 -W 1 -K
"weka.classifiers.functions.supportVector.PolyKernel -E 1.0 -C 250007" -calibrator
"weka.classifiers.functions.Logistic -R 1.0E-8 -M -1 -num-decimal-places 4"
The test of the datasest was done on the 10-folds cross-validation. And, the classifier model used
the full training set. Total of 500 instances with 12 attributes id age, sex, region, income,
married, children, car, save_act, current_act, mortgage and pep.
We can observe the normalization of the data in the following figure:
Fig: Normalizing the original data by the SMO algorithm
The summary of the classifier output can be observed in the figure below:
The detailed accuracy by class for the dataset used Data.arff is shown in the figure below:
And the margin curve for the given dataset in the SMO algorithm was observed as follows:
• J48
J48 is an algorithm that is preferred for the datasets classification. It is one of the best algorithms
that examines the data categorically and continuously. The algorithm generate the decision tree
via the C4 (which is the extension of ID3 algorithm). The algorithm ignores the missing values
completely while building the tree. The value of the ignored items’ is possible to be predicted
through the attributes values for other records. The main idea of the J48 algorithm is the division
of the dataset into ranges based on the values of the attributes that are present in the training
sample. The data set when passed through the classifier J48 the schema used was:
Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2
The J48 pruned tree obtained in the classifier output is as follows:
The decision tree obtained had the number of leaves equal to 23, and the size of the tree equal to
45. The nodes present internally represented the attributes, and the braches represented the
values that could be held by the attributes in the observed samples, and the nodes at the terminal
end represented the final value (i.e. classification) of the dependent variable). The decision tree
visualized in our case is as follows:
The summary of the classifier output for the J48 algorithm is presented as follows:
Fig.: Summary if the J48 algorithm classifier output
Total of 439 instances were correctly classified instances, which 87.9 %, and the 61 instances
were incorrectly classified instances. The Kappa statistic for theJ48 algorithm over our dataset
Data.arff was found to be 0.7533. This means there is almost a substantial agreement. Other stats
like mean absolute error and root mean squared error are lower i.e. 0.1747 and 0.3265
respectively. The detailed accuracy by class in the classifier output was observed as follows:
The true positive (i.e. yes) outcome was observed to be 0.878 on average, i.e. the rate at which
our model correctly predicted the positive class. And, the False Positive prediction rate was
0.127 (on average). The error or the confusion matrix as observed is presented below for the J48
classifier.
This confusion matrix justifies that 193 + 246 = 439 instances were correctly classified, and the
35+26 instances were incorrectly classified.
Comparing the correct percentage the J48 classification algorithm was victorious. While the
other 3 algorithms were close to each other so could not be distinguish between their
performances.
Fig.: Test Output for the comparison field: Percent_correct in WEKA Software
Therefore, the J48 algorithm become last in the percent_incorrect comparison field.
Now, comparing the Kappa statistics, once again the J48 tree algorithm was victorious with the
point 0.76, and other three algorithm remained undistinguished.
Comparing in terms of the F-measure also the J48 classifier is best as it has 0.87 points. F-
measure for the Naïve Bayes and HoeffdingTree is 0.55, and 0.53 for the SMO algorithm. F
measure is the harmonic fraction of precision and recall. It is better as higher it is.
So, based on the results of the test output, we can conclude that J48 classification algorithm is
best compared to other.