Statistical Tests For Comparing Machine Learning Algorithms
Statistical Tests For Comparing Machine Learning Algorithms
Statistical Tests For Comparing Machine Learning Algorithms
https://doi.org/10.22214/ijraset.2022.47955
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue XII Dec 2022- Available at www.ijraset.com
Abstract: The average result of machine learning models is determined using naive 10-fold cross-validation, paired Student's t-
test, McNemar's test. The algorithm with the best average performance should outperform those with the worst. But what if the
difference in average results is due to a statistical anomaly? A statistical hypothesis test is used to determine whether the mean
results of the differences between the three algorithms are real. Using statistical hypothesis testing, this study will show how to
compare machine learning algorithms. When choosing a model, the output of several machine learning algorithms or
simulation channels is compared. The model that performs best based on your performance measure becomes the final model
that can be used to predict new data. With classification and regression prediction models, this can be done using traditional
machine learning and deep learning methods. It is difficult to determine whether the difference between the two models is
accurate or not.
Keywords: Machine Learning, Classifiers, Data Mining Techniques, Data Analysis, Learning Algorithms, Supervised Machine
Learning.
I. INTRODUCTION
Data can be understood by applying the specified structure to the result and then using statistical techniques to verify or invalidate
the estimate. An estimate is called a hypothesis and validation done using statistical tests is called statistical hypothesis tests. If we
want to make a claim about the distribution of the data or if the grouped results differ in applied machine learning, statistical
hypothesis testing needs to be done. [1][2]. The data itself is uninteresting. The most interesting thing is how the data is interpreted.
When it comes time to query the data and understand the discovery, then statistical methodologies will be used to provide certainty
or probability about the answers. This kind of procedure is called significance testing or hypothesis testing [3]. The term theory
might evoke the concept of scientific investigations in which a hypothesis is tested. This is a good step in the right direction. A
hypothesis test statistically calculates a number for a given estimate. The results of the test allow the researcher to determine
whether the estimate is valid or falsified. In particular, two examples that are widely used in machine learning are as follows: • A
hypothesis test performed on data that followed a normal distribution. • Performed a test of the assumption that two samples are
selected from the same population distribution. A null hypothesis, commonly known as hypothesis 0, is a statistical test hypothesis
(H0 for short). The default assumption, often known as the "nothing has changed" assumption, is widely used. Because all available
information indicates that the evidence shows that H0 can be rejected, the first hypothesis, often known as Hypothesis 1 or H1, is a
violation of the assumption of the test. H1 is basically just shorthand for some other hypothesis. Hypothesis 0 (H0): The hypothesis
of the test is correct and is rejected at a significant level. Hypothesis 1 (H1): At the given significance level, the test assumption fails
and is rejected. Before he can reject or fail to reject the null hypothesis, the results of the test must be evaluated [3]. Regardless of
the level of significance, testing the results of hypotheses may contain errors. By estimating a specified structure for the outcome
and then using statistical techniques to verify or invalidate the estimate, the data can be understood. An assumption is called a
hypothesis, and the statistical tests used to test it are called statistical hypothesis tests. Whenever there is a need to examine the
distribution of data or when group results differ [2-4].
Interpreting the P-value: What is the significance of the p-value? It is possible to determine whether a result is statistically
significant by calculating the p-value. For example, it can perform a normality test on sample data and find that the sample data
rarely differs from a Gaussian distribution, thereby rejecting the null hypothesis [5]. A hypothesis test in statistics returns a p-value
as the result. This is a statistic that can be used to evaluate or quantify the results of a test and decide whether to reject the null
hypothesis. This is achieved by comparing the p-value with the significance level, which is a predetermined value [5]. Alpha is
widely used to indicate the degree of significance. The most commonly used alpha value is 0.05. A smaller alpha value, such as 0.01
percent, indicates a more robust interpretation of the null hypothesis [6].
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 628
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue XII Dec 2022- Available at www.ijraset.com
The previously obtained alpha value will be compared with the p-value. The statistical significance of the result is obtained when
the alpha is greater than the p-value. This means that there has been a change: The initial hypothesis is now rejected [7] [8]. The null
hypothesis is not rejected if the alpha is less than the p-value, which means that the result is not significant. The null hypothesis
must be rejected if alpha is equal to the p-value, which means it is a significant result [9]. For example, if a test was run to see if a
sample of data was normal and the p-value was 0.07, one might say the following: The test failed to reject the null hypothesis at the
0.05 significance level, indicating that the sample of data was normal. . You can generate a confidence level for a hypothesis based
on observed sample data by subtracting 1 from the significance level [10].
III. METHODOLOGY/EXPERIMENTAL
A. Use McNemar’s test
McNemar's test can be used when we need to compare the performance of two classifiers when we have matched pairs. The test
works well if there are many different predictions between two classifiers A and B, then if we have a lot of data. Using this test, we
are able to compare the performance of two classifiers on N items with one set of tests, unlike how you did in a paired t-test.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 629
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue XII Dec 2022- Available at www.ijraset.com
1) Assumptions
a) Random Sample
b) Independence
c) Mutually exclusive groups
2) Contingency Table
It’s a tabulation or count of two categorical variables. In case of the McNemar’s test, we are interested in binary variables
correct/incorrect or yes/no for a control and a treatment or two cases. This is called a 2×2 contingency table.
The contingency table may not be intuitive at first glance. Let’s make it concrete with a worked example.
Consider that we have two trained classifiers. Every classifier makes binary class prediction for each of the 10 examples in a test
dataset. The predictions are evaluated and determined to be correct or incorrect.
B. Paired t-test
1) Take a sample of N observations (obtained from k-fold cv). These results are assumed to come from a normal distribution with
a fixed mean and variance.
2) Calculate the sample mean and sample variance for these observations.
3) Calculate the t-statistic.
4) Use a t-distribution with N-1 degrees of freedom to estimate how likely it is that the true mean is within a given range.
5) Reject the null hypothesis at the p significance level if the t-statistic does not lie in the following interval:
Fig.1 Interval
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 630
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue XII Dec 2022- Available at www.ijraset.com
2) K-nearest neighbors: K-nearest neighbors is a non-parametric method used for classification and regression. It is one of the
simplest ML techniques used. It is a lazy learning model with local approximation. Basic theory: The basic logic of KNN is to
examine your surroundings, assume that the test data point is like them, and infer the output. At KNN, we look for neighbors
and come up with a forecast. In the case of KNN classification, majority voting is used for the k closest data points, while in
KNN regression, the average of the k closest data points is calculated as the output. As a general rule, we choose odd numbers
like k. KNN is a lazy learning model where computations only happen at runtime.
3) Support Vector Machine: Support Vector Machine or SVM is one of the most popular supervised learning algorithms used for
both classification and regression problems. However, it is primarily used for classification problems in machine learning. The
goal of the SVM algorithm is to create the best line or decision boundary that can segregate the n-dimensional space into
classes so that we can easily place a new data point into the correct category in the future. This best decision boundary is called
the hyperplane. SVM selects extreme points/vectors that help in creating the hyperplane. These extreme cases are called support
vectors, and thus the algorithm is called a support vector machine.
INPUT
PRE-PROCESSING ML
MODEL
K-NEAREST SVM-SUPPORT
LINEAR REGRESSION VECTOR MACHINE
NEIGHBORS
k-fold cross
validation
TESTS
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 631
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue XII Dec 2022- Available at www.ijraset.com
B. LR vs KNN
KNN is a non-parametric model while LR is a parametric model. KNN is slow in real time because it has to track all the training
data and find neighboring nodes, while LR can easily extract the output from the tuned θ coefficients.
D. KNN vs SVM
SVM takes care of outliers better than KNN. If the training data is much larger than not. functions (m>>n), KNN is better than
SVM. SVM outperforms KNN when large features and less training data are available.
V. CONCLUSION
Hypothesis testing provides a reliable framework for making decisions about data about the population of interest. It helps the
researcher to successfully extrapolate data from a sample to a larger population. Comparing the results you get once on different
models to choose which one is the best is never a good method. Statistical tests allow us to objectively state whether one model
works better.
This study demonstrates the use of statistical hypothesis tests to evaluate machine learning algorithms. In addition, it instructs
researchers how to select models based on model performance averages, which can be misleading. A suitable methodology for
comparing machine learning algorithms is five rounds of two-fold cross-validation using a modified Student's t-test. Compare
algorithms using MLX tend machine learning and statistical hypothesis testing.
REFERENCES
[1] Komisi Penyiaran Indonesia, “Survei Indeks Kualitas Program Siaran Televisi Periode 5 tahun 2016,” 2016.
[2] A. Abdallah, N. P. Rana, Y. K. Dwivedi, and R. Algharabat, “Social media in marketing : A review and analysis of the existing literature,” Telemat.
Informatics, vol. 34, no. 7, pp. 1177–1190, 2017. https://doi.org/10.1016/j.tele.2017.05.008
[3] W. G. Mangold and D. J. Faulds, “Social media: The new hybrid element of the promotion mix,” Business Horizons., vol. 52, no. 4, pp. 357–365, 2009.
https://doi.org/10.1016/j.bushor.2009.03.002
[4] M. H. Saragih and A. S. Girsang, “Sentiment Analysis of Customer Engagement on Social Media in Transport Online,” in 2017 International
Conference on Sustainable Information Engineering and Technology (SIET), 2017. https://doi.org/10.1109/SIET.2017.8304103
[5] J. H. Kietzmann, K. Hermkens, I. P. McCarthy, and B. S. Silvestre, “Social media? Get serious! Understanding the functional building blocks of social media,”
Business Horizons., vol. 54, no. 3, pp. 241–251, 2011. https://doi.org/10.1016/j.bushor.2011.01.005
[6] J. A. Morente-Molinera, G. Kou, C. Pang, F. J. Cabrerizo, and E. Herrera-Viedma, “An automatic procedure to create fuzzy ontologies from users’
opinions using sentiment analysis procedures and multi-granular fuzzy linguistic modelling methods,” Information Sciences, vol. 476, pp. 222–238, 2019.
https://doi.org/10.1016/j.ins.2018.10.022
[7] Y. Lu, F. Wang, and R. Maciejewski, “Business Intelligence from Social Media: A Study from the VAST Box Office Challenge,” Comput. Graph.
Appl. IEEE, vol. 34 no 5, pp. 58-69, 2014. https://doi.org/10.1109/MCG.2014.61
[8] M. Yulianto, A. S. Girsang, and R. Y. Rumagit, “Business Intelligence for Social Media Interaction In The Travel Industry In Indonesia,” J. Intell. Stud. Bus.,
vol. 8, no. 2, pp. 72–79, 2018.
[9] H. H. Do, P. W. C. Prasad, A. Maag, and A. Alsadoon, “Deep Learning for Aspect-Based Sentiment Analysis: A Comparative Review,” Expert Syst. Appl.,
vol. 118, pp. 272–299, 2019. https://doi.org/10.1016/j.eswa.2018.10.003
[10] Y. Fang, H. Tan, and J. Zhang, “Multi-strategy sentiment analysis of consumer reviews based on semantic fuzziness,” IEEE Access, vol. 6, no. c, pp.
20625–20631, 2018. https://doi.org/10.1109/ACCESS.2018.2820025
[11] H. Isah, P. Trundle, and D. Neagu, “Social Media Analysis for Product Safety using Text Mining and Sentiment Analysis,” in 2014 14th UK
Workshop on Computational Intelligence (UKCI), 2014. https://doi.org/10.1109/UKCI.2014.6930158
[12] P. Ducange, M. Fazzolari, M. Petrocchi, and M. Vecchio, “An effective Decision Support System for social media listening based on cross-source
sentiment analysis models,” Eng. Appl. Artif. Intell., vol. 78, no. October 2018, pp. 71–85, 2019. https://doi.org/10.1016/j.engappai.2018.10.014
[13] M. S. Omar, A. Njeru, S. Paracha, M. Wannous, and S. Yi, “Mining Tweets for Education Reforms,” in 2017 International Conference on Applied System
Innovation (ICASI), 2017. https://doi.org/10.1109/ICASI.2017.7988441
[14] P. F. Kurnia and Suharjito, “Business Intelligence Model to Analyze Social Media Information,” Procedia Comput. Sci., vol. 135, no. September, pp. 5–14,
2018.https://doi.org/10.1016/j.procs.2018.08.144
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 632
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.538
Volume 10 Issue XII Dec 2022- Available at www.ijraset.com
[15] P. F. Kurnia, “Perancangan dan implementasi bisnis intelligence pada sistem social media monitoring and analysis (studi kasus di pt. dynamo media
network),” Bina Nusantara University, 2017.
[16] M. A.- Amin, M. S. Islam, and S. Das Uzzal, “Sentiment Analysis of Bengali Comments With Word2Vec and Sentiment Information of Words,” in 2017
International Conference on Electrical, Computer and Communication Engineering (ECCE), 2017. https://doi.org/10.1109/ECACE.2017.7912903
[17] R. Kimball and M. Ross, The Kimball Group Reader: Reader: Relentlessly Practical Tools for Data Warehousing and Business Intelligence. 2010.
[18] C. Vercellis, Business Intelligence, Data Mining and Optimization for Decision Making. John Wiley & Sons, Ltd, 2009.
©IJRASET: All Rights are Reserved | SJ Impact Factor 7.538 | ISRA Journal Impact Factor 7.894 | 633