Shsconf Glob2021 09001
Shsconf Glob2021 09001
Shsconf Glob2021 09001
1051/shsconf/202112909001
Globalization and its Socio-Economic Consequences 2021
Abstract
Research background: In this era of globalization, data growth in
research and educational communities have shown an increase in analysis
accuracy, benefits dropout detection, academic status prediction, and trend
analysis. However, the analysis accuracy is low when the quality of
educational data is incomplete. Moreover, the current approaches on
dropout prediction cannot utilize available sources.
Purpose of the article: This article aims to develop a prediction model for
students' dropout prediction using machine learning techniques.
Methods: The study used machine learning methods to identify early
dropouts of students during their study. The performance of different
machine learning methods was evaluated using accuracy, precision,
support, and f-score methods. The algorithm that best suits the datasets for
these performance measurements was used to create the best prediction
model.
Findings & value added: This study contributes to tackling the current
global challenges of student dropouts from their study. The developed
prediction model allows higher education institutions to target students
who are likely to dropout and intervene timely to improve retention rates
and quality of education. It can also help the institutions to plan resources
in advance for the coming academic semester and allocate it appropriately.
Generally, the learning analytics prediction model would allow higher
education institutions to target students who are likely to dropout and
intervene timely to improve retention rates and quality of education.
© The Authors, published by EDP Sciences. This is an open access article distributed under the terms of the Creative
Commons Attribution License 4.0 (http://creativecommons.org/licenses/by/4.0/).
SHS Web of Conferences 129, 09001 (2021) https://doi.org/10.1051/shsconf/202112909001
Globalization and its Socio-Economic Consequences 2021
1 Introduction
The growth of data in research and education communities has shown an increase in analysis
accuracy, benefits dropout detection, academic status prediction, and trend analysis.
However, the analysis accuracy is low when the quality of educational data is incomplete.
(Chen et al., 2017)
Higher education institutions face the challenge of low student retention rates and an
increased number of dropouts. (Zhang and Rangwala, 2018) Therefore, higher education
institutions need to develop learning analytics systems to find students at-risk of failing at
earliest possible time and provide timely intervention. The main principle of learning
analytics is identifying at-risk students and given timely intervention based on the results of
student behavior investigation. (Huang et al., 2020) Early prediction of students' academic
status helps to intervene early and act accordingly to improve learning outcomes. It helps
increase graduation rates by appropriately helping students, helping higher education
policymakers, monitoring the efficiency and effectiveness of teaching-learning activities,
giving critical feedback to students and teachers, and modifying learning activities. (Ofori et
al., 2020)
A practical prediction algorithm results in a high prediction accuracy of the students'
achievement; identify the low-performing students at the beginning of the learning process.
However, to achieve these objectives, a large volume of student data must be analyzed and
predicted using various machine learning models. Online learning environments such as
Moodle systems and Student Information System (SIS) assist the learning
analytics paradigm by providing datasets for further analysis and reporting. Using the
available data from online learning systems, it can be possible to support decision-making in
students' learning process and use it to timely intervene students who are likely to drop out to
improve their respective performances.
The work by Zhang and Rangwala (2018) has discovered key features using the traditional
statistical methods to identify at risk of dropping out students from their study. Machine
learning offers an advantage over traditional forms of statistical analysis, emphasizing
predictive performance over provable theoretical properties. Machine learning methods are
used to develop prediction models and plot patterns using available data, which is helpful in
decision-making (Hussain et al., 2018). Machine learning is a software modeling technique
of self-learning systems that makes meaningful inferences from data or experiences with
mathematical and statistical operations (Alpaydin, 2020). An effective prediction of
students' dropout during the early stages can provide course instructors with timely
intervention. This helps to reduce the underlying problem by implementing rapid and
consistent intervention mechanisms.
The first section of this paper provides introductory information about the prediction
model in global higher education institutions. The second section discusses related studies on
students' performance prediction models and associated techniques. Comparison of previous
works has been discussed in selected studies. The third section describes the methodology
used for this study. In addition, the experimental procedures, data preprocessing methods
and the steps involved in developing proposed predictive model were discussed in this
section. The fourth section provided the study results and discussed the performance
measurements to identify the winning prediction model.
2
SHS Web of Conferences 129, 09001 (2021) https://doi.org/10.1051/shsconf/202112909001
Globalization and its Socio-Economic Consequences 2021
2 Literature Review
This study aims to predict student academic status using Random Forest, Naïve Bayes
Classifier, Logistic Regression, and Decision Tree machine learning methods. The recent
research by Ofori et al. (2020) suggested that several Machine learning models could be
adapted to analyze the data, such as clustering, classification, and association rules mining
based on the suitability of collected data and aims of the data analytics process.
Students' dropout is one of the most complicated and challenging problems worldwide
that students and global institutions face. Therefore, effectively predicting students' dropout
could help alleviate social and economic costs. (Chacha et al., 2019)
Recent studies have confirmed that Machine Learning methods are used to predict students
at risk of failing and dropout rates to improve their performance during their studies.
(Albreiki et al., 2021)
Several Machine learning models have been created to predict student dropout based on
algorithms such as decision trees, neural networks, random forests, vector support machine,
logistic regression. Machine learning takes advantage of traditional statistical analysis,
stressing predictive performance over verifiable theories. (Ofori et al., 2020) Recently,
Machine learning methods are used commonly to predict students' achievement in their
academics (Emirtekin et al., 2020). However, these models' effectiveness varies mainly due
to the type and size of datasets used in the model, feature selection strategies, performance
measurement criteria, and experimental procedures. In addition, different kinds of literature
used other techniques and selected predictive variables. The following table summarizes
recent studies related to students' academic status achievement, dropout rate prediction,
identification of at-risk students, and evaluation criteria that measure the effectiveness of the
prediction model.
3
SHS Web of Conferences 129, 09001 (2021) https://doi.org/10.1051/shsconf/202112909001
Globalization and its Socio-Economic Consequences 2021
3 Methodology
The following are the main methods and used in this study to develop students’ early
dropout prediction model:
4
SHS Web of Conferences 129, 09001 (2021) https://doi.org/10.1051/shsconf/202112909001
Globalization and its Socio-Economic Consequences 2021
In addition, data related to students' assignment submission rate, the number of hours
spent in an SIS system, and their active engagement on discussion forums have been
collected. But due to the complexity of these datasets, we couldn't incorporate them for
prediction purposes during the model development process.
5
SHS Web of Conferences 129, 09001 (2021) https://doi.org/10.1051/shsconf/202112909001
Globalization and its Socio-Economic Consequences 2021
6
SHS Web of Conferences 129, 09001 (2021) https://doi.org/10.1051/shsconf/202112909001
Globalization and its Socio-Economic Consequences 2021
The information indicated above (Table 3) contains metrics related to the prediction
model dropout. The metrics are described as follows:
x True Positive (TP): The number of students correctly classified as "Pass."
x True Negative (TN): The number of dropout students classified correctly as "Fail."
x False Positive (FP): The number of passed students incorrectly classified as "Fail."
x False Negative (FN): The number of failed students classified incorrectly as
"Pass."
x Recall
It is the proportion of real positives predicted to be positive. Recall ensures that the
predictive model is not overlooking a few students who are Fail or Pass. The Recall is used
to evaluate the actual success rate/ students who passed successfully.
7
SHS Web of Conferences 129, 09001 (2021) https://doi.org/10.1051/shsconf/202112909001
Globalization and its Socio-Economic Consequences 2021
x Precision
Precision is the proportion of real negatives expected to be negative. It determines the
fraction of true positives among true positives and false positives predicted
The precision is used to assess the dropout of students, and it is determined as follows:
x F1score
The F1score determines the harmonic mean of recall and precision of a predictive model.
Therefore, the F1score is suitable for classification problems where the target labels are
imbalanced.
The F1score shows the balance between two measures of classification. It represents a
measure widely used trade-off calculation for imbalanced datasets. It is determined as
follows:
The classification error rate represents a proportion of instances misclassified over the
entire set of cases. Its value can be calculated as follows:
Prediction Model
Performance Logistic
Decision Tree Naïve Bayes Random Forest
Measures Regression
Accuracy 0.93 0.88 0.93 0.94
Precision 0.91 0.89 0.96 0.95
Recall 0.92 0.97 0.96 0.98
F1 Score 0.94 0.96 0.96 0.97
The predictions found through the logistic regression model have an accuracy of 94.44%,
95% precision, 98% recall and 97% F1 score in predicting students' dropout. Out of the 90
records used for testing, the logistic regression model predicted 71 students correctly as
successful non dropouts and four students who were actually passed were incorrectly
classified as failed.
In addition, the model has correctly predicted 14 students as dropouts. One Student was
a dropout, but the Student was detected by the model incorrectly as passed. The proposed
predictive model has a high degree of reliability in predicting the data, with an average
error rate of just 0.056.
Prediction results by Random Forest were very close but a bit less than the winning
model. It was identified as the second-best model in its prediction performance at 93%, 96%,
98%, and 97% accuracy, precision, recall, and F1 score, respectively.
8
SHS Web of Conferences 129, 09001 (2021) https://doi.org/10.1051/shsconf/202112909001
Globalization and its Socio-Economic Consequences 2021
5 Conclusion
Early students' dropout prediction can help academic institutions to provide a timely
intervention and apply appropriate planning and training to improve students' success rate.
This study focused on prediction of students’ academic dropout using different machine
learning techniques. Decision Tree, Random Forest, Logistic Regression and Naïve Bayes
have been used for training and testing the model. The proposed prediction method benefits
course instructors, institutes, and the University to decide on students' performance and
apply appropriate intervention for improving students' academics in advance. This study
found out that the Logistic Regression model performed better than the remaining models
used in this study in predicting students' early dropouts.
A re-examination of the proposed model using more datasets possibly extracted from
academic Big Datasets could be needed to acquire improved accuracy.
Future research needs to include unstructured datasets from students' online activity such
as click streams, discussion forums, campus activities, and libraries. In addition, evaluation
of other predictions using deep learning methods would be essential to assess the
effectiveness of the forecast. The combination of Machine Learning techniques for early
dropout prediction needs to be utilized for feature selection, extraction of dropout factors,
and calculating percentage of dropout rates. In addition, it would not be enough to evaluate
the effectiveness of the machine learning model by observing the accuracy metrics.
According to Wu and Flach (2005), the AUC criterion that expresses the area under the
ROC-Curve could evaluate the success of Machine learning models in addition to the
evaluation metrics used by this study.
Acknowledgments
The work reported in this paper was conducted with the kind support of the University
Pardubice grant No SGS_2021_008 of the Student Grant Competition. The outcome of this
study is part of the ongoing CRP project under the University of Pardubice.
References
1. Chen, M., Hao, Y., Hwang, K., Wang, L., & Wang, L. (2017). Disease prediction by
machine learning over big data from healthcare communities. Ieee Access, 5, 8869-
8879.
2. Zhang, L., & Rangwala, H. (2018). Early identification of at-risk students using
iterative logistic regression. International Conference on Artificial Intelligence in
Education, (pp. 613-626). Springer, Cham.
3. Huang, A. Y., Lu, O. H., Huang, J. C., Yin, C. J., & Yang, S. J. (2020). Predicting
students' academic performance by using educational big data and learning
analytics: evaluation of classification methods and learning logs. Interactive
Learning Environments, 28(2), 206–230.
4. Ofori, F., Maina, E., & Gitonga, R. (2020). Using machine learning algorithms to
predict students performance and improve learning outcome: A literature based
review. Journal of Information and Technology, 4(1), 33–55.
5. Hussain, M., Zhu, W., Zhang, W., & Abidi, S. M. R. (2018). Student engagement
predictions in an e-learning system and their impact on student course assessment
scores. Computational intelligence and neuroscience, 2018.
9
SHS Web of Conferences 129, 09001 (2021) https://doi.org/10.1051/shsconf/202112909001
Globalization and its Socio-Economic Consequences 2021
10