0% found this document useful (0 votes)
29 views6 pages

CSCB HW 1

This study uses machine learning methods like logistic regression, K-nearest neighbors, and decision trees to predict COVID-19 hospitalization and mortality in South Africa using health insurance data. The models are evaluated based on accuracy, sensitivity, specificity, and other metrics. For hospitalization prediction, K-NN has the highest specificity while decision trees perform best for mortality prediction based on the highest sensitivity. Overall, decision trees provided fair assessments across all evaluation metrics for both outcomes.

Uploaded by

Melvin Estolano
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
0% found this document useful (0 votes)
29 views6 pages

CSCB HW 1

This study uses machine learning methods like logistic regression, K-nearest neighbors, and decision trees to predict COVID-19 hospitalization and mortality in South Africa using health insurance data. The models are evaluated based on accuracy, sensitivity, specificity, and other metrics. For hospitalization prediction, K-NN has the highest specificity while decision trees perform best for mortality prediction based on the highest sensitivity. Overall, decision trees provided fair assessments across all evaluation metrics for both outcomes.

Uploaded by

Melvin Estolano
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 6

Analysis of COVID-19 hospitalization and

mortality in South Africa: An application


of Machine Learning methods

Capita Selecta of Computational Biology (3772-2223)


2022-2023

Master of Statistics and Data Science


Hasselt University

Group members:

Amber Huybrechts (1953107)


Joachim Webers (1849120)
Farida Iddy (2159270)
Melvin Estolano (2159122)
Mirriam Dianah Lucheveleli (2159277)

Submission: December 23, 2022

Lecturer:

Prof. Dr. Samuel Manda


Capita Selecta and Computational Biology 2022/2023

Introduction
This study uses health insurance data from Discovery Health (PTY) Ltd of 188,292 individuals who tested
positive for COVID-19 over the period of March 2020 - 28 February 2021 and the hospitalization data
for these members up until 30 June 2021 in South Africa[1]. Several machine learning methods such as
K-nearest neighborhood and decision tree were performed to compare with logistic regression and were
evaluated according to effectiveness in terms of accuracy, sensitivity, specificity, positive predictive, and
negative predictive values as well as ROC curves. These machine-learning methods were applied to predict
two endpoints: hospitalization and death. Such that, these predictions will be based on gender (as to being
male or female), age group (as to <18, 18-40, 40-65, >65 years), number of comorbidities, pandemic wave
(as to Pre-wave 1, Wave 1, Post-wave 1, Wave 2), plan type (as to 4 different levels of insurance covers) and
province (as to 9 South African provinces) where the patient belongs. Hospital code and length of stay were
considered for death outcomes as additional predictors. A testing data set was set to 30% and the training
data set was 70% for all analyses.

1 Methodology
1.1 Data description
Because of the imbalance between positive and negative cases (18.84% submitted cases and only 3.32% cases
of death), there has been a reduction in the number of negative cases in the training set. To account for this,
the training set was constructed to make sure that 70% of the positive cases are placed in the training set.
Dealing with this imbalance is important because the algorithms used would very easily recognize negative
cases but would have difficulties assigning positive cases [2].

1.2 Logistic regression


Logistic regression is considered when fitting binary outcomes. It consists of three components namely,
systematic, link function, and a random component. The random component is the outcome variable, the
systematic component is the combination of parameters and covariates i.e β0 + β1 Age and the link makes
use of a logit link where it links the random component and systematic component such that coefficients can
be interpreted as the log of odds of being in one group as compared to another group [3]. In this study, the
optimal cutpoint for the prediction was obtained using Youden’s J statistics.

1.3 K-Nearest Neighborhood


K-nearest neighbors is a supervised machine learning method. This method determines the class of each
new observation by assigning it to the class of the majority of k-nearest neighbors, which are the k-nearest
points, measured with Euclidean distance [2]. The number of neighbors k is chosen to be the square root of
the number of rows of the training data set. The variable admission length of stay used for predicting death
outcome is normalized because it’s measured in different units than the rest of the variables. The categorical
variables such as province and hospital code are converted to dummy variables [4].

1.4 Decision tree


Decision trees are used while implementing supervised machine learning. The hierarchical structure of a
decision tree leads us to the final outcome by traversing through the nodes of the tree, from the root to the
leaves. Each node consists of a feature that is further split into more nodes as we move down the tree. In
this paper, the rpart package from RStudio is used to build the tree [5].

1
Capita Selecta and Computational Biology 2022/2023

1.5 Model comparison


Across all these models, the confusion matrices (Appendix Figure 2) were constructed to gain the performance
measurements to assess efficiency in terms of accuracy, sensitivity, specificity, positive predictive value, and
negative predictive value. Finally, plots of ROC were also constructed to obtain a graphical representation
of a method’s performance where y-axis represent the sensitivity while the x-axis (1-specificity) .The bigger
the surface under the curve, the better the method.

2 Results and discussion

Figure 1: ROC curves: (a) Logistic Regression for Admission endpoint (b) k-NN for Admission endpoint
(c) Decision tree for Admission endpoint (d) Logistic Regression for Death endpoint (e) k-NN for Death
endpoint (f) Decision tree for Death endpoint

Table 1: Comparison of accuracy measures for logistic regression, k-NN, and decision tree
Logistic k-NN Decision tree
Admission Death Admission Death Admission Death
Accuracy 66.50% 76.30% 62.01% 84.26% 64.34% 87.11%
Sensitivity 64.15% 71.63% 33.87% 71.52% 41.51% 80.80%
Specificity 68.77% 79.18% 89.22% 92.10% 86.41% 90.98%
PPV 66.52% 67.90% 75.24% 84.77% 74.71% 84.64%
NPV 66.48% 81.95% 58.25% 84.03% 60.44% 88.52%
Optimal cut-point 0.3265 0.4384

In summary, selecting the best method is subjective to one’s opinion. By looking at these accuracy measures,
it may suggest that across all measures, the decision tree provides fair assessments, especially for the death
endpoint.

2
Capita Selecta and Computational Biology 2022/2023

3 Conclusion
In this study, two endpoints were considered in analyzing COVID-19 data in South Africa. Here, various
machine learning methods (logistic regression, k-NN, decision tree) were performed and were assessed through
accuracy measures. An important note in this analysis is the challenge of building an accurate and efficient
classifier. As seen in the results, one method can be preferred over the other with regards to a specific
accuracy measure. Considering the practical applications of our findings for the first outcome (admission),
we focus on the efficient use of the hospital resources by choosing a low false positive rate and thus a high
specificity. On the other hand for the second outcome (death) we focus on the human aspect by choosing a
low false negative rate, thus a high sensitivity. Therefore in this study to predict the admission outcome, the
k-NN method is preferred and to predict the death outcome the decision tree method is preferred.

3
Capita Selecta and Computational Biology 2022/2023

References
[1] Geetesh Solanki, Thomas Wilkinson, Shailav Bansal, Joshila Shiba, Samuel Manda, and Tanya Doherty.
Covid-19 hospitalization and mortality and hospitalization-related utilization and expenditure: Analysis
of a south african private health insured population. PloS one, 17(5):e0268025, 2022.

[2] Davide Chicco. Ten quick tips for machine learning in computational biology. BioData mining, 10(1):1–17,
2017.

[3] Alan Agresti. Foundations of linear and generalized linear models. John Wiley & Sons, 2015.

[4] k-nearest neighbor: An introductory example. (n.d.).

[5] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical
learning: data mining, inference, and prediction, volume 2. Springer, 2009.

4
Capita Selecta and Computational Biology 2022/2023

Appendix
Decision tree

(a) Death (b) Hospitalization

Figure 2: Pruning decision tree

Confusion Matrix

Table 2: Confusion matrix logistic regression, k-NN, and decision tree

Hospitalization Death
Admitted Not Admitted Class Dead Alive Class
6826 3436 Admitted 1343 635 Dead
Logistic Regreesion
3814 7565 Not Admitted 532 2415 Alive
3604 1186 Admitted 1341 241 Dead
k-NN
7036 9815 Not Admitted 534 2809 Alive
4417 1495 Admitted 1515 275 Dead
Decision tree
6223 9506 Not Admitted 360 2775 Alive

You might also like