CSCB HW 1
CSCB HW 1
Group members:
Lecturer:
Introduction
This study uses health insurance data from Discovery Health (PTY) Ltd of 188,292 individuals who tested
positive for COVID-19 over the period of March 2020 - 28 February 2021 and the hospitalization data
for these members up until 30 June 2021 in South Africa[1]. Several machine learning methods such as
K-nearest neighborhood and decision tree were performed to compare with logistic regression and were
evaluated according to effectiveness in terms of accuracy, sensitivity, specificity, positive predictive, and
negative predictive values as well as ROC curves. These machine-learning methods were applied to predict
two endpoints: hospitalization and death. Such that, these predictions will be based on gender (as to being
male or female), age group (as to <18, 18-40, 40-65, >65 years), number of comorbidities, pandemic wave
(as to Pre-wave 1, Wave 1, Post-wave 1, Wave 2), plan type (as to 4 different levels of insurance covers) and
province (as to 9 South African provinces) where the patient belongs. Hospital code and length of stay were
considered for death outcomes as additional predictors. A testing data set was set to 30% and the training
data set was 70% for all analyses.
1 Methodology
1.1 Data description
Because of the imbalance between positive and negative cases (18.84% submitted cases and only 3.32% cases
of death), there has been a reduction in the number of negative cases in the training set. To account for this,
the training set was constructed to make sure that 70% of the positive cases are placed in the training set.
Dealing with this imbalance is important because the algorithms used would very easily recognize negative
cases but would have difficulties assigning positive cases [2].
1
Capita Selecta and Computational Biology 2022/2023
Figure 1: ROC curves: (a) Logistic Regression for Admission endpoint (b) k-NN for Admission endpoint
(c) Decision tree for Admission endpoint (d) Logistic Regression for Death endpoint (e) k-NN for Death
endpoint (f) Decision tree for Death endpoint
Table 1: Comparison of accuracy measures for logistic regression, k-NN, and decision tree
Logistic k-NN Decision tree
Admission Death Admission Death Admission Death
Accuracy 66.50% 76.30% 62.01% 84.26% 64.34% 87.11%
Sensitivity 64.15% 71.63% 33.87% 71.52% 41.51% 80.80%
Specificity 68.77% 79.18% 89.22% 92.10% 86.41% 90.98%
PPV 66.52% 67.90% 75.24% 84.77% 74.71% 84.64%
NPV 66.48% 81.95% 58.25% 84.03% 60.44% 88.52%
Optimal cut-point 0.3265 0.4384
In summary, selecting the best method is subjective to one’s opinion. By looking at these accuracy measures,
it may suggest that across all measures, the decision tree provides fair assessments, especially for the death
endpoint.
2
Capita Selecta and Computational Biology 2022/2023
3 Conclusion
In this study, two endpoints were considered in analyzing COVID-19 data in South Africa. Here, various
machine learning methods (logistic regression, k-NN, decision tree) were performed and were assessed through
accuracy measures. An important note in this analysis is the challenge of building an accurate and efficient
classifier. As seen in the results, one method can be preferred over the other with regards to a specific
accuracy measure. Considering the practical applications of our findings for the first outcome (admission),
we focus on the efficient use of the hospital resources by choosing a low false positive rate and thus a high
specificity. On the other hand for the second outcome (death) we focus on the human aspect by choosing a
low false negative rate, thus a high sensitivity. Therefore in this study to predict the admission outcome, the
k-NN method is preferred and to predict the death outcome the decision tree method is preferred.
3
Capita Selecta and Computational Biology 2022/2023
References
[1] Geetesh Solanki, Thomas Wilkinson, Shailav Bansal, Joshila Shiba, Samuel Manda, and Tanya Doherty.
Covid-19 hospitalization and mortality and hospitalization-related utilization and expenditure: Analysis
of a south african private health insured population. PloS one, 17(5):e0268025, 2022.
[2] Davide Chicco. Ten quick tips for machine learning in computational biology. BioData mining, 10(1):1–17,
2017.
[3] Alan Agresti. Foundations of linear and generalized linear models. John Wiley & Sons, 2015.
[5] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical
learning: data mining, inference, and prediction, volume 2. Springer, 2009.
4
Capita Selecta and Computational Biology 2022/2023
Appendix
Decision tree
Confusion Matrix
Hospitalization Death
Admitted Not Admitted Class Dead Alive Class
6826 3436 Admitted 1343 635 Dead
Logistic Regreesion
3814 7565 Not Admitted 532 2415 Alive
3604 1186 Admitted 1341 241 Dead
k-NN
7036 9815 Not Admitted 534 2809 Alive
4417 1495 Admitted 1515 275 Dead
Decision tree
6223 9506 Not Admitted 360 2775 Alive