Application of Machine Learning For Diagnostic Prediction of Root Caries
DOI: 10.1111/ger.12432
Roseman University of Health Sciences
College of Dental Medicine, South Jordan, Abstract
Utah Objective: This study sought to utilise machine learning methods in artificial intel‐
Department of Orthopaedic Surgery
ligence to select the most relevant variables in classifying the presence and absence
Operations, University of Utah, Salt Lake
City, Utah of root caries and to evaluate the model performance.
Background: Dental caries is one of the most prevalent oral health problems. Artificial
Study Design and Biostatistics
Center, University of Utah, Salt Lake City,
intelligence can be used to develop models for identification of root caries risk and to
Department of Family and Preventive gain valuable insights, but it has not been applied in dentistry. Accurately identifying
Medicine, University of Utah, Salt Lake City, root caries may guide treatment decisions, leading to better oral health outcomes.
5 Methods: Data were obtained from the 2015‐2016 National Health and Nutrition
Huntsman Cancer Institute, Salt Lake City,
Utah Examination Survey and were randomly divided into training and test sets. Several
supervised machine learning methods were applied to construct a tool that was ca‐
Man Hung, College of Dental Medicine, pable of classifying variables into the presence and absence of root caries. Accuracy,
Roseman University of Health Sciences,
sensitivity, specificity and area under the receiver operating curve were computed.
10894 South River Front Parkway, South
Jordan, UT 84095. Results: Of the machine learning algorithms developed, support vector machine dem‐
Email: [email protected]
onstrated the best performance with an accuracy of 97.1%, precision of 95.1%, sen‐
Funding information sitivity of 99.6% and specificity of 94.3% for identifying root caries. The area under
National Center for Research Resources,
the curve was 0.997. Age was the feature most strongly associated with root caries.
Grant/Award Number: 5UL1TR001067-02;
Roseman University College of Dental Conclusion: The machine learning algorithms developed in this study perform well
Medicine Clinical Outcomes Research and
and allow for clinical implementation and utilisation by dental and nondental profes‐
Education; National Center for Advancing
Translational Sciences, National Institutes sionals. Clinicians are encouraged to adopt the algorithms from this study for early
of Health
intervention and treatment of root caries for the ageing population of the United
States, and for attaining precision dental medicine.
artificial intelligence, dental medicine, machine learning, National Health and Nutrition
Examination Survey, quality of life, root caries
aged 65 and over.4,5 If left untreated, caries can lead to tooth loss6,7 technique is to detect root caries on an individual level, enabling
as well as reduced quality of life, ease of daily living and self‐con‐ evidence‐based personalised dental medicine that may assist in de‐
cept regarding their oral health.9 Thus, minimising the experience creasing root caries experience of ageing populations via early pre‐
and impact of caries on individuals’ general health and quality of life vention and treatment.
is an important public health issue.
Research into the prevalence and impact of root caries has
demonstrated that a number of individual factors are related to poor
oral health. Males show a higher prevalence of untreated caries than
2.1 | Data
females.10 Individuals belonging to racial/ethnic minority groups,
such as Native Americans, Blacks and Hispanics, have a higher This study used public data from the National Health and Nutrition
prevalence of periodontal diseases, untreated root caries, tooth Examination Survey (NHANES) 2015‐2016 cycle. 27 Since these
loss and generally experience a higher incidence of oral cancer than were de‐identified, public data available from the NHANES web‐
non‐Hispanic Whites.11 Socioeconomic status components such as site, Institutional Review Board (IRB) approval was not required.
income, living condition, education and access to dental This study was considered exempt from IRB evaluation on the
care10,14 are all factors that can contribute to dental disparities. With basis of federal regulation 45 CFR 46.101(b) (research involving
the inability to access to dental care, 21% of Latino children under the study of secondary data recorded in such a manner that sub‐
age 17 are without dental insurance compared with just 6% unin‐ jects cannot be identified). The NHANES is a study of the National
sured rates for whites and 7% for African Americans. The issue Center for Health Statistics within the Centers for Disease Control
of accessibility becomes more prominent as Medicare, for those 65 and Prevention. It is conducted annually via both interviews and
and older, does not provide for routine dental care, and older adults clinical examinations to assess the health status of adults and chil‐
may experience higher rates of tooth decay than children.16 Lifestyle dren in the United States. It includes information derived from
factors such as poor diet, nutrition and a lack of dental hygiene play questionnaires on demographics, socioeconomic status, dietary
key roles in disparities as well. 2 Among the vulnerable, elder popu‐ and health‐related topics. Additionally, the NHANES has a clinical
lation, root caries tends to occur due to reduced upkeep of dental examination component which includes medical, dental and physi‐
hygiene practices.17 While a large portion of the currently affected ological measures.
population can retain their teeth for a majority of their lifespan with For the 2015‐2016 cycle, the NHANES included oversampling of
simple individual‐ and population‐level interventions, such as water under‐represented groups. A total of 15 327 people were invited to
fluoridation and regular professional preventive dental care,18 socio‐ participate in the study, and of those invited, 9971 people completed
economic components remain a large contributor to increased prev‐ the interview and 9544 people were examined. Interview questions
alence of poor oral health. were administered by trained interviewers in the participant's home,
Looking specifically at dental factors that are associated with but sensitive questions regarding alcohol use and reproductive
root caries, self‐reported dry mouth,19 number of teeth at baseline20 health were administered at the examination centre. Clinical exam‐
and gingival recession21,22 have been associated with root caries. inations were conducted at a mobile examination centre at desig‐
Among elders, surfaces with visible plaque, denture contact and nated locations by licensed and trained medical personnel.
more prominent gingival recession are areas that are likelier to get
affected by root caries. 23 However, generally with those older than
2.2 | Outcome variable
35 years, complexities arise in determining relationships with root
caries as the presence of periodontal disease increases and becomes The oral health outcome variable of interest for this study was root
the primary culprit of tooth loss.18 caries. It was a dichotomous variable with either yes or no to indicate
While incredibly important, prevalence and outcome research the presence or absence of one or more root caries based on clinical
can often be difficult to use when attempting to develop clinical in‐ examination. Dental caries was defined as the localised destruction
terventions. Consequently, we believe there can be clinical ben‐ of susceptible dental hard tissues by acidic by‐products from bacte‐
efits from employing artificial intelligence to the prediction of root rial fermentation of dietary carbohydrates occurring either on the
caries. Machine learning methods in artificial intelligence have been crown or on the root of the tooth.3,28 This study focused on root
previously applied to different areas of health care and have the abil‐ caries as it is a more serious condition that can lead to greater oral
ity to explore large amounts of data to reveal patterns and complex health issues but is also highly treatable and can be prevented. 29
relationships between variables. They have strong potential to Root caries was identified in oral examinations by licensed and
produce precise and individualised prediction of root caries risk. trained dental professionals from NHANES during a dental caries
To our knowledge, machine learning has not been used to de‐ assessment using a decayed, missing and filled surface index. The
velop models in identification of root caries risk. This study utilised presence of root caries was defined as the presence of one or more
machine learning techniques to identify the likelihood of a person untreated (decayed, D‐root) root caries lesions or treated (filled, F‐
to develop root caries by selecting the most relevant variables from root) root surfaces. The outcome variable used in this study included
demographic and lifestyle factors. A potential application of this the presence/absence of D‐root and/or F‐root.
predict the minority class and that the majority class almost always
2.3 | Analytical approach
has inflated model performance. Thus, when using highly imbal‐
Demographic and clinical characteristics of the participants were ex‐ anced data set, we often see a large gap between sensitivity and
amined in terms of mean, standard deviation, frequency and propor‐ specificity of machine learning models and a high misclassification
tion where appropriate. Machine learning methods were utilised to rate for the minority class.30,31 In order to solve such issues, various
classify the presence or absence of root caries. In machine learning, strategies such as oversampling or undersampling have been pro‐
computer algorithms can be applied to a training data set. These al‐ posed to reduce the inherent bias resulting from imbalanced data.
gorithms “learn” the patterns which are present in the data and au‐ Oversampling has demonstrated to be able to reduce the gap be‐
tomatically generate rules that are used to conduct data mining or tween sensitivity and specificity and lower the misclassification rate
predict future outcome from the features (ie, variables). These pre‐ for the minority class.30,31 On the other hand, if not done properly,
dictions can then be compared against the actual values from a test oversampling can result in overfitting issues such as obtaining per‐
data set (ie, validation data set) to assess the performance of the ma‐ fect accuracy and AUC when in reality they are not perfect. In this
chine‐generated rules. Machine learning is particularly helpful when study, we strived to minimise overfitting issues by using a separate
dealing with large and complex data where the relationships between validation sample for model validation.
variables are not obvious. It is useful for clinical decision support and Several supervised machine learning methods were applied
can contribute to diagnosis and prognosis of oral health conditions as to generate the prediction of root caries for individuals. These in‐
well as personalised or individualised dental treatment regimes. clude support vector machine (SVM), extreme gradient boosting
There were a total of 9971 cases and 950 variables present in the (XGBoost), random forest regression (RF), k‐nearest neighbours (k‐
complete data set. To prepare the data for processing, all cases with NN) and logistic regression. 32-35 Logistic regression was chosen be‐
missing data for root caries as well as variables that had 50% or more cause it was commonly used in traditional medical studies; all other
missing data were excluded, resulting in 357 variables and a total methods were chosen due to their tolerance to overfitting, ability
sample size of 5135. To minimise bias and enhance efficiency of in‐ to model nonlinear relationships, ease for implementation in clin‐
terpretation, variables that were unlikely to be related to root caries ical settings or acceptability in the machine learning community.
(eg, subject IDs) and variables providing essentially the same infor‐ These machine learning algorithms were coded using Python
mation (eg, age in a continuous scale and age in categorical group‐ 3.7.0 (Python Software Foundation) and WEKA 3.8.2 (University
ings) as well as variables that were the likely results of possessing of Waikato, Hamilton, New Zealand). The test data set was used to
root caries (eg, recommendation for dental care) were removed. The compute accuracy, sensitivity, specificity and area under the curve
resulting variables were subjected to independent samples t tests for (AUC) of the receiver operating characteristic (ROC) curve. Accuracy
continuous variables and chi‐square tests for categorical variables of the prediction was considered as the most relevant for clinical
to examine whether there were significant differences between the applications in dental care.
root caries and no root caries groups. A total of 37 variables demon‐
strated statistically significant relationships with the outcome vari‐
able root caries (P < 0.001) (Appendix A). These 37 variables were 3 | R E S U LT S
inputted into initial machine learning models to determine their
relative importance based on their F‐scores. The F‐score is a mea‐ The sample size for this study was 5135. Males made up 48.4% of
sure that determines feature importance based on how often that the sample, and females made up 51.6%. A total of 1629 (31.7%)
feature is taken into account during the machine learning process. identified as White or Caucasian, 1094 (21.3%) as black or African
Variables with higher F‐scores contributed more to the prediction of American, 1613 (31.4%) as Hispanic or Mexican American and 611
root caries. In order to achieve parsimony, the top 15 most important (11.9%) as Asian. The average age of the sample was 46.6 (standard
variables were selected to construct machine learning models. The deviation = 18.1; median = 46.0) (Table 1).
data were then randomly partitioned into training and test sets with Figure 1 displays a visual presentation of variable importance,
80% for training and 20% for testing. Since the original data were reflecting the contribution of each of the thirty‐seven significant
highly imbalanced (containing 4344 cases without root caries but indicators of root caries to the machine learning model. The larger
only 791 cases with root caries), sampling with replacement, specif‐ the F‐score of a variable, the higher the contribution it has on the
ically oversampling, was used to create balanced data for the under‐ identification of root caries. Age was found to be the most important
represented class (ie, minority class). The balanced data contained variable in identifying root caries. The top fifteen features included
4746 cases with root caries and 4344 cases without root caries. five demographic variables (ie, age, household income, education,
Altogether, a total of 9090 cases were used for training and testing race/ethnicity and marital status), five oral health variables (ie, last
in machine learning, with 7272 cases (80% of 9090) randomly se‐ dentist visit, flossing, mouth ache, self‐rated oral health and oral
lected for training and 1818 cases (20% of 9090) for testing. embarrassment) and five lifestyle/health variables (ie, TV watching,
Imbalanced data are known to introduce a high degree of classi‐ computer use, use of sunscreen, alcohol consumption and choles‐
fication bias to model performance (eg, sensitivity, specificity) such terol prescriptions) (Figure 1 and Table 2). The F‐scores calculated by
that the machine learning algorithms are almost never be able to the various machine learning algorithms were slightly different, but
this difference was minor, and the rankings of the variables remained an accuracy of 97.1%, precision of 95.1%, sensitivity of 99.6% and
relatively consistent. specificity of 94.3% for identification of root caries. The XGBoost
Classification results for the machine learning algorithms are pre‐ and RF also performed very well with an overall accuracy of 94.7%
sented in Table 3. The top classifier was SVM with an AUC of 0.997, and 94.1%, respectively. The k‐NN was satisfactory at 83.2%
F I G U R E 1 Variable importance
accuracy. The commonly used logistic regression in traditional re‐ factors related to root caries is an opportunity to improve oral
search studies performed the worst relative to the other algorithms health with consequent effects on general health. This is the first
in this study but still had a reasonable accuracy of 74.3%. study using machine learning methods in artificial intelligence to
Figure 2 displays a graphical plot of the ROC curves for all of the identify root caries from a large scale of data consisting of demo‐
machine learning algorithms utilised in this study. The AUC for the graphic, nutrition, lifestyle, laboratory and oral examination vari‐
logistic regression was adequate at 0.818. The SVM, XGBoost and ables. We used the NHANES data and applied multiple machine
RF had an AUC of 0.997, 0.987 and 0.999, revealing exceptionally learning methods to identify the best model and factors related to
high model performance. root caries. The best performing machine learning model was SVM,
which most accurately classified the presence versus absence of
root caries.
4 | D I S CU S S I O N Across all methods, four variables were consistently identified
as the most critical in indicating the presence of root caries. These
Root caries is a significant public health concern and has been were age, income, date of last dental visit and hours of television
increasing in prevalence. The use of machine learning to identify watching. Age was the most relevant predictive variable, consistent
TA B L E 3 Performance metrics of
Classifier Accuracy Precision Sensitivity Specificity AUC
machine learning models using the top 15
Support vector 0.971 0.95 0.996 0.943 0.997 selected features
XGBoost 0.947 0.908 1.000 0.889 0.987
Random forets 0.941 0.947 1.000 0.875 0.999
k‐nearest 0.832 0.769 0.971 0.679 0.881
Logistic 0.742 0.742 0.771 0.711 0.818
with evidence that root caries increases with age due to increasing confirms previously identified features. Yet, the value of the machine
exposure of root surfaces among other things. 24 Low income as a learning approach comes with the identification of unexpected and
factor of socioeconomic status and as an indicator of a finan‐ less intuitive features, such as hours spent watching television, as
cial barrier10,14 to dental care access has also been associated with important indicators of root caries risk. While lifestyle factors in
oral health disparities. Receiving dental care from a professional on general may not be directly responsible for the development of root
a regular basis increases chances of early diagnosis, prevention and caries, they may provide an indirect link to a person's overall health,
treatment of oral diseases. Consequently, previous research lifestyle and likelihood of developing oral health problems.
has shown that those who do not receive regular care have worse Ultimately, most of our other top features were consistent
oral health than those who do.38 Last dental visit as a prominent with prior research that has identified demographic, lifestyle and
oral health feature is especially consistent with the idea of reduced oral health variables as important features of poor oral health.
accessibility and increased prevalence among elders. Overall, this We found that education, marital status, race/ethnicity and
gender and demographic factors were indicative of root caries. benefit for building accurate models in artificial intelligence. Third,
Meanwhile previous research has identified gender,10 ethnicity,11 the machine learning feature selection did not account for the co‐
socioeconomic status, living conditions and education variance between lifestyle factors and the oral hygiene variables.
as factors contributing to oral health disparities. We found high By only using F‐score to select feature importance, variables that
alcohol consumption as an indicator of root caries, and although may not have been directly correlated with root caries, but rather
an association has yet to be made with root caries specifically, were associated with other variables that influence root caries may
high alcohol consumption has been found to be associated with have been selected. This may have been the case with features
larger amount of caries on tooth surfaces. 39 Finally, we identified such as age and taking cholesterol medications, and some of the
four other oral health variables relevant in classifying root caries: lifestyle factors. In the future, it may be beneficial to compare the
aching in mouth, self‐rated oral health, flossing and oral embar‐ current models against other machine learning models in predicting
rassment. Although literature on the effects of flossing on dental root caries that use different methods of feature selection. Fourth,
health is inconclusive,40,41 previous research has cited correlation this was a cross‐sectional study, so the final model showed possible
of poor oral health with aching in mouth,42 oral embarrassment43 indicators associated with the presence or absence of root caries.
and self‐rated oral health.44 Longitudinal study is needed in the future to establish and confirm
Sunscreen use, computer use and taking prescription medicine the predictive ability of the model. Additionally, onsite clinical val‐
for cholesterol were some of the unique indicators we discovered, idation has not been started but future research can focus on such
which did not exist in the current literature. Similar to the case with validation to improve the algorithms.
hours spent watching television, factors like use of sunscreen and While confirming prior research regarding significant indica‐
computer use, although not directly indicative, may provide insight tors of root caries at a population level, our study also developed
to a patient's oral health practices and habits. Taking prescription highly accurate and precise computer algorithms to model risk for
medicine for cholesterol may suggest why prior evidence shows an individual patients. The application of machine learning in artificial
association between dental health and heart disease/elevated cho‐ intelligence not only approximated dentists’ examination skills but
lesterol. Since commonly prescribed cholesterol‐lowering drugs also discovered novel and complex relationships not readily appar‐
such as anticholinergics decrease salivary gland function,47 perhaps ent to dentists or humans in general. The use of machine learn‐
patients with heart disease/elevated cholesterol are at an increased ing methods did not simply help us in identification of risk factors
risk for root caries due to their medications. for root caries, and it helped us to generate computer algorithms
This study was not without limitations. First, we utilised a large that are able to consider combinations of variables to classify the
amount of data collected by NHANES from a large sample in the presence and absence root caries. Discovering and incorporating
United States. The findings derived from this large sample are such combinations of variables and their complex relationships
meant to be more representative of and can be generalizable to the with root caries to guide understanding and individual treatment
United States’ population. Yet, individual dental clinics may have decision can be a challenge for humans, but they can be a reality
different patient demographics and may exhibit different charac‐ with artificial intelligence (such as Alexa, self‐driven cars, face ID
teristics. Second, in machine learning a large amount of data or for unlocking phones or other robots that we have seen and used
variables is often used in search for novel insights, which makes in our daily life). Machine learning is the driver of artificial intelli‐
statistical significance testing inapplicable or losing its meaning. gence and has powerful public health implications when applied to
However, since a central aim of this study is discovery and explora‐ clinical problems.
tion of actionable new insights, not statistical significance testing, Innovations using artificial intelligence have the ability to disrupt
applying a large number of variables is not of concern but of great and advance the areas of diagnosis and prognosis in oral health. In
the future, real‐time online clinical decision support tool can be made C O N FL I C T O F I N T E R E S T
by incorporating the machine learning algorithms developed from
The authors declare that there is no conflict of interest.
this study to facilitate precision medicine in oral care. This can be
used as a screening tool in general medical practices, dental clinics,
social service centres or placed online, providing recommendations AU T H O R C O N T R I B U T I O N S
for dental examinations for those identified at high risk. The infor‐
Man Hung: conception and design of the study, data acquisition,
mation derived from the machine learning findings in this study also
analysis and interpretation of data, drafting and revising the ar‐
included the identification of other medical conditions or life styles
ticle for important intellectual content and final approval of the
to the presence of root caries, which is probably more applicable
manuscript. Maren W. Voss: design of the study, data acquisition,
to be utilised by nondental professionals to categorise patients that
interpretation of data, drafting the article and final approval of the
might be of higher risk to develop root caries and provide referrals of
manuscript. Megan N. Rosales: design of the study, data acquisi‐
those patients to oral health professionals for further evaluation and
tion, analysis and interpretation of data, revising the article, and
early intervention and prevention.
final approval of the manuscript. Wei Li: design of the study, ac‐
According to the US Census Bureau's 2017 National Population
quisition of data, analysis of data, revising the article, and final ap‐
Projections, the size of the older population will expand by 2030
proval of the manuscript. Weicong Su: design of the study, analysis
such that 1 in every 5 people will be at the retirement age of 65
and interpretation of data, revising the article and final approval
or older.48 With an increasingly ageing population, root caries and
of the manuscript. Julie Xu: design of the study, interpretation
other oral health outcomes that most commonly affect the elder
of data, drafting the article and final approval of the manuscript.
population will only increase in prevalence. Therefore, the use of
Jerry Bounsanga: design of the study, acquisition of data, revis‐
machine learning methods to understand root caries represents an
ing the article and final approval of the manuscript. Bianca Ruiz‐
incredible opportunity for early intervention and the improvement
Negrón: design of the study, interpretation of data, revising the
of oral health for the ageing population. This is the first study ap‐
article and final approval of the manuscript. Evelyn Lauren: design
plying machine learning to classify root caries, and it has generated
of the study, interpretation of data, revising the article and final
highly robust and accurate computer algorithms. The use of these
approval of the manuscript. Frank W. Licari: conception and design
algorithms may enable the development of automated and cost‐effi‐
of the study, interpretation of data, revising the article for impor‐
cient tools for dental care and precision medicine and may have huge
tant intellectual content and final approval of the manuscript.
implications in intervention for those that are or could be affected by
root caries and other oral health conditions.
TABLE A1 Indicators of root caries
BPQ090D Told to take prescription for cholesterol 3828 592 3236 <0.001
DEQ034D Use sunscreen? 3397 439 2958 <0.001
DMDEDUC2 Education level—Adults 20+ 4873 789 4084 <0.001
DMDMARTL Marital status 4874 790 4084 <0.001
INDHHIN2 Annual household income 4801 732 4069 <0.001
OHQ030 When did you last visit a dentist 5122 789 4333 <0.001
OHQ620 How often last year had aching in mouth? 3962 731 3231 <0.001
OHQ680 Last year embarrassed because of mouth 3966 731 3235 <0.001
OHQ845 Rate the health of your teeth and gums 5133 791 4342 <0.001
OHQ870 How many days use dental floss/device 3962 730 3232 <0.001
PAQ710 Hours watch TV or videos past 30 d 5124 788 4336 <0.001
PAQ715 Hours use computer past 30 d 5133 789 4344 <0.001
RIDAGEYR Age in years at screening 5135 791 4344 <0.001
RIDRETH1 Race/Ethnicity—Recode 5135 791 4344 <0.001
SMQ020 Smoked at least 100 cigarettes in life 5128 789 4339 <0.001
ALQ151 Ever have 4/5 or more drinks every day? 3865 620 3245 <0.001
BPQ020 Ever told you had high blood pressure 5133 790 4343 <0.001
DIQ010 Doctor told you have diabetes 5132 790 4342 <0.001
DLQ020 Have serious difficulty seeing? 5133 790 4343 <0.001
DLQ040 Have serious difficulty concentrating? 5131 790 4341 <0.001
DLQ050 Have serious difficulty walking? 5135 791 4344 <0.001
DLQ060 Have difficulty dressing or bathing? 5134 791 4343 <0.001
DPQ080 Moving or speaking slowly or too fast 4717 733 3984 <0.001
IMQ011 Received hepatitis A vaccine 4179 678 3501 <0.001
IMQ020 Received hepatitis B 3 dose series 4331 691 3640 <0.001
INQ300 Family has savings more than $20 000 4752 733 4019 <0.001
MCQ160B Ever told had congestive heart failure 4863 784 4079 <0.001
MCQ160F Ever told you had a stroke 4870 790 4080 <0.001
MCQ160O Ever told you had COPD? 4868 788 4080 <0.001
OHQ640 Last year had diff w/ job because of mouth 3965 731 3234 <0.001
OHQ770 Past year need dental but could not get it 5003 765 4328 <0.001
PAQ650 Vigorous recreational activities 5134 790 4344 <0.001
PFQ049 Limitations keeping you from working 4872 790 4082 <0.001
PFQ054 Need special equipment to walk 4873 789 4084 <0.001
PFQ057 Experience confusion/memory problems 4870 789 4081 <0.001
PFQ090 Require special healthcare equipment 4874 790 4084 <0.001
RIAGENDR Gender 5135 791 4344 <0.001