Improving Acne Image Grading with Label Distribution Smoothing
Abstract
Acne, a prevalent skin condition, necessitates precise severity assessment for effective treatment. Acne severity grading typically involves lesion counting and global assessment. However, manual grading suffers from variability and inefficiency, highlighting the need for automated tools. Recently, label distribution learning (LDL) was proposed as an effective framework for acne image grading, but its effectiveness is hindered by severity scales that assign varying numbers of lesions to different severity grades. Addressing these limitations, we proposed to incorporate severity scale information into lesion counting by combining LDL with label smoothing, and to decouple if from global assessment. A novel weighting scheme in our approach adjusts the degree of label smoothing based on the severity grading scale. This method helped to effectively manage label uncertainty without compromising class distinctiveness. Applied to the benchmark ACNE04 dataset, our model demonstrated improved performance in automated acne grading, showcasing its potential in enhancing acne diagnostics. The source code is publicly available at http://github.com/openface-io/acne-lds.
Index Terms— acne grading, label smoothing, label distribution learning
1 Introduction
Acne vulgaris, commonly known as acne, is a widespread skin condition that is estimated to affect over 700 million people worldwide, significantly impacting interpersonal relationships, social functioning, and mental health [1]. Accurate acne severity assessment is important for selecting the right treatment and as a clinical trial outcome [2]. However, manual severity grading by visual global assessment and lesion counting is time-consuming and susceptible to inter-observer variability [3]. Moreover, dermatologists are consistently in short supply, particularly in rural areas, and often cases are seen instead by general practitioners with lower diagnostic accuracy, while consultation costs are rising [4, 5]. Therefore, the use of automated tools for computer-aided acne severity assessment may be a promising alternative for broadening the availability of dermatology expertise [5].
Over the last two decades, multiple approaches to automated acne severity assesement from facial photos were proposed. Initially, these solutions relied on conventional image analysis [6], but the advent of deep learning’s breakthrough performance improvements in biomedical image analysis have shifted focus to its use for acne lesion detection [7], classification [8], counting [9], and severity grading [10].
In acne image grading, each photo is assigned a severity level and while over 20 different grading scales have been proposed over time, the medical community has yet to agree on standardized criteria [11]. Most grading scales rely on lesion counting as a quantifiable measure informative of severity. Recognizing the connection between lesion counting and global severity grading, Wu et al. [12] introduced a unified framework that tackles both tasks simultaneously and published the benchmark dataset ACNE04 with annotated lesions. Utilizing label distribution learning (LDL) [13], their method assigns to each image two label distributions: one for quantifying lesion counts and another for classifying acne severity. The severity class labels are based on the Hayashi scale [14], which delineates acne severity into four levels based on lesion count ranges: 1–5 lesions is mild, 6–20 lesions is moderate, 21–50 lesions is severe, and 50+ lesions is very severe.
The approach proposed by Wu et al. [12] generates the ground-truth label distribution for lesion counts independently of the severity scale, assuming the same levels of grade uncertainty for all lesion counts. However, the predicted lesion count is then converted into the severity grade prediction and evaluated against the labels generated per Hayashi severity scale [14]. This leads to varying grade uncertainty for different lesion counts based on the severity scale: for example, both 12 and 13 lesion counts confidently correspond to the moderate grade, while 20 and 21 lesions are assigned moderate and severe grades, accordingly.
At the same time, global severity assessment branch directly predicts severity grade distribution from an image. In contrast with the lesion counting task, accounting for the Hayashi scale is not beneficial here, because the scale delineates severity classes by uneven ranges of lesion counts (e.g., mild only includes 1–5 lesions, while moderate includes 6–20), making global prediction more challenging.
Here, we proposed an approach that addresses these issues by incorporating severity scale information into generating label distributions for lesion counting, while simultaneously removing it from the direct severity grade classification.
Our first contribution can be viewed as a novel way to combine label smoothing [15] with LDL. We smooth a hard lesion count label with the Gaussian label distribution, such that the amount of smoothing depends on where this lesion count falls on the severity scale (Fig. 1). This is realized by introducing a parameter to control amount of smoothing applied to each lesion count based in its proximity to the grade range border. For the counts at grade range boundaries, we use to generate Gaussian label distributions, enabling a soft transition between classes. This corresponds to LDL, incorporating high grade uncertainty. But for object counts towards the middle of the grade range, we reduce the weight of the label distribution such that the original count label remains dominant compared to its neighbors. In such cases, the grade uncertainty is lower, which allows the model to calibrate predictions accordingly. For instance, an image with a lesion count of 34—well within the range of the severe class—generates a distribution with lesser amount of label smoothing to maintain a highly confident grade prediction (Fig. 1). This hybrid approach ensures that our model accounts for the inherent uncertainty in the counting task without diluting the distinctness of each class.
For the classification branch, we reduce the complexity of the task by breaking down Hayashi-defined uneven grade ranges into evenly-sized classes such that each class range contains exactly five lesion counts. We demonstrate that our approach improves the results of automated acne grading on the benchmark dataset indicating the potential to improve diagnostics of acne.
2 Method
Let be the -th image out of the training set of size with the corresponding ground-truth lesion count annotation , where is the maximum lesion count, and the severity level , where represents the number of distinct severity grades. Overall architecture follows [12], except for changes described below (see Fig. 2).
2.1 Gaussian label distribution generation
Wu et al. [12] used the Gaussian function to generate the lesion count label distribution. For the particular acne count label and image they defined the description degree as:
(1) |
where and is the normalization factor:
(2) |
such that and .
2.2 Label smoothing
Label smoothing [15] was proposed to soften the hard label in the training process to prevent overconfidence and improve generalization. Consider one particular image with ground-truth label that is one-hot encoded as . Then the original label can be replaced with a distribution:
(3) |
where is usually the uniform distribution , where is the number of classes. As the result, the true label description degree will be reduced, while the other classes will obtain non-zero values.
2.3 Scale-adaptive label distribution smoothing
To obtain more confident predictions for the mid-range counts, while maintaining higher grade uncertainty for counts near the grade border, we propose a methods that combines Gaussian label distribution generation with confident labels via a label smoothing-like weighting scheme (see Fig. 1). We achieve this in two steps. First, we replace the uniform distribution in eq. (3) with the generated label distribution from eq. (1). This limits redistribution of confidence from the hard label to its surrounding neighbors, unlike the traditional label smoothing that assigns some small description degrees to all labels. Second, we introduce piecewise-linear schedule for the smoothing parameter in order to control the weight of the label distribution base on the count label location in the grading scale, as illustrated on Fig. 1. Now we can replace eq. (3) with the following:
(4) |
where is the one-hot encoded ground-truth label, is the smoothed label distribution. Near the class border , which corresponds to LDL, whereas for the mid-range labels ( is a hyperparameter), which is more similar to the traditional label smoothing.
2.4 Lesion counting branch
We replace with from eq. (4) in the loss function that is the Kullback–Leibler (KL) divergence between the generated and predicted distributions eq. (2):
(5) |
where the probability of image belonging to class is:
(6) |
Following [12], we also convert count label distributions and their predictions into severity labels and predictions by summing up corresponding probabilities by the Hayashi scale.
Metric | Wu et al. [12] | LD smoothing | New class ranges | Both |
---|---|---|---|---|
Accuracy | 83.70 1.53 | 83.90 1.48 | 83.63 1.32 | 84.11 1.94 |
Precision | 82.97 1.27 | 83.38 3.02 | 82.63 2.27 | 83.11 2.56 |
Specificity | 93.76 0.63 | 93.81 0.473 | 93.75 0.42 | 93.99 0.68 |
Sensitivity | 81.06 3.46 | 81.21 2.29 | 81.47 2.88 | 81.53 2.95 |
Youden Index | 74.83 4.06 | 75.02 2.75 | 75.22 3.28 | 75.52 3.61 |
MCC | 75.41 2.35 | 75.69 2.18 | 75.32 1.98 | 76.16 2.82 |
2.5 Severity prediction branch
Since severity grading branch is independent of lesion counting, we can convert Hayashi-based severity grade labels into evenly-spaced ones, see Fig.3. The severity label distribution is generated according to new classes instead of the Hayashi scale. Then the severity prediction loss function follows:
(7) |
where probability of image to belonging to class is:
(8) |
and is the new severity description degree.
2.6 Combined loss function
To combine severity grade assessment from the counting branch with direct global grading using the severity prediction branch, we train the model using a multi-task loss function defined as:
(9) |
where is the trade-off hyperparameter.
At the prediction stage, class probabilities for the new set of classes are converted back to the original Hayashi class probabilities using the reverse mapping, see Fig. 2. After that, the final predicted distribution is obtained by averaging predicted class and counting probability distributions:
(10) |
3 Experiments and results
3.1 Dataset and evaluation details
We evaluate the proposed approach using the ACNE04 benchmarking dataset [12]. It contains images with bounding boxes of lesions. For evaluating, the dataset is split into 80% training set and 20% testing set, containing and images, respectively.
Considering accurate acne severity grading as the ultimate goal, we focus on classification metrics to evaluate model performance. In addition to accuracy, precision, specificity, sensitivity, and Youden Index reported by Wu et al. [12], we also added Matthews correlation coefficient (MCC) [16] that has recently been reported to have advantages over other classification metrics [17]. During training, we use maximum validation MCC to select the best epoch for saving the model state for further evaluation.
3.2 Implementation details
We were unable to exactly reproduce the results from the original paper by Wu et al. [12]. Therefore, we re-trained their LDL model from scratch using provided source code to ensure fair comparison. We use exactly the same ResNet-50 [18] architecture and training schedule, including the pre-defined -fold cross validation. We start calculating evaluation metrics after the first learning rate decay event. We tuned several hyperparameters using a single-fold validation, including the standard deviation in eq. (1), in eq. 4, and the trade-off parameter balancing counting and grading tasks in eq. (9).
3.3 Results and ablations
As shown in Table 1, we compared performance of the baseline approach with both of the proposed contributions and their combination. Smoothing labels with generated lesion count label distributions in the scale-adaptive fashion (’LD smoothing’ column) immediately demonstrated performance improvement across all metrics. While the use of evenly-sized class ranges in the severity grading branch showed no obvious improvement when applied independently (’New class ranges’ column), the combination of both techniques resulted in further performance boost. This indicates that the combination of these two components benefits from their complimentary. The label distribution smoothing method effectively handles the uncertainty at the class boundaries and provides a more nuanced approach to learning the relationship between lesion counts and severity grading, while the simplified class definitions offer a straightforward image grading process for the model. Together, they balance detail-oriented and global approaches, enhancing overall performance.
4 Conclusion
In this work, we introduced an automated acne image grading method that combines smoothing lesion count labels by label distributions based on the severity grading scale and simplifying severity class definitions to enhance global acne grading. Our results demonstrate the synergy of these strategies, boosting grading accuracy and promising a step forward in automated acne diagnostics. The novel technique of smoothing hard labels by label distributions instead of the uniform distribution is general and potentially applicable beyond acne grading, for example, for grading tumor malignancy.
5 Compliance with Ethical Standards
This research study was conducted retrospectively using human subject data made available in open access [12].
6 Acknowledgments
The authors thank Natalia Martynova for valuable discussions and other support in development of this project.
References
- [1] AM Layton, D Thiboutot, and J Tan, “Reviewing the global burden of acne: how could we improve care to reduce the burden?,” British Journal of Dermatology, vol. 184, no. 2, pp. 219–225, 2021.
- [2] DM Thiboutot, AM Layton, M-M Chren, EA Eady, and J Tan, “Assessing effectiveness in acne clinical trials: steps towards a core outcome measure set,” British Journal of Dermatology, vol. 181, no. 4, pp. 700–706, 2019.
- [3] Anne W Lucky, Beth L Barber, Cynthia J Girman, Jody Williams, Joan Ratterman, and Joanne Waldstreicher, “A multirater validation study to assess the reliability of acne lesion counting,” Journal of the American Academy of Dermatology, vol. 35, no. 4, pp. 559–565, 1996.
- [4] Jack Resneck Jr and Alexa B Kimball, “The dermatology workforce shortage,” Journal of the American Academy of Dermatology, vol. 50, no. 1, pp. 50–54, 2004.
- [5] Yuan Liu, Ayush Jain, Clara Eng, David H. Way, Kang Lee, Peggy Bui, Kimberly Kanada, Guilherme de Oliveira Marinho, Jessica Gallegos, Sara Gabriele, Vishakha Gupta, Nalini Singh, Vivek Natarajan, Rainer Hofmann-Wellenhof, Greg S. Corrado, Lily H. Peng, Dale R. Webster, Dennis Ai, Susan J. Huang, Yun Liu, R. Carter Dunn, and David Coz, “A deep learning system for differential diagnosis of skin diseases,” Nature Medicine, vol. 26, no. 6, pp. 900–908, 2020.
- [6] Roshaslinie Ramli, Aamir Saeed Malik, Ahmad Fadzil Mohamad Hani, and Adawiyah Jamil, “Acne analysis, grading and computational assessment methods: An overview,” Skin Research and Technology, vol. 18, no. 1, pp. 1–14, 2012.
- [7] Thanapha Chantharaphaichi, Bunyarit Uyyanonvara, Chanjira Sinthanayothin, and Akinori Nishihara, “Automatic acne detection for medical treatment,” in IC-ICTES. 2015, IEEE.
- [8] Nasim Alamdari, Kouhyar Tavakolian, Minhal Alhashim, and Reza Fazel-Rezai, “Detection and classification of acne lesions in acne patients: A mobile application,” in EIT. 2016, IEEE.
- [9] Gabriele Maroni, Michele Ermidoro, Fabio Previdi, and Glauco Bigini, “Automated detection, extraction and counting of acne lesions for automatic evaluation and tracking of acne severity,” in SSCI. 2017, IEEE.
- [10] Sophie Seité, Amir Khammari, Michael Benzaquen, Dominique Moyal, and Brigitte Dréno, “Development and accuracy of an artificial intelligence algorithm for acne grading from smartphone photographs,” Experimental Dermatology, vol. 28, no. 11, pp. 1252–1257, 2019.
- [11] Tamara Agnew, Gareth Furber, Matthew Leach, and Leonie Segal, “A comprehensive critique and review of published measures of acne severity,” The Journal of Clinical and Aesthetic Dermatology, vol. 9, no. 7, pp. 40–52, 2016.
- [12] Xiaoping Wu, Ni Wen, Jie Liang, Yu Kun Lai, Dongyu She, Ming Ming Cheng, and Jufeng Yang, “Joint acne image grading and counting via label distribution learning,” in ICCV. 2019, pp. 10641–10650, IEEE/CVF.
- [13] Xin Geng, “Label distribution learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 7, pp. 1734–1748, 2016.
- [14] Nobukazu Hayashi, Hirohiko Akamatsu, Makoto Kawashima, and Acne Study Group, “Establishment of grading criteria for acne severity,” The Journal of Dermatology, vol. 35, no. 5, pp. 255–260, 2008.
- [15] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna, “Rethinking the Inception architecture for computer vision,” in CVPR. 2016, pp. 2818–2826, IEEE/CVF.
- [16] Brian W Matthews, “Comparison of the predicted and observed secondary structure of t4 phage lysozyme,” Biochimica et Biophysica Acta (BBA)-Protein Structure, vol. 405, no. 2, pp. 442–451, 1975.
- [17] Davide Chicco and Giuseppe Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC genomics, vol. 21, no. 1, pp. 1–13, 2020.
- [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR. 2016, pp. 770–778, IEEE/CVF.