Improving Acne Image Grading with Label Distribution Smoothing

Abstract

Acne, a prevalent skin condition, necessitates precise severity assessment for effective treatment. Acne severity grading typically involves lesion counting and global assessment. However, manual grading suffers from variability and inefficiency, highlighting the need for automated tools. Recently, label distribution learning (LDL) was proposed as an effective framework for acne image grading, but its effectiveness is hindered by severity scales that assign varying numbers of lesions to different severity grades. Addressing these limitations, we proposed to incorporate severity scale information into lesion counting by combining LDL with label smoothing, and to decouple if from global assessment. A novel weighting scheme in our approach adjusts the degree of label smoothing based on the severity grading scale. This method helped to effectively manage label uncertainty without compromising class distinctiveness. Applied to the benchmark ACNE04 dataset, our model demonstrated improved performance in automated acne grading, showcasing its potential in enhancing acne diagnostics. The source code is publicly available at http://github.com/openface-io/acne-lds.

Index Terms— acne grading, label smoothing, label distribution learning

1 Introduction

Acne vulgaris, commonly known as acne, is a widespread skin condition that is estimated to affect over 700 million people worldwide, significantly impacting interpersonal relationships, social functioning, and mental health [1]. Accurate acne severity assessment is important for selecting the right treatment and as a clinical trial outcome [2]. However, manual severity grading by visual global assessment and lesion counting is time-consuming and susceptible to inter-observer variability [3]. Moreover, dermatologists are consistently in short supply, particularly in rural areas, and often cases are seen instead by general practitioners with lower diagnostic accuracy, while consultation costs are rising [4, 5]. Therefore, the use of automated tools for computer-aided acne severity assessment may be a promising alternative for broadening the availability of dermatology expertise [5].

Over the last two decades, multiple approaches to automated acne severity assesement from facial photos were proposed. Initially, these solutions relied on conventional image analysis [6], but the advent of deep learning’s breakthrough performance improvements in biomedical image analysis have shifted focus to its use for acne lesion detection [7], classification [8], counting [9], and severity grading [10].

In acne image grading, each photo is assigned a severity level and while over 20 different grading scales have been proposed over time, the medical community has yet to agree on standardized criteria [11]. Most grading scales rely on lesion counting as a quantifiable measure informative of severity. Recognizing the connection between lesion counting and global severity grading, Wu et al. [12] introduced a unified framework that tackles both tasks simultaneously and published the benchmark dataset ACNE04 with annotated lesions. Utilizing label distribution learning (LDL) [13], their method assigns to each image two label distributions: one for quantifying lesion counts and another for classifying acne severity. The severity class labels are based on the Hayashi scale [14], which delineates acne severity into four levels based on lesion count ranges: 1–5 lesions is mild, 6–20 lesions is moderate, 21–50 lesions is severe, and 50+ lesions is very severe.

The approach proposed by Wu et al. [12] generates the ground-truth label distribution for lesion counts independently of the severity scale, assuming the same levels of grade uncertainty for all lesion counts. However, the predicted lesion count is then converted into the severity grade prediction and evaluated against the labels generated per Hayashi severity scale [14]. This leads to varying grade uncertainty for different lesion counts based on the severity scale: for example, both 12 and 13 lesion counts confidently correspond to the moderate grade, while 20 and 21 lesions are assigned moderate and severe grades, accordingly.

At the same time, global severity assessment branch directly predicts severity grade distribution from an image. In contrast with the lesion counting task, accounting for the Hayashi scale is not beneficial here, because the scale delineates severity classes by uneven ranges of lesion counts (e.g., mild only includes 1–5 lesions, while moderate includes 6–20), making global prediction more challenging.

Refer to caption — Fig. 1: Piecewise linear weighting of the smoothing parameter $\varepsilon$ used to control how much of label distribution is added to smooth the hard label. Near the class boundaries $\varepsilon_{min}=1$ , which corresponds to LDL, and near the mid-range $\varepsilon=\varepsilon_{min}$ , which preserves the dominance of the original label value. The value $\varepsilon_{min}=0.6$ is tuned using single-fold validation.

Here, we proposed an approach that addresses these issues by incorporating severity scale information into generating label distributions for lesion counting, while simultaneously removing it from the direct severity grade classification.

Our first contribution can be viewed as a novel way to combine label smoothing [15] with LDL. We smooth a hard lesion count label with the Gaussian label distribution, such that the amount of smoothing depends on where this lesion count falls on the severity scale (Fig. 1). This is realized by introducing a parameter $\varepsilon$ to control amount of smoothing applied to each lesion count based in its proximity to the grade range border. For the counts at grade range boundaries, we use $\varepsilon=1$ to generate Gaussian label distributions, enabling a soft transition between classes. This corresponds to LDL, incorporating high grade uncertainty. But for object counts towards the middle of the grade range, we reduce the weight $\varepsilon$ of the label distribution such that the original count label remains dominant compared to its neighbors. In such cases, the grade uncertainty is lower, which allows the model to calibrate predictions accordingly. For instance, an image with a lesion count of 34—well within the range of the severe class—generates a distribution with lesser amount of label smoothing to maintain a highly confident grade prediction (Fig. 1). This hybrid approach ensures that our model accounts for the inherent uncertainty in the counting task without diluting the distinctness of each class.

For the classification branch, we reduce the complexity of the task by breaking down Hayashi-defined uneven grade ranges into evenly-sized classes such that each class range contains exactly five lesion counts. We demonstrate that our approach improves the results of automated acne grading on the benchmark dataset indicating the potential to improve diagnostics of acne.

2 Method

Let $x_{i}$ be the $i$ -th image out of the training set of size $N$ with the corresponding ground-truth lesion count annotation $z_{i}\in\{1,2,\dots Z\}$ , where $Z$ is the maximum lesion count, and the severity level $y_{i}\in[1,2,\dots Y]$ , where $Y$ represents the number of distinct severity grades. Overall architecture follows [12], except for changes described below (see Fig. 2).

2.1 Gaussian label distribution generation

Wu et al. [12] used the Gaussian function to generate the lesion count label distribution. For the particular acne count label $c_{j}$ and image $x_{i}$ they defined the description degree as:

d_{x_{i}}^{c_{j}}=d(c_{j}|x_{i})=\frac{1}{\sqrt{2\pi\sigma^{2}}M}\exp\left({-% \frac{\left(c_{j}-z_{i}\right)^{2}}{2\sigma^{2}}}\right),

(1)

where $j\in\{1,2,\dots,Z\}$ and $M$ is the normalization factor:

M=\frac{1}{\sqrt{2\pi\sigma^{2}}}\sum\limits_{j=1}^{Z}\exp\left({-\frac{\left(% c_{j}-z_{i}\right)^{2}}{2\sigma^{2}}}\right),

(2)

such that $d_{x_{i}}^{c_{j}}\in\left[0,1\right]$ and $\sum\limits_{j=1}^{Z}d_{x_{i}}^{c_{j}}=1$ .

2.2 Label smoothing

Label smoothing [15] was proposed to soften the hard label in the training process to prevent overconfidence and improve generalization. Consider one particular image $x$ with ground-truth label $y_{gt}$ that is one-hot encoded as $q(k|x)=\delta_{k,y_{gt}}$ . Then the original label can be replaced with a distribution:

q^{\prime}(k|x)=(1-\varepsilon)\delta_{k,y_{gt}}+\varepsilon u(k|x),

(3)

where $u(k|x)$ is usually the uniform distribution $u(k|x)=\frac{1}{K}$ , where $K$ is the number of classes. As the result, the true label description degree will be reduced, while the other classes will obtain non-zero values.

2.3 Scale-adaptive label distribution smoothing

To obtain more confident predictions for the mid-range counts, while maintaining higher grade uncertainty for counts near the grade border, we propose a methods that combines Gaussian label distribution generation with confident labels via a label smoothing-like weighting scheme (see Fig. 1). We achieve this in two steps. First, we replace the uniform distribution in eq. (3) with the generated label distribution from eq. (1). This limits redistribution of confidence from the hard label to its surrounding neighbors, unlike the traditional label smoothing that assigns some small description degrees to all labels. Second, we introduce piecewise-linear schedule for the smoothing parameter $\varepsilon$ in order to control the weight of the label distribution base on the count label location in the grading scale, as illustrated on Fig. 1. Now we can replace eq. (3) with the following:

q^{\prime}(c_{j}|x_{i})=[1-\varepsilon(c_{j})]q(c_{j}|x_{i})+\varepsilon(c_{j}% )d(c_{j}|x_{i}),

(4)

where $q(c_{j}|x_{i})$ is the one-hot encoded ground-truth label, $q^{\prime}(c_{j}|x_{i})$ is the smoothed label distribution. Near the class border $\varepsilon=1$ , which corresponds to LDL, whereas for the mid-range labels $\varepsilon_{min}\leq\varepsilon(c_{j})<1$ ( $\varepsilon_{min}$ is a hyperparameter), which is more similar to the traditional label smoothing.

2.4 Lesion counting branch

We replace $d_{x_{i}}^{c_{j}}$ with $q^{\prime}(c_{j}|x_{i})$ from eq. (4) in the loss function that is the Kullback–Leibler (KL) divergence between the generated and predicted distributions eq. (2):

\mathcal{L}_{cnt}(x_{i},z_{i})=-\sum\limits_{j=1}^{Z}q^{\prime}(c_{j}|x_{i})% \ln\frac{p_{cnt}(c_{j}|x_{i},\boldsymbol{\theta})}{q^{\prime}(c_{j}|x_{i})},

(5)

where the probability of image $x_{i}$ belonging to class $c_{j}$ is:

p_{cnt}(c_{j}|x_{i},\boldsymbol{\theta})=\exp{(\theta_{c_{j}})}/\sum\limits_{l% }\exp{(\theta_{l})}.

(6)

Following [12], we also convert count label distributions and their predictions into severity labels and predictions by summing up corresponding probabilities by the Hayashi scale.

Metric	Wu et al. [12]	LD smoothing	New class ranges	Both
Accuracy	83.70 $\pm$ 1.53	83.90 $\pm$ 1.48	83.63 $\pm$ 1.32	84.11 $\pm$ 1.94
Precision	82.97 $\pm$ 1.27	83.38 $\pm$ 3.02	82.63 $\pm$ 2.27	83.11 $\pm$ 2.56
Specificity	93.76 $\pm$ 0.63	93.81 $\pm$ 0.473	93.75 $\pm$ 0.42	93.99 $\pm$ 0.68
Sensitivity	81.06 $\pm$ 3.46	81.21 $\pm$ 2.29	81.47 $\pm$ 2.88	81.53 $\pm$ 2.95
Youden Index	74.83 $\pm$ 4.06	75.02 $\pm$ 2.75	75.22 $\pm$ 3.28	75.52 $\pm$ 3.61
MCC	75.41 $\pm$ 2.35	75.69 $\pm$ 2.18	75.32 $\pm$ 1.98	76.16 $\pm$ 2.82

Table 1: Evaluation results on the ACNE04 dataset [12]

2.5 Severity prediction branch

Since severity grading branch is independent of lesion counting, we can convert Hayashi-based severity grade labels into evenly-spaced ones, see Fig.3. The severity label distribution is generated according to new classes instead of the Hayashi scale. Then the severity prediction loss function follows:

\mathcal{L}_{cls}(x_{i},y_{i})=-\sum\limits_{k=1}^{Y^{\prime}}d(s^{\prime}_{k}% |x_{i})\ln\frac{p_{cls}(s^{\prime}_{k}|x_{i},\boldsymbol{\theta})}{d(s^{\prime% }_{k}|x_{i})},

(7)

where probability of image $x_{i}$ to belonging to $s^{\prime}_{k}$ class is:

p_{cls}(s^{\prime}_{k}|x_{i},\boldsymbol{\theta})=\exp{(\theta_{s^{\prime}_{k}% })}/\sum\limits_{l}\exp{(\theta_{l})},

(8)

and $d(s^{\prime}_{k}|x_{i})$ is the new severity description degree.

2.6 Combined loss function

To combine severity grade assessment from the counting branch with direct global grading using the severity prediction branch, we train the model using a multi-task loss function defined as:

\begin{split}\mathcal{L}_{i}(x_{i},y_{i},z_{i})=(1-\lambda)\mathcal{L}_{cnt}(x% _{i},z_{i})+\\ +\frac{\lambda}{2}\left(\mathcal{L}_{cls}(x_{i},y_{i})+\mathcal{L}_{cnt2cls}(x% _{i},y_{i})\right),\end{split}

(9)

where $\lambda$ is the trade-off hyperparameter.

At the prediction stage, class probabilities $p_{cls}(s^{\prime}_{k}|x_{i},\boldsymbol{\theta})$ for the new set of classes are converted back to the original Hayashi class probabilities $p_{cls}(s_{k}|x_{i},\boldsymbol{\theta})$ using the reverse mapping, see Fig. 2. After that, the final predicted distribution is obtained by averaging predicted class and counting probability distributions:

p_{tot}(\textbf{s}|x_{i},\boldsymbol{\theta})=\frac{1}{2}\left(\tilde{p}_{cls}% (\textbf{s}|x_{i},\boldsymbol{\theta})+p_{cls}(\textbf{s}|x_{i},\boldsymbol{% \theta})\right).

(10)

3 Experiments and results

3.1 Dataset and evaluation details

We evaluate the proposed approach using the ACNE04 benchmarking dataset [12]. It contains $1,457$ images with $18,983$ bounding boxes of lesions. For evaluating, the dataset is split into 80% training set and 20% testing set, containing $1,165$ and $292$ images, respectively.

Considering accurate acne severity grading as the ultimate goal, we focus on classification metrics to evaluate model performance. In addition to accuracy, precision, specificity, sensitivity, and Youden Index reported by Wu et al. [12], we also added Matthews correlation coefficient (MCC) [16] that has recently been reported to have advantages over other classification metrics [17]. During training, we use maximum validation MCC to select the best epoch for saving the model state for further evaluation.

3.2 Implementation details

We were unable to exactly reproduce the results from the original paper by Wu et al. [12]. Therefore, we re-trained their LDL model from scratch using provided source code to ensure fair comparison. We use exactly the same ResNet-50 [18] architecture and training schedule, including the pre-defined $5$ -fold cross validation. We start calculating evaluation metrics after the first learning rate decay event. We tuned several hyperparameters using a single-fold validation, including the standard deviation $\sigma=3.0$ in eq. (1), $\varepsilon_{min}=0.6$ in eq. 4, and the trade-off parameter $\lambda=0.3$ balancing counting and grading tasks in eq. (9).

3.3 Results and ablations

As shown in Table 1, we compared performance of the baseline approach with both of the proposed contributions and their combination. Smoothing labels with generated lesion count label distributions in the scale-adaptive fashion (’LD smoothing’ column) immediately demonstrated performance improvement across all metrics. While the use of evenly-sized class ranges in the severity grading branch showed no obvious improvement when applied independently (’New class ranges’ column), the combination of both techniques resulted in further performance boost. This indicates that the combination of these two components benefits from their complimentary. The label distribution smoothing method effectively handles the uncertainty at the class boundaries and provides a more nuanced approach to learning the relationship between lesion counts and severity grading, while the simplified class definitions offer a straightforward image grading process for the model. Together, they balance detail-oriented and global approaches, enhancing overall performance.

4 Conclusion

In this work, we introduced an automated acne image grading method that combines smoothing lesion count labels by label distributions based on the severity grading scale and simplifying severity class definitions to enhance global acne grading. Our results demonstrate the synergy of these strategies, boosting grading accuracy and promising a step forward in automated acne diagnostics. The novel technique of smoothing hard labels by label distributions instead of the uniform distribution is general and potentially applicable beyond acne grading, for example, for grading tumor malignancy.

5 Compliance with Ethical Standards

This research study was conducted retrospectively using human subject data made available in open access [12].

6 Acknowledgments

The authors thank Natalia Martynova for valuable discussions and other support in development of this project.

References

[1] AM Layton, D Thiboutot, and J Tan, “Reviewing the global burden of acne: how could we improve care to reduce the burden?,” British Journal of Dermatology, vol. 184, no. 2, pp. 219–225, 2021.
[2] DM Thiboutot, AM Layton, M-M Chren, EA Eady, and J Tan, “Assessing effectiveness in acne clinical trials: steps towards a core outcome measure set,” British Journal of Dermatology, vol. 181, no. 4, pp. 700–706, 2019.
[3] Anne W Lucky, Beth L Barber, Cynthia J Girman, Jody Williams, Joan Ratterman, and Joanne Waldstreicher, “A multirater validation study to assess the reliability of acne lesion counting,” Journal of the American Academy of Dermatology, vol. 35, no. 4, pp. 559–565, 1996.
[4] Jack Resneck Jr and Alexa B Kimball, “The dermatology workforce shortage,” Journal of the American Academy of Dermatology, vol. 50, no. 1, pp. 50–54, 2004.
[5] Yuan Liu, Ayush Jain, Clara Eng, David H. Way, Kang Lee, Peggy Bui, Kimberly Kanada, Guilherme de Oliveira Marinho, Jessica Gallegos, Sara Gabriele, Vishakha Gupta, Nalini Singh, Vivek Natarajan, Rainer Hofmann-Wellenhof, Greg S. Corrado, Lily H. Peng, Dale R. Webster, Dennis Ai, Susan J. Huang, Yun Liu, R. Carter Dunn, and David Coz, “A deep learning system for differential diagnosis of skin diseases,” Nature Medicine, vol. 26, no. 6, pp. 900–908, 2020.
[6] Roshaslinie Ramli, Aamir Saeed Malik, Ahmad Fadzil Mohamad Hani, and Adawiyah Jamil, “Acne analysis, grading and computational assessment methods: An overview,” Skin Research and Technology, vol. 18, no. 1, pp. 1–14, 2012.
[7] Thanapha Chantharaphaichi, Bunyarit Uyyanonvara, Chanjira Sinthanayothin, and Akinori Nishihara, “Automatic acne detection for medical treatment,” in IC-ICTES. 2015, IEEE.
[8] Nasim Alamdari, Kouhyar Tavakolian, Minhal Alhashim, and Reza Fazel-Rezai, “Detection and classification of acne lesions in acne patients: A mobile application,” in EIT. 2016, IEEE.
[9] Gabriele Maroni, Michele Ermidoro, Fabio Previdi, and Glauco Bigini, “Automated detection, extraction and counting of acne lesions for automatic evaluation and tracking of acne severity,” in SSCI. 2017, IEEE.
[10] Sophie Seité, Amir Khammari, Michael Benzaquen, Dominique Moyal, and Brigitte Dréno, “Development and accuracy of an artificial intelligence algorithm for acne grading from smartphone photographs,” Experimental Dermatology, vol. 28, no. 11, pp. 1252–1257, 2019.
[11] Tamara Agnew, Gareth Furber, Matthew Leach, and Leonie Segal, “A comprehensive critique and review of published measures of acne severity,” The Journal of Clinical and Aesthetic Dermatology, vol. 9, no. 7, pp. 40–52, 2016.
[12] Xiaoping Wu, Ni Wen, Jie Liang, Yu Kun Lai, Dongyu She, Ming Ming Cheng, and Jufeng Yang, “Joint acne image grading and counting via label distribution learning,” in ICCV. 2019, pp. 10641–10650, IEEE/CVF.
[13] Xin Geng, “Label distribution learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 7, pp. 1734–1748, 2016.
[14] Nobukazu Hayashi, Hirohiko Akamatsu, Makoto Kawashima, and Acne Study Group, “Establishment of grading criteria for acne severity,” The Journal of Dermatology, vol. 35, no. 5, pp. 255–260, 2008.
[15] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna, “Rethinking the Inception architecture for computer vision,” in CVPR. 2016, pp. 2818–2826, IEEE/CVF.
[16] Brian W Matthews, “Comparison of the predicted and observed secondary structure of t4 phage lysozyme,” Biochimica et Biophysica Acta (BBA)-Protein Structure, vol. 405, no. 2, pp. 442–451, 1975.
[17] Davide Chicco and Giuseppe Jurman, “The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation,” BMC genomics, vol. 21, no. 1, pp. 1–13, 2020.
[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR. 2016, pp. 770–778, IEEE/CVF.