EPSD: Early Pruning with Self-Distillation for Efficient Model Compression

Dong Chen^1,2, Ning Liu²¹¹footnotemark: 1, Yichen Zhu², Zhengping Che², Rui Ma¹,
Fachao Zhang², Xiaofeng Mou², Yi Chang¹, Jian Tang²²²footnotemark: 2 These authors contributed equally. This work was done during Dong Chen’s internship at Midea Group.Corresponding authors.

Abstract

Neural network compression techniques, such as knowledge distillation (KD) and network pruning, have received increasing attention. Recent work ‘Prune, then Distill’ reveals that a pruned student-friendly teacher network can benefit the performance of KD. However, the conventional teacher-student pipeline, which entails cumbersome pre-training of the teacher and complicated compression steps, makes pruning with KD less efficient. In addition to compressing models, recent compression techniques also emphasize the aspect of efficiency. Early pruning demands significantly less computational cost in comparison to the conventional pruning methods as it does not require a large pre-trained model. Likewise, a special case of KD, known as self-distillation (SD), is more efficient since it requires no pre-training or student-teacher pair selection. This inspires us to collaborate early pruning with SD for efficient model compression. In this work, we propose the framework named Early Pruning with Self-Distillation (EPSD), which identifies and preserves distillable weights in early pruning for a given SD task. EPSD efficiently combines early pruning and self-distillation in a two-step process, maintaining the pruned network’s trainability for compression. Instead of a simple combination of pruning and SD, EPSD enables the pruned network to favor SD by keeping more distillable weights before training to ensure better distillation of the pruned network. We demonstrated that EPSD improves the training of pruned networks, supported by visual and quantitative analyses. Our evaluation covered diverse benchmarks (CIFAR-10/100, Tiny-ImageNet, full ImageNet, CUB-200-2011, and Pascal VOC), with EPSD outperforming advanced pruning and SD techniques.

1 Introduction

Refer to caption — Figure 1: Comparison of different model compression schemes. (a) PKD (Park and No 2022) follows four steps to combine pruning and KD. (b) Our Early Pruning with SD (EPSD) needs only two steps for compression.

Resource-limited edge devices struggle to handle the computational demands of large deep neural networks (DNNs). Therefore, compressing deep models is crucial to eliminate redundancy, facilitating the effective deployment of DNNs on edge devices (Gong et al. 2019; Liu et al. 2020; Guo, Xu, and Ouyang 2023). Various compression methods have been well studied, including network pruning (Han et al. 2015; Guo, Ouyang, and Xu 2020; Huang et al. 2023), knowledge distillation (KD) (Hinton et al. 2014), parameter quantization (Hubara et al. 2017; Wei et al. 2022) and low-rank decomposition (Zhang et al. 2015). Among them, KD and pruning have received increasing attention.

The concept behind KD is to train a smaller student network to approximate a larger, pre-trained teacher network with higher accuracy (Hinton et al. 2014). The cost of pre-training and the capacity gap issue between teachers and students inevitably limit the usage of KD (Mirzadeh et al. 2020; Xu and Liu 2019). To overcome these limitations, self-distillation (SD) is proposed to enable students to distill knowledge from themselves (Shen et al. 2022; Yang et al. 2019; Zhang, Bao, and Ma 2021; Zhang et al. 2019). Namely, SD allows the student network to learn from its predictions (Mobahi, Farajtabar, and Bartlett 2020), enabling a more streamlined training procedure that requires much fewer computational resources. However, the potential risk in SD is that the student could result in overfitting if the training process is not properly regularized (Kim et al. 2021). Therefore, regularizing the student model becomes crucial to ensure effective knowledge transfer.

Network pruning removes the redundancy inside the original network and generates a sub-network with comparable accuracy performance (LeCun, Denker, and Solla 1989; Liu et al. 2021a). In addition to reducing the computational requirements, pruning also helps prevent overfitting of DNNs (Han, Mao, and Dally 2016). Current pruning works focus on pruning in the early stage (Frankle and Carbin 2019; de Jorge et al. 2021; Alizadeh et al. 2022). Pruning takes place either during initialization or shortly after a few training steps. Early pruning methods efficiently operate without the need for a pre-trained model. Recent work ‘Prune, then Distill’ (Park and No 2022) (we refer to it as PKD) explores the regularization effect of pruning on KD. They analyze the distillation process regularized by the pruned teacher and combine pruning and KD in four steps as shown in Fig. 1(a): 1) Pre-train a teacher network $f_{t}$ , 2) prune $f_{t}$ and obtain a pruned teacher $f_{pt}$ , 3) construct a student network $f_{s}$ according to $f_{pt}$ , and 4) distill knowledge from $f_{pt}$ to $f_{s}$ . Though PKD reveals that pruning student-friendly teacher can boost the performance of KD, the required cumbersome pre-training of the teacher and the complicated steps make it suffer from heavy training efforts.

To mitigate the complicated compression process, we attempt to collaborate early pruning with SD for efficient model compression. An intuitive approach is to prune the network and then finetune the pruned network with SD. However, different from PKD, in the context of the SD, pruning the teacher network also affects the student and leads to inadequate regularization if the pruned student presents weak trainability. Empirically, applying a simple combination of pruning and SD results in severe performance degradation especially under the large sparsity ratios (as shown in Fig. 2). Therefore, the key question is: How to effectively prune DNNs with SD to yield performance gains?

A promising solution is to make the pruned network favorable to SD, $i.e.,$ to preserve more distillable weights to ensure the efficacy of the SD process. To this end, we propose a novel framework named EPSD that collaborates Early Pruning and Self-Distillation for efficient model compression. Specifically, EPSD has two main steps as shown in Fig. 1 (b): 1) Early Pruning: Prune an initialized network $f_{init}$ to obtain a pruned sub-network $f_{sub}$ with distillable weights. 2) Self-Distillation: Train the pruned network $f_{sub}$ by SD. Namely, given a desired sparsity level, EPSD globally ranks the weights in $f_{init}$ according to their influence (quantified by the absolute gradients) on the SD loss and removes weights with less influence. By doing so, the trainability of the student network can be enhanced since the sub-network maintains objective consistency with SD loss and preserves more distillable weights. Next, EPSD applies SD to recover the accuracy of the student network with distillable weights. Our contributions are summarized as follows:

•

We present EPSD, which collaborates early pruning with SD, to compress models efficiently in only two steps. Meanwhile, EPSD preserves the trainability of the pruned network to improve performance.
•

EPSD identifies distillable weights that ensure objective consistency between early pruning and SD, and we present quantitative and visualized analysis to demonstrate the efficacy of EPSD.
•

Extensive results with three advanced SD methods on multiple benchmarks show that EPSD outperforms advanced pruning and SD methods while showcasing its scalability on two downstream tasks.

2 Related Works

Knowledge Distillation. Knowledge Distillation (KD) transfers various ‘knowledge’ in networks (Romero et al. 2015; Hinton et al. 2014), acting as a potent regularization method to enhance generalization by utilizing learned softened targets (Shen et al. 2022). However, the capacity gap prevents well-performing teachers from making students better (Mirzadeh et al. 2020).

Self-Distillation. To improve the efficiency of knowledge transfer, Self-Distillation (SD) leverages knowledge from the student network without involving additional teachers (Wang and Yoon 2021; Yun et al. 2020). The key to SD is creating soft targets, where the student network generates its valuable knowledge to guide its training (Lee, Hwang, and Shin 2020; Zhang, Bao, and Ma 2021; Yang et al. 2019; Shen et al. 2022). SD’s efficiency arises from avoiding teacher network pre-training and addressing teacher-student capacity gaps. Yet, the student network might be over-fitting due to insufficient training regularization (Kim et al. 2021). Recently, PKD (Park and No 2022) revealed the positive regularizing impact of pruning teacher networks on KD, which inspires us to regularize the SD process by pruning.

Network Pruning. Network pruning aims to identify and remove unnecessary weights, reducing complexity while preserving training performance (Reed 1993; Lee et al. 2020). Traditional approaches (Han et al. 2015; Molchanov et al. 2017) typically follow pre-training, pruning, and re-training to prune, which requires much training effort. Another paradigm named Dynamic Sparse Training (DST) (Mocanu et al. 2018; Bellec et al. 2018; Evci et al. 2020; Liu et al. 2021b) starts from a (random) sparse neural network and allows the sparse connectivity to evolve dynamically during training. DST can significantly improve the trainability of sparse DNNs without increasing the training FLOPs. Recently, early pruning (Lee, Ajanthan, and Torr 2019; Wang, Zhang, and Grosse 2020; de Jorge et al. 2021; Alizadeh et al. 2022) has been widely studied as it identifies sparse sub-networks before training without cumbersome pre-training. Many early pruning works evaluate the importance of individual weights regarding the impact on loss, i.e., the gradients of a network. Though early pruning is efficient, it is considered under-performance (Wang et al. 2022): pruning neural networks breaks the dynamical isometry (Saxe, McClelland, and Ganguli 2014) and results in the trainability degradation (Lee et al. 2020). In this work, we empirically show that SD greatly enhances the performance of early pruned networks and improves their trainability by ensuring alignment between pruning and SD objectives.

3 Early Pruning with Self-Distillation

We first introduce a simple combination of early pruning and SD and show that it suffers performance degradation. To address this issue, we introduce the concept of distillable weights, along with quantitative and visualized analysis. Finally, we present the overall framework of EPSD, demonstrating the efficiency by comparing the required training efforts with other compression techniques.

The ‘Simple Combination’

A straightforward way to combine pruning and SD requires two steps. Step-1: Network pruning without pre-training and step-2: Distill knowledge to itself. Specifically, step-1 is to identify a sub-network from the randomly initialized network by pruning. Step-2 is to fine-tune the sub-network via SD. Since our goal is to efficiently compress the model, the early pruning method ProsPr (Alizadeh et al. 2022) is utilized as the representative method in step-1.

Step-1: Identify Redundancy Before Training. Lee et al. first proposed SNIP (Lee, Ajanthan, and Torr 2019) to prune unnecessary weights in random initialized networks that are least salient for the loss. They compute the gradients $\Delta$ to generate saliency scores for initial weight $\theta_{init}$ with random samples $x_{rand}$ and remove the weights with the lowest scores. Specifically, an all-one mask $m$ is attached to initial weights to get $\theta_{0}\leftarrow m\odot\theta_{init}$ , and the saliency scores can be computed as:

{\Delta}(w_{p},x_{rand})=\frac{\partial\mathcal{L}(\theta_{0},x_{rand})}{% \partial m_{p}},

(1)

s_{w_{p}}=\frac{\left|{\Delta}(w_{p},x_{rand})\right|}{\sum_{q}\left|{\Delta}(% w_{q},x_{rand})\right|},

(2)

where $m$ is the pruning mask with values 0s or 1s (initial value is $1$ s), $\Delta$ denotes gradients derived from labels, $w_{p}$ is $p$ -th weight in $\theta_{init}$ , $s_{w_{p}}$ is the saliency score for measuring the importance of $w_{p}$ . Recently, Milad et al. pointed out that pruning should consider the trainability of a certain weight, instead of only its immediate impact on the loss before training (Alizadeh et al. 2022), they measured the impact of pruning on loss across $i$ gradient descent steps during initial training, rather than assessing alterations in loss at initialization. The saliency scores are calculated based on the updated weights $\theta_{i}$ as in Eq. (3):

{\Delta}(w_{p},x_{i})=\frac{\partial\mathcal{L}(\theta_{i},x_{i})}{\partial m_% {p}},

(3)

where $x_{i}$ denotes $i$ -th random sampled batch of data for computing the gradients. In the classification tasks, the cross-entropy (CE) loss runs through the entire process, from pruning to training. The difference between predictions and labels is used to evaluate the importance of weights and optimize the pruned network.

Step-2: Distilling Knowledge from Soften Targets. In classification task, we denote $\mathbf{x}\in\mathcal{X}$ as input and $y\in\mathcal{Y}=$ $\{1,\ldots,C\}$ as its ground-truth label. Given the input $\mathbf{x}$ , the predictive distribution of a softmax classifier is:

P(y\mid\mathbf{x};\theta,\tau)=\frac{\exp\left(l_{y}(\mathbf{x};\theta)/\tau% \right)}{\sum_{i=1}^{C}\exp\left(l_{i}(\mathbf{x};\theta)/\tau\right)},

(4)

where $l_{i}$ denotes the logit of DNNs for class $i$ which are parameterized by $\theta$ , and $\tau>0$ is the temperature scaling factor. To improve the generalization ability, traditional KD (Hinton et al. 2014) transfers pre-trained teacher’s knowledge by optimizing an additional Kullback-Leibler (KL) divergence loss between the softened outputs $\widetilde{P}$ from teacher and student in every mini-batch $x_{i}$ :

\mathcal{L}_{KD}=\frac{1}{n}\sum_{i=1}^{n}\tau^{2}\cdot D_{KL}\left(\widetilde% {P}({x_{i};\theta_{t}})\|\widetilde{P}({x_{i};\theta_{s}})\right).

(5)

The original KD matches the predictions of the same inputs from two different networks, while the SD replaces the teacher’s prediction with that of the student network itself. Various works (Lee, Hwang, and Shin 2020; Xu and Liu 2019; Zhang et al. 2019; Zhang, Bao, and Ma 2021; Yang et al. 2019; Shen et al. 2022) have explored enhancing the SD method in different ways. Our work focuses on the gradients of SD loss rather than specific improvements. We further discuss the gradients of SD loss in Sec. 3. Without loss of generality, we formulate SD loss as follows:

\mathcal{L}_{SD}=\frac{1}{n}\sum_{i=1}^{n}\tau^{2}\cdot D_{KL}\left(\widetilde% {P}({\overline{x}_{i};\overline{\theta}_{s}})\|\widetilde{P}({x_{i};\theta_{s}% })\right),

(6)

where $\widetilde{P}({\overline{x}_{i};\overline{\theta}_{s}})$ represents the soft targets produced by the student networks in SD, and different SD methods have different definitions of $\overline{x}_{i}$ and $\overline{\theta}_{s}$ . We refer the readers to the Appendix for a more detailed explanation of these symbols.

Remarks. A baseline method for combing early pruning and SD involves applying the two techniques sequentially (we name it ‘Simple Combination.’). This straightforward approach produces a distilled, sparse network. The preliminary study shown in Fig. 2 demonstrates that under the sparsity ratio of 95%, the accuracy of the ‘Simple Combination’ is only 62.67%, lower nearly 13% than the ‘Unpruned Baseline’ and 8% than the ‘Pruning Only’. These anomalous results indicate that the pruned network can not effectively learn via SD when directly applying SD to the early-pruned network. To this end, we raised one question: “Why does the early-pruned network degrade accuracy when training with SD?” In the pruning step of the ‘Simple Combination’, the gradient only reflects the difference between the network output and the hard labels, without considering the soft targets generated in SD. We argue that it is difficult for the early-pruned network to learn knowledge from itself during SD when directly combining the early pruning and SD.

Identify Distillable Weights via SD

As introduced in the previous section, a simple combination of early pruning and SD does not lead to performance gains and even results in severe degradation at large sparsity. In the SD scenario, when the teacher network is pruned, the students are also affected, potentially leading to inadequate distillation if a weak student is involved. A desirable mitigation solution is to make pruning results favorable to SD, $i.e.,$ to preserve more distillable weights to ensure that the pruned model can be better distilled. Intuitively, obtaining distillable weights implies that the pruned network should be consistent with the optimized objective of SD.

As a result, we propose to identify distillable weights with SD loss before training. More specifically, during pruning, we establish a knowledge transmission path to facilitate the model to learn from its own outputs. We evaluate the importance of the weights by conducting a few SD iterations to derive the necessary gradients. Formally, the salience score for an individual weight can be derived from Eq. (3) and (6):

{\widetilde{\Delta}}(w_{p},x_{i})=\frac{\partial\mathcal{L}_{SD}(\theta_{i},x_% {i})}{\partial m_{p}},

(7)

\widetilde{s}_{w_{p}}=\frac{\left|{\widetilde{\Delta}}(w_{p},x_{i})\right|}{% \sum_{q}\left|{\widetilde{\Delta}}(w_{q},x_{i})\right|}.

(8)

We remove weights that have the least impact on SD loss according to the desired sparsity ratio, and the weights with higher salience scores $\widetilde{s}$ are regarded as distillable to be preserved. We thoroughly assess weight importance by considering both hard label influences and network-generated knowledge during pruning, resulting in more reliable saliency criteria driven mainly by the SD loss.

To analyze the trainability of the pruned model, we leverage loss surface (Li et al. 2018) to visualize the loss landscape and assess the ease of optimization. Additionally, we utilize the mean Jacobian singular values (Mean-JSV) as a quantitative metric to gauge compliance with the dynamic isometry conditions (Wang et al. 2021; Wang and Fu 2023). The top of Fig. 4 shows the contour plots of loss. We observed that the loss surface of EPSD is flatter than the ‘Simple Combination’, and reaches local minima faster (minimum loss value $0.6$ v.s $1.6$ within equal training steps), implying the pruned model by EPSD is easier to optimize (Arora et al. 2018; Dinh et al. 2017). The bottom of Fig. 4 shows Mean-JSV curves over the first 200 training steps for pruned model obtained by EPSD and the ‘Simple Combination’, respectively. In theory, a larger Mean-JSV (closer to 1) indicates better trainability of the model. The Mean-JSV of EPSD better meets dynamic isometry requirements than the ‘Simple Combination’, revealing the potential of keeping objective consistency between pruning and SD in preserving trainable weights.

Remarks. To tackle the degradation issue raised by the ‘Simple Combination’, we aim to pinpoint distillable weights preferred by SD for improved accuracy. Our visual and quantitative analysis reveals that sub-networks identified by maintaining objective consistency exhibit superior trainability compared to those identified solely through pruning.

Towards Efficient Model Compression

Fig. 3 shows the overall compression procedure of EPSD. There are mainly two steps as mentioned in Sec. 3. In step-1, given randomly initialized weights $\theta_{init}$ , EPSD estimates the effect of pruning on the SD loss (Eq. 6) over $i$ steps of gradient descent. By doing so, EPSD preserves more distillable weights, which become crucial since they offer superior trainability and are more easily optimized by the SD loss as discussed in Sec. 3. Once the pruning mask $m$ is generated by the gradients $\widetilde{\Delta}$ , we apply it to the initial weights $\theta_{init}$ to get a pruned network. In Step 2, we train the pruned network by SD until it reaches convergence.

We emphasize that EPSD is efficient, which is attributed to: 1) the absence of pre-training for pruning, 2) the elimination of teacher training, and 3) the pruned network’s distillable weights, which contribute to improved trainability and faster convergence during SD. In Fig. 5, we demonstrate the training efforts of EPSD and compare them against other representative compression methods. Among them, EPSD combines early pruning and SD (PR+SD), DMC uses advanced pruning (PR), ReKD is a KD method (KD), and the other two are combinations of pruning and KD (PR+KD). EPSD achieves efficient training with fewer epochs than other methods. For instance, the training time of PKD is about eight times that of EPSD (11.3 vs. 1.4 hours).

4 Experiments

We evaluate EPSD on various benchmarks, including CIFAR-10/CIFAR-100 (Krizhevsky, Hinton et al. 2009), Tiny-ImageNet, and full ImageNet (Deng et al. 2009) using diverse networks and comparing with the ‘Simple Combination’ approach, advanced pruning and SD methods. We also assess EPSD’s adaptability and scalability in two downstream tasks. More details can be found in the Appendix.

EPSD equipped with Various SD Methods

We incorporate three distinct SD algorithms (CS-KD (Yun et al. 2020), PS-KD (Kim et al. 2021), and DLB (Shen et al. 2022)) into EPSD to ensure a comprehensive evaluation. Our experiments are conducted on CIFAR-10/100 and Tiny-ImageNet datasets across five sparsity ratios (36%, 59%, 79%, 90%, 95%). To ensure fairness in comparison, we employ identical hyper-parameters for training each dataset. For each variant of EPSD utilizing a specific SD method, we conduct a comprehensive comparison with 1) the unpruned network without any pruning or SD (Unpruned Baselines’), 2) network training using the respective SD method (’SD Only’), and 3) the simple combination of pruning and the specific SD method (‘Simple Combination’). Figure 6 illustrates the specific comparison results.

Based on the results, we have the following observations:

•

EPSD consistently outperformed the ‘Simple Combination’ overall settings. Moreover, under high sparsity conditions ( $e.g.,$ $95\%$ ), EPSD remained competitive while the ‘Simple Combination’ heavily declined.
•

On the more challenging Tiny-ImageNet, the ‘Simple Combination’ degraded more severely than EPSD for all three SD methods. For instance, with DLB and VGG-19 on Tiny-ImageNet at sparsity $90\%$ , the accuracy of the ‘Simple Combination’ is $20.80\%$ lower than ‘Unpruned Baseline’ ( $29.08\%$ vs. $49.88\%$ ), while EPSD achieved $53.91\%$ accuracy, increasing $1.80\%$ and $3.93\%$ compared to ‘SD Only’ and ‘Unpruned Baseline’, respectively.
•

EPSD outperformed ‘Unpruned Baseline’ and ‘SD Only’ over all three SD methods in most settings, indicating that early pruning with SD can boost the performance of SD. EPSD maintains an advantage over the ‘Simple Combination’, affirming its efficacy in preserving more distillable weights and achieving promising performance.

Comparison of Pruning Methods

To illustrate the effectiveness of EPSD, we compared EPSD with advanced pruning methods on CIFAR-10/100 (See Appendix) and ImageNet. Further, we extended EPSD with structured pruning to show the extensibility of our method.

Backbone	VGG-19				ResNet-50
Sparsity	90%		95%		90%		95%
Accuracy	top1	top5	top1	top5	top1	top5	top1	top5
Unpruned	73.1	91.3	73.1	91.3	75.6	92.8	75.6	92.8
SNIP ${}_{19^{\prime}}$	68.5	88.8	63.8	86.0	61.5	83.9	44.3	69.6
GraSP ${}_{20^{\prime}}$	69.5	89.2	67.0	87.4	65.4	86.7	46.2	66.0
FORCE ${}_{21^{\prime}}$	70.2	89.5	65.8	86.8	64.9	86.5	59.0	82.3
DOP ${}_{22^{\prime}}$	-	-	-	-	64.1	-	48.1	-
ProsPr ${}_{22^{\prime}}$	70.7	89.9	66.1	87.2	65.9	86.9	59.6	82.8
Sim.Cmb.	17.3	25.8	15.4	23.0	9.9	16.4	8.3	15.3
EPSD	71.2	90.1	67.1	87.6	66.3	87.3	60.1	83.0

Table 1: Comparing test accuracy of various advanced early pruning methods at 90% and 95% sparsity on full ImageNet. ‘Sim.Cmb.’ refers to the ‘Simple Combination’.

CIFAR-10/100. We perform extensive comparisons with recent early pruning methods on CIFAR-10 and CIFAR-100, and we applied EPSD to two popular lightweight networks (MobileNet-v2 (Sandler et al. 2018) and MobileViT (Mehta and Rastegari 2022)), which is not a common practice in previous early pruning works. We also investigated the iterative version of EPSD. Please refer to the Appendix.

ImageNet. We evaluated EPSD on the challenging full ImageNet dataset. Table 1 compared EPSD with advanced pruning methods in terms of top-1 and top-5 accuracy under $90\%$ and $95\%$ sparsity ratio with VGG-19 and ResNet-50. EPSD surpasses other early pruning methods and notably addresses the degradation problem of ‘Simple Combination’ on challenging datasets. For instance, EPSD leads GraSP by $0.9\%$ and improves by $0.4\%$ over ProsPr at sparsity 90% with ResNet-50. This highlights EPSD’s effective synergy of early pruning and SD, leading to enhanced performance.

Structured Pruning. To illustrate the extensibility of EPSD, We evaluate structured pruning, where entire channels are eliminated rather than individual weights. We compare EPSD against 3SP (van Amersfoort et al. 2020), ProsPr (Alizadeh et al. 2022), and random structure pruning reported in ProsPr. The results are summarized in Table 2, and our EPSD achieves the best accuracy performance compared with other structured pruning methods.

Comparison of SD Methods

Since EPSD is to explore the effective combination of early pruning and SD, we compare EPSD with SD methods to show the effectiveness. Specifically, we compare EPSD with LSR (Szegedy et al. 2016), TFKD (Yuan et al. 2020), CSKD (Yun et al. 2020), PSKD (Kim et al. 2021), and DLB (Shen et al. 2022) using various models (ResNet-32/110 and VGG-16/19) on CIFAR-10/100. When compared to SD methods, EPSD prunes networks at 80% sparsity. Table 3 shows the comparison results. Surprisingly, though EPSD removes most of the weights, it still achieved comparable or better performance than other advanced SD methods. Please be aware that directly comparing early-pruned models with unpruned self-distilled models is uncommon in prior research. This is because models obtained through early pruning are often considered less trainable (Lee et al. 2020; Frankle et al. 2021; Wang et al. 2022). However, we demonstrate that combining early pruning with self-distillation is a viable and competitive approach.

Sparsity	Method	CIFAR-10	CIFAR-100
-	Unpruned	93.88(%)	72.84 (%)
80%	Random	92.00	67.50
	3SP	93.40	69.90
	ProsPr	93.61	72.29
	EPSD	93.82	73.16
90%	Random	90.40	63.80
	3SP	93.10	68.30
	ProsPr	93.64	71.12
	EPSD	93.72	71.80

Table 2: Test accuracy among various structured pruning methods using VGG-19 on CIFAR-10 and CIFAR-100 under sparsity ratios 80% and 90%.

Impact of SD-based Pre-training

In previous sections, we showed that a simple combination of early pruning and SD can lead to performance degradation. To verify the key idea of EPSD that identifying more distillable weights enhances the accuracy performance, we design another way for combination: 1) start by training the network from scratch with SD, then 2) prune it, and 3) fine-tune the pruned model with SD to regain performance. We name this method ‘Simple Combination-2’ (SC-2). Compared to ‘Simple Combination’ (SC-1), SC-2 requires more pre-training effort. To explore the potential impact of SD-based pre-training on the pruned model, we tested SC-2’s effect on ImageNet using ResNet-50. Experiments shown in Table 4 indicated that SC-2 achieved comparable accuracy to EPSD (66.4% vs. 66.2%). We argue this happened because SD-based pre-training in SC-2 produced distillable weights. After pruning with a standard cross-entropy (CE) loss, the remaining weights still kept their distillable nature, allowing the pruned model to regain from fine-tuning with SD. In addition, building upon SC-2, we used EPSD to compress the SD-pre-trained model (instead of starting from random initialization), resulting in further accuracy improvement (66.6%), which is attributed to retaining more distillable weights through pruning with the SD loss.

Net.	U.P.	LSR	TFKD	CSKD	PSKD	DLB	EPSD
R32	93.46	93.27	93.68	93.12	94.04	94.15	94.68
R110	94.79	94.40	95.08	93.88	94.91	95.15	95.32
V16	93.97	94.09	94.08	93.78	94.10	94.62	94.51
V19	93.88	93.95	94.09	93.62	93.93	94.42	94.45
R32	71.74	71.79	73.91	70.79	72.51	74.00	74.30
R110	76.36	76.68	72.98	76.59	77.15	78.18	78.45
V16	73.63	74.19	74.06	74.19	74.05	76.12	76.31
V19	74.61	73.25	72.54	73.35	73.64	75.47	76.11

Table 3: Comparing test accuracy against advanced SD methods and the unpruned baseline (U.P.). The top section shows CIFAR-10 and the lower section displays CIFAR-100. We use ‘R’ for ResNet and ‘V’ for VGG. EPSD is 80% sparsity, while the other approaches remain unpruned.

Downstream Tasks

We further verify the robustness of EPSD on two downstream tasks presented below. See the Appendix for details.

Weakly Supervised Object Localization. As shown in Table 5, we reported the error rates with a pruning ratio of 50%. Compared to ProsPr, EPSD achieved lower errors (Cls. Err of 24.40% vs. 25.39%, Top-1 Loc. Err. as low as 41.23%). Compared to the unpruned baseline, EPSD only saw a slight 0.27% drop in localization accuracy, showing improved generalization in weakly supervised scenarios.

Semantic Segmentation. As shown in Table 6, across two different metrics, EPSD outperforms ProsPr and the ‘Simple Combination’. Specifically, EPSD achieves a 1.63% higher than ProsPr in mean IoU and 2.63% higher in pixel accuracy. Compared to the ‘Simple Combination’, the improvements are even more significant, with increases of 5.26% and 5.76% in two metrics, respectively.

Method	P.T. w/ SD	PR. w/		(Re-)Train w/ SD
Method	#Epochs	CE	SD	#Epochs	Top1 Acc.(%)
SC-1	0	✓		100	9.9
EPSD	0		✓	100	66.2
SC-2	100	✓		100	66.4
P.T.+EPSD	100		✓	100	66.6

Table 4: Investigation of the impact of SD-based Pre-training. ‘P.T.’ means pre-training and ‘PR.’ is the pruning process with a 90% sparsity ratio.

Method	s.p.	Cls.Err. ( $\downarrow$ )	Loc.Err. ( $\downarrow$ )
Method	s.p.	Cls.Err. ( $\downarrow$ )	Top-1	Gt-Known
Unpruned	-	23.90%	40.96%	23.97%
ProsPr	50%	25.39%	48.65%	32.69%
Sim.Cmb.	50%	27.53%	50.57%	33.33%
EPSD	50%	24.40%	41.23%	25.08%

Table 5: Results of weakly supervised object localization task on CUB-200-2011. The top-1 classification error (Cls.Err.) and localization error rates (Loc.Err.) are reported.

Method	s.p.	Mean IoU ( $\uparrow$ )	pixAcc ( $\uparrow$ )
Unpruned	-	46.46%	85.70%
ProsPr	40%	42.87%	80.34%
Sim.Cmb.	40%	39.24%	77.21%
EPSD	40%	44.50%	82.97%

Table 6: Results of semantic segmentation task on Pascal VOC 2012. The mean intersection-over-union (Mean IOU) and pixel accuracy (pixAcc) are reported.

Discussion and Limitation

This paper explores an efficient model compression framework. By effectively combining early pruning with SD, EPSD improved performance for pruned models without the burden of extensive training. Importantly, we address the degradation issue arising in a simple combination of early pruning and SD, shedding light on a promising research direction for combining these two techniques, which might offer enlightening insights to the community. However, this paper mainly addresses fundamental vision models in computer vision. Our focus has yet to encompass the presently prevalent large-scale language or multi-model networks. It remains a potential direction for our future research.

5 Conclusion

In this study, we introduce the Early Pruning with Self-Distillation (EPSD) framework, which identifies and retains distillable weights during pruning for a specific SD task. EPSD seamlessly integrates early pruning and SD in just two steps, ensuring the trainability of pruned networks for effective model compression. We unveil that a straightforward combination of pruning and SD can result in performance decline, particularly at high sparsity ratios. Extensive visual and quantitative analysis show that EPSD enhances the trainability of pruned networks, and outperforms advanced pruning and SD methods. We believe EPSD will inspire more follow-ups for efficient compression of other multi-modal networks, which will accelerate the deployment of the latest deep models to edge devices.

References

Aghli and Ribeiro (2021) Aghli, N.; and Ribeiro, E. 2021. Combining weight pruning and knowledge distillation for cnn compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3191–3198.
Alizadeh et al. (2022) Alizadeh, M.; Tailor, S. A.; Zintgraf, L. M.; van Amersfoort, J.; Farquhar, S.; Lane, N. D.; and Gal, Y. 2022. Prospect Pruning: Finding Trainable Weights at Initialization using Meta-Gradients. International Conference on Learning Representations.
Arora et al. (2018) Arora, S.; Ge, R.; Neyshabur, B.; and Zhang, Y. 2018. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning, 254–263. PMLR.
Bellec et al. (2018) Bellec, G.; Kappel, D.; Maass, W.; and Legenstein, R. 2018. Deep rewiring: Training very sparse deep networks. International Conference on Learning Representations.
Chen et al. (2021) Chen, P.; Liu, S.; Zhao, H.; and Jia, J. 2021. Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5008–5017.
de Jorge et al. (2021) de Jorge, P.; Sanyal, A.; Behl, H. S.; Torr, P. H.; Rogez, G.; and Dokania, P. K. 2021. Progressive skeletonization: Trimming more fat from a network at initialization. International Conference on Learning Representations.
Deng et al. (2009) Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; and Fei-Fei, L. 2009. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 248–255.
Dinh et al. (2017) Dinh, L.; Pascanu, R.; Bengio, S.; and Bengio, Y. 2017. Sharp minima can generalize for deep nets. In International Conference on Machine Learning, 1019–1028. PMLR.
Evci et al. (2020) Evci, U.; Gale, T.; Menick, J.; Castro, P. S.; and Elsen, E. 2020. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning, 2943–2952. PMLR.
Everingham et al. (2015) Everingham, M.; Eslami, S. A.; Van Gool, L.; Williams, C. K.; Winn, J.; and Zisserman, A. 2015. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111: 98–136.
Frankle and Carbin (2019) Frankle, J.; and Carbin, M. 2019. The lottery ticket hypothesis: Finding sparse, trainable neural networks. International Conference on Learning Representations.
Frankle et al. (2020) Frankle, J.; Dziugaite, G. K.; Roy, D.; and Carbin, M. 2020. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, 3259–3269. PMLR.
Frankle et al. (2021) Frankle, J.; Dziugaite, G. K.; Roy, D. M.; and Carbin, M. 2021. Pruning neural networks at initialization: Why are we missing the mark? International Conference on Learning Representations.
Gao et al. (2020) Gao, S.; Huang, F.; Pei, J.; and Huang, H. 2020. Discrete model compression with resource constraint for deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1899–1908.
Geifman, Uziel, and El-Yaniv (2018) Geifman, Y.; Uziel, G.; and El-Yaniv, R. 2018. Bias-reduced uncertainty estimation for deep neural classifiers. arXiv preprint arXiv:1805.08206.
Gong et al. (2019) Gong, R.; Liu, X.; Jiang, S.; Li, T.; Hu, P.; Lin, J.; Yu, F.; and Yan, J. 2019. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4852–4861.
Guo, Ouyang, and Xu (2020) Guo, J.; Ouyang, W.; and Xu, D. 2020. Channel pruning guided by classification loss and feature importance. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 10885–10892.
Guo, Xu, and Ouyang (2023) Guo, J.; Xu, D.; and Ouyang, W. 2023. Multidimensional Pruning and Its Extension: A Unified Framework for Model Compression. IEEE Transactions on Neural Networks and Learning Systems.
Han, Mao, and Dally (2016) Han, S.; Mao, H.; and Dally, W. J. 2016. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations.
Han et al. (2015) Han, S.; Pool, J.; Tran, J.; and Dally, W. 2015. Learning both weights and connections for efficient neural network. Advances in Neural Information Processing Systems, 28.
He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.
Hinton et al. (2014) Hinton, G.; Vinyals, O.; Dean, J.; et al. 2014. Distilling the knowledge in a neural network. Advances in Neural Information Processing Systems Workshop.
Huang et al. (2023) Huang, Y.; Liu, N.; Che, Z.; Xu, Z.; Shen, C.; Peng, Y.; Zhang, G.; Liu, X.; Feng, F.; and Tang, J. 2023. CP3: Channel Pruning Plug-In for Point-Based Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5302–5312.
Hubara et al. (2017) Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; and Bengio, Y. 2017. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1): 6869–6898.
Jaiswal et al. (2022) Jaiswal, A. K.; Ma, H.; Chen, T.; Ding, Y.; and Wang, Z. 2022. Training your sparse neural network better with any mask. In International Conference on Machine Learning, 9833–9844. PMLR.
Kim et al. (2021) Kim, K.; Ji, B.; Yoon, D.; and Hwang, S. 2021. Self-knowledge distillation with progressive refinement of targets. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6567–6576.
Krizhevsky, Hinton et al. (2009) Krizhevsky, A.; Hinton, G.; et al. 2009. Learning multiple layers of features from tiny images. (Technical Report).
LeCun, Denker, and Solla (1989) LeCun, Y.; Denker, J.; and Solla, S. 1989. Optimal brain damage. Advances in Neural Information Processing Systems, 2.
Lee, Hwang, and Shin (2020) Lee, H.; Hwang, S. J.; and Shin, J. 2020. Self-supervised label augmentation via input transformations. In International Conference on Machine Learning, 5714–5724. PMLR.
Lee et al. (2020) Lee, N.; Ajanthan, T.; Gould, S.; and Torr, P. H. 2020. A signal propagation perspective for pruning neural networks at initialization. International Conference on Learning Representations.
Lee, Ajanthan, and Torr (2019) Lee, N.; Ajanthan, T.; and Torr, P. H. 2019. Snip: Single-shot network pruning based on connection sensitivity. International Conference on Learning Representations.
Li et al. (2018) Li, H.; Xu, Z.; Taylor, G.; Studer, C.; and Goldstein, T. 2018. Visualizing the loss landscape of neural nets. Advances in Neural Information Processing Systems, 31.
Liu et al. (2020) Liu, N.; Ma, X.; Xu, Z.; Wang, Y.; Tang, J.; and Ye, J. 2020. Autocompress: An automatic dnn structured pruning framework for ultra-high compression rates. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 4876–4883.
Liu et al. (2021a) Liu, N.; Yuan, G.; Che, Z.; Shen, X.; Ma, X.; Jin, Q.; Ren, J.; Tang, J.; Liu, S.; and Wang, Y. 2021a. Lottery Ticket Preserves Weight Correlation: Is It Desirable or Not? In International Conference on Machine Learning, 7011–7020. PMLR.
Liu et al. (2021b) Liu, S.; Yin, L.; Mocanu, D. C.; and Pechenizkiy, M. 2021b. Do we actually need dense over-parameterization? in-time over-parameterization in sparse training. In International Conference on Machine Learning, 6989–7000. PMLR.
Long, Shelhamer, and Darrell (2015) Long, J.; Shelhamer, E.; and Darrell, T. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3431–3440.
Mehta and Rastegari (2022) Mehta, S.; and Rastegari, M. 2022. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. International Conference on Learning Representations.
Mirzadeh et al. (2020) Mirzadeh, S. I.; Farajtabar, M.; Li, A.; Levine, N.; Matsukawa, A.; and Ghasemzadeh, H. 2020. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, volume 34, 5191–5198.
Mobahi, Farajtabar, and Bartlett (2020) Mobahi, H.; Farajtabar, M.; and Bartlett, P. 2020. Self-distillation amplifies regularization in hilbert space. Advances in Neural Information Processing Systems, 33: 3351–3361.
Mocanu et al. (2018) Mocanu, D. C.; Mocanu, E.; Stone, P.; Nguyen, P. H.; Gibescu, M.; and Liotta, A. 2018. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1): 1–12.
Molchanov et al. (2017) Molchanov, P.; Tyree, S.; Karras, T.; Aila, T.; and Kautz, J. 2017. Pruning convolutional neural networks for resource efficient inference. International Conference on Learning Representations.
Mostafa and Wang (2019) Mostafa, H.; and Wang, X. 2019. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In International Conference on Machine Learning, 4646–4655. PMLR.
Naeini, Cooper, and Hauskrecht (2015) Naeini, M. P.; Cooper, G.; and Hauskrecht, M. 2015. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
Pan et al. (2021) Pan, X.; Gao, Y.; Lin, Z.; Tang, F.; Dong, W.; Yuan, H.; Huang, F.; and Xu, C. 2021. Unveiling the potential of structure preserving for weakly supervised object localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11642–11651.
Park and No (2022) Park, J.; and No, A. 2022. Prune your model before distill it. In European Conference on Computer Vision, 120–136. Springer.
Paszke et al. (2019) Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32.
Reed (1993) Reed, R. 1993. Pruning algorithms-a survey. IEEE Transactions on Neural Networks, 4(5): 740–747.
Renda, Frankle, and Carbin (2020) Renda, A.; Frankle, J.; and Carbin, M. 2020. Comparing rewinding and fine-tuning in neural network pruning. arXiv preprint arXiv:2003.02389.
Romero et al. (2015) Romero, A.; Ballas, N.; Kahou, S. E.; Chassang, A.; Gatta, C.; and Bengio, Y. 2015. Fitnets: Hints for thin deep nets. International Conference on Learning Representations.
Sandler et al. (2018) Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; and Chen, L.-C. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4510–4520.
Saxe, McClelland, and Ganguli (2014) Saxe, A. M.; McClelland, J. L.; and Ganguli, S. 2014. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. International Conference on Learning Representations.
Shen et al. (2022) Shen, Y.; Xu, L.; Yang, Y.; Li, Y.; and Guo, Y. 2022. Self-Distillation from the Last Mini-Batch for Consistency Regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11943–11952.
Simonyan and Zisserman (2014) Simonyan, K.; and Zisserman, A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
Szegedy et al. (2016) Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; and Wojna, Z. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826.
Tanaka et al. (2020) Tanaka, H.; Kunin, D.; Yamins, D. L.; and Ganguli, S. 2020. Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems, 33: 6377–6389.
van Amersfoort et al. (2020) van Amersfoort, J.; Alizadeh, M.; Farquhar, S.; Lane, N.; and Gal, Y. 2020. Single shot structured pruning before training. arXiv preprint arXiv:2007.00389.
Wah et al. (2011) Wah, C.; Branson, S.; Welinder, P.; Perona, P.; and Belongie, S. 2011. The caltech-ucsd birds-200-2011 dataset. california institute of technology.
Wang, Zhang, and Grosse (2020) Wang, C.; Zhang, G.; and Grosse, R. 2020. Picking winning tickets before training by preserving gradient flow. International Conference on Learning Representations.
Wang and Fu (2023) Wang, H.; and Fu, Y. 2023. Trainability Preserving Neural Pruning. International Conference on Learning Representations.
Wang et al. (2021) Wang, H.; Qin, C.; Bai, Y.; and Fu, Y. 2021. Dynamical isometry: The missing ingredient for neural network pruning. arXiv preprint arXiv:2105.05916.
Wang et al. (2022) Wang, H.; Qin, C.; Bai, Y.; Zhang, Y.; and Fu, Y. 2022. Recent advances on neural network pruning at initialization. In Proceedings of the International Joint Conference on Artificial Intelligence, 23–29.
Wang and Yoon (2021) Wang, L.; and Yoon, K.-J. 2021. Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6): 3048–3068.
Wei et al. (2022) Wei, X.; Zhang, Y.; Zhang, X.; Gong, R.; Zhang, S.; Zhang, Q.; Yu, F.; and Liu, X. 2022. Outlier suppression: Pushing the limit of low-bit transformer language models. Advances in Neural Information Processing Systems, 35: 17402–17414.
Wightman (2019) Wightman, R. 2019. PyTorch Image Models. https://github.com/rwightman/pytorch-image-models.
Xu and Liu (2019) Xu, T.-B.; and Liu, C.-L. 2019. Data-distortion guided self-distillation for deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 5565–5572.
Yang et al. (2019) Yang, C.; Xie, L.; Su, C.; and Yuille, A. L. 2019. Snapshot distillation: Teacher-student optimization in one generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2859–2868.
Yuan et al. (2020) Yuan, L.; Tay, F. E.; Li, G.; Wang, T.; and Feng, J. 2020. Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3903–3911.
Yun et al. (2020) Yun, S.; Park, J.; Lee, K.; and Shin, J. 2020. Regularizing class-wise predictions via self-knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13876–13885.
Zagoruyko and Komodakis (2016) Zagoruyko, S.; and Komodakis, N. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146.
Zhang, Bao, and Ma (2021) Zhang, L.; Bao, C.; and Ma, K. 2021. Self-distillation: Towards efficient and compact neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8): 4388–4403.
Zhang et al. (2019) Zhang, L.; Song, J.; Gao, A.; Chen, J.; Bao, C.; and Ma, K. 2019. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3713–3722.
Zhang et al. (2015) Zhang, X.; Zou, J.; He, K.; and Sun, J. 2015. Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10): 1943–1955.

Appendix

We provided more implementation details and additional results for the proposed Early Pruning with Self-Distillation (EPSD) in this appendix. Algorithm 1 showed the detailed compression procedure of our method.

The organization of the appendix is as follows:

•

In Sec. A, we outlined the datasets, networks, and other experimental setups employed in this paper.
•

Sec. B provided the experimental setups for the empirical studies shown in the main manuscript, to underscore the credibility of our research.
•

In Sec. C, we present additional experimental results, including the theoretical training/test FLOPs, the combined effects with traditional knowledge distillation (KD), and a brief investigation when integrating SD with a dynamic sparse training (DST) method.
•

Sec. D presented comparison results of EPSD against other advanced early pruning approaches on CIFAR-10 and CIFAR-100 datasets. Furthermore, we also assessed the performance of EPSD using recently popular lightweight networks.
•

In Sec. E, we introduced the iterative manner of EPSD, referred to as EPSD-It. We demonstrate the effectiveness of EPSD-It through a comprehensive comparison of various pruning methods (post-training pruning and early pruning) across different levels of sparsity. Moreover, we evaluate EPSD-It in contrast to EPSD, emphasizing its improved confidence estimation capabilities.
•

In Sec. F, we presented additional results of EPSD on full ImageNet, showcasing the robustness and generalization of our approach to challenging large-scale datasets.
•

In Sec. G, we provided the implementation details and more results on EPSD equipped with three SD methods (Yun et al. 2020; Kim et al. 2021; Shen et al. 2022) as discussed in the main manuscript.
•

Sec. H presented the ablation study of EPSD. We unveil the efficacy of EPSD by varying the optimization objectives (standard cross-entropy loss vs. SD loss) in two steps. Moreover, we investigate the impact of employing another early pruning method SNIP (Lee, Ajanthan, and Torr 2019) on EPSD.

We aim for readers to develop a more comprehensive understanding of the proposed method through the provision of additional details and experimental insights.

Algorithm 1 Model Compression Procedure of EPSD

k,j\leftarrow 0

; random initialized network

\theta_{init}

; pruning mask

m

, target sparsity

S

; image batches

x\in\mathcal{X}

; learning rate

\alpha

; update steps

i

during pruning; total training epochs

e

for SD.

2:Sparse, converged network

\hat{\theta}_{cmp}

3:Initial

\theta_{0}\leftarrow m\odot\theta_{init}

\triangleright

Early Pruning with SD

4:while

k<i

5: Forward to get predictive distribution

\widetilde{P}({x_{k};\theta_{k}})

6: Construct soften targets

\widetilde{P}({\overline{x}_{k}};\overline{\theta}_{k})

by (Yun et al. 2020; Kim et al. 2021; Shen et al. 2022).

7: Compute SD loss,

\widetilde{\Delta}

by Eq.(6) and Eq.(7).

8: if

k=i

then

9: Compute saliency scores

\widetilde{s}

by Eq.(8).

10: Sort and get pruning mask

m

\widetilde{s}

S

11: Apply

m

\hat{\theta}_{0}\leftarrow\theta_{init}\odot m

; break.

12: end if

13: Update:

\theta_{k+1}\leftarrow\theta_{k}-\alpha\cdot\widetilde{\Delta}

;

k\leftarrow k+1

14:end while

15:while

j<e

\triangleright

Self-Distillation

16: Forward on full training set

\mathcal{X}

17: Compute

\widetilde{\Delta}

by Eq.(4), (6) and (7).

18: Update

\hat{\theta}_{j+1}\leftarrow\hat{\theta}_{j}-\alpha\cdot\widetilde{\Delta}

;

j\leftarrow j+1

19:end while

20:Output compressed model

\hat{\theta}_{cmp}\leftarrow\hat{\theta}_{e}

Appendix A Datasets and Networks

We mainly employ three multi-class classification benchmark datasets for comprehensive classification performance evaluations. The CIFAR-10/CIFAR-100 (Krizhevsky, Hinton et al. 2009) contains 60,000 RGB natural images of 32 $\times$ 32 pixels from 10/100 classes. Each class includes 5,000/500 training samples and 1,000/100 testing samples. We followed the widely-used pre-processing from previous works (He et al. 2016; Zagoruyko and Komodakis 2016). The Tiny-ImageNet is a subset of ILSVRC-2012, made up of 200 classes. Each class includes 500 training and 50 testing samples, scaled at 64 $\times$ 64. All training images were randomly cropped and resized to 32 $\times$ 32 after the normalization. The test images were only normalized. ImageNet (Deng et al. 2009) classification dataset comprises 1000 classes. Each class is depicted by thousands of images and we resize them into 256 $\times$ 256 pixels RGB images. The accuracy of ImageNet is computed on the validation set. We utilize PyTorch (Paszke et al. 2019) version 1.11.0 within the Python 3.8 environment.

CIFAR-10/100, Tiny-ImageNet: The network architectures in the main manuscript are ResNet-18, ResNet-32, ResNet-110 (He et al. 2016), VGG-16 and VGG-19 (Simonyan and Zisserman 2014). We use ResNet-18 which modifies the first convolutional layer with kernel size 3 $\times$ 3, strides 1 and padding 1, instead of the kernel size 7 $\times$ 7, strides 2 and padding 3, for image size 32 $\times$ 32 by following (Yun et al. 2020; Alizadeh et al. 2022). During pruning, we use data from 3 batches, each with 128 samples. This can be done in seconds on a single GPU. In training, we follow a consistent setting of hyper-parameters for the training scheme for a fair comparison, and all CNNs are trained using SGD with a momentum of 0.9, and the learning rate is decayed by a factor of 10. For CIFAR-10, CIFAR-100, and Tiny-ImageNet, we augment training data by applying random cropping (32 $\times$ 32, padding 4), and horizontal flipping following previous setting (Yun et al. 2020; Kim et al. 2021; Alizadeh et al. 2022; Shen et al. 2022).

ImageNet: To compare with previous compression methods broadly, we employ ResNet-50 (He et al. 2016) as the backbone network for fair comparisons and report results obtained by our EPSD at sparsity 90%. According to (Alizadeh et al. 2022; Wang, Zhang, and Grosse 2020), the batches must have enough samples from all classes in the dataset. Wang et al. (Wang, Zhang, and Grosse 2020) recommend using class-balanced batches sized ten times the number of classes. Because ImageNet is larger and more complex, we require data across several pruning iterations to maintain useful distillable weights for SD. We use 512 batches with 128 samples each, following ProsPr’s approximation for gradient computation. This takes only a few minutes on a single GPU, showing high efficiency. After pruning, the model is trained for 100 epochs, starting with a learning rate of 0.1, which decreases by a factor of 10 at the 30th, 60th, and 90th epochs. We resize an image as 256 $\times$ 256 and then perform a random crop to have a 224 $\times$ 224 sized input, augmented with horizontal flipping, color jitter, and lighting. The weight decay was set to 0.0001 and the batch size was 256.

Appendix B Experimental Setup for Empirical Studies

In this section, we thoroughly explain the empirical studies and experimental setups in the main manuscript. The content is arranged in the order of the main manuscript.

Trainability Analysis, Fig. 4 in Sec. 3.2. For the visualization of loss surface, we employed open-source visualization tools¹¹1https://github.com/tomgoldstein/loss-landscape to intricately illustrate the loss surfaces of pruned models obtained through EPSD and the ‘Simple Combination’, respectively. These visualizations reflect the varying degrees of optimization complexity for the pruned models. The main manuscript showcases the results achieved using ResNet-18 on the CIFAR-100 dataset. For the ‘Mean-JSV’ curves in Fig. 4, we followed the previous works (Wang et al. 2021; Wang and Fu 2023) to record and compute the Jacobian singular values during training. The main manuscript showcases the results achieved using ResNet-18 on the CIFAR-100 dataset.

Method	Type	Dataset	Model	Number of Training Epochs
Method	Type	Dataset	Model	P.T.(ref.)	P.T.(teacher)	Fine-tune	Distill.	Total
EPSD	PR+SD	CIFAR-10/100	ResNet32/110,VGG16/19	0	0	0	200	200
DMC	PR	CIFAR-10	VGG16	300	0	150	0	450
ReKD	KD	CIFAR-100	ResNet110 $\rightarrow$ ResNet32	0	240	0	240	480
PKD	PR+KD	CIFAR-100	VGG19	0	200	910	200	1310
CPKD	PR+KD	CIFAR-10	ResNet110	0	0	1400	85	1485

Table A1: Training epochs used by various compression techniques in Fig. 5. Note that the models and datasets used in EPSD are comparable to those in all other methods.

Training Efforts, Fig. 5 in Sec. 3.3. We show the number of training epochs used by various compression methods in Table A1, which are provided by the original papers. For example, in the case of CPKD (Aghli and Ribeiro 2021), ResNet-110 and ResNet-164 are utilized for image classification on CIFAR-10. They begin with 6 pruning iterations for the teacher network, fine-tuning for 200 epochs after each iteration to regain initial accuracy. The student network is then trained for 85 epochs using KD, followed by an additional 200-epoch fine-tuning phase. In total, 1485 epochs are employed for comprehensive model compression. In terms of training time, we replicate the PKD and ReKD using the authors’ open-source projects²²2https://github.com/dvlab-research/ReviewKD, https://github.com/ososos888/prune-then-distill, maintaining identical hardware environment (using an NVIDIA A40 GPU).

Results on ImageNet, Table 1 in Sec. 4.2. In Table 1 of the main manuscript, we report the classification results on full ImageNet with EPSD equipped with SD method DLB. For detailed configurations, please refer to Sec. A.

Structured Pruning, Table 2 in Sec. 4.2. We further assess EPSD within structured pruning, wherein entire channels are masked. We modified the shape of the pruning mask $m$ to encompass one entry per channel (or column of the weight matrix) as in ProsPr. In Table 2 of the main manuscript, we report the classification results of structured pruning with EPSD equipped with SD method PS-KD.

Comparison with SD Methods, Table 3 in Sec. 4.3. We report the results of EPSD equipped with SD method PS-KD for the experiments shown in Table 3. For detailed configurations, please refer to Sec. G.

Impact of SD-based Pre-training, Table 4 in Sec. 4.4. Following the configurations in Table 1, we further assessed the SC-2 in our experiments. Specifically, we first train a randomly initialized ResNet-50 on ImageNet for 100 epochs and then prune it to obtain a pruned model. Finally, we retrain the pruned model for 100 epochs using the SD method DLB. The difference between ‘SC-2’ and ‘P.T.+EPSD’ is the loss function (standard cross-entropy or SD loss) used to obtain the saliency scores during pruning.

Weakly Supervised Object Localization. Weakly Supervised Object Localization (WSOL) is a challenging task in computer vision that involves locating objects in images only by image-level labels. Following previous work (Pan et al. 2021), we performed experiments using VGG-16 as a backbone network on the CUB-200-2011 (Wah et al. 2011) dataset. We replace the original cross-entropy loss function with the SD loss proposed in DLB and keep other optimization objectives unchanged. We also report the reproduced results with an unpruned backbone network and the experimental results using ProsPr, and the relevant parameters in the pruning process are consistent with our settings on CIFAR-10/CIFAR-100, and detailed parameters can be found in Table A9. We report the top-1 classification and localization error, gt-known localization error (considers localization only regardless of classification) on CUB-200-2011 (Wah et al. 2011) in Table 5. We utilize the open-source code³³3https://github.com/Panxjia/SPA˙CVPR2021 for the experiment.

Semantic Segmentation. Semantic segmentation is a vital computer vision task that categorizes image pixels into different classes based on their semantic meaning. We prune pre-trained FCN32s (Long, Shelhamer, and Darrell 2015) which employs VGG-16 and then train the pruned network for 50 epochs. We also report the reproduced results with an unpruned model and the results using ProsPr. The detailed settings can be found in Table A9. We report the intersection-over-union (Mean-IOU) and the pixel accuracy (pixAcc) on Pascal VOC 2012 (Everingham et al. 2015) in Table 6. We utilize the open-source code⁴⁴4https://github.com/Tramac/awesome-semantic-segmentation-pytorch for the experiment.

Appendix C Additional Experiments

Theoretical Training/Test FLOPs of EPSD

Model	Dense	Sparsity	SNIP	GraSP	FORCE	DOP	ProsPr	EPSD +PS-KD	EPSD +CS-KD	EPSD +DLB
VGG19	1 $\times$ (15.1T)	90%	0.3 $\times$	0.3 $\times$	0.3 $\times$	-	0.3 $\times$	0.3 $\times$	0.4 $\times$	0.4 $\times$
VGG19	1 $\times$ (15.1T)	95%	0.15 $\times$	0.15 $\times$	0.15 $\times$	-	0.15 $\times$	0.15 $\times$	0.2 $\times$	0.2 $\times$
ResNet50	1 $\times$ (3.2T)	90%	0.3 $\times$	0.3 $\times$	0.3 $\times$	0.3 $\times$	0.3 $\times$	0.3 $\times$	0.4 $\times$	0.4 $\times$
ResNet50	1 $\times$ (3.2T)	95%	0.15 $\times$	0.15 $\times$	0.15 $\times$	0.15 $\times$	0.15 $\times$	0.15 $\times$	0.2 $\times$	0.2 $\times$

Table A2: Theoretical training FLOPs of EPSD and other baselines in Table 1.

The baseline methods in Table 1 of the main paper require a fine-tuning process where EPSD does not (EPSD directly uses SD training), thus the training cost of EPSD does not increase significantly. We report theoretical FLOPs following RigL (Evci et al. 2020) which counts FLOPs based on the forward/backward pass in training. For all baselines in Table 1, the total training FLOPs are $3*f_{s}$ , where $f_{s}$ is the FLOPs for a given sparse NN. For EPSD, different SD strategy produces diverse training FLOPs ( $3*f_{s}$ , $4*f_{s}$ , $4*f_{s}$ for PS-KD, CS-KD, and DLB, resp.), while their test FLOPs (without backward pass) remains consistent ( $1*f_{s}$ ). EPSD equipped with PS-KD presents similar FLOPs as pruning methods.

Distill Sparse NN with KD

Method	Teacher Model	Student Model	Sparsity	Acc. after KD/SD
SNIP+KD	VGG19 Acc.=72.92%	random initialized	95%	71.65%
SNIP+KD		pre-trained	95%	71.67%
ProsPr+KD		random initialized	95%	55.04%
ProsPr+KD		pre-trained	95%	72.75%
EPSD	-	random initialized	95%	73.81%

Table A3: The simple combination of early pruning and traditional KD.

We also studied the model performance when simply combining different early pruning methods (SNIP and ProsPr) when using traditional KD. We first pre-train a teacher model, then prune it, and finally use KD to distill the teacher model into student models with different initializations. More specifically, we evaluate two initializations for the student model (VGG19) in KD: 1) random initialization, and 2) initialization from pre-training. The results on CIFAR-100 indicate that 1) The pruned student model benefits from pre-training. 2) Compared to SNIP, a random initialized model pruned by ProsPr struggles in standard KD (71.65% vs. 55.04%). 3) Without pre-training for teacher and student models, EPSD achieved better results.

DST with SD

Model	Dataset	Dense	Sparsity	RigL	RigL+EPSD
WideResNet-22-2	CIFAR-10	94.36%	90%	92.16%	92.49%
WideResNet-22-2	CIFAR-10	94.36%	95%	90.14%	90.32%

Table A4: Performance comparison of RigL and RigL+EPSD under different sparsity ratios.

We specifically investigated the compatibility of DST with SD. Specifically, we applied the idea of EPSD to RigL, and the DLB is integrated into the DST process. We followed the default settings from https://github.com/nollied/rigl-torch for RigL while keeping other settings the same as EPSD. Benefiting from dynamic topology during the training, RigL achieves acceptable performance compared to the dense model at high sparsity (95%), while integrating the idea of EPSD into RigL can further improve its performance.

Appendix D Comparison of Early Pruning Methods on CIFAR-10/100

We compare EPSD with the early pruning methods SNIP (Lee, Ajanthan, and Torr 2019), GraSP (Wang, Zhang, and Grosse 2020), FORCE (de Jorge et al. 2021), ToST (Jaiswal et al. 2022) and ProsPr (Alizadeh et al. 2022), and dynamic sparse training methods SET (Mocanu et al. 2018), Deep-R (Bellec et al. 2018) and DSR (Mostafa and Wang 2019) on CIFAR-10/100 datasets under sparsity ratios 90%, 95%, 98%, and the accuracy comparison are shown in Fig. 0(a) and Fig. 0(b), respectively. The results in Fig. A1 illustrate that EPSD achieved surprising performance at almost all sparsity ratios compared to the advanced early pruning methods. In the case of a 90% sparsity ratio, EPSD remarkably outperforms the unpruned baselines (94.19% vs. 93.46% on ResNet-32 of CIFAR-10 and 74.79% vs. 74.61% on VGG-19 of CIFAR-100). Compared with ProsPr, which is most related to our method, EPSD consistently improved the accuracy under various settings. Moreover, results of CIFAR-10/100 showed that the accuracy of EPSD surpassed the unpruned baselines at high sparsity ratios. Overall, EPSD achieves significant performance gains from SD compared to advanced pruning methods.

Method	Model	U.P.	Acc. (%)
Method	Model	U.P.	36%	59%
EPSD	MobileNet-v2	56.45	58.64	48.13
Sim.Cmb.	MobileNet-v2	56.45	56.32	45.70
EPSD	MobileViT	65.56	65.17	64.61
Sim.Cmb.	MobileViT	65.56	65.09	63.16

Table A5: Test accuracy (%) of EPSD and the ‘Simple Combination’ with MobileNet-v2 and MobileViT on CIFAR-100, respectively. ‘U.P.’ represents the unpruned baseline.

Additionally, we present the results of EPSD employing recently prominent lightweight models (MobileNet-v2 and MobileViT) across varying sparsity in Table A5. For MobileNet-v2⁵⁵5https://github.com/tonylins/pytorch-mobilenet-v2, we engage in both pruning and training from scratch, utilizing identical training settings as outlined in Table A8. For MobileViT, we utilized the model following timm (Wightman 2019). The results show that using newer lightweight models, EPSD maintains higher classification accuracy than the ‘Simple Combination’ approach at two sparsity ratios. For instance, with MobileNet-v2, EPSD consistently leads the ‘Simple Combination’ by an average of 2.38% across two sparsity ratios. Notably, EPSD achieves even better results than the unpruned baseline (58.64% vs. 56.45%) at a sparsity of 36%.

Appendix E Iterative Version of EPSD

EPSD can also perform pruning-SD cycles iteratively. Given a target sparsity, each pruning removes some of the weights, then uses SD to restore network performance, and repeats the process until the target sparsity is reached. In the experiment, we combine PS-KD as a study case to explore the performance of the iterative manner of EPSD.

Comparison of EPSD-It with Advanced Early Pruning Methods.

Recent work (Frankle et al. 2021) assesses the efficacy of early pruning methods under iterative sparsity ratios. They introduce a benchmark that involves training a network by cycling through the complete learning rate schedule anew after every pruning step. To underscore the effectiveness of our iterative pruning approach, EPSD-It achieves accuracy recovery in fewer epochs (30 epochs in our experimentation), as opposed to the entire training timeline, following each pruning iteration. Furthermore, it conducts a full training cycle solely after the final pruning step. We evaluate EPSD-It using ResNet-20 and VGG-16 on CIFAR-10 respectively and show the performance compared to pruning-after-training (PaT) methods (lottery ticket hypothesis after training (Renda, Frankle, and Carbin 2020), magnitude after training (Frankle et al. 2020)), early pruning methods (SNIP (Lee, Ajanthan, and Torr 2019), GraSP (Wang, Zhang, and Grosse 2020), SynFlow (Tanaka et al. 2020), ProsPr (Alizadeh et al. 2022)), and random pruning method in Fig. A2. It can be found that EPSD-It outperforms other pruning methods, including PaT and early pruning methods, which are trained with a full training schedule after pruning. It is worth mentioning that previous early pruning methods have not yet surpassed PaT pruning methods in most settings. ProsPr is the first attempt to bridge the gap with PaT methods, while our EPSD-It surpasses PaT methods at most sparsity ratios, showing a significant improvement.

EPSD vs. EPSD-It

To investigate the performance gain from iteratively performing EPSD, we compare ‘one-shot EPSD’ (EPSD) with the ‘iterative EPSD’ (EPSD-It). Two metrics are adopted to investigate the performance: expected calibration error (ECE) (Naeini, Cooper, and Hauskrecht 2015) and the area under the risk-coverage curve (AURC) (Geifman, Uziel, and El-Yaniv 2018), to evaluate the quality of predictive probabilities in terms of confidence estimation, following (Kim et al. 2021; Yun et al. 2020). ECE determines whether predictions are well-calibrated, approximating the difference in expectation between classification accuracy and confidence estimates. AURC measures the area under the curve from plotting the risk (i.e., error rate) according to coverage and lower AURC implies that correct and incorrect predictions can be well-separable based on confidence estimates. The maximum class probability is used as a confidence estimator.

We investigate the performance of EPSD and EPSD-It by evaluating various metrics. We experiment with ResNet-18 on CIFAR-10, and the sparsity ratio is set to 80% for both EPSD and EPSD-It. EPSD-It iteratively prunes 20% of the remained weights over seven times to reach the target sparsity. In the first 6 prunings, the performance is recovered with a fixed 10 or 30 training epochs (corresponding to the two rows of Fig. A3) and finally train 300 epochs to fully regain performance for EPSD-It. Under this setting, we trained a total of 360 and 480 epochs and kept the consistency of the learning rate decay points in both methods. We observed that EPSD-It gains the benefits of iterative pruning and achieves better results than EPSD during training: 1) EPSD-It converged the loss faster and achieved lower training errors than EPSD. 2) Regarding the ECE and AURC, EPSD-It eventually converged to a smaller value than EPSD, indicating that EPSD-It has an advantage in confidence estimation.

Appendix F More Results on Full ImageNet

Acc.(%)	Pruning Only	EPSD ${}_{cskd}$	EPSD ${}_{pskd}$	EPSD ${}_{dlb}$
Top-1	65.9	64.8	66.2	66.3
Top-5	86.9	86.6	87.2	87.3

Table A6: Performance of EPSD equipped with different SD methods (CS-KD, PS-KD, and DLB) on ImageNet validation set. We utilize ResNet-50 as the baseline network and report the results with a

90\%

sparsity ratio.

Besides the success of EPSD on CIFAR-10/100 and Tiny-ImageNet, we further extend the reference experiment on the large-scale dataset ImageNet (Deng et al. 2009). Table A6 shows the comparison results for EPSD equipped with three different SD methods with 90% sparsity. For a fair comparison, the hyperparameters are kept consistent with previous experiments (e.g., $\tau$ and $\alpha$ , see Sec. A for details). Since DLB (Shen et al. 2022) do not provide results on ImageNet in their papers, we evaluate EPSD to follow consistent basic settings with the other two SD methods and report the Top-1 and Top-5 test accuracy. Therefore, the results reported on ImageNet do not imply that this is the best performance achievable with DLB-equipped EPSD. From Table A6, observing that EPSD with DLB (EPSD ${}_{dlb}$ ) achieves the best classification accuracy, while EPSD with PS-KD (EPSD ${}_{pskd}$ ) has about 0.07% lower in Top-1 accuracy and EPSD with CS-KD (EPSD ${}_{cskd}$ ) achieve 64.75% Top-1 accuracy rate. The results show that the equipped SD method affects the classification accuracy, while EPSD seems to achieve comparable accuracy on ImageNet when equipped with PS-KD and DLB, respectively. Furthermore, when equipped with PS-KD and DLB, EPSD outperforms the pure pruning method (‘Pruning Only’) under the same training settings, suggesting that EPSD can also improve accuracy performance on the large-scale dataset.

Pruning	CIFAR-10	CIFAR-100	Tiny-ImageNet
Pruner	ProsPr	ProsPr	ProsPr
Optimizer	nesterov SGD (0.9)	nesterov SGD (0.9)	nesterov SGD (0.9)
Iteration steps	3	3	3
New batch for iteration	✓	✓	✓
Batch size (Iteration)	128	128	128
Learning rate (Iteration)	0.1	0.1	0.1
$\lambda_{cls}$	1	1	1
Temperature $\tau$	4	4	4
Self-Distillation	CIFAR-10	CIFAR-100	Tiny-ImageNet
SD epochs	200	200	200
SD batch size	128	128	128
SD learning rate	0.1	0.1	0.1
LR drop schedule	[100, 150]	[100, 150]	[100, 150]
Drop factor	0.1	0.1	0.1
Weight decay	0.0001	0.0001	0.0001
$\lambda_{cls}$	1	1	1
Temperature $\tau$	4	4	4

Table A7: Hyperparameters of EPSD equipped with CS-KD.

Pruning	CIFAR-10	CIFAR-100	Tiny-ImageNet
Pruner	ProsPr	ProsPr	ProsPr
Optimizer	nesterov SGD (0.9)	nesterov SGD (0.9)	nesterov SGD (0.9)
Iteration steps	3	3	3
New batch for iteration	✗	✗	✗
Batch size (iteration)	128	512	512
Learning rate (iteration)	0.1	0.1	0.1
$\alpha$ (fixed)	[0.1,0.2,0.3]	[0.1,0.2,0.3]	[0.1,0.2,0.3]
Self-Distillation	CIFAR-10	CIFAR-100	Tiny-ImageNet
SD epochs	300	300	200
SD batch size	128	128	128
SD learning rate	0.1	0.1	0.1
LR drop schedule	[150, 225]	[150, 225]	[100, 150]
Drop factor	0.1	0.1	0.1
Weight decay	0.0005	0.0005	0.0001
$\alpha$ (linear growth)	0.8	0.8	0.8

Table A8: Hyperparameters of EPSD equipped with PS-KD.

Pruning	CIFAR-10	CIFAR-100	Tiny-ImageNet
Pruner	ProsPr	ProsPr	ProsPr
Optimizer	nesterov SGD (0.9)	nesterov SGD (0.9)	nesterov SGD (0.9)
Iteration steps	3	3	3
New batch for iteration	✓	✓	✓
Batch size (Iteration)	128	128	128
Learning rate (Iteration)	0.1	0.1	0.1
$\lambda_{cls}$	1	1	1
Temperature $\tau$	3	3	3
Self-Distillation	CIFAR-10	CIFAR-100	Tiny-ImageNet
SD epochs	240	240	200
SD batch size	64	64	128
SD learning rate	0.05	0.05	0.2
LR drop schedule	[150, 180, 210]	[150, 180, 210]	[100, 150]
Drop factor	0.1	0.1	0.1
Weight decay	0.0005	0.0005	0.0001
$\lambda_{cls}$	1	1	1
Temperature $\tau$	3	3	3

Table A9: Hyperparameters of EPSD equipped with DLB.

Appendix G EPSD equipped with Various SD Methods

In this section, we provided the detailed experimental setup and more results of EPSD equipped with various SD methods. Specifically, we described the implementation details for EPSD equipped with three SD methods CS-KD (Yun et al. 2020), PS-KD (Kim et al. 2021) and DLB (Shen et al. 2022), respectively. We provide more classification results on various datasets, including CIFAR-10, CIFAR-100, and Tiny-ImageNet.

EPSD equipped with CS-KD

Implementations. In the main manuscript, we have the following definition for the abstract SD loss function:

\mathcal{L}_{SD}=\frac{1}{n}\sum_{i=1}^{n}\tau^{2}\cdot D_{KL}\left(\widetilde% {P}\left(\bar{x}_{i};\bar{\theta}_{s}\right)\|\widetilde{P}\left(x_{i};\theta_% {s}\right)\right),

(1)

where $\widetilde{P}({\bar{x}_{i};\bar{\theta}_{s}})$ represents the soft targets produced by the student networks in SD. For CS-KD (Yun et al. 2020), Yun et al. propose a class-wise regularization that enforces consistent predictive distributions in the same class. The total training loss of CS-KD is defined as:

\mathcal{L}_{\mathrm{CS}-\mathrm{KD}}=\mathcal{L}_{\mathrm{CE}}(x_{i};\theta_{% s})+\lambda_{cls}\cdot\tau^{2}\cdot\mathcal{L}_{SD}\left(x_{i},\bar{x}_{i};% \theta_{s},\bar{\theta}_{s};\tau\right),

(2)

where $\bar{x}_{i}$ represents another randomly sampled input which having the same classification label and $\bar{\theta}_{s}$ is a fixed copy of the parameters $\theta_{s}$ . The $\mathcal{L}_{\mathrm{CE}}$ is the cross-entropy loss and $\tau$ is the temperature coefficient. A higher temperature results in a more uniform distribution, leading to a similar regularization effect as label smoothing. In EPSD, we utilized Eq. (2) in both the pruning and SD phases. We followed the training and validation settings in CS-KD, and the detailed hyperparameters can be found in Table A7.

More Results. We provided more results of EPSD combined with CS-KD on different network structures and datasets to supplement the results in the main manuscript, as shown in Fig. A4 (1st column). It can be found that the EPSD consistently outperformed the ‘Simple Combination’ in all settings. For instance, with ResNet-18 on CIFAR-100 at sparsity 90%, the accuracy of the ‘Simple Combination’ is 12.62% lower than the ‘Unpruned Baseline’ (62.67% vs. 75.29%), while EPSD achieved 74.90%, which is comparable to the ‘SD Only’ and the ‘Unpruned Baseline’.

EPSD equipped with PS-KD

Dataset	Baseline	$\alpha=0.1$	$0.2$	$0.3$	$0.4$
CIFAR-10	VGG-16	$93.97$	-	-	-
CIFAR-10	VGG-16	$\mathbf{94.44}$	$94.18$	$94.15$	$94.32$
CIFAR-100	ResNet-18	$75.29$	-	-	-
CIFAR-100	ResNet-18	$79.20$	$\mathbf{79.44}$	$78.96$	$79.14$

Table A10: Ablation study of the proportion of historical knowledge (

\alpha

) when pruning with PS-KD (Kim et al. 2021). We report the results of ResNet-18 on CIFAR-100 and VGG-16 on CIFAR-10, respectively, and the sparsity is set to

80\%

. The results of the unpruned baseline network are underlined and the best accuracy is in bold.

Implementations. In PS-KD (Kim et al. 2021), the network is trained with the soften targets which are computed as a linear combination of the hard labels (ground-truth) and the past predictions at last epoch, which is adjusted adaptively as training proceeds by a hyperparameter $\alpha$ . Therefore, the $\bar{x}_{i}$ in PS-KD is the same as $x_{i}$ , but $\bar{\theta}_{s}$ refers to weights of the student network in the last epoch. The loss function of PS-KD can be formulated as follows:

\mathcal{L}_{\mathrm{PS}-\mathrm{KD}}=\mathcal{L}_{SD}\left(x_{i};\theta_{s},% \bar{\theta}_{s};\alpha\right),

(3)

where $\bar{\theta}_{s}$ represents the weights of the network in the last epoch, and $\alpha$ is an additional coefficient that controls the proportion of historical knowledge from the previous epoch. Different from the original settings in PS-KD, we modified the epoch-level distillation to the iteration-level for the pruning in EPSD. Namely, EPSD utilized each iteration samples $x_{i}$ to generate the predictions, and distilled the predictions as soft targets in the next iteration. This modification avoids traversing the entire training dataset in pruning, making our method more efficient in the pruning phase. Besides, we set $\alpha$ to a small fixed value when pruning, considering that the network usually does not have enough knowledge about data in the early stages of training. The detailed hyperparameters can be found in Table A8.

Impact of Hyper-parameter $\mathbf{\alpha}$ . Historical information plays an important role when EPSD is equipped with PS-KD for generating soft targets, which is controlled by the hyperparameter $\alpha$ . During training, we follow the original paper’s method of gradually increasing $\alpha$ with training epochs. For pruning, we opt for a constant $\alpha$ value for simplicity and explore this fixed value alongside PS-KD pruning experiments. Table A10 outlines two scenarios with different $\alpha$ values for VGG-16 and ResNet-18 on CIFAR-10 and CIFAR-100 datasets. $\alpha$ varies between 0.1 and 0.4, and we report the top-1 accuracy. Sparsity remains at 80% across all setups. As illustrated in Table A10, EPSD achieves the higher accuracy with $\alpha=0.1$ using VGG16 on CIFAR-10, and $\alpha=0.2$ for ResNet-18 on CIFAR-100. We observe that when $\alpha$ changes from 0.1 to 0.4, the difference between the upper and lower accuracy bounds of the two settings is stable within 0.3% (94.44%-94.15%) and 0.5% (79.44%-78.96%), respectively. Meanwhile, EPSD (80% sparsity) still shows better results than the two unpruned baselines in all configurations. The above analysis shows that our method is not sensitive to the hyperparameters $\alpha$ of PS-KD.

More Results. We provided more results of EPSD equipped with PS-KD to verify the results in the main manuscript, as shown in Fig. A4 (second column). It can be found that EPSD consistently outperformed the ‘Simple Combination’ over all settings, and EPSD outperformed ‘Unpruned Baseline’ and ‘SD Only’ in most settings, indicating that pruning can boost the performance of SD. For instance, with ResNet-18 on CIFAR-100 at sparsity 90%, the accuracy of EPSD is 77.33%, which is 2.04% higher than the ‘Unpruned Baseline’. With ResNet-18 on Tiny-ImageNet at sparsity 36%, the accuracy of EPSD is 0.93% higher than the ‘SD Only’ (57.76% vs. 56.83%).

EPSD equipped with DLB

Implementations. For DLB (Shen et al. 2022), Shen et al. introduce an extra last-batch consistency regularization loss. Rather than storing the whole predictions at the last iteration as designed in PS-KD, DLB employs a data sampler to obtain batches $\mathcal{B}_{t}$ and $\mathcal{B}_{t-1}$ in iteration t and t-1 simultaneously at the (t-1) ${}^{th}$ iteration for implementation. Whereas predictions from $\mathcal{B}_{t}$ are smoothed by temperature $\tau$ and then stored for regularization in t ${}^{th}$ iteration. The overall loss function is formulated by:

\mathcal{L}_{\mathrm{DLB}}=\mathcal{L}_{\mathrm{CE}}(x_{i};\theta_{s})+\lambda% _{cls}\cdot\mathcal{L}_{SD}\left(x_{i};\theta_{s},\bar{\theta}_{s};\tau\right),

(4)

where $\lambda_{cls}$ is the coefficient to balance two loss terms. The definition of symbols $\bar{x}_{i}$ and $\bar{\theta}_{s}$ for Eq. (1) in DLB is replaced with the identical input $x_{i}$ and the weights of student network at last iteration, respectively. DLB divides each batch into two halves: one aligns with the previous iteration, and the other with the next iteration. The first half batch distills using real-time softened targets from the previous iteration. Specific hyperparameters are provided in Table A9.

More Results. We provide more results of EPSD combined with DLB on different network structures and datasets to supplement the results in the main manuscript, as shown in Fig. A4 (3rd column). It can be found that EPSD kept comparable accuracies under high sparsity (e.g., 95%) while the accuracies of the ‘Simple Combination’ heavily decreased. For instance, with ResNet-18 on CIFAR-100 at sparsity 95%, EPSD achieved 72.72% accuracy while the accuracy of the ‘Simple Combination’ was only 46.86%.

Appendix H Ablation Study

Step1: Pruning w/		Step2: Training w/		Test Acc. ( $\%$ )
CE Loss	SD Loss	CE Loss	SD Loss	Test Acc. ( $\%$ )
None	None	✓	-	$78.88$
✓	-	✓	-	$76.80_{({\color[rgb]{0,1,0}\downarrow 2.08})}$
-	✓	✓	-	$77.85_{({\color[rgb]{0,1,0}\downarrow 1.03})}$
✓	-	-	✓	$79.30_{({\color[rgb]{1,0,0}\uparrow 0.42})}$
-	✓	-	✓	$\mathbf{79.85}_{({\color[rgb]{1,0,0}\uparrow 0.97})}$

Table A11: Ablation study for EPSD using ResNet-18 on CIFAR-100 dataset. The hyperparameters are kept consistent for pruning and training, and the numerical subscripts indicate the percentage of performance increase or decrease relative to the unpruned baseline (first row).

We unveil the efficacy of EPSD by varying the optimization objectives (standard cross-entropy loss vs. SD loss) in two steps. Moreover, we investigate the impact of employing another early pruning method SNIP (Lee, Ajanthan, and Torr 2019) on EPSD.

Varying Objectives

To show the performance improvement of the different objectives in step-1 and step-2 mentioned in the manuscript, we use SD loss and cross-entropy (CE) loss to differentiate the influence of each component in the two steps of EPSD. The ResNet-18 on CIFAR-100 is adopted as the study case. Specifically, first, we fixedly use CE loss in step-2 to train the pruned network and use SD loss or CE loss to evaluate the importance of weights in the pruning of step-1, to observe which loss is beneficial to retain distillable weights. Then, we fixedly use SD loss for optimization in step-2, and use these two losses for pruning, to observe which loss is more conducive to retaining distillable weights. Step-1: Early pruning with/without SD. As shown in Table A11, when step-2 is trained with CE loss (1st, 2nd row), the performance of the pruned network decreased compared with the unpruned baseline (78.88%), and the degradation is less by pruning with SD loss (-1.03%) compared with pruning with CE loss (-2.08%). This showed that pruning with SD can preserve more trainable weights. On the other hand, when step-2 is trained with SD loss (3rd, 4th row), pruning with SD loss (79.85%) is still better than pruning with CE loss (79.30%). Step-2: Training with/without SD. We further investigated the effectiveness of the SD, which trains the pruned sub-network in EPSD. In Table A11, the fourth and fifth rows show that, when we use the SD loss instead of the CE loss, training with SD can achieve better results than the unpruned baseline. Specifically, compared to training with CE, training with SD improves by about 2.5%, and pruning with soft-gradient is also improved by about 2%. In addition, with the help of SD, they can even surpass the unpruned network by 0.42% and 0.97%, respectively. The above analysis shows that the two improvements both achieve the expected improvement, which further proves the efficiency of our proposed model compression method.

Effect of Pruning algorithm

We applied another early pruning algorithm, SNIP (Lee, Ajanthan, and Torr 2019), to our proposed framework EPSD. The SNIP is a simple pruning algorithm that only considers the immediate impact of pruning on the loss before training. We combined SNIP with PS-KD as a study case to investigate the performance. As shown in Fig. A5, our framework still consistently outperformed the ‘Simple Combination’ over all settings when using SNIP for pruning. For instance, under high sparsity conditions (e.g. 95%), the accuracy of the ‘Simple Combination’ is 1.66% lower than the ‘Unpruned Baseline’ while EPSD is comparable (75.12% vs. 75.29%). Our results indicate that EPSD’s effectiveness is independent of a particular pruning algorithm.