Data-free ensemble knowledge distillation for privacy-conscious multimedia model compression
Proceedings of the 29th ACM International Conference on Multimedia, 2021•dl.acm.org
Recent advances in deep learning bring impressive performance for multimedia
applications. Hence, compressing and deploying these applications on resource-limited
edge devices via model compression becomes attractive. Knowledge distillation (KD) is one
of the most popular model compression techniques. However, most well-behaved KD
approaches require the original dataset, which is usually unavailable due to privacy issues,
while existing data-free KD methods perform much worse than data-required counterparts …
applications. Hence, compressing and deploying these applications on resource-limited
edge devices via model compression becomes attractive. Knowledge distillation (KD) is one
of the most popular model compression techniques. However, most well-behaved KD
approaches require the original dataset, which is usually unavailable due to privacy issues,
while existing data-free KD methods perform much worse than data-required counterparts …
Recent advances in deep learning bring impressive performance for multimedia applications. Hence, compressing and deploying these applications on resource-limited edge devices via model compression becomes attractive. Knowledge distillation (KD) is one of the most popular model compression techniques. However, most well-behaved KD approaches require the original dataset, which is usually unavailable due to privacy issues, while existing data-free KD methods perform much worse than data-required counterparts. In this paper, we analyze previous data-free KD methods from the data perspective and point out that using a single pre-trained model limits the performance of these approaches. We then propose a Data-Free Ensemble knowledge Distillation (DFED) framework, which contains a student network, a generator network, and multiple pre-trained teacher networks. During training, the student mimics behaviors of the ensemble of teachers using samples synthesized by a generator, which aims to enlarge the prediction discrepancy between the student and teachers. A moment matching loss term assists the generator training by minimizing the distance between activations of synthesized samples and real samples. We evaluate DFED on three popular image classification datasets. Results demonstrate that our method achieves significant performance improvements compared with previous works. We also design an ablation study to verify the effectiveness of each component of the proposed framework.

Showing the best result for this search. See all results