MCAD: Multi-teacher Cross-modal Alignment Distillation for efficient image-text retrieval
Y Lei, F He, C Chen, Y Mo, SJ Li, D Xie… - arXiv preprint arXiv …, 2023 - arxiv.org
Y Lei, F He, C Chen, Y Mo, SJ Li, D Xie, H Lu
arXiv preprint arXiv:2310.19654, 2023•arxiv.orgWith the success of large-scale visual-language pretraining models and the wide application
of image-text retrieval in industry areas, reducing the model size and streamlining their
terminal-device deployment have become urgently necessary. The mainstream model
structures for image-text retrieval are single-stream and dual-stream, both aiming to close
the semantic gap between visual and textual modalities. Dual-stream models excel at offline
indexing and fast inference, while single-stream models achieve more accurate cross-model …
of image-text retrieval in industry areas, reducing the model size and streamlining their
terminal-device deployment have become urgently necessary. The mainstream model
structures for image-text retrieval are single-stream and dual-stream, both aiming to close
the semantic gap between visual and textual modalities. Dual-stream models excel at offline
indexing and fast inference, while single-stream models achieve more accurate cross-model …
With the success of large-scale visual-language pretraining models and the wide application of image-text retrieval in industry areas, reducing the model size and streamlining their terminal-device deployment have become urgently necessary. The mainstream model structures for image-text retrieval are single-stream and dual-stream, both aiming to close the semantic gap between visual and textual modalities. Dual-stream models excel at offline indexing and fast inference, while single-stream models achieve more accurate cross-model alignment by employing adequate feature fusion. We propose a multi-teacher cross-modality alignment distillation (MCAD) technique to integrate the advantages of single-stream and dual-stream models. By incorporating the fused single-stream features into the image and text features of the dual-stream model, we formulate new modified teacher features and logits. Then, we conduct both logit and feature distillation to boost the capability of the student dual-stream model, achieving high retrieval performance without increasing inference complexity. Extensive experiments demonstrate the remarkable performance and high efficiency of MCAD on image-text retrieval tasks. Furthermore, we implement a mobile CLIP model on Snapdragon clips with only 93M running memory and 30ms search latency, without apparent performance degradation of the original large CLIP.
arxiv.org
Showing the best result for this search. See all results