Long-Short Term Cross-Transformer in Compressed Domain for Few-Shot Video Classification

Long-Short Term Cross-Transformer in Compressed Domain for Few-Shot Video Classification

Wenyang Luo, Yufan Liu, Bing Li, Weiming Hu, Yanan Miao, Yangxi Li

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence
Main Track. Pages 1247-1253. https://doi.org/10.24963/ijcai.2022/174

Compared with image few-shot learning, most of the existing few-shot video classification methods perform worse on feature matching, because they fail to sufficiently exploit the temporal information and relation. Specifically, frames are usually evenly sampled, which may miss important frames. On the other hand, the heuristic model simply encodes the equally treated frames in sequence, which results in the lack of both long-term and short-term temporal modeling and interaction. To alleviate these limitations, we take advantage of the compressed domain knowledge and propose a long-short term Cross-Transformer (LSTC) for few-shot video classification. For short terms, the motion vector (MV) contains temporal cues and reflects the importance of each frame. For long terms, a video can be natively divided into a sequence of GOPs (Group Of Picture). Using this compressed domain knowledge helps to obtain a more accurate spatial-temporal feature space. Consequently, we design the long-short term selection module, short-term module, and long-term module to comprise the LSTC. Long-short term selection is performed to select informative compressed domain data. Long/short-term modules are utilized to sufficiently exploit the temporal information so that the query and support can be well-matched by cross-attention. Experimental results show the superiority of our method on various datasets.
Keywords:
Computer Vision: Recognition (object detection, categorization)
Computer Vision: Video analysis and understanding   
Machine Learning: Few-shot learning