TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers

Zhou, Qianyu; Li, Xiangtai; He, Lu; Yang, Yibo; Cheng, Guangliang; Tong, Yunhai; Ma, Lizhuang; Tao, Dacheng

doi:10.1109/TPAMI.2022.3223955

Computer Science > Computer Vision and Pattern Recognition

arXiv:2201.05047 (cs)

[Submitted on 13 Jan 2022 (v1), last revised 22 Nov 2022 (this version, v4)]

Title:TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers

Authors:Qianyu Zhou, Xiangtai Li, Lu He, Yibo Yang, Guangliang Cheng, Yunhai Tong, Lizhuang Ma, Dacheng Tao

View PDF

Abstract:Detection Transformer (DETR) and Deformable DETR have been proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance as previous complex hand-crafted detectors. However, their performance on Video Object Detection (VOD) has not been well explored. In this paper, we present TransVOD, the first end-to-end video object detection system based on spatial-temporal Transformer architectures. The first goal of this paper is to streamline the pipeline of VOD, effectively removing the need for many hand-crafted components for feature aggregation, e.g., optical flow model, relation networks. Besides, benefited from the object query design in DETR, our method does not need complicated post-processing methods such as Seq-NMS. In particular, we present a temporal Transformer to aggregate both the spatial object queries and the feature memories of each frame. Our temporal transformer consists of two components: Temporal Query Encoder (TQE) to fuse object queries, and Temporal Deformable Transformer Decoder (TDTD) to obtain current frame detection results. These designs boost the strong baseline deformable DETR by a significant margin (3%-4% mAP) on the ImageNet VID dataset. Then, we present two improved versions of TransVOD including TransVOD++ and TransVOD Lite. The former fuses object-level information into object query via dynamic convolution while the latter models the entire video clips as the output to speed up the inference time. We give detailed analysis of all three models in the experiment part. In particular, our proposed TransVOD++ sets a new state-of-the-art record in terms of accuracy on ImageNet VID with 90.0% mAP. Our proposed TransVOD Lite also achieves the best speed and accuracy trade-off with 83.7% mAP while running at around 30 FPS on a single V100 GPU device.

Comments:	Accepted to IEEE Transactions on Pattern Analysis and Machine Intelligence (IEEE TPAMI), extended version of arXiv:2105.10920
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2201.05047 [cs.CV]
	(or arXiv:2201.05047v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2201.05047
Related DOI:	https://doi.org/10.1109/TPAMI.2022.3223955

Submission history

From: Qianyu Zhou [view email]
[v1] Thu, 13 Jan 2022 16:17:34 UTC (5,107 KB)
[v2] Fri, 14 Jan 2022 07:19:08 UTC (5,108 KB)
[v3] Mon, 17 Jan 2022 02:06:34 UTC (5,108 KB)
[v4] Tue, 22 Nov 2022 06:07:22 UTC (7,439 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:TransVOD: End-to-End Video Object Detection with Spatial-Temporal Transformers

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators