AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant

Lei, Stan Weixian; Gao, Difei; Wang, Yuxuan; Mao, Dongxing; Liang, Zihan; Ran, Lingmin; Shou, Mike Zheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2111.15050 (cs)

[Submitted on 30 Nov 2021 (v1), last revised 10 Oct 2022 (this version, v4)]

Title:AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant

Authors:Stan Weixian Lei, Difei Gao, Yuxuan Wang, Dongxing Mao, Zihan Liang, Lingmin Ran, Mike Zheng Shou

View PDF

Abstract:It is still a pipe dream that personal AI assistants on the phone and AR glasses can assist our daily life in addressing our questions like ``how to adjust the date for this watch?'' and ``how to set its heating duration? (while pointing at an oven)''. The queries used in conventional tasks (i.e. Video Question Answering, Video Retrieval, Moment Localization) are often factoid and based on pure text. In contrast, we present a new task called Task-oriented Question-driven Video Segment Retrieval (TQVSR). Each of our questions is an image-box-text query that focuses on affordance of items in our daily life and expects relevant answer segments to be retrieved from a corpus of instructional video-transcript segments. To support the study of this TQVSR task, we construct a new dataset called AssistSR. We design novel guidelines to create high-quality samples. This dataset contains 3.2k multimodal questions on 1.6k video segments from instructional videos on diverse daily-used items. To address TQVSR, we develop a simple yet effective model called Dual Multimodal Encoders (DME) that significantly outperforms several baseline methods while still having large room for improvement in the future. Moreover, we present detailed ablation analyses. Code and data are available at \url{this https URL}.

Comments:	20 pages, 12 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2111.15050 [cs.CV]
	(or arXiv:2111.15050v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2111.15050

Submission history

From: Weixian Lei [view email]
[v1] Tue, 30 Nov 2021 01:14:10 UTC (44,991 KB)
[v2] Mon, 6 Dec 2021 11:42:42 UTC (44,992 KB)
[v3] Sun, 13 Mar 2022 05:20:52 UTC (42,223 KB)
[v4] Mon, 10 Oct 2022 05:40:46 UTC (33,282 KB)

✅2024-10-01: arxiv.org is back to normal.✅

Computer Science > Computer Vision and Pattern Recognition

Title:AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

✅2024-10-01: arxiv.org is back to normal.✅

Computer Science > Computer Vision and Pattern Recognition

Title:AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators