Show Me the World in My Language: Establishing the First Baseline for Scene-Text to Scene-Text Translation

Vaidya, Shreyas; Sharma, Arvind Kumar; Gatti, Prajwal; Mishra, Anand

Computer Science > Computer Vision and Pattern Recognition

arXiv:2308.03024 (cs)

[Submitted on 6 Aug 2023 (v1), last revised 2 Sep 2024 (this version, v3)]

Title:Show Me the World in My Language: Establishing the First Baseline for Scene-Text to Scene-Text Translation

Authors:Shreyas Vaidya, Arvind Kumar Sharma, Prajwal Gatti, Anand Mishra

View PDF HTML (experimental)

Abstract:In this work, we study the task of ``visually'' translating scene text from a source language (e.g., Hindi) to a target language (e.g., English). Visual translation involves not just the recognition and translation of scene text but also the generation of the translated image that preserves visual features of the source scene text, such as font, size, and background. There are several challenges associated with this task, such as translation with limited context, deciding between translation and transliteration, accommodating varying text lengths within fixed spatial boundaries, and preserving the font and background styles of the source scene text in the target language. To address this problem, we make the following contributions: (i) We study visual translation as a standalone problem for the first time in the literature. (ii) We present a cascaded framework for visual translation that combines state-of-the-art modules for scene text recognition, machine translation, and scene text synthesis as a baseline for the task. (iii) We propose a set of task-specific design enhancements to design a variant of the baseline to obtain performance improvements. (iv) Currently, the existing related literature lacks any comprehensive performance evaluation for this novel task. To fill this gap, we introduce several automatic and user-assisted evaluation metrics designed explicitly for evaluating visual translation. Further, we evaluate presented baselines for translating scene text between Hindi and English. Our experiments demonstrate that although we can effectively perform visual translation over a large collection of scene text images, the presented baseline only partially addresses challenges posed by visual translation tasks. We firmly believe that this new task and the limitations of existing models, as reported in this paper, should encourage further research in visual translation.

Comments:	Accepted at ICPR 2024, Project Website: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Multimedia (cs.MM)
Cite as:	arXiv:2308.03024 [cs.CV]
	(or arXiv:2308.03024v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2308.03024

Submission history

From: Anand Mishra [view email]
[v1] Sun, 6 Aug 2023 05:23:25 UTC (6,496 KB)
[v2] Wed, 17 Jul 2024 09:53:23 UTC (4,894 KB)
[v3] Mon, 2 Sep 2024 05:51:02 UTC (3,249 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Show Me the World in My Language: Establishing the First Baseline for Scene-Text to Scene-Text Translation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Show Me the World in My Language: Establishing the First Baseline for Scene-Text to Scene-Text Translation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators