Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition

Bhunia, Ayan Kumar; Sain, Aneeshan; Kumar, Amandeep; Ghose, Shuvozit; Chowdhury, Pinaki Nath; Song, Yi-Zhe

Computer Science > Computer Vision and Pattern Recognition

arXiv:2107.12090 (cs)

[Submitted on 26 Jul 2021 (v1), last revised 27 Jul 2021 (this version, v2)]

Title:Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition

Authors:Ayan Kumar Bhunia, Aneeshan Sain, Amandeep Kumar, Shuvozit Ghose, Pinaki Nath Chowdhury, Yi-Zhe Song

View PDF

Abstract:Although text recognition has significantly evolved over the years, state-of-the-art (SOTA) models still struggle in the wild scenarios due to complex backgrounds, varying fonts, uncontrolled illuminations, distortions and other artefacts. This is because such models solely depend on visual information for text recognition, thus lacking semantic reasoning capabilities. In this paper, we argue that semantic information offers a complementary role in addition to visual only. More specifically, we additionally utilize semantic information by proposing a multi-stage multi-scale attentional decoder that performs joint visual-semantic reasoning. Our novelty lies in the intuition that for text recognition, the prediction should be refined in a stage-wise manner. Therefore our key contribution is in designing a stage-wise unrolling attentional decoder where non-differentiability, invoked by discretely predicted character labels, needs to be bypassed for end-to-end training. While the first stage predicts using visual features, subsequent stages refine on top of it using joint visual-semantic information. Additionally, we introduce multi-scale 2D attention along with dense and residual connections between different stages to deal with varying scales of character sizes, for better performance and faster convergence during training. Experimental results show our approach to outperform existing SOTA methods by a considerable margin.

Comments:	IEEE International Conference on Computer Vision (ICCV), 2021
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2107.12090 [cs.CV]
	(or arXiv:2107.12090v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2107.12090

Submission history

From: Ayan Kumar Bhunia [view email]
[v1] Mon, 26 Jul 2021 10:15:14 UTC (1,497 KB)
[v2] Tue, 27 Jul 2021 02:27:15 UTC (1,483 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators