Modularized Textual Grounding for Counterfactual Resilience

Fang, Zhiyuan; Kong, Shu; Fowlkes, Charless; Yang, Yezhou

Computer Science > Computer Vision and Pattern Recognition

arXiv:1904.03589 (cs)

[Submitted on 7 Apr 2019 (v1), last revised 1 Jul 2019 (this version, v2)]

Title:Modularized Textual Grounding for Counterfactual Resilience

Authors:Zhiyuan Fang, Shu Kong, Charless Fowlkes, Yezhou Yang

View PDF

Abstract:Computer Vision applications often require a textual grounding module with precision, interpretability, and resilience to counterfactual inputs/queries. To achieve high grounding precision, current textual grounding methods heavily rely on large-scale training data with manual annotations at the pixel level. Such annotations are expensive to obtain and thus severely narrow the model's scope of real-world applications. Moreover, most of these methods sacrifice interpretability, generalizability, and they neglect the importance of being resilient to counterfactual inputs. To address these issues, we propose a visual grounding system which is 1) end-to-end trainable in a weakly supervised fashion with only image-level annotations, and 2) counterfactually resilient owing to the modular design. Specifically, we decompose textual descriptions into three levels: entity, semantic attribute, color information, and perform compositional grounding progressively. We validate our model through a series of experiments and demonstrate its improvement over the state-of-the-art methods. In particular, our model's performance not only surpasses other weakly/un-supervised methods and even approaches the strongly supervised ones, but also is interpretable for decision making and performs much better in face of counterfactual classes than all the others.

Comments:	13 pages, 12 figures, IEEE Conference on Computer Vision and Pattern Recognition, 2019
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:1904.03589 [cs.CV]
	(or arXiv:1904.03589v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.1904.03589

Submission history

From: Zhiyuan Fang [view email]
[v1] Sun, 7 Apr 2019 05:59:04 UTC (3,476 KB)
[v2] Mon, 1 Jul 2019 04:42:34 UTC (3,476 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Modularized Textual Grounding for Counterfactual Resilience

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Modularized Textual Grounding for Counterfactual Resilience

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators