Improving Visual Reasoning by Exploiting The Knowledge in Texts

02/09/2021

∙

This paper presents a new framework for training image-based classifiers from a combination of texts and images with very few labels. We consider a classification framework with three modules: a backbone, a relational reasoning component, and a classification component. While the backbone can be trained from unlabeled images by self-supervised learning, we can fine-tune the relational reasoning and the classification components from external sources of knowledge instead of annotated images. By proposing a transformer-based model that creates structured knowledge from textual input, we enable the utilization of the knowledge in texts. We show that, compared to the supervised baselines with 1 scene graph classification, 3x in object classification, and 1.5x in predicate classification.

READ FULL TEXT

Improving Visual Reasoning by Exploiting The Knowledge in Texts

Sign in with Google

Consider DeepAI Pro