Improving Visual Reasoning by Exploiting The Knowledge in Texts

02/09/2021
by   Sahand Sharifzadeh, et al.
0

This paper presents a new framework for training image-based classifiers from a combination of texts and images with very few labels. We consider a classification framework with three modules: a backbone, a relational reasoning component, and a classification component. While the backbone can be trained from unlabeled images by self-supervised learning, we can fine-tune the relational reasoning and the classification components from external sources of knowledge instead of annotated images. By proposing a transformer-based model that creates structured knowledge from textual input, we enable the utilization of the knowledge in texts. We show that, compared to the supervised baselines with 1 scene graph classification,  3x in object classification, and  1.5x in predicate classification.

READ FULL TEXT

Please sign up or login with your details

Forgot password? Click here to reset