TextTopicNet - Self-Supervised Learning of Visual Features Through Embedding Images on Semantic Text Spaces

07/04/2018
by   Yash Patel, et al.
0

The immense success of deep learning based methods in computer vision heavily relies on large scale training datasets. These richly annotated datasets help the network learn discriminative visual features. Collecting and annotating such datasets requires a tremendous amount of human effort and annotations are limited to popular set of classes. As an alternative, learning visual features by designing auxiliary tasks which make use of freely available self-supervision has become increasingly popular in the computer vision community. In this paper, we put forward an idea to take advantage of multi-modal context to provide self-supervision for the training of computer vision algorithms. We show that adequate visual features can be learned efficiently by training a CNN to predict the semantic textual context in which a particular image is more probable to appear as an illustration. More specifically we use popular text embedding techniques to provide the self-supervision for the training of deep CNN. Our experiments demonstrate state-of-the-art performance in image classification, object detection, and multi-modal retrieval compared to recent self-supervised or naturally-supervised approaches.

READ FULL TEXT

page 3

page 7

page 8

page 12

page 13

page 14

research
05/24/2017

Self-supervised learning of visual features through embedding images into text topic spaces

End-to-end training from scratch of current deep architectures for new c...
research
01/31/2019

Self-Supervised Visual Representations for Cross-Modal Retrieval

Cross-modal retrieval methods have been significantly improved in last y...
research
05/22/2014

Self-tuned Visual Subclass Learning with Shared Samples An Incremental Approach

Computer vision tasks are traditionally defined and evaluated using sema...
research
11/28/2022

Pitfalls of Conditional Batch Normalization for Contextual Multi-Modal Learning

Humans have perfected the art of learning from multiple modalities throu...
research
04/14/2023

DINOv2: Learning Robust Visual Features without Supervision

The recent breakthroughs in natural language processing for model pretra...
research
03/18/2020

Self-Supervised Contextual Bandits in Computer Vision

Contextual bandits are a common problem faced by machine learning practi...
research
05/26/2023

SelfClean: A Self-Supervised Data Cleaning Strategy

Most commonly used benchmark datasets for computer vision contain irrele...

Please sign up or login with your details

Forgot password? Click here to reset