Experimenting with Self-Supervision using Rotation Prediction for Image Captioning

07/28/2021
by   Ahmed Elhagry, et al.
14

Image captioning is a task in the field of Artificial Intelligence that merges between computer vision and natural language processing. It is responsible for generating legends that describe images, and has various applications like descriptions used by assistive technology or indexing images (for search engines for instance). This makes it a crucial topic in AI that is undergoing a lot of research. This task however, like many others, is trained on large images labeled via human annotation, which can be very cumbersome: it needs manual effort, both financial and temporal costs, it is error-prone and potentially difficult to execute in some cases (e.g. medical images). To mitigate the need for labels, we attempt to use self-supervised learning, a type of learning where models use the data contained within the images themselves as labels. It is challenging to accomplish though, since the task is two-fold: the images and captions come from two different modalities and usually handled by different types of networks. It is thus not obvious what a completely self-supervised solution would look like. How it would achieve captioning in a comparable way to how self-supervision is applied today on image recognition tasks is still an ongoing research topic. In this project, we are using an encoder-decoder architecture where the encoder is a convolutional neural network (CNN) trained on OpenImages dataset and learns image features in a self-supervised fashion using the rotation pretext task. The decoder is a Long Short-Term Memory (LSTM), and it is trained, along within the image captioning model, on MS COCO dataset and is responsible of generating captions. Our GitHub repository can be found: https://github.com/elhagry1/SSL_ImageCaptioning_RotationPrediction

READ FULL TEXT

page 3

page 4

page 9

page 10

research
11/01/2018

A sequential guiding network with attention for image captioning

The recent advances of deep learning in both computer vision (CV)and nat...
research
03/08/2021

Multiple Instance Captioning: Learning Representations from Histopathology Textbooks and Articles

We present ARCH, a computational pathology (CP) multiple instance captio...
research
04/15/2022

Image Captioning In the Transformer Age

Image Captioning (IC) has achieved astonishing developments by incorpora...
research
08/05/2021

Neural Twins Talk Alternative Calculations

Inspired by how the human brain employs a higher number of neural pathwa...
research
06/23/2021

Neural Fashion Image Captioning : Accounting for Data Diversity

Image captioning has increasingly large domains of application, and fash...
research
07/02/2019

Neural Image Captioning

In recent years, the biggest advances in major Computer Vision tasks, su...
research
06/15/2019

Generating Diverse and Informative Natural Language Fashion Feedback

Recent advances in multi-modal vision and language tasks enable a new se...

Please sign up or login with your details

Forgot password? Click here to reset