Text Conditional Alt-Text Generation for Twitter Images

05/24/2023
by   Nikita Srivatsan, et al.
0

In this work we present an approach for generating alternative text (or alt-text) descriptions for images shared on social media, specifically Twitter. This task is more than just a special case of image captioning, as alt-text is both more literally descriptive and context-specific. Also critically, images posted to Twitter are often accompanied by user-written text that despite not necessarily describing the image may provide useful context that if properly leveraged can be informative – e.g. the tweet may name an uncommon object in the image that the model has not previously seen. We address this with a CLIP prefix model that extracts an embedding of the image and passes it to a mapping network that outputs a short sequence in word embedding space, or a “prefix”, to which we also concatenate the text from the tweet itself. This lets the model condition on both visual and textual information from the post. The combined multimodal prefix is then fed as a prompt to a pretrained language model which autoregressively completes the sequence to generate the alt-text. While prior work has used similar methods for captioning, ours is the first to our knowledge that incorporates textual information from the associated social media post into the prefix as well, and we further demonstrate through ablations that utility of these two information sources stacks. We put forward a new dataset scraped from Twitter and evaluate on it across a variety of automated metrics as well as human evaluation, and show that our approach of conditioning on both tweet text and visual information significantly outperforms prior work.

READ FULL TEXT

page 1

page 9

research
10/29/2022

NTULM: Enriching Social Media Text Representations with Non-Textual Units

On social media, additional context is often present in the form of anno...
research
04/18/2017

25 Tweets to Know You: A New Model to Predict Personality with Social Media

Predicting personality is essential for social applications supporting h...
research
05/24/2022

On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization

Integrating vision and language has gained notable attention following t...
research
08/25/2019

Towards Unsupervised Image Captioning with Shared Multimodal Embeddings

Understanding images without explicit supervision has become an importan...
research
01/11/2022

Incidents1M: a large-scale dataset of images with natural disasters, damage, and incidents

Natural disasters, such as floods, tornadoes, or wildfires, are increasi...
research
04/27/2023

Learning Human-Human Interactions in Images from Weak Textual Supervision

Interactions between humans are diverse and context-dependent, but previ...
research
09/01/2021

Point-of-Interest Type Prediction using Text and Images

Point-of-interest (POI) type prediction is the task of inferring the typ...

Please sign up or login with your details

Forgot password? Click here to reset