IIITD-20K: Dense captioning for Text-Image ReID

05/08/2023
by   A V Subramanyam, et al.
0

Text-to-Image (T2I) ReID has attracted a lot of attention in the recent past. CUHK-PEDES, RSTPReid and ICFG-PEDES are the three available benchmarks to evaluate T2I ReID methods. RSTPReid and ICFG-PEDES comprise of identities from MSMT17 but due to limited number of unique persons, the diversity is limited. On the other hand, CUHK-PEDES comprises of 13,003 identities but has relatively shorter text description on average. Further, these datasets are captured in a restricted environment with limited number of cameras. In order to further diversify the identities and provide dense captions, we propose a novel dataset called IIITD-20K. IIITD-20K comprises of 20,000 unique identities captured in the wild and provides a rich dataset for text-to-image ReID. With a minimum of 26 words for a description, each image is densely captioned. We further synthetically generate images and fine-grained captions using Stable-diffusion and BLIP models trained on our dataset. We perform elaborate experiments using state-of-art text-to-image ReID models and vision-language pre-trained models and present a comprehensive analysis of the dataset. Our experiments also reveal that synthetically generated data leads to a substantial performance improvement in both same dataset as well as cross dataset settings. Our dataset is available at https://bit.ly/3pkA3Rj.

READ FULL TEXT

page 2

page 3

page 4

research
08/24/2023

Dense Text-to-Image Generation with Attention Modulation

Existing text-to-image diffusion models struggle to synthesize realistic...
research
06/26/2023

Semi-Supervised Image Captioning with CLIP

Image captioning, a fundamental task in vision-language understanding, s...
research
09/10/2023

Prefix-diffusion: A Lightweight Diffusion Model for Diverse Image Captioning

While impressive performance has been achieved in image captioning, the ...
research
07/09/2021

Memes in the Wild: Assessing the Generalizability of the Hateful Memes Challenge Dataset

Hateful memes pose a unique challenge for current machine learning syste...
research
05/19/2023

Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner

Large pre-trained multimodal models have demonstrated significant succes...
research
05/26/2022

Fine-grained Image Captioning with CLIP Reward

Modern image captioning models are usually trained with text similarity ...
research
08/21/2023

Explore and Tell: Embodied Visual Captioning in 3D Environments

While current visual captioning models have achieved impressive performa...

Please sign up or login with your details

Forgot password? Click here to reset