A Feature-space Multimodal Data Augmentation Technique for Text-video Retrieval

08/03/2022
by   Alex Falcon, et al.
0

Every hour, huge amounts of visual contents are posted on social media and user-generated content platforms. To find relevant videos by means of a natural language query, text-video retrieval methods have received increased attention over the past few years. Data augmentation techniques were introduced to increase the performance on unseen test examples by creating new training samples with the application of semantics-preserving techniques, such as color space or geometric transformations on images. Yet, these techniques are usually applied on raw data, leading to more resource-demanding solutions and also requiring the shareability of the raw data, which may not always be true, e.g. copyright issues with clips from movies or TV series. To address this shortcoming, we propose a multimodal data augmentation technique which works in the feature space and creates new videos and captions by mixing semantically similar samples. We experiment our solution on a large scale public dataset, EPIC-Kitchens-100, and achieve considerable improvements over a baseline method, improved state-of-the-art performance, while at the same time performing multiple ablation studies. We release code and pretrained models on Github at https://github.com/aranciokov/FSMMDA_VideoRetrieval.

READ FULL TEXT

page 1

page 5

research
03/16/2022

Learning video retrieval models with relevance-aware online mining

Due to the amount of videos and related captions uploaded every hour, de...
research
09/28/2016

Understanding data augmentation for classification: when to warp?

In this paper we investigate the benefit of augmenting data with synthet...
research
12/29/2022

Learning Multimodal Data Augmentation in Feature Space

The ability to jointly learn from multiple modalities, such as text, aud...
research
01/17/2022

AugLy: Data Augmentations for Robustness

We introduce AugLy, a data augmentation library with a focus on adversar...
research
03/21/2019

Low Resource Text Classification with ULMFit and Backtranslation

In computer vision, virtually every state of the art deep learning syste...
research
04/07/2022

Enhancing Semantic Code Search with Multimodal Contrastive Learning and Soft Data Augmentation

Code search aims to retrieve the most semantically relevant code snippet...
research
11/16/2022

A Unified Multimodal De- and Re-coupling Framework for RGB-D Motion Recognition

Motion recognition is a promising direction in computer vision, but the ...

Please sign up or login with your details

Forgot password? Click here to reset