Learning the Best Pooling Strategy for Visual Semantic Embedding

11/09/2020
by   Jiacheng Chen, et al.
0

Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions. Recent VSE models use complex methods to better contextualize and aggregate multi-modal features into holistic embeddings. However, we discover that surprisingly simple (but carefully selected) global pooling functions (e.g., max pooling) outperform those complex models, across different feature extractors. Despite its simplicity and effectiveness, seeking the best pooling function for different data modality and feature extractor is costly and tedious, especially when the size of features varies (e.g., text, video). Therefore, we propose a Generalized Pooling Operator (GPO), which learns to automatically adapt itself to the best pooling strategy for different features, requiring no manual tuning while staying effective and efficient. We extend the VSE model using this proposed GPO and denote it as VSE∞. Without bells and whistles, VSE∞ outperforms previous VSE methods significantly on image-text retrieval benchmarks across popular feature extractors. With a simple adaptation, variants of VSE∞ further demonstrate its strength by achieving the new state of the art on two video-text retrieval datasets. Comprehensive experiments and visualizations confirm that GPO always discovers the best pooling strategy and can be a plug-and-play feature aggregation module for standard VSE models.

READ FULL TEXT
research
10/05/2022

Improving Visual-Semantic Embedding with Adaptive Pooling and Optimization Objective

Visual-Semantic Embedding (VSE) aims to learn an embedding space where r...
research
05/17/2022

A CLIP-Hitchhiker's Guide to Long Video Retrieval

Our goal in this paper is the adaptation of image-text models for long v...
research
04/20/2021

T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval

Text-video retrieval is a challenging task that aims to search relevant ...
research
04/05/2018

Finding beans in burgers: Deep semantic-visual embedding with localization

Several works have proposed to learn a two-path neural network that maps...
research
03/01/2023

The style transformer with common knowledge optimization for image-text retrieval

Image-text retrieval which associates different modalities has drawn bro...
research
03/25/2023

Learning video embedding space with Natural Language Supervision

The recent success of the CLIP model has shown its potential to be appli...
research
05/28/2021

EDEN: Deep Feature Distribution Pooling for Saimaa Ringed Seals Pattern Matching

In this paper, pelage pattern matching is considered to solve the indivi...

Please sign up or login with your details

Forgot password? Click here to reset