Paying Attention to Multiscale Feature Maps in Multimodal Image Matching

03/20/2021
by   Aviad Moreshet, et al.
0

We propose an attention-based approach for multimodal image patch matching using a Transformer encoder attending to the feature maps of a multiscale Siamese CNN. Our encoder is shown to efficiently aggregate multiscale image embeddings while emphasizing task-specific appearance-invariant image cues. We also introduce an attention-residual architecture, using a residual connection bypassing the encoder. This additional learning signal facilitates end-to-end training from scratch. Our approach is experimentally shown to achieve new state-of-the-art accuracy on both multimodal and single modality benchmarks, illustrating its general applicability. To the best of our knowledge, this is the first successful implementation of the Transformer encoder architecture to the multimodal image patch matching task.

READ FULL TEXT

page 1

page 7

research
03/05/2023

Estimating Extreme 3D Image Rotation with Transformer Cross-Attention

The estimation of large and extreme image rotation plays a key role in m...
research
11/19/2021

UFO: A UniFied TransfOrmer for Vision-Language Representation Learning

In this paper, we propose a single UniFied transfOrmer (UFO), which is c...
research
12/09/2017

Modulating and attending the source image during encoding improves Multimodal Translation

We propose a new and fully end-to-end approach for multimodal translatio...
research
08/11/2018

Self-Supervised Model Adaptation for Multimodal Semantic Segmentation

Learning to reliably perceive and understand the scene is an integral en...
research
06/05/2018

TS-Net: Combining modality specific and common features for multimodal patch matching

Multimodal patch matching addresses the problem of finding the correspon...
research
06/08/2021

LocalTrans: A Multiscale Local Transformer Network for Cross-Resolution Homography Estimation

Cross-resolution image alignment is a key problem in multiscale gigapixe...
research
05/09/2022

Multilevel Hierarchical Network with Multiscale Sampling for Video Question Answering

Video question answering (VideoQA) is challenging given its multimodal c...

Please sign up or login with your details

Forgot password? Click here to reset