Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks

10/29/2020
by   Masood S. Mortazavi, et al.
0

Semantically-aligned (speech, image) datasets can be used to explore "visually-grounded speech". In a majority of existing investigations, features of an image signal are extracted using neural networks "pre-trained" on other tasks (e.g., classification on ImageNet). In still others, pre-trained networks are used to extract audio features prior to semantic embedding. Without "transfer learning" through pre-trained initialization or pre-trained feature extraction, previous results have tended to show low rates of recall in speech → image and image → speech queries. Choosing appropriate neural architectures for encoders in the speech and image branches and using large datasets, one can obtain competitive recall rates without any reliance on any pre-trained initialization or feature extraction: (speech,image) semantic alignment and speech → image and image → speech retrieval are canonical tasks worthy of independent investigation of their own and allow one to explore other questions—e.g., the size of the audio embedder can be reduced significantly with little loss of recall rates in speech → image and image → speech queries.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/11/2017

Convolutional Neural Networks for Histopathology Image Classification: Training vs. Using Pre-Trained Networks

We explore the problem of classification within a medical image data-set...
research
10/25/2020

Probing Acoustic Representations for Phonetic Properties

Pre-trained acoustic representations such as wav2vec and DeCoAR have att...
research
01/03/2022

Improving Feature Extraction from Histopathological Images Through A Fine-tuning ImageNet Model

Due to lack of annotated pathological images, transfer learning has been...
research
10/27/2021

CBIR using Pre-Trained Neural Networks

Much of the recent research work in image retrieval, has been focused ar...
research
04/04/2019

Modified Distribution Alignment for Domain Adaptation with Pre-trainedInception ResNet

Deep neural networks have been widely used in computer vision. There are...
research
05/17/2022

Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model

Many application studies rely on audio DNN models pre-trained on a large...
research
12/04/2020

Is It a Plausible Colour? UCapsNet for Image Colourisation

Human beings can imagine the colours of a grayscale image with no partic...

Please sign up or login with your details

Forgot password? Click here to reset