Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model

05/17/2022
by   Daisuke Niizumi, et al.
0

Many application studies rely on audio DNN models pre-trained on a large-scale dataset as essential feature extractors, and they extract features from the last layers. In this study, we focus on our finding that the middle layer features of existing supervised pre-trained models are more effective than the late layer features for some tasks. We propose a simple approach to compose features effective for general-purpose applications, consisting of two steps: (1) calculating feature vectors along the time frame from middle/late layer outputs, and (2) fusing them. This approach improves the utility of frequency and channel information in downstream processes, and combines the effectiveness of middle and late layer features for different tasks. As a result, the feature vectors become effective for general purposes. In the experiments using VGGish, PANNs' CNN14, and AST on nine downstream tasks, we first show that each layer output of these models serves different tasks. Then, we demonstrate that the proposed approach significantly improves their performance and brings it to a level comparable to that of the state-of-the-art. In particular, the performance of the non-semantic speech (NOSS) tasks greatly improves, especially on Speech commands V2 with VGGish of +77.1 (14.3

READ FULL TEXT

page 1

page 2

page 3

page 4

page 5

research
11/05/2019

Deepening Hidden Representations from Pre-trained Language Models for Natural Language Understanding

Transformer-based pre-trained language models have proven to be effectiv...
research
05/26/2023

Inter-connection: Effective Connection between Pre-trained Encoder and Decoder for Speech Translation

In end-to-end speech translation, speech and text pre-trained models imp...
research
06/13/2023

Efficient Adapters for Giant Speech Models

Large pre-trained speech models are widely used as the de-facto paradigm...
research
04/04/2023

G2PTL: A Pre-trained Model for Delivery Address and its Applications in Logistics System

Text-based delivery addresses, as the data foundation for logistics syst...
research
09/30/2022

An empirical study of weakly supervised audio tagging embeddings for general audio representations

We study the usability of pre-trained weakly supervised audio tagging (A...
research
06/17/2021

Frustratingly Easy Transferability Estimation

Transferability estimation has been an essential tool in selecting a pre...
research
10/29/2020

Speech-Image Semantic Alignment Does Not Depend on Any Prior Classification Tasks

Semantically-aligned (speech, image) datasets can be used to explore "vi...

Please sign up or login with your details

Forgot password? Click here to reset