Florian Metze

research

∙ 12/11/2022

Error-aware Quantization through Noise Tempering

Quantization has become a predominant approach for model compression, en...

0 Zheng Wang, et al. ∙

research

∙ 10/27/2022

Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models

End-to-end spoken language understanding (SLU) systems are gaining popul...

0 Siddhant Arora, et al. ∙

research

∙ 10/13/2022

SQuAT: Sharpness- and Quantization-Aware Training for BERT

Quantization is an effective technique to reduce memory footprint, infer...

0 Zheng Wang, et al. ∙

research

∙ 10/11/2022

CTC Alignments Improve Autoregressive Translation

Connectionist Temporal Classification (CTC) is a widely used approach fo...

0 Brian Yan, et al. ∙

research

∙ 09/06/2022

ASR2K: Speech Recognition for Around 2000 Languages without Audio

Most recent speech recognition models rely on large supervised datasets,...

0 Xinjian Li, et al. ∙

research

∙ 07/13/2022

Masked Autoencoders that Listen

This paper studies a simple extension of image-based Masked Autoencoders...

3 Po-Yao Huang, et al. ∙

research

∙ 06/07/2022

LegoNN: Building Modular Encoder-Decoder Models

State-of-the-art encoder-decoder models (e.g. for machine translation (M...

0 Siddharth Dalmia, et al. ∙

research

∙ 05/24/2022

On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization

Integrating vision and language has gained notable attention following t...

0 Shruti Palaskar, et al. ∙

research

∙ 05/06/2022

Robustness of Neural Architectures for Audio Event Detection

Traditionally, in Audio Recognition pipeline, noise is suppressed by the...

0 Juncheng B. Li, et al. ∙

research

∙ 03/25/2022

AudioTagging Done Right: 2nd comparison of deep learning methods for environmental sound classification

After its sweeping success in vision and language tasks, pure attention-...

0 Juncheng B. Li, et al. ∙

research

∙ 03/23/2022

On Adversarial Robustness of Large-scale Audio Visual Learning

As audio-visual systems are being deployed for safety-critical tasks suc...

0 Juncheng B. Li, et al. ∙

research

∙ 10/12/2021

Speech Summarization using Restricted Self-Attention

Speech summarization is typically performed by using a cascade of speech...

0 Roshan Sharma, et al. ∙

research

∙ 09/28/2021

VideoCLIP: Contrastive Pre-training for Zero-shot Video-Text Understanding

We present VideoCLIP, a contrastive approach to pre-train a unified mode...

0 Hu Xu, et al. ∙

research

∙ 07/24/2021

Differentiable Allophone Graphs for Language-Universal Speech Recognition

Building language-universal speech recognition systems entails producing...

0 Brian Yan, et al. ∙

research

∙ 06/29/2021

Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding

Decomposable tasks are complex and comprise of a hierarchy of sub-tasks....

0 Siddhant Arora, et al. ∙

research

∙ 05/20/2021

VLM: Task-agnostic Video-Language Model Pre-training for Video Understanding

We present a simplified, task-agnostic multi-modal pre-training approach...

0 Hu Xu, et al. ∙

research

∙ 05/02/2021

Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks

End-to-end approaches for sequence tasks are becoming increasingly popul...

7 Siddharth Dalmia, et al. ∙

research

∙ 04/13/2021

Self-supervised object detection from audio-visual correspondence

We tackle the problem of learning object detectors without supervision. ...

0 Triantafyllos Afouras, et al. ∙

research

∙ 03/18/2021

Space-Time Crop Attend: Improving Cross-modal Video Representation Learning

The quality of the image representations obtained from self-supervised l...

7 Mandela Patrick, et al. ∙

research

∙ 03/16/2021

Multilingual Multimodal Pre-training for Zero-Shot Cross-Lingual Transfer of Vision-Language Models

This paper studies zero-shot cross-lingual transfer of vision-language m...

7 Po-Yao Huang, et al. ∙

research

∙ 11/15/2020

Audio-Visual Event Recognition through the lens of Adversary

As audio/visual classification models are widely deployed for sensitive ...

0 Juncheng B. Li, et al. ∙

research

∙ 10/16/2020

Multimodal Speech Recognition with Unstructured Audio Masking

Visual context has been shown to be useful for automatic speech recognit...

16 Tejas Srinivasan, et al. ∙

research

∙ 10/06/2020

Support-set bottlenecks for video-text representation learning

The dominant paradigm for learning video-text representations – noise co...

1 Mandela Patrick, et al. ∙

research

∙ 10/05/2020

Fine-Grained Grounding for Multimodal Speech Recognition

Multimodal automatic speech recognition systems integrate information fr...

0 Tejas Srinivasan, et al. ∙

research

∙ 09/12/2020

Revisiting Factorizing Aggregated Posterior in Learning Disentangled Representations

In the problem of learning disentangled representations, one of the prom...

5 Ze Cheng, et al. ∙

research

∙ 08/18/2020

How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language

Sign Language is the primary means of communication for the majority of ...

20 Amanda Duarte, et al. ∙

research

∙ 04/17/2020

AlloVera: A Multilingual Allophone Database

We introduce a new resource, AlloVera, which provides mappings from 218 ...

0 David R. Mortensen, et al. ∙

research

∙ 03/13/2020

ASR Error Correction and Domain Adaptation Using Machine Translation

Off-the-shelf pre-trained Automatic Speech Recognition (ASR) systems are...

7 Anirudh Mani, et al. ∙

research

∙ 02/26/2020

Universal Phone Recognition with a Multilingual Allophone System

Multilingual models can improve language processing, particularly for lo...

0 Xinjian Li, et al. ∙

research

∙ 02/26/2020

Towards Zero-shot Learning for Automatic Phonemic Transcription

Automatic phonemic transcription tools are useful for low-resource langu...

0 Xinjian Li, et al. ∙

research

∙ 02/13/2020

Looking Enhances Listening: Recovering Missing Speech Using Images

Speech is understood better by using visual context; for this reason, th...

0 Tejas Srinivasan, et al. ∙

research

∙ 01/29/2020

Gun Source and Muzzle Head Detection

There is a surging need across the world for protection against gun viol...

16 Zhong Zhou, et al. ∙

research

∙ 11/09/2019

Enforcing Encoder-Decoder Modularity in Sequence-to-Sequence Models

Inspired by modular software design principles of independence, intercha...

0 Siddharth Dalmia, et al. ∙

research

∙ 11/04/2019

On Compositionality in Neural Machine Translation

We investigate two specific manifestations of compositionality in Neural...

0 Vikas Raunak, et al. ∙

research

∙ 10/31/2019

Adversarial Music: Real World Audio Adversary Against Wake-word Detection System

Voice Assistants (VAs) such as Amazon Alexa or Google Assistant rely on ...

0 Juncheng B. Li, et al. ∙

research

∙ 10/27/2019

Multitask Learning For Different Subword Segmentations In Neural Machine Translation

In Neural Machine Translation (NMT) the usage of subwords and characters...

0 Tejas Srinivasan, et al. ∙

research

∙ 10/07/2019

On Leveraging the Visual Modality for Neural Machine Translation

Leveraging the visual modality effectively for Neural Machine Translatio...

0 Vikas Raunak, et al. ∙

research

∙ 10/05/2019

On Dimensional Linguistic Properties of the Word Embedding Space

Word embeddings have become a staple of several natural language process...

0 Vikas Raunak, et al. ∙

research

∙ 08/02/2019

SANTLR: Speech Annotation Toolkit for Low Resource Languages

While low resource speech recognition has attracted a lot of attention f...

0 Xinjian Li, et al. ∙

research

∙ 08/02/2019

Multilingual Speech Recognition with Corpus Relatedness Sampling

Multilingual acoustic models have been successfully applied to low-resou...

0 Xinjian Li, et al. ∙

research

∙ 07/24/2019

Cross-Attention End-to-End ASR for Two-Party Conversations

We present an end-to-end speech recognition model that learns interactio...

0 Suyoun Kim, et al. ∙

research

∙ 06/30/2019

Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions

Multimodal learning allows us to leverage information from multiple sour...

0 Tejas Srinivasan, et al. ∙

research

∙ 06/27/2019

Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion

We present a novel conversational-context aware end-to-end speech recogn...

0 Suyoun Kim, et al. ∙

research

∙ 06/19/2019

Multimodal Abstractive Summarization for How2 Videos

In this paper, we study abstractive summarization for open-domain videos...

0 Shruti Palaskar, et al. ∙

research

∙ 06/13/2019

Grounding Object Detections With Transcriptions

A vast amount of audio-visual data is available on the Internet thanks t...

0 Yasufumi Moriya, et al. ∙

research

∙ 05/21/2019

Acoustic-to-Word Models with Conversational Context Information

Conversational context information, higher-level knowledge that spans ac...

0 Suyoun Kim, et al. ∙

research

∙ 02/24/2019

The ARIEL-CMU Systems for LoReHLT18

This paper describes the ARIEL-CMU submissions to the Low Resource Human...

0 Aditi Chaudhary, et al. ∙

research

∙ 02/20/2019

Phoneme Level Language Models for Sequence Based Low Resource ASR

Building multilingual and crosslingual models help bring different langu...

0 Siddharth Dalmia, et al. ∙

research

∙ 02/18/2019

Learned In Speech Recognition: Contextual Acoustic Word Embeddings

End-to-end acoustic-to-word speech recognition models have recently gain...

0 Shruti Palaskar, et al. ∙

research

∙ 11/21/2018

Learning from Multiview Correlations in Open-Domain Videos

An increasing number of datasets contain multiple views, such as video, ...

0 Nils Holzenberger, et al. ∙

Florian Metze

Featured Co-authors

Sign in with Google

Consider DeepAI Pro