
-
Audio-Visual Event Recognition through the lens of Adversary
As audio/visual classification models are widely deployed for sensitive ...
read it
-
Multimodal Speech Recognition with Unstructured Audio Masking
Visual context has been shown to be useful for automatic speech recognit...
read it
-
Support-set bottlenecks for video-text representation learning
The dominant paradigm for learning video-text representations – noise co...
read it
-
Fine-Grained Grounding for Multimodal Speech Recognition
Multimodal automatic speech recognition systems integrate information fr...
read it
-
Revisiting Factorizing Aggregated Posterior in Learning Disentangled Representations
In the problem of learning disentangled representations, one of the prom...
read it
-
How2Sign: A Large-scale Multimodal Dataset for Continuous American Sign Language
Sign Language is the primary means of communication for the majority of ...
read it
-
AlloVera: A Multilingual Allophone Database
We introduce a new resource, AlloVera, which provides mappings from 218 ...
read it
-
ASR Error Correction and Domain Adaptation Using Machine Translation
Off-the-shelf pre-trained Automatic Speech Recognition (ASR) systems are...
read it
-
Universal Phone Recognition with a Multilingual Allophone System
Multilingual models can improve language processing, particularly for lo...
read it
-
Towards Zero-shot Learning for Automatic Phonemic Transcription
Automatic phonemic transcription tools are useful for low-resource langu...
read it
-
Looking Enhances Listening: Recovering Missing Speech Using Images
Speech is understood better by using visual context; for this reason, th...
read it
-
Gun Source and Muzzle Head Detection
There is a surging need across the world for protection against gun viol...
read it
-
Enforcing Encoder-Decoder Modularity in Sequence-to-Sequence Models
Inspired by modular software design principles of independence, intercha...
read it
-
On Compositionality in Neural Machine Translation
We investigate two specific manifestations of compositionality in Neural...
read it
-
Adversarial Music: Real World Audio Adversary Against Wake-word Detection System
Voice Assistants (VAs) such as Amazon Alexa or Google Assistant rely on ...
read it
-
Multitask Learning For Different Subword Segmentations In Neural Machine Translation
In Neural Machine Translation (NMT) the usage of subwords and characters...
read it
-
On Leveraging the Visual Modality for Neural Machine Translation
Leveraging the visual modality effectively for Neural Machine Translatio...
read it
-
On Dimensional Linguistic Properties of the Word Embedding Space
Word embeddings have become a staple of several natural language process...
read it
-
SANTLR: Speech Annotation Toolkit for Low Resource Languages
While low resource speech recognition has attracted a lot of attention f...
read it
-
Multilingual Speech Recognition with Corpus Relatedness Sampling
Multilingual acoustic models have been successfully applied to low-resou...
read it
-
Cross-Attention End-to-End ASR for Two-Party Conversations
We present an end-to-end speech recognition model that learns interactio...
read it
-
Analyzing Utility of Visual Context in Multimodal Speech Recognition Under Noisy Conditions
Multimodal learning allows us to leverage information from multiple sour...
read it
-
Gated Embeddings in End-to-End Speech Recognition for Conversational-Context Fusion
We present a novel conversational-context aware end-to-end speech recogn...
read it
-
Multimodal Abstractive Summarization for How2 Videos
In this paper, we study abstractive summarization for open-domain videos...
read it
-
Grounding Object Detections With Transcriptions
A vast amount of audio-visual data is available on the Internet thanks t...
read it
-
Acoustic-to-Word Models with Conversational Context Information
Conversational context information, higher-level knowledge that spans ac...
read it
-
The ARIEL-CMU Systems for LoReHLT18
This paper describes the ARIEL-CMU submissions to the Low Resource Human...
read it
-
Phoneme Level Language Models for Sequence Based Low Resource ASR
Building multilingual and crosslingual models help bring different langu...
read it
-
Learned In Speech Recognition: Contextual Acoustic Word Embeddings
End-to-end acoustic-to-word speech recognition models have recently gain...
read it
-
Learning from Multiview Correlations in Open-Domain Videos
An increasing number of datasets contain multiple views, such as video, ...
read it
-
Multimodal Grounding for Sequence-to-Sequence Speech Recognition
Humans are capable of processing speech by making use of multiple sensor...
read it
-
How2: A Large-scale Dataset for Multimodal Language Understanding
In this paper, we introduce How2, a multimodal collection of instruction...
read it
-
Connectionist Temporal Localization for Sound Event Detection with Sequential Labeling
Research on sound event detection (SED) with weak labeling has mostly fo...
read it
-
A comparison of five multiple instance learning pooling functions for sound event detection with weak labeling
Sound event detection (SED) entails two subtasks: recognizing what types...
read it
-
Activity Recognition on a Large Scale in Short Videos - Moments in Time Dataset
Moments capture a huge part of our lives. Accurate recognition of these ...
read it
-
Dialog-context aware end-to-end speech recognition
Existing speech recognition systems are typically built at the sentence ...
read it
-
Domain Robust Feature Extraction for Rapid Low Resource ASR Development
Developing a practical speech recognizer for a low resource language is ...
read it
-
Acoustic-to-Word Recognition with Sequence-to-Sequence Models
Acoustic-to-Word recognition provides a straightforward solution to end-...
read it
-
Hierarchical Multi Task Learning With CTC
In Automatic Speech Recognition, it is still challenging to learn useful...
read it
-
End-to-End Multimodal Speech Recognition
Transcription or sub-titling of open-domain videos is still a challengin...
read it
-
Comparing the Max and Noisy-Or Pooling Functions in Multiple Instance Learning for Weakly Supervised Sequence Learning Tasks
Many sequence learning tasks require the localization of certain events ...
read it
-
Sequence-based Multi-lingual Low Resource Speech Recognition
Techniques for multi-lingual and cross-lingual speech recognition can he...
read it
-
Linguistic unit discovery from multi-modal inputs in unwritten languages: Summary of the "Speaking Rosetta" JSALT 2017 Workshop
We summarize the accomplishments of a multi-disciplinary workshop explor...
read it
-
A Light-Weight Multimodal Framework for Improved Environmental Audio Tagging
The lack of strong labels has severely limited the state-of-the-art full...
read it
-
Multiple Instance Deep Learning for Weakly Supervised Small-Footprint Audio Event Detection
State-of-the-art audio event detection (AED) systems rely on supervised ...
read it
-
Multiple Instance Deep Learning for Weakly Supervised Audio Event Detection
State-of-the-art audio event detection (AED) systems rely on supervised ...
read it
-
Subword and Crossword Units for CTC Acoustic Models
This paper proposes a novel approach to create an unit set for CTC based...
read it
-
Visual Features for Context-Aware Speech Recognition
Automatic transcriptions of consumer-generated multi-media content such ...
read it
-
Annotating High-Level Structures of Short Stories and Personal Anecdotes
Stories are a vital form of communication in human culture; they are emp...
read it
-
Comparison of Decoding Strategies for CTC Acoustic Models
Connectionist Temporal Classification has recently attracted a lot of in...
read it