Shinji Watanabe

research

∙ 09/20/2023

Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff

Blockwise self-attentional encoder models have recently emerged as one p...

0 Peter Polák, et al. ∙

research

∙ 09/19/2023

Semi-Autoregressive Streaming ASR With Label Context

Non-autoregressive (NAR) modeling has gained significant interest in spe...

0 Siddhant Arora, et al. ∙

research

∙ 09/19/2023

AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models

Audio-visual representation learning aims to develop systems with human-...

0 Yuan Tseng, et al. ∙

research

∙ 09/18/2023

Dynamic-SUPERB: Towards A Dynamic, Collaborative, and Comprehensive Instruction-Tuning Benchmark for Speech

Text language models have shown remarkable zero-shot capability in gener...

0 Chien-yu Huang, et al. ∙

research

∙ 09/16/2023

Decoder-only Architecture for Speech Recognition with CTC Prompts and Text Data Augmentation

Collecting audio-text pairs is expensive; however, it is much easier to ...

0 Emiru Tsunoo, et al. ∙

research

∙ 09/15/2023

The Multimodal Information Based Speech Processing (MISP) 2023 Challenge: Audio-Visual Target Speaker Extraction

Previous Multimodal Information based Speech Processing (MISP) challenge...

0 Shilong Wu, et al. ∙

research

∙ 09/14/2023

Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks

We propose a decoder-only language model, VoxtLM, that can perform four ...

0 Soumi Maiti, et al. ∙

research

∙ 08/19/2023

Bayes Risk Transducer: Transducer with Controllable Alignment Prediction

Automatic speech recognition (ASR) based on transducers is widely used. ...

0 Jinchuan Tian, et al. ∙

research

∙ 07/24/2023

Integration of Frame- and Label-synchronous Beam Search for Streaming Encoder-decoder Speech Recognition

Although frame-based models, such as CTC and transducers, have an affini...

0 Emiru Tsunoo, et al. ∙

research

∙ 07/23/2023

Exploring the Integration of Speech Separation and Recognition with Self-Supervised Learning Representation

Neural speech separation has made remarkable progress and its integratio...

0 Yoshiki Masuyama, et al. ∙

research

∙ 07/20/2023

Integrating Pretrained ASR and LM to Perform Sequence Generation for Spoken Language Understanding

There has been an increased interest in the integration of pretrained sp...

0 Siddhant Arora, et al. ∙

research

∙ 07/17/2023

BASS: Block-wise Adaptation for Speech Summarization

End-to-end speech summarization has been shown to improve performance ov...

0 Roshan Sharma, et al. ∙

research

∙ 06/23/2023

The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios

The CHiME challenges have played a significant role in the development a...

0 Samuele Cornell, et al. ∙

research

∙ 06/11/2023

Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute

Self-supervised learning (SSL) has led to great strides in speech proces...

0 William Chen, et al. ∙

research

∙ 06/01/2023

Exploration on HuBERT with Multiple Resolutions

Hidden-unit BERT (HuBERT) is a widely-used self-supervised learning (SSL...

0 Jiatong Shi, et al. ∙

research

∙ 05/31/2023

UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures

In reverberant conditions with multiple concurrent speakers, each microp...

0 Zhong-Qiu Wang, et al. ∙

research

∙ 05/29/2023

Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning

Self-supervised learning (SSL) of speech has shown impressive results in...

0 Xuankai Chang, et al. ∙

research

∙ 05/28/2023

DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models

Self-supervised learning (SSL) has achieved notable success in many spee...

0 Yifan Peng, et al. ∙

research

∙ 05/18/2023

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization

We investigate the emergent abilities of the recently proposed web-scale...

0 Puyuan Peng, et al. ∙

research

∙ 05/18/2023

A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks

Conformer, a convolution-augmented Transformer variant, has become the d...

0 Yifan Peng, et al. ∙

research

∙ 05/18/2023

ML-SUPERB: Multilingual Speech Universal PERformance Benchmark

Speech processing Universal PERformance Benchmark (SUPERB) is a leaderbo...

0 Jiatong Shi, et al. ∙

research

∙ 05/12/2023

Improving Cascaded Unsupervised Speech Translation with Denoising Back-translation

Most of the speech translation models heavily rely on parallel data, whi...

0 Yu-Kuan Fu, et al. ∙

research

∙ 05/02/2023

A Study on the Integration of Pipeline and E2E SLU systems for Spoken Semantic Parsing toward STOP Quality Challenge

Recently there have been efforts to introduce new benchmark tasks for sp...

0 Siddhant Arora, et al. ∙

research

∙ 05/02/2023

The Pipeline System of ASR and NLU with MLM-based Data Augmentation toward STOP Low-resource Challenge

This paper describes our system for the low-resource domain adaptation t...

0 Hayato Futami, et al. ∙

research

∙ 05/01/2023

Joint Modelling of Spoken Language Understanding Tasks with Integrated Dialog History

Most human interactions occur in the form of spoken conversations where ...

0 Siddhant Arora, et al. ∙

research

∙ 04/25/2023

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head

Large language models (LLMs) have exhibited remarkable capabilities acro...

7 Rongjie Huang, et al. ∙

research

∙ 04/18/2023

Neural Speech Enhancement with Very Low Algorithmic Latency and Complexity via Integrated Full- and Sub-Band Modeling

We propose FSB-LSTM, a novel long short-term memory (LSTM) based archite...

0 Zhong-Qiu Wang, et al. ∙

research

∙ 04/13/2023

Efficient Sequence Transduction by Jointly Predicting Tokens and Durations

This paper introduces a novel Token-and-Duration Transducer (TDT) archit...

0 Hainan Xu, et al. ∙

research

∙ 04/10/2023

Enhancing Speech-to-Speech Translation with Multiple TTS Targets

It has been known that direct speech-to-speech translation (S2ST) models...

0 Jiatong Shi, et al. ∙

research

∙ 04/10/2023

ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit

ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitat...

0 Brian Yan, et al. ∙

research

∙ 03/14/2023

I3D: Transformer architectures with input-dependent dynamic depth for speech recognition

Transformer-based end-to-end speech recognition has achieved great succe...

0 Yifan Peng, et al. ∙

research

∙ 03/11/2023

The Multimodal Information based Speech Processing (MISP) 2022 Challenge: Audio-Visual Diarization and Recognition

The Multi-modal Information based Speech Processing (MISP) challenge aim...

0 Zhe Wang, et al. ∙

research

∙ 03/03/2023

End-to-End Speech Recognition: A Survey

In the last decade of automatic speech recognition (ASR) research, the i...

0 Rohit Prabhavalkar, et al. ∙

research

∙ 02/27/2023

Structured Pruning of Self-Supervised Pre-trained Models for Speech Recognition and Understanding

Self-supervised speech representation learning (SSL) has shown to be eff...

0 Yifan Peng, et al. ∙

research

∙ 02/24/2023

Improving Massively Multilingual ASR With Auxiliary CTC Objectives

Multilingual Automatic Speech Recognition (ASR) models have extended the...

0 William Chen, et al. ∙

research

∙ 02/16/2023

PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech Enhancement

Despite rapid advancement in recent years, current speech enhancement mo...

0 Muqiao Yang, et al. ∙

research

∙ 02/16/2023

TAPLoss: A Temporal Acoustic Parameter Loss for Speech Enhancement

Speech enhancement models have greatly progressed in recent years, but s...

0 Yunyang Zeng, et al. ∙

research

∙ 02/15/2023

Multi-Channel Target Speaker Extraction with Refinement: The WavLab Submission to the Second Clarity Enhancement Challenge

This paper describes our submission to the Second Clarity Enhancement Ch...

0 Samuele Cornell, et al. ∙

research

∙ 02/14/2023

Speaker-Independent Acoustic-to-Articulatory Speech Inversion

To build speech processing methods that can handle speech as naturally a...

0 Peter Wu, et al. ∙

research

∙ 02/08/2023

A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech

Recent Text-to-Speech (TTS) systems trained on reading or acted corpora ...

0 Li-Wei Chen, et al. ∙

research

∙ 01/30/2023

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining

While neural text-to-speech (TTS) has achieved human-like natural synthe...

0 Takaaki Saeki, et al. ∙

research

∙ 12/21/2022

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

The network architecture of end-to-end (E2E) automatic speech recognitio...

0 Yui Sudo, et al. ∙

research

∙ 12/20/2022

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

Spoken language understanding (SLU) tasks have been studied for many dec...

0 Suwon Shon, et al. ∙

research

∙ 12/16/2022

Context-aware Fine-tuning of Self-supervised Speech Models

Self-supervised pre-trained transformers have improved the state of the ...

0 Suwon Shon, et al. ∙

research

∙ 12/15/2022

UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

Direct speech-to-speech translation (S2ST), in which all components can ...

2 Hirofumi Inaguma, et al. ∙

research

∙ 12/08/2022

SpeechLMScore: Evaluating speech generation using speech language model

While human evaluation is the most reliable metric for evaluating speech...

0 Soumi Maiti, et al. ∙

research

∙ 11/30/2022

EURO: ESPnet Unsupervised ASR Open-source Toolkit

This paper describes the ESPnet Unsupervised ASR Open-source Toolkit (EU...

0 Dongji Gao, et al. ∙

research

∙ 11/22/2022

TF-GridNet: Integrating Full- and Sub-Band Modeling for Speech Separation

We propose TF-GridNet for speech separation. The model is a novel multi-...

0 Zhong-Qiu Wang, et al. ∙

research

∙ 11/16/2022

Streaming Joint Speech Recognition and Disfluency Detection

Disfluency detection has mainly been solved in a pipeline approach, as p...

0 Hayato Futami, et al. ∙

research

∙ 11/12/2022

A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units

We present a unified system to realize one-shot voice conversion (VC) on...

0 Li-Wei Chen, et al. ∙

Shinji Watanabe

Featured Co-authors

Sign in with Google

Consider DeepAI Pro