Contrastive learning, which is a powerful technique for learning image-l...
Speaker diarization has gained considerable attention within speech
proc...
In this paper, we explored how to boost speech emotion recognition (SER)...
End-to-end weakly supervised semantic segmentation aims at optimizing a
...
Training speaker-discriminative and robust speaker verification systems
...
Transformer-based pre-trained language models, such as BERT, achieve gre...
Heterogeneous federated multi-task learning (HFMTL) is a federated learn...
Disentangling uncorrelated information in speech utterances is a crucial...
The logic of the hide and seek game LHS was proposed to reason about sea...
Automatically open-ended long text generation poses significant challeng...
The recently proposed serialized output training (SOT) simplifies
multi-...
Speaker diarization(SD) is a classic task in speech processing and is cr...
Effective fusion of multi-scale features is crucial for improving speake...
Recently, speaker-attributed automatic speech recognition (SA-ASR) has
a...
For speech interaction, voice activity detection (VAD) is often used as ...
Prior studies diagnose the anisotropy problem in sentence representation...
Since the number of incident energies is limited, it is difficult to dir...
Transformer-based models have significantly advanced natural language
pr...
Meetings are increasingly important for collaborations. Action items in
...
Listening to long video/audio recordings from video conferencing and onl...
ICASSP2023 General Meeting Understanding and Generation Challenge (MUG)
...
Learning on a massive amount of speech corpus leads to the recent succes...
Time delay neural network (TDNN) has been proven to be efficient for spe...
Masked Language Modeling (MLM) is widely used to pretrain language model...
The performance of learning-based denoising largely depends on clean
sup...
Marketers employ various online advertising channels to reach customers,...
Multi-modal and multi-hop question answering aims to answer a question b...
The mainstream of the existing approaches for video prediction builds up...
Training robust speaker verification systems without speaker labels has ...
This paper presents a logic of preference and functional dependence (LPF...
Convolutional Neural Networks (CNNs) are widely used in fault diagnosis ...
Classification activation map (CAM), utilizing the classification struct...
Federated learning (FL), an attractive and promising distributed machine...
Speaker embedding has been a fundamental feature for speaker-related tas...
A traditional federated learning (FL) allows clients to collaboratively ...
Weakly supervised object localization (WSOL) focuses on localizing objec...
Expressive text-to-speech (TTS) has become a hot research topic recently...
Weakly supervised object localization (WSOL) relaxes the requirement of ...
To date, live-cell imaging at the nanometer scale remains challenging. E...
Transformer-based models have achieved great success in various NLP, vis...
Neighbor discovery (ND) is a key initial step of network configuration a...
We propose BeamTransformer, an efficient architecture to leverage
beamfo...
Automatically detecting software vulnerabilities in source code is an
im...
Despite video forecasting has been a widely explored topic in recent yea...
Transcripts generated by automatic speech recognition (ASR) systems for
...
The concept of Federated Learning has emerged as a convergence of distri...
We propose a novel Synergistic Attention Network (SA-Net) to address the...
In the traditional cascading architecture for spoken language understand...
Punctuation prediction for automatic speech recognition (ASR) output
tra...
Adversarial Transferability is an intriguing property of adversarial exa...