Equivariant Transformers such as Equiformer have demonstrated the effica...
3D-related inductive biases like translational invariance and rotational...
Are end-to-end text-to-speech (TTS) models over-parametrized? To what ex...
Vision Transformer (ViT) demonstrates that Transformer for natural langu...
Recent work on speech self-supervised learning (speech SSL) demonstrated...
Neural architecture search (NAS) typically consists of three main steps:...
Aiming at inferring 3D shapes from 2D images, 3D shape reconstruction ha...