Filler words like “um" or “uh" are common in spontaneous speech. It is
d...
Video understanding tasks have traditionally been modeled by two separat...
Exploring dense matching between the current frame and past frames for
l...
This paper presents OmniVL, a new foundation model to support both
image...
This paper proposes a new "decompose-and-edit" paradigm for the text-bas...
Human vision possesses a special type of visual processing systems calle...
Attention mechanism has been widely believed as the key to success of vi...
Given a piece of speech and its transcript text, text-based speech editi...
Transformers have sprung up in the field of computer vision. In this wor...
Convolutional neural networks (CNN) are the dominant deep neural network...
Advanced self-supervised visual representation learning methods rely on ...
Deep convolutional neural networks (CNN) have achieved astonishing resul...
This paper presents a self-supervised learning framework, named MGF, for...