Make-up temporal video grounding (MTVG) aims to localize the target vide...
In this paper, we present our solution to the MuSe-Personalisation
sub-c...
Remote photoplethysmography (rPPG) based physiological measurement is an...
Deep-learning-based object detection methods show promise for improving
...
The video grounding (VG) task aims to locate the queried action or event...
In this paper, we present the solution of our team HFUT-VUT for the
Mult...
In this paper, we briefly introduce the solution of our team HFUT-VUT fo...
Audio-Visual Video Parsing is a task to predict the events that occur in...
Visual and audio signals often coexist in natural environments, forming
...
In most E-commerce platforms, whether the displayed items trigger the us...
Graph Convolution Networks (GCNs), with their efficient ability to captu...
Distributed learning such as federated learning or collaborative learnin...
Unsupervised image captioning with no annotations is an emerging challen...
Visual dialog is a challenging task that requires the comprehension of t...