We present DiverseMotion, a new approach for synthesizing high-quality h...
Recent advancements in natural language and Large Language Models (LLMs)...
In real-world scenarios, collected and annotated data often exhibit the
...
Misalignment between the outputs of a vision-language (VL) model and tas...
This paper presents a whitening-based contrastive learning method for
se...
Pre-trained vision-language models are the de-facto foundation models fo...
In this paper, we tackle the problem of sign language translation (SLT)
...
Large-scale pre-training has brought unimodal fields such as computer vi...
Large-scale pre-trained multi-modal models (e.g., CLIP) demonstrate stro...
Temporal grounding is the task of locating a specific segment from an
un...
Video-Language Pre-training models have recently significantly improved
...
Domain adaptation methods reduce domain shift typically by learning
doma...
To build Video Question Answering (VideoQA) systems capable of assisting...
Self-supervised learning makes great progress in large model pre-trainin...
Existing 3D skeleton-based action recognition approaches reach impressiv...
Large-scale vision-language pre-training has shown impressive advances i...
Understanding human emotions is a crucial ability for intelligent robots...
3D pose estimation has recently gained substantial interests in computer...
Recently, large-scale pre-training methods like CLIP have made great pro...
As an important area in computer vision, object tracking has formed two
...
Temporal grounding in videos aims to localize one target video segment t...
To improve the generalization of detectors, for domain adaptive object
d...
Video-and-Language Inference is a recently proposed task for joint
video...
Obtaining viewer responses from videos can be useful for creators and
st...
It has been shown that deep neural networks are prone to overfitting on
...
Text-video retrieval is a challenging task that aims to search relevant ...
Few-shot object detection (FSOD) aims to strengthen the performance of n...
Anticipating actions before they are executed is crucial for a wide rang...
GloVe learns word embeddings by leveraging statistical information from ...
In this paper, we introduce ActBERT for self-supervised learning of join...
Optimal transport is a machine learning technique with applications incl...
In this paper, we tackle the problem of discovering new classes in unlab...
Deep neural networks are known to be susceptible to adversarial noise, w...
In this paper, we study an intermediate form of supervision, i.e.,
singl...
Egocentric video recognition is a natural testbed for diverse interactio...
Most state-of-the-art methods of object detection suffer from poor
gener...
In this work, we propose a generally applicable transformation unit for
...
We propose a novel framework, learning to transfer learn (L2TL), to impr...
In this report, we present the Baidu-UTS submission to the EPIC-Kitchens...
Video classification methods often divide the video into short clips, do...
Predicting future frames in videos has become a promising direction of
r...
Existing methods usually utilize pre-defined criterions, such as p-norm,...
There has been an increasing interest in 3D indoor navigation, where a r...
Prevailing deep convolutional neural networks (CNNs) for person
re-IDent...
We propose a novel attentive sequence to sequence translator (ASST) for ...
Image captioning is a challenging task where the machine automatically
d...
In this paper, we present our solution to Google YouTube-8M Video
Classi...
With the tremendous advances of Convolutional Neural Networks (ConvNets)...
Despite the recent success of neural networks in image feature learning,...