Vision and Language (VL) models offer an effective method for aligning
r...
Large-scale pre-trained Vision Language (VL) models have shown remar...
Large pre-trained language models have achieved state-of-the-art results...
Large scale Vision-Language (VL) models have shown tremendous success in...
Prompt tuning, in which a base pretrained model is adapted to each task ...
Scaling transformers has led to significant breakthroughs in many domain...
Transformers have become the de facto models of choice in machine learni...
Pre-training is an effective technique for ensuring robust performance o...
Computer vision models suffer from a phenomenon known as catastrophic
fo...
Vision and Language (VL) models have demonstrated remarkable zero-shot
p...
Recently, large-scale pre-trained Vision-and-Language (VL) foundation mo...
We present a new semi-supervised domain adaptation framework that combin...
Foundation Models (FMs) have demonstrated unprecedented capabilities
inc...
Designing better machine translation systems by considering auxiliary in...
Pre-training models on Imagenet or other massive datasets of real images...
In this paper, we explore self-supervised audio-visual models that learn...
Selective regression allows abstention from prediction if the confidence...
Unsupervised domain adaptation which aims to adapt models trained on a
l...
Deep convolutional networks have recently achieved great success in vide...
We propose a new perspective on video understanding by casting the video...
The self-attention-based model, transformer, is recently becoming the le...
Most existing works in few-shot learning rely on meta-learning the netwo...
Vision transformer (ViT) has recently showed its strong capability in
ac...
Multi-modal learning, which focuses on utilizing various modalities to
i...
Multimodal self-supervised learning is getting more and more attention a...
Nowadays, there is an abundance of data involving images and surrounding...
The recently developed vision transformer (ViT) has achieved promising
r...
Tremendous progress has been made in visual representation learning, not...
Network quantization has rapidly become one of the most widely used meth...
Performing inference on deep learning models for videos remains a challe...
Temporal modelling is the key for efficient video action recognition. Wh...
Learning to recognize actions from only a handful of labeled videos is a...
As machine learning algorithms grow in popularity and diversify to many
...
Partial domain adaptation which assumes that the unknown target label sp...
Neural Architecture Search (NAS) is a powerful tool to automatically des...
In recent years, a number of approaches based on 2D CNNs and 3D CNNs hav...
The emergence of Internet of Things (IoT) brings about new security
chal...
While machine learning approaches to visual recognition offer great prom...
Supervised deep learning methods are enjoying enormous success in many
p...
Action recognition is an open and challenging problem in computer vision...
Most of the existing approaches for person re-identification consider a
...
Neural Architecture Search (NAS) is an open and challenging problem in
m...
Most of the existing works in video synthesis focus on generating videos...
Multi-task learning is an open and challenging problem in computer visio...
Recent advances in computer vision and deep learning have led to
breakth...
Most existing unsupervised person re-identificationmethods focus on lear...
Cross-modal retrieval between visual data and natural language descripti...
While machine learning approaches to visual emotion recognition offer gr...
For many applications with limited computation, communication, storage a...
Most video summarization approaches have focused on extracting a summary...