The recent wave of AI-generated content has witnessed the great developm...
In order to reveal the rationale behind model predictions, many works ha...
Unlike language tasks, where the output space is usually limited to a se...
Vision Transformers (ViTs) have achieved overwhelming success, yet they
...
Semi-supervised action recognition is a challenging but critical task du...
An important goal of self-supervised learning is to enable model pre-tra...
Recently, masked image modeling (MIM) has offered a new methodology of
s...
Recent literature have shown design strategies from Convolutions Neural
...
This paper presents SimMIM, a simple framework for masked image modeling...
Cross-modal correlation provides an inherent supervision for video
unsup...
Vision Transformer (ViT) attains state-of-the-art performance in visual
...
We are witnessing a modeling shift from CNN to Transformers in computer
...
Understanding human driving behaviors quantitatively is critical even in...
Consistent in-focus input imagery is an essential precondition for machi...
Training temporal action detection in videos requires large amounts of
l...
We propose a multi-agent based computational framework for modeling
deci...
Convolutional Neural Networks (CNNs) are known to rely more on local tex...
Due to the compelling efficiency in retrieval and storage,
similarity-pr...
Weakly-supervised temporal action localization is a problem of learning ...
We present a self-supervised learning framework to estimate the individu...
Tremendous variation in the scale of people/head size is a critical prob...
The aim of crowd counting is to estimate the number of people in images ...
Video temporal action detection aims to temporally localize and recogniz...