Vision and Language Models (VLMs), such as CLIP, have enabled visual
rec...
Recently, large-scale pre-trained Vision and Language (VL) models have s...
Large scale Vision-Language (VL) models have shown tremendous success in...
Temporal action segmentation in untrimmed videos has gained increased
at...
We show how the inherent, but often neglected, properties of large-scale...
Existing Multiple Object Tracking (MOT) methods design complex architect...
Although action recognition systems can achieve top performance when
eva...
Test-Time-Training (TTT) is an approach to cope with out-of-distribution...
We propose MATE, the first Test-Time-Training (TTT) method designed for ...
Keyless entry systems in cars are adopting neural networks for localizin...
LiDAR 3D object detection models are inevitably biased towards their tra...
While 3D object detection in LiDAR point clouds is well-established in
a...
Although action recognition has achieved impressive results over recent
...
3D human pose estimation is fundamental to understanding human behavior....
Domain adaptation is crucial to adapt a learned model to new scenarios, ...
In the field of autonomous driving, self-training is widely applied to
m...
We introduce a simple yet effective fusion method of LiDAR and RGB data ...
Automated toll systems rely on proper classification of the passing vehi...
Learning similarity functions between image pairs with deep neural netwo...
Detection of partially occluded objects is a challenging computer vision...