Image captioning, like many tasks involving vision and language, current...
Research in Image Generation has recently made significant progress,
par...
The use of self-supervised pre-training has emerged as a promising appro...
Machine Unlearning has recently been emerging as a paradigm for selectiv...
In this work, we explore massive pre-training on synthetic word images f...
Recent advancements in diffusion models have enabled the generation of
r...
The CLIP model has been recently proven to be very effective for a varie...
The development of embodied agents that can communicate with humans in
n...
Handwritten Text Recognition (HTR) in free-layout pages is a challenging...
Handwritten Text Recognition (HTR) is an open problem at the intersectio...
Image-text matching is gaining a leading role among tasks involving the ...
Image captioning models aim at connecting Vision and Language by providi...
Embodied agents, trained to explore and navigate indoor photorealistic
e...
Embodied AI is a recent research area that aims at creating intelligent
...
Describing images in natural language is a fundamental step towards the
...
While captioning models have obtained compelling results in describing
n...
Exploration of indoor environments has recently experienced a significan...
Recurrent Neural Networks with Long Short-Term Memory (LSTM) make use of...
Connecting Vision and Language plays an essential role in Generative
Int...
Image captioning models have lately shown impressive results when applie...
The research field of Embodied AI has witnessed substantial progress in
...
As the request for deep learning solutions increases, the need for
expla...
The recently proposed action spotting task consists in finding the exact...
In this document, we report our proposal for modeling the risk of possib...
Embodied AI has been recently gaining attention as it aims to foster the...
The joint understanding of vision and language has been recently gaining...
Transformer-based architectures represent the state of the art in sequen...
Spatio-temporal action localization is a challenging yet fascinating tas...
Vision-and-Language Navigation (VLN) is a challenging task in which an a...
The ability to generate natural language explanations conditioned on the...
In Vision-and-Language Navigation (VLN), an embodied agent needs to reac...
Cloud computing data centers are growing in size and complexity to the p...
Current movie captioning architectures are not capable of mentioning
cha...
The applicability of computer vision to real paintings and artworks has ...
Current captioning approaches can describe images using black-box
archit...
Image captioning has been recently gaining a lot of attention thanks to ...
Data-driven saliency has recently gained a lot of attention thanks to th...
The use of Recurrent Neural Networks for video captioning has recently g...
This paper presents a novel approach for temporal and semantic segmentat...
This paper presents a novel deep architecture for saliency prediction.
C...
This paper presents a novel retrieval pipeline for video collections, wh...