Automatic bill payment is an important part of business operations in fi...
Vision and text have been fully explored in contemporary video-text
foun...
Building general-purpose models that can perceive diverse real-world
mod...
In this paper, we propose a Vision-Audio-Language Omni-peRception pretra...
As a combination of visual and audio signals, video is inherently
multi-...
Motion, scene and object are three primary visual components of a video....
In this paper, we propose an Omni-perception Pre-Trainer (OPT) for
cross...
In this paper, we consider the image captioning task from a new
sequence...
Depth information matters in RGB-D semantic segmentation task for provid...
Autoregressive sequence Generation models have achieved state-of-the-art...
Image captioning transforms complex visual information into abstract nat...
Most image captioning models are autoregressive, i.e. they generate each...
Self-attention (SA) network has shown profound value in image captioning...
This document describes our solution for the VATEX Captioning Challenge ...