We present a new formulation for structured information extraction (SIE)...
We propose DocFormerv2, a multi-modal transformer for Visual Document
Un...
In this work, instead of directly predicting the pixel-level segmentatio...
We present YORO - a multi-modal transformer encoder-only architecture fo...
In recent years, the dominant paradigm for text spotting is to combine t...
Text spotting end-to-end methods have recently gained attention in the
l...
We propose a novel multimodal architecture for Scene Text Visual Questio...
We present DocFormer – a multi-modal transformer based architecture for ...
This paper presents results of Document Visual Question Answering Challe...
We present a new dataset for Visual Question Answering on document image...
Deep learning usually achieves the best results with complete supervisio...
While image classification models have recently continued to advance, mo...
Scene Text Recognition (STR), the task of recognizing text against compl...
We propose a new end-to-end trainable model for lossy image compression ...
Recently, there has been much interest in deep learning techniques to do...
Several deep learned lossy compression techniques have been proposed in ...
In this age of social media, people often look at what others are wearin...
Training robust deep video representations has proven to be much more
ch...
Deep embeddings answer one simple question: How similar are two images?
...