DeepAI AI Chat
Log In Sign Up

Technical Report on Web-based Visual Corpus Construction for Visual Document Understanding

by   Donghyun Kim, et al.

We present a dataset generator engine named Web-based Visual Corpus Builder (Webvicob). Webvicob can readily construct a large-scale visual corpus (i.e., images with text annotations) from a raw Wikipedia HTML dump. In this report, we validate that Webvicob-generated data can cover a wide range of context and knowledge and helps practitioners to build a powerful Visual Document Understanding (VDU) backbone. The proposed engine is publicly available at


page 2

page 9

page 10

page 11


ClueWeb22: 10 Billion Web Documents with Visual and Semantic Information

ClueWeb22, the newest iteration of the ClueWeb line of datasets, provide...

MadDog: A Web-based System for Acronym Identification and Disambiguation

Acronyms and abbreviations are the short-form of longer phrases and they...

SciREX: A Challenge Dataset for Document-Level Information Extraction

Extracting information from full documents is an important problem in ma...

Interpretable Visual Understanding with Cognitive Attention Network

While image understanding on recognition-level has achieved remarkable a...

Essay-BR: a Brazilian Corpus of Essays

Automatic Essay Scoring (AES) is defined as the computer technology that...

DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine

In this paper, we present DuReader_retrieval, a large-scale Chinese data...

Code Repositories