Russian Web Tables: A Public Corpus of Web Tables for Russian Language Based on Wikipedia

10/03/2022
by   Platon Fedorov, et al.
0

Corpora that contain tabular data such as WebTables are a vital resource for the academic community. Essentially, they are the backbone of any modern research in information management. They are used for various tasks of data extraction, knowledge base construction, question answering, column semantic type detection and many other. Such corpora are useful not only as a source of data, but also as a base for building test datasets. So far, there were no such corpora for the Russian language and this seriously hindered research in the aforementioned areas. In this paper, we present the first corpus of Web tables created specifically out of Russian language material. It was built via a special toolkit we have developed to crawl the Russian Wikipedia. Both the corpus and the toolkit are open-source and publicly available. Finally, we present a short study that describes Russian Wikipedia tables and their statistics.

READ FULL TEXT
research
04/28/2019

OPIEC: An Open Information Extraction Corpus

Open information extraction (OIE) systems extract relations and their ar...
research
08/31/2018

The use of Charts, Pivot Tables, and Array Formulas in two Popular Spreadsheet Corpora

The use of spreadsheets in industry is widespread. Companies base decisi...
research
11/04/2018

ColNet: Embedding the Semantics of Web Tables for Column Type Prediction

Automatically annotating column types with knowledge base (KB) concepts ...
research
12/10/2019

GeBioToolkit: Automatic Extraction of Gender-Balanced Multilingual Corpus of Wikipedia Biographies

We introduce GeBioToolkit, a tool for extracting multilingual parallel c...
research
06/21/2022

WikiDoMiner: Wikipedia Domain-specific Miner

We introduce WikiDoMiner, a tool for automatically generating domain-spe...
research
03/20/2019

On Extracting Data from HTML Tables

The Web provides many data in user-friendly tabular formats that are enc...
research
11/07/2022

Technical Report on Web-based Visual Corpus Construction for Visual Document Understanding

We present a dataset generator engine named Web-based Visual Corpus Buil...

Please sign up or login with your details

Forgot password? Click here to reset