Log In Sign Up

WebSRC: A Dataset for Web-Based Structural Reading Comprehension

by   Lu Chen, et al.

Web search is an essential way for human to obtain information, but it's still a great challenge for machines to understand the contents of web pages. In this paper, we introduce the task of web-based structural reading comprehension. Given a web page and a question about it, the task is to find an answer from the web page. This task requires a system not only to understand the semantics of texts but also the structure of the web page. Moreover, we proposed WebSRC, a novel Web-based Structural Reading Comprehension dataset. WebSRC consists of 0.44M question-answer pairs, which are collected from 6.5K web pages with corresponding HTML source code, screenshots, and metadata. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no. We evaluate various strong baselines on our dataset to show the difficulty of our task. We also investigate the usefulness of structural information and visual features. Our dataset and task are publicly available at


page 1

page 4

page 7

page 9


TIE: Topological Information Enhanced Structural Reading Comprehension on Web Pages

Recently, the structural reading comprehension (SRC) task on web pages h...

ReCO: A Large Scale Chinese Reading Comprehension Dataset on Opinion

This paper presents the ReCO, a human-curated ChineseReading Comprehensi...

Inferring Missing Categorical Information in Noisy and Sparse Web Markup

Embedded markup of Web pages has seen widespread adoption throughout the...

Cartes auto-organisées pour l'analyse exploratoire de données et la visualisation

This paper shows how to use the Kohonen algorithm to represent multidime...

Handling tree-structured text: parsing directory pages

The determination of the reading sequence of text is fundamental to docu...

Repartitioning of the ComplexWebQuestions Dataset

Recently, Talmor and Berant (2018) introduced ComplexWebQuestions - a da...