WebSRC: A Dataset for Web-Based Structural Reading Comprehension

01/23/2021
by   Lu Chen, et al.
0

Web search is an essential way for human to obtain information, but it's still a great challenge for machines to understand the contents of web pages. In this paper, we introduce the task of web-based structural reading comprehension. Given a web page and a question about it, the task is to find an answer from the web page. This task requires a system not only to understand the semantics of texts but also the structure of the web page. Moreover, we proposed WebSRC, a novel Web-based Structural Reading Comprehension dataset. WebSRC consists of 0.44M question-answer pairs, which are collected from 6.5K web pages with corresponding HTML source code, screenshots, and metadata. Each question in WebSRC requires a certain structural understanding of a web page to answer, and the answer is either a text span on the web page or yes/no. We evaluate various strong baselines on our dataset to show the difficulty of our task. We also investigate the usefulness of structural information and visual features. Our dataset and task are publicly available at https://speechlab-sjtu.github.io/WebSRC/.

READ FULL TEXT

page 1

page 4

page 7

page 9

research
05/13/2022

TIE: Topological Information Enhanced Structural Reading Comprehension on Web Pages

Recently, the structural reading comprehension (SRC) task on web pages h...
research
06/22/2020

ReCO: A Large Scale Chinese Reading Comprehension Dataset on Opinion

This paper presents the ReCO, a human-curated ChineseReading Comprehensi...
research
12/22/2022

Generative Colorization of Structured Mobile Web Pages

Color is a critical design factor for web pages, affecting important fac...
research
03/01/2018

Inferring Missing Categorical Information in Noisy and Sparse Web Markup

Embedded markup of Web pages has seen widespread adoption throughout the...
research
11/14/2006

Cartes auto-organisées pour l'analyse exploratoire de données et la visualisation

This paper shows how to use the Kohonen algorithm to represent multidime...
research
05/19/2019

Phish-IRIS: A New Approach for Vision Based Brand Prediction of Phishing Web Pages via Compact Visual Descriptors

Phishing, a continuously growing cyber threat, aims to obtain innocent u...
research
06/14/2016

Using Fuzzy Logic to Leverage HTML Markup for Web Page Representation

The selection of a suitable document representation approach plays a cru...

Please sign up or login with your details

Forgot password? Click here to reset