WebFormer: The Web-page Transformer for Structure Information Extraction

02/01/2022
by   Qifan Wang, et al.
16

Structure information extraction refers to the task of extracting structured text fields from web pages, such as extracting a product offer from a shopping page including product title, description, brand and price. It is an important research topic which has been widely studied in document understanding and web search. Recent natural language models with sequence modeling have demonstrated state-of-the-art performance on web information extraction. However, effectively serializing tokens from unstructured web pages is challenging in practice due to a variety of web layout patterns. Limited work has focused on modeling the web layout for extracting the text fields. In this paper, we introduce WebFormer, a Web-page transFormer model for structure information extraction from web documents. First, we design HTML tokens for each DOM node in the HTML by embedding representations from their neighboring tokens through graph attention. Second, we construct rich attention patterns between HTML tokens and text tokens, which leverages the web layout for effective attention weight computation. We conduct an extensive set of experiments on SWDE and Common Crawl benchmarks. Experimental results demonstrate the superior performance of the proposed approach over several state-of-the-art methods.

READ FULL TEXT
research
03/16/2022

FormNet: Structural Encoding beyond Sequential Modeling in Form Document Information Extraction

Sequence modeling has demonstrated state-of-the-art performance on natur...
research
06/14/2016

Using Fuzzy Logic to Leverage HTML Markup for Web Page Representation

The selection of a suitable document representation approach plays a cru...
research
01/07/2021

Simplified DOM Trees for Transferable Attribute Extraction from the Web

There has been a steady need to precisely extract structured knowledge f...
research
03/07/2011

Automatic Wrapper Adaptation by Tree Edit Distance Matching

Information distributed through the Web keeps growing faster day by day,...
research
05/05/2023

A Suite of Generative Tasks for Multi-Level Multimodal Webpage Understanding

Webpages have been a rich, scalable resource for vision-language and lan...
research
11/29/2022

ClueWeb22: 10 Billion Web Documents with Visual and Semantic Information

ClueWeb22, the newest iteration of the ClueWeb line of datasets, provide...
research
11/24/2021

Handling tree-structured text: parsing directory pages

The determination of the reading sequence of text is fundamental to docu...

Please sign up or login with your details

Forgot password? Click here to reset