The Klarna Product Page Dataset: A Realistic Benchmark for Web Representation Learning

11/03/2021
by   Alexandra Hotti, et al.
0

This paper tackles the under-explored problem of DOM tree element representation learning. We advance the field of machine learning-based web automation and hope to spur further research regarding this crucial area with two contributions. First, we adapt several popular Graph-based Neural Network models and apply them to embed elements in website DOM trees. Second, we present a large-scale and realistic dataset of webpages. By providing this open-access resource, we lower the entry barrier to this area of research. The dataset contains 51,701 manually labeled product pages from 8,175 real e-commerce websites. The pages can be rendered entirely in a web browser and are suitable for computer vision applications. This makes it substantially richer and more diverse than other datasets proposed for element representation learning, classification and prediction on the web. Finally, using our proposed dataset, we show that the embeddings produced by a Graph Convolutional Neural Network outperform representations produced by other state-of-the-art methods in a web element prediction task.

READ FULL TEXT

page 7

page 10

research
05/24/2022

PLAtE: A Large-scale Dataset for List Page Web Extraction

Recently, neural models have been leveraged to significantly improve the...
research
06/09/2021

Erratum: Leveraging Flexible Tree Matching to Repair Broken Locators in Web Automation Scripts

Web applications are constantly evolving to integrate new features and f...
research
09/26/2022

Neural-FacTOR: Neural Representation Learning for Website Fingerprinting Attack over TOR Anonymity

TOR (The Onion Router) network is a widely used open source anonymous co...
research
04/27/2018

An Element Sensitive Saliency Model with Position Prior Learning for Web Pages

Understanding human visual attention is important for multimedia applica...
research
08/01/2022

Similarity-based web element localization for robust test automation

Non-robust (fragile) test execution is a commonly reported challenge in ...
research
01/08/2022

Extraction of Product Specifications from the Web – Going Beyond Tables and Lists

E-commerce product pages on the web often present product specification ...
research
01/30/2023

WebUI: A Dataset for Enhancing Visual UI Understanding with Web Semantics

Modeling user interfaces (UIs) from visual information allows systems to...

Please sign up or login with your details

Forgot password? Click here to reset