Extraction of Product Specifications from the Web – Going Beyond Tables and Lists

E-commerce product pages on the web often present product specification data in structured tabular blocks. Extraction of these product attribute-value specifications has benefited applications like product catalogue curation, search, question answering, and others. However, across different Websites, there is a wide variety of HTML elements (like <table>, <ul>, <div>, <span>, <dl> etc.) typically used to render these blocks that makes their automatic extraction a challenge. Most of the current research has focused on extracting product specifications from tables and lists and, therefore, suffers from recall when applied to a large-scale extraction setting. In this paper, we present a product specification extraction approach that goes beyond tables or lists and generalizes across the diverse HTML elements used for rendering specification blocks. Using a combination of hand-coded features and deep learned spatial and token features, we first identify the specification blocks on a product page. We then extract the product attribute-value pairs from these blocks following an approach inspired by wrapper induction. We created a labeled dataset of product specifications extracted from 14,111 diverse specification blocks taken from a range of different product websites. Our experiments show the efficacy of our approach compared to the current specification extraction models and support our claim about its application to large-scale product specification extraction.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2022

PLAtE: A Large-scale Dataset for List Page Web Extraction

Recently, neural models have been leveraged to significantly improve the...
research
05/29/2020

Using Large Pretrained Language Models for Answering User Queries from Product Specifications

While buying a product from the e-commerce websites, customers generally...
research
01/08/2019

StaBL - State Based Language for Specification of Web Applications

Context and motivation: Usage of Formal Specification languages is scarc...
research
08/15/2016

Attribute Extraction from Product Titles in eCommerce

This paper presents a named entity extraction system for detecting attri...
research
11/03/2021

The Klarna Product Page Dataset: A Realistic Benchmark for Web Representation Learning

This paper tackles the under-explored problem of DOM tree element repres...
research
01/07/2021

Simplified DOM Trees for Transferable Attribute Extraction from the Web

There has been a steady need to precisely extract structured knowledge f...
research
08/26/2017

Navigation Objects Extraction for Better Content Structure Understanding

Existing works for extracting navigation objects from webpages focus on ...

Please sign up or login with your details

Forgot password? Click here to reset