SFTM: Fast Comparison of Web Documents using Similarity-based Flexible Tree Matching

04/27/2020
by   Sacha Brisset, et al.
0

Tree matching techniques have been investigated in many fields, including web data mining and extraction, as a key component to analyze the content of web documents, existing tree matching approaches, like Tree-Edit Distance (TED) or Flexible Tree Matching (FTM), fail to scale beyond a few hundreds of nodes, which is far below the average complexity of existing web online documents and applications. In this paper, we therefore propose a novel Similarity-based Flexible Tree Matching algorithm (SFTM), which is the first algorithm to enable tree matching on real-life web documents with practical computation times. In particular, we approach tree matching as an optimisation problem and we leverage node labels and local topology similarity in order to avoid any combinatorial explosion. Our practical evaluation demonstrates that our approach compares to the reference implementation of TED qualitatively, while improving the computation times by two orders of magnitude.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/07/2011

Automatic Wrapper Adaptation by Tree Edit Distance Matching

Information distributed through the Web keeps growing faster day by day,...
research
01/20/2022

JEDI: These aren't the JSON documents you're looking for... (Extended Version*)

The JavaScript Object Notation (JSON) is a popular data format used in d...
research
11/05/2019

Fast Multiple Pattern Cartesian Tree Matching

Cartesian tree matching is the problem of finding all substrings in a gi...
research
03/15/2022

FastKASSIM: A Fast Tree Kernel-Based Syntactic Similarity Metric

Syntax is a fundamental component of language, yet few metrics have been...
research
01/07/2021

Simplified DOM Trees for Transferable Attribute Extraction from the Web

There has been a steady need to precisely extract structured knowledge f...
research
03/26/2021

A PSO Strategy of Finding Relevant Web Documents using a New Similarity Measure

In the world of the Internet and World Wide Web, which offers a tremendo...
research
11/07/2022

Fast Key Points Detection and Matching for Tree-Structured Images

This paper offers a new authentication algorithm based on image matching...

Please sign up or login with your details

Forgot password? Click here to reset