JEDI: These aren't the JSON documents you're looking for... (Extended Version*)

01/20/2022
by   Thomas Hütter, et al.
0

The JavaScript Object Notation (JSON) is a popular data format used in document stores to natively support semi-structured data. In this paper, we address the problem of JSON similarity lookup queries: given a query document and a distance threshold τ, retrieve all JSON documents that are within τ from the query document. Due to its recursive definition, JSON data are naturally represented as trees. Different from other hierarchical formats such as XML, JSON supports both ordered and unordered sibling collections within a single document. This feature poses a new challenge to the tree model and distance computation. We propose JSON tree, a lossless tree representation of JSON documents, and define the JSON Edit Distance (JEDI), the first edit-based distance measure for JSON documents. We develop an algorithm, called QuickJEDI, for computing JEDI by leveraging a new technique to prune expensive sibling matchings. It outperforms a baseline algorithm by an order of magnitude in runtime. To boost the performance of JSON similarity queries, we introduce an index called JSIM and a highly effective upper bound based on tree sorting. Our algorithm for the upper bound runs in O(n τ) time and O(n + τlog n) space, which substantially improves the previous best bound of O(n^2) time and O(n log n) space (where n is the tree size). Our experimental evaluation shows that our solution scales to databases with millions of documents and JSON trees with tens of thousands of nodes.

READ FULL TEXT
research
05/06/2021

Faster Algorithms for Bounded Tree Edit Distance

Tree edit distance is a well-studied measure of dissimilarity between ro...
research
07/21/2023

Identifying document similarity using a fast estimation of the Levenshtein Distance based on compression and signatures

Identifying document similarity has many applications, e.g., source code...
research
06/07/2019

A Tree Pattern Matching Algorithm for XML Queries with Structural Preferences

In the XML community, exact queries allow users to specify exactly what ...
research
04/27/2020

SFTM: Fast Comparison of Web Documents using Similarity-based Flexible Tree Matching

Tree matching techniques have been investigated in many fields, includin...
research
06/02/2020

Efficient tree-structured categorical retrieval

We study a document retrieval problem in the new framework where D text ...
research
03/15/2022

FastKASSIM: A Fast Tree Kernel-Based Syntactic Similarity Metric

Syntax is a fundamental component of language, yet few metrics have been...
research
01/16/2013

Probabilistic Models for Query Approximation with Large Sparse Binary Datasets

Large sparse sets of binary transaction data with millions of records an...

Please sign up or login with your details

Forgot password? Click here to reset