Fast Search-By-Classification for Large-Scale Databases Using Index-Aware Decision Trees and Random Forests

06/05/2023
by   Christian Lülf, et al.
0

The vast amounts of data collected in various domains pose great challenges to modern data exploration and analysis. To find "interesting" objects in large databases, users typically define a query using positive and negative example objects and train a classification model to identify the objects of interest in the entire data catalog. However, this approach requires a scan of all the data to apply the classification model to each instance in the data catalog, making this method prohibitively expensive to be employed in large-scale databases serving many users and queries interactively. In this work, we propose a novel framework for such search-by-classification scenarios that allows users to interactively search for target objects by specifying queries through a small set of positive and negative examples. Unlike previous approaches, our framework can rapidly answer such queries at low cost without scanning the entire database. Our framework is based on an index-aware construction scheme for decision trees and random forests that transforms the inference phase of these classification models into a set of range queries, which in turn can be efficiently executed by leveraging multidimensional indexing structures. Our experiments show that queries over large data catalogs with hundreds of millions of objects can be processed in a few seconds using a single server, compared to hours needed by classical scanning-based approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/11/2018

Multidimensional Range Queries on Modern Hardware

Range queries over multidimensional data are an important part of databa...
research
04/13/2021

On the Computational Intelligibility of Boolean Classifiers

In this paper, we investigate the computational intelligibility of Boole...
research
11/07/2021

Em-K Indexing for Approximate Query Matching in Large-scale ER

Accurate and efficient entity resolution (ER) is a significant challenge...
research
09/08/2017

FAST: Frequency-Aware Spatio-Textual Indexing for In-Memory Continuous Filter Query Processing

Many applications need to process massive streams of spatio-textual data...
research
10/25/2019

Overlay Indexes: Efficiently Supporting Aggregate Range Queries and Authenticated Data Structures in Off-the-Shelf Databases

Commercial off-the-shelf DataBase Management Systems (DBMSes) are highly...
research
02/18/2018

Training Big Random Forests with Little Resources

Without access to large compute clusters, building random forests on lar...
research
05/10/2021

Probabilistic Top-k Dominating Queries in Distributed Uncertain Databases (Technical Report)

In many real-world applications such as business planning and sensor dat...

Please sign up or login with your details

Forgot password? Click here to reset