GPU Accelerated Self-join for the Distance Similarity Metric

03/12/2018
by   Michael Gowanlock, et al.
0

The self-join finds all objects in a dataset within a threshold of each other defined by a similarity metric. As such, the self-join is a building block for the field of databases and data mining, and is employed in Big Data applications. In this paper, we advance a GPU-efficient algorithm for the similarity self-join that uses the Euclidean distance metric. The search-and-refine strategy is an efficient approach for low dimensionality datasets, as index searches degrade with increasing dimension (i.e., the curse of dimensionality). Thus, we target the low dimensionality problem, and compare our GPU self-join to a search-and-refine implementation, and a state-of-the-art parallel algorithm. In low dimensionality, there are several unique challenges associated with efficiently solving the self-join problem on the GPU. Low dimensional data often results in higher data densities, causing a significant number of distance calculations and a large result set. As dimensionality increases, index searches become increasingly exhaustive, forming a performance bottleneck. We advance several techniques to overcome these challenges using the GPU. The techniques we propose include a GPU-efficient index that employs a bounded search, a batching scheme to accommodate large result set sizes, and a reduction in distance calculations through duplicate search removal. Our GPU self-join outperforms both search-and-refine and state-of-the-art algorithms.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/26/2018

GPU Accelerated Similarity Self-Join for Multi-Dimensional Data

The self-join finds all objects in a dataset that are within a search di...
research
10/10/2018

Technical Report: KNN Joins Using a Hybrid Approach: Exploiting CPU/GPU Workload Characteristics

This paper studies finding the K nearest neighbors (KNN) of all points i...
research
09/22/2022

Computing Double Precision Euclidean Distances using GPU Tensor Cores

Tensor cores (TCs) are a type of Application-Specific Integrated Circuit...
research
04/25/2019

GPU-based Efficient Join Algorithms on Hadoop

The growing data has brought tremendous pressure for query processing an...
research
10/01/2018

SVS-JOIN: Efficient Spatial Visual Similarity Join over Multimedia Data

In the big data era, massive amount of multimedia data with geo-tags has...
research
06/13/2017

Preference-driven Similarity Join

Similarity join, which can find similar objects (e.g., products, names, ...
research
02/27/2020

A Data Dependent Algorithm for Querying Earth Mover's Distance with Low Doubling Dimension

In this paper, we consider the following query problem: given two weight...

Please sign up or login with your details

Forgot password? Click here to reset