Scalable Feature Matching Across Large Data Collections

01/06/2021
by   David Degras, et al.
0

This paper is concerned with matching feature vectors in a one-to-one fashion across large collections of datasets. Formulating this task as a multidimensional assignment problem with decomposable costs (MDADC), we develop extremely fast algorithms with time complexity linear in the number n of datasets and space complexity a small fraction of the data size. These remarkable properties hinge on using the squared Euclidean distance as dissimilarity function, which can reduce n 2 matching problems between pairs of datasets to n problems and enable calculating assignment costs on the fly. To our knowledge, no other method applicable to the MDADC possesses these linear scaling and low-storage properties necessary to large-scale applications. In numerical experiments, the novel algorithms outperform competing methods and show excellent computational and optimization performances. An application of feature matching to a large neuroimaging database is presented. The algorithms of this paper are implemented in the R package matchFeat available at https://github.com/ddegras/matchFeat.

READ FULL TEXT

page 26

page 28

research
11/07/2021

Em-K Indexing for Approximate Query Matching in Large-scale ER

Accurate and efficient entity resolution (ER) is a significant challenge...
research
07/19/2023

Improved Distribution Matching for Dataset Condensation

Dataset Condensation aims to condense a large dataset into a smaller one...
research
01/05/2023

PA-GM: Position-Aware Learning of Embedding Networks for Deep Graph Matching

Graph matching can be formalized as a combinatorial optimization problem...
research
08/10/2018

Greedy Algorithms for Approximating the Diameter of Machine Learning Datasets in Multidimensional Euclidean Space

Finding the diameter of a dataset in multidimensional Euclidean space is...
research
05/05/2022

Multi-Freq-LDPy: Multiple Frequency Estimation Under Local Differential Privacy in Python

This paper introduces the Python package for multiple frequency estimat...
research
04/21/2016

LOH and behold: Web-scale visual search, recommendation and clustering using Locally Optimized Hashing

We propose a novel hashing-based matching scheme, called Locally Optimiz...

Please sign up or login with your details

Forgot password? Click here to reset