Similarity Downselection: A Python implementation of a heuristic search algorithm for finding the set of the n most dissimilar items with an application in conformer sampling

by   Felicity F. Nielson, et al.

Finding the set of the n items most dissimilar from each other out of a larger population becomes increasingly difficult and computationally expensive as either n or the population size grows large. Finding the set of the n most dissimilar items is different than simply sorting an array of numbers because there exists a pairwise relationship between each item and all other items in the population. For instance, if you have a set of the most dissimilar n=4 items, one or more of the items from n=4 might not be in the set n=5. An exact solution would have to search all possible combinations of size n in the population, exhaustively. We present an open-source software called similarity downselection (SDS), written in Python and freely available on GitHub. SDS implements a heuristic algorithm for quickly finding the approximate set(s) of the n most dissimilar items. We benchmark SDS against a Monte Carlo method, which attempts to find the exact solution through repeated random sampling. We show that for SDS to find the set of n most dissimilar conformers, our method is not only orders of magnitude faster, but is also more accurate than running the Monte Carlo for 1,000,000 iterations, each searching for set sizes n=3-7 out of a population of 50,000. We also benchmark SDS against the exact solution for example small populations, showing SDS produces a solution close to the exact solution in these instances.



There are no comments yet.


page 1

page 2

page 3

page 4


Sampling from a k-DPP without looking at all items

Determinantal point processes (DPPs) are a useful probabilistic model fo...

Adversarial Top-K Ranking

We study the top-K ranking problem where the goal is to recover the set ...

Quasi-Monte Carlo Software

Practitioners wishing to experience the efficiency gains from using low ...

A Further Generalization of the Finite-Population Geiringer-like Theorem for POMDPs to Allow Recombination Over Arbitrary Set Covers

A popular current research trend deals with expanding the Monte-Carlo tr...

Monte Carlo non local means: Random sampling for large-scale image filtering

We propose a randomized version of the non-local means (NLM) algorithm f...

Monte Carlo Techniques for Approximating the Myerson Value – Theoretical and Empirical Analysis

Myerson first introduced graph-restricted games in order to model the in...

A heuristic extending the Squarified treemapping algorithm

A heuristic extending the Squarified Treemap technique for the represent...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.