# Similarity Downselection: A Python implementation of a heuristic search algorithm for finding the set of the n most dissimilar items with an application in conformer sampling

Finding the set of the n items most dissimilar from each other out of a larger population becomes increasingly difficult and computationally expensive as either n or the population size grows large. Finding the set of the n most dissimilar items is different than simply sorting an array of numbers because there exists a pairwise relationship between each item and all other items in the population. For instance, if you have a set of the most dissimilar n=4 items, one or more of the items from n=4 might not be in the set n=5. An exact solution would have to search all possible combinations of size n in the population, exhaustively. We present an open-source software called similarity downselection (SDS), written in Python and freely available on GitHub. SDS implements a heuristic algorithm for quickly finding the approximate set(s) of the n most dissimilar items. We benchmark SDS against a Monte Carlo method, which attempts to find the exact solution through repeated random sampling. We show that for SDS to find the set of n most dissimilar conformers, our method is not only orders of magnitude faster, but is also more accurate than running the Monte Carlo for 1,000,000 iterations, each searching for set sizes n=3-7 out of a population of 50,000. We also benchmark SDS against the exact solution for example small populations, showing SDS produces a solution close to the exact solution in these instances.

## Authors

• 1 publication
• 2 publications
• 3 publications
• 2 publications
06/30/2020

### Sampling from a k-DPP without looking at all items

Determinantal point processes (DPPs) are a useful probabilistic model fo...
02/15/2016

We study the top-K ranking problem where the goal is to recover the set ...
02/15/2021

### Quasi-Monte Carlo Software

Practitioners wishing to experience the efficiency gains from using low ...
05/11/2013

### A Further Generalization of the Finite-Population Geiringer-like Theorem for POMDPs to Allow Recombination Over Arbitrary Set Covers

A popular current research trend deals with expanding the Monte-Carlo tr...
12/27/2013

### Monte Carlo non local means: Random sampling for large-scale image filtering

We propose a randomized version of the non-local means (NLM) algorithm f...
12/31/2019

### Monte Carlo Techniques for Approximating the Myerson Value – Theoretical and Empirical Analysis

Myerson first introduced graph-restricted games in order to model the in...
09/02/2016

### A heuristic extending the Squarified treemapping algorithm

A heuristic extending the Squarified Treemap technique for the represent...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.