Similarity Downselection: A Python implementation of a heuristic search algorithm for finding the set of the n most dissimilar items with an application in conformer sampling

05/06/2021
by   Felicity F. Nielson, et al.
0

Finding the set of the n items most dissimilar from each other out of a larger population becomes increasingly difficult and computationally expensive as either n or the population size grows large. Finding the set of the n most dissimilar items is different than simply sorting an array of numbers because there exists a pairwise relationship between each item and all other items in the population. For instance, if you have a set of the most dissimilar n=4 items, one or more of the items from n=4 might not be in the set n=5. An exact solution would have to search all possible combinations of size n in the population, exhaustively. We present an open-source software called similarity downselection (SDS), written in Python and freely available on GitHub. SDS implements a heuristic algorithm for quickly finding the approximate set(s) of the n most dissimilar items. We benchmark SDS against a Monte Carlo method, which attempts to find the exact solution through repeated random sampling. We show that for SDS to find the set of n most dissimilar conformers, our method is not only orders of magnitude faster, but is also more accurate than running the Monte Carlo for 1,000,000 iterations, each searching for set sizes n=3-7 out of a population of 50,000. We also benchmark SDS against the exact solution for example small populations, showing SDS produces a solution close to the exact solution in these instances.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/12/2023

Fast exact simulation of the first-passage event of a subordinator

This paper provides an exact simulation algorithm for the sampling from ...
research
06/30/2020

Sampling from a k-DPP without looking at all items

Determinantal point processes (DPPs) are a useful probabilistic model fo...
research
02/15/2016

Adversarial Top-K Ranking

We study the top-K ranking problem where the goal is to recover the set ...
research
02/15/2021

Quasi-Monte Carlo Software

Practitioners wishing to experience the efficiency gains from using low ...
research
05/11/2013

A Further Generalization of the Finite-Population Geiringer-like Theorem for POMDPs to Allow Recombination Over Arbitrary Set Covers

A popular current research trend deals with expanding the Monte-Carlo tr...
research
12/27/2013

Monte Carlo non local means: Random sampling for large-scale image filtering

We propose a randomized version of the non-local means (NLM) algorithm f...
research
11/19/2022

Heuristic Algorithm for Univariate Stratification Problem

In sampling theory, stratification corresponds to a technique used in su...

Please sign up or login with your details

Forgot password? Click here to reset