
Medoids in almost linear time via multiarmed bandits
Computing the medoid of a large number of points in highdimensional space is an increasingly common operation in many data science problems. We present an algorithm Meddit which uses O(n log n) distance evaluations to compute the medoid with high probability. Meddit is based on a connection with the multiarmed bandit problem. We evaluate the performance of Meddit empirically on the Netflixprize and the singlecell RNASeq datasets, containing hundreds of thousands of points living in tens of thousands of dimensions, and observe a 510x improvement in performance over the current state of the art. Meddit is available at https://github.com/bagavi/Meddit
11/02/2017 ∙ by Vivek Bagaria, et al. ∙ 0 ∙ shareread it

Adaptive MonteCarlo Optimization
The celebrated Monte Carlo method estimates a quantity that is expensive to compute by random sampling. We propose adaptive Monte Carlo optimization: a general framework for discrete optimization of an expensivetocompute function by adaptive random sampling. Applications of this framework have already appeared in machine learning but are tied to their specific contexts and developed in isolation. We take a unified view and show that the framework has broad applicability by applying it on several common machine learning problems: knearest neighbors, hierarchical clustering and maximum mutual information feature selection. On real data we show that this framework allows us to develop algorithms that confer a gain of a magnitude or two over exact computation. We also characterize the performance gain theoretically under regularity assumptions on the data that we verify in real world data. The code is available at https://github.com/govindakamath/combinatorial_MAB.
05/21/2018 ∙ by Vivek Bagaria, et al. ∙ 0 ∙ shareread it

Hidden Hamiltonian Cycle Recovery via Linear Programming
We introduce the problem of hidden Hamiltonian cycle recovery, where there is an unknown Hamiltonian cycle in an nvertex complete graph that needs to be inferred from noisy edge measurements. The measurements are independent and distributed according to _n for edges in the cycle and _n otherwise. This formulation is motivated by a problem in genome assembly, where the goal is to order a set of contigs (genome subsequences) according to their positions on the genome using longrange linking measurements between the contigs. Computing the maximum likelihood estimate in this model reduces to a Traveling Salesman Problem (TSP). Despite the NPhardness of TSP, we show that a simple linear programming (LP) relaxation, namely the fractional 2factor (F2F) LP, recovers the hidden Hamiltonian cycle with high probability as n →∞ provided that α_n  n →∞, where α_n 2 ∫√(d P_n d Q_n) is the Rényi divergence of order 1/2. This condition is informationtheoretically optimal in the sense that, under mild distributional assumptions, α_n ≥ (1+o(1)) n is necessary for any algorithm to succeed regardless of the computational cost. Departing from the usual proof techniques based on dual witness construction, the analysis relies on the combinatorial characterization (in particular, the halfintegrality) of the extreme points of the F2F polytope. Represented as bicolored multigraphs, these extreme points are further decomposed into simpler "blossomtype" structures for the large deviation analysis and counting arguments. Evaluation of the algorithm on real data shows improvements over existing approaches.
04/15/2018 ∙ by Vivek Bagaria, et al. ∙ 0 ∙ shareread it

Deconstructing the Blockchain to Approach Physical Limits
Transaction throughput, confirmation latency and confirmation reliability are fundamental performance measures of any blockchain system in addition to its security. In a decentralized setting, these measures are limited by two underlying physical network attributes: communication capacity and speedoflight propagation delay. Existing systems operate far away from these physical limits. In this work we introduce Prism, a new proofofwork blockchain protocol, which can achieve 1) security against up to 50 adversarial hashing power; 2) optimal throughput up to the capacity C of the network; 3) confirmation latency for honest transactions proportional to the propagation delay D, with confirmation error probability exponentially small in CD ; 4) eventual total ordering of all transactions. Our approach to the design of this protocol is based on deconstructing the blockchain into its basic functionalities and systematically scaling up these functionalities to approach their physical limits.
10/18/2018 ∙ by Vivek Bagaria, et al. ∙ 0 ∙ shareread it
Vivek Bagaria
is this you? claim profile