# A Polynomial Time Algorithm for Log-Concave Maximum Likelihood via Locally Exponential Families

We consider the problem of computing the maximum likelihood multivariate log-concave distribution for a set of points. Specifically, we present an algorithm which, given n points in R^d and an accuracy parameter ϵ>0, runs in time poly(n,d,1/ϵ), and returns a log-concave distribution which, with high probability, has the property that the likelihood of the n points under the returned distribution is at most an additive ϵ less than the maximum likelihood that could be achieved via any log-concave distribution. This is the first computationally efficient (polynomial time) algorithm for this fundamental and practically important task. Our algorithm rests on a novel connection with exponential families: the maximum likelihood log-concave distribution belongs to a class of structured distributions which, while not an exponential family, "locally" possesses key properties of exponential families. This connection then allows the problem of computing the log-concave maximum likelihood distribution to be formulated as a convex optimization problem, and solved via an approximate first-order method. Efficiently approximating the (sub) gradients of the objective function of this optimization problem is quite delicate, and is the main technical challenge in this work.

## Authors

• 6 publications
• 64 publications
• 17 publications
• 20 publications
• 33 publications
12/13/2018

### A Polynomial Time Algorithm for Maximum Likelihood Estimation of Multivariate Log-concave Densities

We study the problem of computing the maximum likelihood estimator (MLE)...
03/10/2020

### Exact Solutions in Log-Concave Maximum Likelihood Estimation

We study probability density functions that are log-concave. Despite the...
06/26/2018

### Maximum Likelihood Estimation for Totally Positive Log-Concave Densities

We study nonparametric density estimation for two classes of multivariat...
04/06/2020

### The Bethe and Sinkhorn Permanents of Low Rank Matrices and Implications for Profile Maximum Likelihood

In this paper we consider the problem of computing the likelihood of the...
11/12/2021

### Convergence Rates for the MAP of an Exponential Family and Stochastic Mirror Descent – an Open Problem

We consider the problem of upper bounding the expected log-likelihood su...
10/17/2019

### Calculating Optimistic Likelihoods Using (Geodesically) Convex Optimization

A fundamental problem arising in many areas of machine learning is the e...
03/29/2018

### Computationally efficient likelihood inference in exponential families when the maximum likelihood estimator does not exist

In a regular full exponential family, the maximum likelihood estimator (...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

A distribution on

is log-concave if the logarithm of its probability density function is concave:

###### Definition 1 (Log-concave Density).

A probability density function , , is called log-concave if there exists an upper semi-continuous concave function such that for all . We will denote by the set of upper semi-continuous, log-concave densities with respect to the Lebesgue measure on .

Log-concave densities form a broad nonparametric family encompassing a wide range of fundamental distributions, including the uniform, normal, exponential, logistic, extreme value, Laplace, Weibull, Gamma, Chi and Chi-Squared, and Beta distributions (see, e.g.,

Bagnoli and Bergstrom (2005)

). Log-concave probability measures have been extensively investigated in several scientific disciplines, including economics, probability theory and statistics, computer science, and geometry (see, e.g.,

Stanley (1989); An (1995); Lovász and Vempala (2007); Walther (2009); Saumard and Wellner (2014)). The problem of density estimation

for log-concave distributions is of central importance in the area of non-parametric estimation (see, e.g.,

Walther (2009); Saumard and Wellner (2014); Samworth (2017)) and has received significant attention during the past decade in statistics Cule et al. (2010); Dumbgen and Rufibach (2009); Doss and Wellner (2016); Chen and Samworth (2013); Kim and Samworth (2016); Balabdaoui and Doss (2018); Han and Wellner (2016) and computer science Chan et al. (2013, 2014a); Acharya et al. (2017); Canonne et al. (2016); Diakonikolas et al. (2016d, 2017); Carpenter et al. (2018).

One reason the class of log-concave distributions has attracted this attention, both from the theoretical and practical communities, is that log-concavity is a very natural “shape constraint,” which places significantly fewer assumptions on the distribution in question than most parameterized classes of distributions. In extremely high-dimensional settings when the amount of available data is not too much larger than the dimensionality, fitting a multivariate Gaussian (or some other parametric distribution) to the data might be all one can hope to do. For many practical settings, however, the dimensionality is modest (e.g., 5-20) and the amount of data is significantly larger (e.g., hundreds of thousands or millions). In such settings, making a strong assumption on the parametric form of the underlying distribution is unnecessary—there is sufficient data to fit a significantly broader class of distributions, and log-concave distributions are one of the most natural such classes. From a practical perspective, even in the univariate setting, computing the log-concave density that maximizes the likelihood of the available data is a useful primitive, with the R implementation of Rufibach and Duembgen having over 39,000 downloads Dümbgen and Rufibach (2011). As we discuss below, the amount of data required to learn a log-concave distribution scales exponentially in the dimension, in contrast to most parametric classes of distributions. Nevertheless, for the many practical settings with modest dimensionality and large amounts of data, there is sufficient data to learn. The question now is computational: how does one compute the best-fit log-concave distribution? We focus on this algorithmic question:

Is there an efficient algorithm to compute the log-concave MLE for datapoints in ?

Obtaining an understanding of the above algorithmic question is of interest for a number of reasons. First, the log-concave MLE is the prototypical statistical estimator for the class, is fully automatic (in contrast to kernel-based estimators, for example), and was very recently shown to achieve the minimax optimal sample complexity for the task of learning a log-concave distribution (up to logarithmic factors) Carpenter et al. (2018); Dagan and Kur (2019). The log-concave MLE also has an intriguing geometry that is of interest from a purely theoretical standpoint Cule et al. (2010); Robeva et al. (2017). Developing an efficient algorithm for computing the log-concave MLE is of significant theoretical interest, and would also allow this general non-parametric class of distributions to be leveraged in the many practical settings where the dimensionality is moderate and the amount of data is large. We refer the reader to the recent survey Samworth (2017) for a more thorough justification for why the log-concave MLE is a desirable distribution to compute.

### 1.1 Our Results and Techniques

The main result of this paper is the first efficient algorithm to compute the multivariate log-concave MLE. For concreteness, we formally define the log-concave MLE:

###### Definition 2 (Log-concave MLE).

Let . The log-concave MLE, , is the density which maximizes the log-likelihood over .

As shown in Cule et al. (2010), the log-concave MLE exists and is unique. Our main result is the first efficient algorithm to compute it up to any desired accuracy.

###### Theorem 1 (Main Result).

Fix and . There is an algorithm that, on input any set of points in , and , , runs in time and with probability at least outputs a succinct description of a log-concave density such that

Our algorithm does not require that the input points in are i.i.d. samples from a log-concave density, i.e., it efficiently solves the MLE optimization problem for any input set of points. We also note that the succinct output description of allows for both efficient evaluation and efficient sampling. That is, we can efficiently approximate the density at a given point (within multiplicative accuracy), and efficient sample from a distribution that is close in total variation distance.

Recent work Carpenter et al. (2018); Dagan and Kur (2019) has shown that the log-concave MLE is minimax optimal, within a logarithmic factor, with respect to squared Hellinger distance. In particular, the minimax rate of convergence with samples is . Combining this sample complexity bound with our Theorem 1, we obtain the first sample near-optimal and computationally efficient proper learning algorithm for multivariate log-concave densities. See Theorem 4 in Appendix B.

##### Technical Overview

Here we provide an overview of our algorithmic approach. Notably, our algorithm does not require the assumption that the input points are samples from a log-concave distribution. It runs in on any set of input points and outputs an -accurate solution to the log-concve MLE. Our algorithm proceeds by convex optimization: We formulate the problem of computing the log-concave MLE of a set of points in as a convex optimization problem that we solve via an appropriate first-order method. It should be emphasized that one needs to overcome several non-trivial technical challenges to implement this plan.

The first difficulty lies in choosing the right (convex) formulation. Previous work Cule et al. (2010) has considered a convex formulation of the problem that inherently fails, i.e., it cannot lead to a polynomial time algorithm. Given our convex formulation, a second difficulty arises: we do not have direct access to the (sub-)gradients of the objective function and the naive algorithm to compute a subgradient at a point takes exponential time. Hence, a second challenge is how to obtain an efficient algorithm for this task. One of our main contributions is a randomized polynomial time algorithm to approximately compute a subgradient of the objective function. Our algorithm for this task leverages structural results on log-concave densities established in Carpenter et al. (2018) combined with classical algorithmic results on approximating the volume of convex bodies and uniformly sampling from convex sets Kannan et al. (1997); Lovász and Vempala (2006c, b).

We now proceed to explain our convex optimization formulation. Our starting point is a key structural property of the log-concave MLE, shown in Cule et al. (2010): The logarithm of the log-concave MLE , is a “tent” function, whose parameters are the values of the log density at the input points , and whose log-likelihoods correspond to polyhedra. Our conceptual contribution lies in observing that while tent distributions are not an exponential family, they “locally” retain many properties of exponential families (Definition 4). This high-level similarity can be leveraged to obtain a convex formulation of the log-concave MLE that is similar in spirit to the standard convex formulation of the exponential family MLE Wainwright et al. (2008). Specifically, we seek to maximize the log-likelihood of the probability density function obtained by normalizing the log-concave function whose logarithm is the convex hull of the log densities at the samples. This objective function is a concave function of the parameters, so we end up with a (non-differentiable) convex optimization problem. The crucial observation is that the subgradient of this objective at a given point is given by an expectation under the current hypothesis density at .

Given our convex formulation, we would like to use a first-order method to efficiently find an -approximate optimum. We note that the objective function is not differentiable everywhere, hence we need to work with subgradients. We show that the subgradient of the objective function is bounded in -norm at each point, i.e., the objective function is Lipschitz. Another important structural result (Lemma 2) allows us to essentially restrict the domain of our optimization problem to a compact convex set of appropriately bounded diameter . This is crucial for us, as the diameter bound implies an upper bound on the number of iterations of a first-order method. Given the above, we can in principle use a projected subgradient method to find an approximate optimum to our optimization problem, i.e., find a log-concave density whose log-likelihood is -optimal.

It remains to describe how we can efficiently compute a subgradient of our objective function. Note that the log density of our hypothesis can be considered as an unbounded convex polytope. The previous approach to calculate the subgradient in Cule et al. (2010) relied on decomposing this polytope into faces and obtaining a closed form for the underlying integral over these faces (that gives their contribution to the subgradient). However, this convex polytope is given by vertices in dimensions, and therefore the number of its faces can be . So, such an algorithm cannot run in polynomial time.

Instead, we note that we can use a linear program (see proof of Lemma

1) to evaluate a function proportional to the hypothesis density at a point in time polynomial in and

. To use this oracle for the density in order to produce samples from the hypothesis density, we use Markov Chain Monte Carlo (MCMC) methods. In particular, we use MCMC to draw samples from the uniform distribution on super-level sets and estimate their volumes. With appropriate rejection sampling, we can use these samples to obtain samples from a distribution that is close to the hypothesis density. See Lemma

3. (We note that it does not suffice to simply run a standard log-concave density sampling technique such as hit-and-run Lovász and Vempala (2006a). These random walks require a hot start which is no easier than the sampling technique we propose.)

Since the subgradient of the objective can be expressed as an expectation over this density, we can use these samples to sample from a distribution whose expectation is close to a subgradient. We then use stochastic subgradient descent to find an approximately optimal solution to the convex optimization problem. The hypothesis density this method outputs has log-likelihood close to the maximum.

### 1.2 Related Work

There are two main strands of research in density estimation. The first one concerns the learnability of high-dimensional parametric distributions, e.g., mixtures of Gaussians. The sample complexity of learning parametric families is typically polynomial in the dimension and the challenge is to design computationally efficient algorithms. The second research strand — which is the focus of this paper — considers the problem of learning a probability distribution under various non-parametric assumptions on the shape of the underlying density, typically focusing on the univariate or small constant dimensional regime. There has been a long line of work in this vein within statistics since the 1950s, dating back to the pioneering work of

Grenander (1956) who analyzed the MLE of a univariate monotone density. Since then, shape constrained density estimation has been an active research area with a rich literature in mathematical statistics and, more recently, in computer science. The reader is referred to Barlow et al. (1972) for a summary of the early work and to Groeneboom and Jongbloed (2014) for a recent book on the subject.

The standard method used in statistics for density estimation problems of this form is the MLE. See Brunk (1958); Rao (1969); Wegman (1970); Hanson and Pledger (1976); Groeneboom (1985); Birgé (1987a, b); Fougères (1997); Chan and Tong (2004); Balabdaoui and Wellner (2007); Jankowski and Wellner (2009); Dumbgen and Rufibach (2009); Balabdaoui et al. (2009); Gao and Wellner (2009); Balabdaoui and Wellner (2010); Koenker and Mizera (2010); Walther (2009); Chen and Samworth (2013); Kim and Samworth (2016); Balabdaoui and Doss (2018); Han and Wellner (2016); Carpenter et al. (2018) for a partial list of works analyzing the MLE for various distribution families. During the past decade, there has been a body of algorithmic work on shape constrained density estimation in computer science with a focus on both sample and computational efficiency Daskalakis et al. (2012a, b, 2013); Chan et al. (2013, 2014a, 2014b); Acharya et al. (2015, 2017); Diakonikolas et al. (2016a, b); Daskalakis et al. (2016); Diakonikolas et al. (2016c, 2017, 2018a). The majority of this literature has studied the univariate (one-dimensional) setting which is by now fairly well-understood for a wide range of distributions. On the other hand, the multivariate setting is significantly more challenging and wide gaps in our understanding remain even for .

For the specific problem of learning a log-concave distribution, a line of work in statistics Cule et al. (2010); Dumbgen and Rufibach (2009); Doss and Wellner (2016); Chen and Samworth (2013); Balabdaoui and Doss (2018) has characterized the global consistency properties of the log-concave multivariate MLE. Regarding finite sample bounds, Kim and Samworth (2016); Dagan and Kur (2019) gave a sample complexity lower bound of for that holds for any estimator, and Kim and Samworth (2016) gave a near-optimal sample complexity upper bound for the log-concave MLE for . Diakonikolas et al. (2017)

established the first finite sample complexity upper bound for learning multivariate log-concave densities under global loss functions. Their estimator (which is different than the MLE and seems hard to compute in multiple dimensions) learns log-concave densities on

within squared Hellinger loss with samples. Carpenter et al. (2018) showed a sample complexity upper bound of for the multivariate log-concave MLE with respect to squared Hellinger loss, thus obtaining the first finite sample complexity upper bound for this estimator in dimension . Building on their techniques, this bound was subsequently improved in  Dagan and Kur (2019) to a near-minimax optimal bound of . Alas, the computational complexity of the log-concave MLE has remained open in the multivariate case. Finally, we note that a recent work De et al. (2018) obtained a non-proper estimator for multivariate log-concave densities with sample complexity (i.e., at least quadratic in that of the MLE) and runtime .

On the empirical side, recent work Rathke and Schnörr (2018) proposed a non-convex optimization approach to the problem of computing the log-concave MLE, which seems to exhibit superior performance in practice in comparison to previous implementations (scaling to

or higher dimensions). Unfortunately, their method is of a heuristic nature, in the sense that there is no guarantee that their solution will converge to the log-concave MLE.

The present paper is a merger of two independent works Axelrod and Valiant (2018); Diakonikolas et al. (2018b), proposing essentially the same algorithm to compute the log-concave MLE. Here we provide a unified presentation of these works with an arguably conceptually cleaner analysis.

## 2 Preliminaries

Notation. We denote by the sequence of samples. We denote by the convex hull of , and by the

matrix with columns vectors

. We write for the all-ones vector of the appropriate length. For a set , denotes the indicator function for .

Tent Densities. We start by defining tent functions and tent densities:

###### Definition 3 (Tent Function).

For and a set of points in , we define the tent function as follows:

 hX,y(x)={max{z∈R such that (x,z)∈Conv({(Xi,yi)}ni=1)} if x∈Sn−∞ if x∉Sn

The points are referred to as tent poles. (See Figure 1 for the graph of an example tent function.)

Let with chosen such that integrates to one. We refer to as a tent density and the corresponding distribution as a tent distribution. Note that the support of a tent distribution must be within the convex hull of . For the remainder of the paper, we choose a scaling such that . This scaling is arbitrary, and has no significant effect on either the algorithm or its analysis.

Tent densities are notable because they contain solutions to the log-concave MLE Cule et al. (2010). The solution to the log-concave MLE over is always a tent density, because tent densities with tent poles are the minimal log-concave functions with log densities at points .

The algorithm which we present can be thought of as an optimization over tent functions. In Section 3.1, we will show that tent distributions retain important properties of exponential families which will be useful to establish the correctness of our algorithm.

Regular Subdivisions. Given a tent function with , its associated regular subdivision of is a collection of subsets of whose convex hulls are the regions of linearity of . See Figure 1 for an illustration of a tent function and its regular subdivision. We refer to these polytopes of linearity as cells. We say that is a regular triangulation of if every cell is a dimensional simplex.

It is helpful to think of regular subdivisions in the following way: Consider the hyperplane

in obtained by fixing the last coordinate. Consider the function as a polytope and project each face onto . Each cell is a projection of a face, and together the cells partition the convex hull of . Observe that regular subdivisions may vary with . Figure 2 provides one example of how changing the vector changes the regular subdivision.

For a given regular triangulation , the associated consistent neighborhood is the set of all , such that . That is, consistent neighborhoods are the sets of parameters where the regular triangulation remains fixed. Note that these neighborhoods are open and their closures cover the whole space. See Figure 2 for an example of how crossing between consistent neighborhoods results in different subdivisions. We note that for fixed , when is chosen in general position, is always a regular triangulation.

## 3 Locally Exponential Convex Programs

In this section, we lay the foundations for the algorithm presented in the next section. We present the “locally" exponential form of tent distributions and show it has the necessary properties to enable efficient computation of the log-concave MLE. Though they form a broader class of distributions, “locally" exponential distributions share some important properties of exponential families. Namely, the log-likelihood optimization is convex, and the expectation of the sufficient statistic is a subgradient. This will allows us to formulate a convex program which we will be able to solve in polynomial time.

###### Definition 4.

Let be some function (possibly parametrized by ) and let be a family of probability densities parametrized by with acting to normalize the density so it integrates to . We say that the family is locally-exponential if the following hold: (1) is convex in , and (2) .

Note that the above definition differs from an exponential family in that for exponential families may not depend on .

In this section, we derive a sufficient statistic, the polyhedral statistic, that shows that tent distributions are in fact locally exponential. More formally, we show:

###### Lemma 1.

For tent poles , there exists a function (the polyhedral statistic) such that corresponds to the family of tent-distributions such that is locally exponential. Furthermore, is computable in time .

Since we know that the log-concave MLE is a tent distribution, and all tent-distributions are log-concave, we know that the optimum of the maximum likelihood convex program in Equation (3.1) corresponds to the log-concave MLE.

 MLE of tents =maxy ∑ihX,y(Xi)−log∫exphX,y(x)dx=maxy ∑iyi−A(y) (3.1)

Combining the above with the fact that the sufficient statistic allows us to compute the stochastic subgradient suggests that Algorithm 1 can compute the log-concave MLE in polynomial time.

### 3.1 The Polyhedral Sufficient Statistic

Consider a regular triangulation corresponding to tent distribution parametrized by and . The polyhedral statistic is the function

 TX,y(x):Sn→[0,1]n,

that expresses as a convex combination of corners of the cell containing in . That is where and if is not a corner of the cell containing . The polyhedral statistic gives an alternative way of writing tent functions and tent densities:

 hX,y(x)=⟨Ty(x),y⟩          pX,y(x)=exp(⟨Ty(x),y⟩).

If we restrict such that and define , then we can see that for every consistent neighborhood we have an exponential family of the form

 exp(⟨Ty(x),θ⟩−A(y)) % for θ∈NΔ. (3.2)

While Equation (3.2) shows how subsets of tent distributions are exponential families, it also helps highlight why tent distributions are not an exponential family. The sufficient statistic depends on through the regular subdivision. This means that tent distributions do not admit the same factorized form as exponential families since the sufficient statistic depends on .

Note that we can use any ordering of to define the polyhedral sufficient statistic everywhere including on regular subdivisions that are not regular triangulations. Also note that, assuming that no , eliminating the last coordinate using the constraint makes each exponential family minimal. In other words, over regions where the regular subdivision does not change (for example the consistent neighborhoods), tent distributions are minimal exponential families. This means the set of tent distribution can be seen as the finite union of a set of minimal exponential families. We refer to Equation (3.3) as the exponential form for tent densities:

 pX,y(x)=exp(⟨TX,y(x),y⟩−A(y))1Sn(x). (3.3)

Both the polyhedral statistic and tent density queries can be computed in polynomial time with the packing linear program presented in Equation (3.4). For a point , the value of yields the log-density and the vector corresponds to polyhedral statistic.

 maxy s.t. (x,y)=∑iαi(Xi,yi),∑iαi=1,αi≥0 (3.4)

Note that the above combined with tent distributions being exponential families on consistent neighborhoods gives us that the properties from Lemma 1 hold true on consistent neighborhoods. We extend the proof to the full result below.

###### Proof.

Convexity follows by iteratively applying known operations that preserve convexity of a function. Since a sum of convex functions is convex (see, e.g., page 79 of  Boyd and Vandenberghe (2004)), it suffices to show that the function is convex. Since is a convex function of , by definition, is log-convex as a function of . Since an integral of log-convex functions is log-convex (see, e.g., page 106 of  Boyd and Vandenberghe (2004)), it follows that is log-convex. Therefore, is convex. We have therefore established that Equation (3.1) is convex, as desired.

: Note that when is in the interior of a consistent neighborhood, the polyhedral statistic LP has a unique solution and (by Fact 3). When is on the boundary the solution set to the LP corresponds to the convex hull of solutions corresponding to each adjacent consistent neighborhood. This corresponds to the convex hull of limiting gradients from each neighboring consistent neighborhood and is the set of subgradients. ∎

## 4 Algorithm and Detailed Analysis

Recall that we compute the log-concave MLE via a first-order method on the optimization formulation presented in Equation (3.1). The complete method is presented in Algorithm 1. The algorithm is based on the stochastic gradient computation presented in the previous section, a standard application of the stochastic gradient method, and a sampler to be described later in this section.

### 4.1 Analysis

We now provide the main technical ingredients used to prove Theorem 1. Specifically, we bound the rate of convergence of the stochastic subgradient method, and we provide an efficient procedure for sampling from a log-concave distribution.

Recall that algorithm 1 is simply applying the stochastic subgradient method to the following convex program with : .

We will require a slight strengthening of the following standard result, see, e.g., Theorem 3.4.11 in Duchi (2016):

###### Fact 1.

Let be a compact convex set of diameter . Suppose that the projections are efficiently computable, and there exists such that for all we have that for all stochastic subgradients. Then, after iterations of the projected stochastic subgradient method (for appropriate step sizes), with probability at least , we have that where .

We note that Fact 1 assumes that, in each iteration, we can efficiently calculate an unbiased stochastic subgradient, i.e., a vector such that . Unfortunately, this is not the case in our setting, because we can only approximately sample from log-concave densities. However, it is straightforward to verify that the conclusion of Fact 1 continues to hold if in each iteration we can compute a random vector such that , for some . This slight generalization is the basic algorithm we use in our setting.

We now return to the problem at hand. We note that since represents the coefficients of a convex combination for all , bounding by 1.

Lemma 2 will show that . This implies that if we let and run SGD for iterations, the resulting point will have objective value within of the log-concave MLE.

###### Lemma 2.

Let be a set of points in and be the corresponding log-concave MLE. Then, we have that . Converting to an norm yields a bound on the diameter of :

Let us briefly sketch the proof of Lemma 2. The main idea is to show that if were too high, then would have a lower likelihood than the uniform distribution on the convex hull of the samples . More specifically, if the maximum value of the density is large, then the volume of the set is small. For a fixed , this set contains and thus must be large compared to . Since has likelihood at least as high as the uniform distribution over , must be small compared to . Combining these two observations yields a bound on .

We now proceed with the complete proof.

###### Proof of Lemma 2.

Let be the volume of the convex hull of the sample points and be the maximum pdf value of the MLE. By basic properties of the log-concave MLE (see, e.g., Theorem 2 of Cule et al. (2010)), we have that for all and for all . Moreover, by the definition of a tent function, it follows that attains its global maximum value and its global non-zero positive value in one of the points .

We can assume without loss of generality that is not the uniform distribution on , since otherwise and the lemma follows. Under this assumption, we have that or , which implies that . The following fact bounds the volume of upper level sets of any log-concave density:

###### Fact 2 (see, e.g., Lemma 8 in  Carpenter et al. (2018)).

Let with maximum value . Then for all , we have .

By Fact 2 applied to the MLE , for , we get that . Since the pdf value of at any point in the convex hull is at least that of the smallest sample point , i.e., , it follows that is contained in . Therefore,

 V≤(lnR∞)d/M. (4.1)

On the other hand, the log-likelihood of is at least the log-likelihood of the uniform distribution on . Since at least one sample point has pdf value and the other sample points have pdf value , we have that

 ln(M/R∞)+(n−1)lnM≥ℓ(ˆfn)≥ℓ(USn)=nln(1/V),

or , and therefore . This gives that

 R1/n∞≤MV. (4.2)

Combining (4.1) and (4.2) gives

 R∞≤(lnR∞)nd. (4.3)

Since , , setting gives that or

 (lnR∞)nd<(2nd)nd⋅R1/2∞. (4.4)

By (4.3) and (4.4) we deduce that or

 R∞≤(2nd)2nd.

This completes the proof of Lemma 2. ∎

#### 4.1.2 Efficient Sampling and Log-Partition Function Evaluation

In this section, we establish the following result, which gives an efficient algorithm for sampling from the log-concave distribution computed by our algorithm.

###### Lemma 3 (Efficient Sampling).

There exist algorithms and satisfying the following: Let , let , let be a parameter of a tent-density in exponential form. Then the following conditions hold:

• On input , , , and , algorithm outputs a random vector , distributed according to some probability distribution with density , such that , in time , with probability at least .

• On input , , , and , algorithm outputs some , such that , in time , with probability at least .

The algorithm used in the proof of Lemma 3 is concerned mainly with part (1) in its statement. The pseudocode of this sampling procedure is given in Algorithm 2.

Using the notation from Algorithm 2, part (2) is easier to describe and we thus omit the pseudocode. We note that the following exposition of Algorithm 2 assumes that the input vector is bounded. In the execution of Algorithm 1, is bounded linearly by the number of SGD iterates. Thus, the dependence of the sampling runtime on increases the overall runtime by at most a polynomial.

We now present the proof of Lemma 3. The pseudocode of the sampling procedure is given in Algorithm 2. As stated in Section 4.1.2, Algorithm 2 uses subroutines for approximating the volume of a convex body given by a membership oracle, and a procedure for sampling from the uniform distribution supported on such a body. For these procedures we use the algorithms by Kannan et al. (1997), which are summarized in Theorems 2 and 3 respectively.

###### Theorem 2 (Kannan et al. (1997)).

The volume of a convex body in , given by a membership oracle, can be approximated to within a relative error of with probability using

 d5⋅poly(logd,1/δ,log(1/τ))

oracle calls.

###### Theorem 3 (Kannan et al. (1997)).

Given a convex body , with oracle access, and some , we can generate a random point that is distributed according to a distribution that is at most away from uniform in total variation distance, using

 d5⋅poly(logd,1/δ)

oracle calls.

For all , , and , we use the notation .

In order to use the algorithms in Theorems 2 and 3 in our setting, we need a membership oracle for the superlevel sets of the function . Such an oracle can clearly be implemented using the LP (3.4). We also need a separation oracle for these superlevel sets, which is given in the following lemma:

###### Lemma 4 (Efficient Separation).

There exists a time separation oracle for the superlevel sets of .

###### Proof.

To construct our separation oracle, we will rely on the covering LP that is dual to the packing LP used to evaluate a tent function. The dual to the packing LP looks for the hyperplane that is above all the that has minimal at . More specifically, it is the following LP:

 minimizeβ0+∑dj=1βjxjsubject toβ∈Rd+1,β0+∑dj=1βjXi,j≥yi,i∈[n], (4.5)

where is the -th coordinate of the vector . Now suppose that we are interested in a super level set . We can use the above LP to compute (and thus ) and check if it is in the superlevel set. Suppose that it is not, then there will be a solution whose value is below , say for some . Consider an in the halfspace which has in the interior. Since does not appear in the objective, is a feasible solution for the dual LP (4.5) with , and so , which implies that is not in the superlevel set. Therefore, is a separating hyperplane for and the level set. This completes the proof. ∎

Given all of the above ingredients, we are now ready to prove the main result of this section.

###### Proof of Lemma 3.

We first prove part (1) of the assertion. To that end we analyze the sampling procedure described in Algorithm 2. Recall that , and for any , we define the superlevel set

 Li={x∈Rd:HX,y(x)≥MHX,y⋅2−i}.

For any recall that

 GX,y(x)=MHX,y2−⌊log2(MHX,y/HX,y(x))⌋.

For any , let be the indicator function for . It is immediate that for all ,

 GX,y(x) =MHX,y∞∑i=12−iχLi(x) =MHX,ym∑i=12−iχLi(x)+2−mχLi(m) (since HX,y(x)=0 for all x∉Lm)

Let

 c=m∑i=12−ivol(Li)+2−mvol(Lm).

We have

 ∫RdGX,y(x)dx =MHX,y(m∑i=12−ivol(Li)+2−mvol(Lm))=MHX,yc. (4.6)

Let

 ˆGX,y(x)=GX,y(x)/(MHX,yc).

It follows by (4.6) that is a probability density function.

Let be the probability distribution on , where

 PrI∼D[I=i]={vol(Li)⋅2−i/c if i∈{1,…,m−1}2⋅vol(Lm)⋅2−m/c if i=m

For any , let be the uniform probability density function on . To sample from , we can first sample , and then sample .

Recall that is the probability density function obtained by normalizing ; that is, for all we have

 pX,y(x)=HX,y(x)/c′,

where

 c′=∫RdHX,y(x)dx.

Consider the following random experiment: first sample , and then accept with probability

; conditioning on accepting, the resulting random variable

is distributed according to . Note that since for all , , it follows that we always accept with probability at least . Let be the probability of accepting. Then

 α =∫RdˆGX,y(x)(HX,y(x)/GX,y(x))dx,

and thus

 ∫RdHX,y(x)dx =∫RdGX,y(x)(HX,y(x)/GX,y(x))dx =MHX,yc∫RdˆGX,y(x)(HX,y(x)/GX,y(x))dx =MHX,ycα. (4.7)

By Theorem 2, for each , we compute an estimate, , to , to within relative error , using oracle calls, with probability at least , where , for some constant to be determined; moreover, by Theorem 3, we can efficiently sample, using oracle calls, from a probability distribution with . Each of these oracle calls is a membership query in some superlevel set of . This membership query can clearly be implemented if we can compute that value at the desired query point , which can be done in time using LP (3.4). Thus, each oracle call takes time . Let

 ˜c =m∑i=12−i˜vol(Li)+2−m˜vol(Lm). (4.8)

Since for all , , it is immediate that

 c/(1+δ)≤˜c≤c(1+δ).

Recall that Algorithm 2 uses the probability distribution on , where

 PrI∼˜D[I=i]={˜vol(Li)⋅2−i/˜c if i∈{1,…,m−1}2⋅˜vol(Lm)⋅2−m/˜c if i=m

Consider the following random experiment, which corresponds to Steps 5–6 of Algorithm 2: We first sample , and then we sample . The resulting random vector is distributed according to

 ˜GX,y(x)=1˜c(m∑i=12−i˜vol(Li)˜ui(x)+2−m˜vol(Lm)˜um(x)).

Next, consider the following random experiment, which captures Steps 5–8 of Algorithm 2: We sample , and we accept with probability . Let be the resulting probability density function supported on obtained by conditioning the above random experiment on accepting. Let be the acceptance probability. We have

 ˜α=∫Rd(HX,y(x)/GX,y(x))˜G(x)dx.

We have

 ∥D