 # The algorithm by Ferson et al. is surprisingly fast: An NP-hard optimization problem solvable in almost linear time with high probability

Ferson et al. (Reliable computing 11(3), p. 207--233, 2005) introduced an algorithm for the NP-hard nonconvex quadratic programming problem called MaxVariance motivated by robust statistics. They proposed an implementation with worst-case time complexity O(n^2 · 2^ω), where ω is the largest clique in a certain intersection graph. First we show that with a careful implementation the complexity can be improved to O(n n + n· 2^ω). Then we treat input data as random variables (as it is usual in statistics) and introduce a natural probabilistic data generating model. We show that ω = O( n/ n) on average under this model. As a result we get average computing time O(n^1+ε) for ε > 0 arbitrarily small. We also prove the following tail bound on computation time: the instances, forcing the algorithm to compute in exponential time, occur rarely, with probability tending to zero faster than exponentially with n →∞.

Comments

There are no comments yet.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction and motivation

### 1.1 Problem formulation

Ferson, Ginzburg, Kreinovich, Longpré and Aviles  studied the pair of optimization problems

 minx∈RnV(x)s.t.x––≤x≤¯¯¯x, (1) maxx∈RnV(x)s.t.x––≤x≤¯¯¯x, (2)

where and are given input data and

 V(x)\coloneqq1nn∑i=1(xi−1nn∑j=1xj)2.

It is obvious that Eq. 1 is a convex quadratic program (CQP) solvable in polynomial time, while Eq. 2 is easily proven to be NP-hard. It is worth noting that a general CQP solver yields a weakly polynomial algorithm for Eq. 1, but Ferson et al.  introduced a strongly polynomial method.

They also introduced a method for solving Eq. 2 which works in exponential time in the worst case (not surprisingly). The method will be described in Section 2. Abbreviating the names of the authors, we will refer to their method as FGKLA algorithm.

### 1.2 Summary of results

In this text we focus on the NP-hard case Eq. 2, called MaxVariance, and the FGKLA algorithm. Our contribution is twofold.

#### Improving the worst-case complexity of the FGKLA algorithm

We show that there exists an implementation of the FGKLA algorithm working in time

 O(nlogn+n⋅2ω), (3)

where is the size of the largest clique in a certain intersection graph introduced in Definition 1. This improves the bound from the original paper, see also Remark 1.

#### Proving a “good” behaviour in a probabilistic setting

Then we treat the input data as random variables. We introduce a natural and fairly general probabilistic model (details are in Section 3), under which we show that

1. [label=()]

2. on average, the algorithm works in time

 O(n1+ε)for all ε>0, (4)
3. the probability that the algorithm computes in exponential time tends to zero faster than exponentially with . In other words, we show that the “hard” instances occur indeed rarely.

More specifically: 1 we prove that under the probabilistic model it holds

 Eω=O(lognloglogn), (5)

where stands for the expected value of . Combining Eq. 5 with Eq. 3 yields Eq. 4 as and for any (in the entire text, “” stands for the natural logarithm).

To achieve 2: from Eq. 3 it follows that the computing time is exponential when with . We prove that

 Pr[ω≥δn]≤e−nloglognfor every δ>0\ and a sufficiently large n.

This shows that the “hard” instances, forcing the FGKLA algorithm to compute in exponential time, occur rarely, with probability tending to zero faster than exponentially.

### 1.3 Motivation from statistics

Problems Eqs. 2 and 1 are studied in statistics; see e.g.  and references therein, and a pseudopolynomial method in 

. The statistical motivation is as follows: we are interested in sample variance

of a dataset . However, the data is not observable. What is available instead is a collection of intervals , , such that (for example, instead of the exact values we have rounded versions only). Then, cannot be computed exactly, but we can get tight bounds for in the form Eqs. 2 and 1. In econometrics, this phenomenon is sometimes called partial identification .

The problem is more general and is studied for various statistics in place of in Eqs. 2 and 1, see the reference books [6, 8].

### 1.4 Related work

In general, this paper contributes to the analysis of complexity of optimization problems and algorithms when input data can be assumed to be random, drawn from a particular distribution or a class of distributions. As a prominent example recall the famous average-time analysis of the Simplex Algorithm , , where the phenomenon “exponential in the worst case but fast on average” has been studied since 1980’s.

The phenomenon is particularly interesting in case of NP-hard problems since the exponential time at worst case seems to be unavoidable. From the areas related to our work, we mention average-case complexity studies of the well-known NP-complete -clique problem: Rossman  derived the bounds of average-case complexity of the -clique problem on monotone circuits. His results were subsequently followed by Fountoulakis et al.  in a study whether the “hard” instances occur frequently or rarely under a probability setup. The result is in some sense similar to our one: if the probability of edge between two vertices comes from “natural” distribution function, then the deterministic algorithms for -clique problem work in polynomial time with “high” probability, i.e. “hard” instances occur with probability smaller than any nonnegative polynomial in the number of vertices.

## 2 FGKLA Algorithm

Recall that the input instance is given by the pair and . Compact intervals will be denoted in boldface, e.g. . For define

 x∗i\coloneqq12(x––i+¯¯¯xi),xΔi\coloneqq12(¯¯¯xi−x––i),x1/ni\coloneqq[x∗i−1nxΔi,x∗i+1nxΔi].

The numbers are referred to as center and radius of , respectively, and is called a narrowed interval (i.e., shrunk by factor around its center). For we define (the mean of ).

Our version of the FGKLA algorithm is summarized as Algorithm 1. The main result of this section is Theorem 1. In particular, it improves the worst-case complexity bound from  (see also Remark 1). The proof of Theorem 1 will be given in Section 2.1.

###### Theorem 1 (properties of the FGKLA algorithm (Algorithm 1))
• The FGKLA algorithm correctly solves (2).

• Let be an undirected graph where if and only if (here, ). Let be the size of the largest clique in . Then, FGKLA algorithm works in time .

###### Definition 1

The graph from Theorem 1 is referred to as FGKLA intersection graph with data .

### 2.1 Idea of the FGKLA algorithm

Since the quadratic form is positive semidefinite, the maximum of Eq. 2 is attained in a vertex (also called extremal point) of the feasible set

 x={x | x––≤x≤¯¯¯x}.

There are vertices in total. In a vertex we have for every . FGKLA algorithm reduces the number of vertices to be examined from to . The reduction is based on Lemma 1 (see [4, section 6] for proof).

###### Lemma 1 ()

Let and let there exist such that

1. for all and

2. one of the following is satisfied:

1. , and ,

2. , and .

Then .

###### Corollary 1

Let be a maximizer and . Let

be the set of all vectors

satisfying:

• if ,

• if , and

• if .

Then contains a maximizer.

In cases (a) and (b) we say that variable (or index ) is fixable with respect to ; in case (c), variable (or index ) is free with respect to .

Algorithm 1 works as follows. It builds the set (creftype 1) containing all endpoints of the narrowed intervals , and denotes them (creftype 2). Consider the set of all possible means. The endpoints from divide the interval into at most regions

 [a0\coloneqqμ[x––],a1),(a1,a2),…,(am−1,am),(am,am+1\coloneqqμ[¯¯¯x]].

Thanks to Corollaries 1 and 1, every region contains means with the same set of free indices. For a region , we denote this set by , i.e. . The set of endpoints contains the worst possible mean values with respect to the number of free indices. More precisely: for every , , all indices from are free.

On creftype 5, we examine the vertex . The value of is computed and stored as , the maximal value of found so far. Variables and will be useful in Algorithm 2.

Then, Algorithm 1 takes means one by one. For every mean, say , it takes the set of indices of narrowed intervals beginning in and inserts it to the set of free indices with respect (creftype 7). Indices are fixable with respect to . This yields candidate extremal points that are examined by Algorithm 2, called on creftype 8.

Then, indices from the set of narrowed intervals ending in are removed from . Intervals with these indices will be fixed to the lower endpoint for every upcoming (creftype 9 of Algorithm 1). The update of and will be explained later.

Algorithm 2 consecutively traverses all extremal points (vertices of ) resulting from fixing either or for the free indices . For every such vertex, say , the variance is computed. To make these computations cheap, the traversal of is performed in a way that two successive extremal points differ in just one component. Then Lemma 2 shows how to get from in arithmetic operations. The variance is stored indirectly as variables and ; they can be easily updated when is switched to , or vice versa.

###### Lemma 2

For , we have , where and . Furthermore, if differs from in just one component, say th, then

 V(x′)=V1+1n((x′)2i−x2i)−(V2+1n(x′i−xi))2.

Algorithm 2 is an adaptation of the algorithm from [10, pg. 37] for enumeration of elements of the set for a given . The enumeration can in general start from an arbitrary element. The proof of correctness can be found therein. In our variant, the variable indicates the current extremal point. In every iteration of while cycle, some is set to . The th index is taken from (here we consider as a list rather than a set) and is switched to the other endpoint. For this new extremal point, and are updated (creftype 8) and the resulting variance is compared to the best value found so far (creftype 9).

The following property of Algorithm 2 is crucial for the correctness and complexity of FGKLA algorithm: When Algorithm 2 ends, then . The next proposition immediately follows.

###### Proposition 1

Let be the values of the global variables when Algorithm 2 starts and let be the values of when Algorithm 2 ends. Then and .

In particular, this means that when entering creftype 8 of Algorithm 1, we always start examining the free indices with for each .

Finally, creftype 9 of Algorithm 1 removes intervals ending in from . These intervals are going to be fixed to their lower endpoints in the following iterations. Since they are at the upper endpoint now, creftype 9 updates and accordingly.

###### Proof (Proof of Theorem 1)
1. [label=)]

2. Correctness. Let be a maximizer of Eq. 2. Since the maximum is attained in a vertex of the feasible set , we can assume for all . Moreover, thanks to Corollary 1, we can assume for every such that and for every such that . Put all other indices to set , i.e. . Set . Consider the set processed by Algorithm 2 in th iteration of Algorithm 1. By construction, . Hence, the maximizer is among the examined extremal points.

3. Complexity. On creftype 2, the algorithm sorts numbers with complexity . Algorithm 2 is called at most times, where . Recall that is the size of the maximal clique of the FGKLA intersection graph. In the th iteration of the for cycle on creftypeplural 6 to 10 of Algorithm 1 we have . Thus .

Algorithm 2 performs exactly iterations of the while cycle on creftypeplural 2 to 11. Inside its iteration, there is the for cycle on creftypeplural 3 to 6. The amortized time complexity of this for cycle is , because in its iteration it either sets some nonzero to or stops iterating. Since is set to a nonzero value only times, the overall time of all courses of the for cycle is .

The amount of work in the remaining steps is negligible. In particular, note that since are pairwise disjoint sets (the same holds true for ), the total number of iterations of for cycles on creftypeplural 9 and 7 is at most during the whole course of FGKLA algorithm.

The overall complexity is .

###### Remark 1

Aside of the implementation details (which are however important for the reduced time complexity bound), our formulation of the algorithm differs from the original paper  also for another reason. The original formulation can lead to complexity , for example if and if there are narrowed intervals ending in some and further narrowed intervals starting in . However, a minor modification of the original formulation would be sufficient to achieve the time .

## 3 A probabilistic model where the FGKLA algorithm works in time O(n1+ε) on average

This section is devoted to the main result: on average, FGKLA algorithm works in “almost linear time” and the cases when it computes in exponential time occur extremely rarely.

Here we use the statistical motivation of the problem as described in Section 1.3. Namely, in statistics, data are often assumed to form a random sample from a certain distribution. This is exactly our probabilistic model: we assume that both centers of the intervals and their radii form two independent random samples from fairly general classes of distributions.

###### Assumption (the probabilistic model)
1. [label=(),ref=]

2. The centers

are independent and identically distributed (“i.i.d.”) random variables with a Lipschitz continuous cumulative distribution function (“c.d.f.”)

. That is, there exists a constant such that

 Φ∗(˜z)−Φ∗(z)≤L(˜z−z)\ \ whenever\ % \ ˜z>z.
3. The radii

are i.i.d. nonnegative random variables with a finite moment of order

for some . In other words, we assume

 γ\coloneqqE[(xΔi)1+ε]<∞. (6)
4. We assume that the pair of random variables , are independent.

Define

 α\coloneqqmax{1, 8L(1+γ)ε}. (7)
###### Theorem 2

The size of the largest clique of the FGKLA intersection graph with data has the following properties:

1. [label=()]

2. for a sufficiently large ; here ,

3. for every there is an such that if .

Proof of Theorem 2 will be given in Section 3.1.

###### Corollary 2 (main result)

The average computing time is

 O(nlogn+n⋅2Eω)=O(nlogn+n⋅nO(1)/loglogn)=O(n1+ε)

for an arbitrarily small . Moreover, the exponential case, when is linear in , occurs with probability (i.e., with probability even smaller than exponential).

###### Remark 2

The assumption on the distribution of radii is very mild; indeed, we need just something a little more than existence of the expectation (we even do not need finite variance). On the other hand, Lipschitz continuity of (Item 1) is unavoidable; we will show what can happen without Lipschitz continuity in Section 4. We will also discuss there what happens when we relax the independence assumption (Item 3) and what is the cost for dependence paid by existence of higher-order moments.

###### Remark 3

In Eq. 7 we have imposed a technical condition . This is not restrictive since the interesting cases are those with . Indeed, the difficult case is when is close to zero (“radii can be large with a high probability”), (“radii are large on average”) and (“the density of centers can have high peaks”, or “many centers can be close to one another”).

We have also introduced a technical condition in Item 2. Again, this is not at all restrictive — for example, if a distribution of reader’s interest has finite high-order moments, it must also have a finite moment of low order .

### 3.1 Proof of Theorem 2

Let and let be the probability that is an edge of the FGKLA intersection graph. That is,

 pn\coloneqqPr[x1/ni∩x1/nj≠∅]=Pr[|x∗i−x∗j|≤1n(xΔi+xΔj)]=Pr[B≥An],

where

 An\coloneqqn|x∗i−x∗j|,B\coloneqqxΔi+xΔj.

Observe that does not depend on by the i.i.d. assumptions.

#### Notation

Given a random variable

, its probability density function (“p.d.f.”) is denoted by

.

###### Lemma 3

We have for every .

###### Proof

Observe that the p.d.f. of exists because the c.d.f. is Lipschitz continuous (and thus absolutely continuous). Recall that is the corresponding Lipschitz constant. The Lipschitz condition also implies for all . By symmetry, . Using independence of , the symmetry of around zero and the Convolution Theorem we get

 φ|x∗i−x∗j|(z) =2I{z≥0}⋅φx∗i−x∗j(z) =2I{z≥0}∫∞−∞φx∗i(ξ)⋅φ−x∗j(z−ξ) dξ =2I{z≥0}∫∞−∞φx∗i(ξ)⋅φx∗i(ξ−z) dξ ≤2LI{z≥0}∫∞−∞φx∗i(ξ−z) dξ =2LI{z≥0}∫∞−∞φx∗i(ξ) dξ ≤2L,

where is the 0-1 indicator of . Now .

Recall that the number has been introduced in Eq. 7.

.

###### Proof

Recall that is the value of the th moment of for some . The well-known Minkowski inequality tells us

 E[B1+ε]=E[(xΔi+xΔj)1+ε]≤21+εγ≤4γ.

By Markov’s inequality we get

 Pr[B≥z]≤4γz1+ε. (8)

Using the Law of Total Probability and independence of

and we get

 pn =Pr[B≥An] =∫∞0Pr[B≥z | An=z]⋅φAn(z) dz =∫∞0Pr[B≥z]⋅φAn(z) dz =∫10Pr[B≥z]⋅φAn(z) dz+∫∞1Pr[B≥z]⋅φAn(z) dz ≤∫10φAn(z) dz+∫∞14γz1+ε⋅φAn(z) dz ≤2Ln+8Lγn∫∞11z1+ε dz =2Ln+8Lγεn ≤8L(1+γ)εn ≤αn.

Let us introduce indicator variables ():

 Wij={1if |x∗i−x∗j|≤1n(xΔi+xΔj),0otherwise.

Obviously, almost surely (“a.s.”) if . If , then is alternatively distributed with parameter . Moreover, the variables

 Wi1,Wi2,…,Wi,i−1,Wi,i+1,…,Win (9)

are independent (this is an important point). Putting

 Ei=∑j∈{1,…,n}∖{i}Wij,

we get

 Ei∼Bi(n−1,pn). (10)

Now we can use an estimate based on (a kind of) Penrose’s method, see

, Chapter 6. The crucial observation is

 Pr[ω≥κ+1]≤Pr[E1≥κ∨⋯∨En≥κ]. (11)

Indeed, if the FGKLA graph has a clique of size containing vertex , then at least indicators from Eq. 9 are one.

###### Lemma 5 (Tail bound for the binomial distribution [9, p. 16])

Let

 H(ξ)=1−ξ+ξlogξ. (12)

If and , then

 Pr[Z≥κ]≤exp(−np⋅H(κnp)).

Let

 kn\coloneqqeαlognloglogn,

where was defined in Eq. 7 and .

###### Lemma 6

for a sufficiently large .

###### Proof

It is easy to verify that is increasing for . If (see also Remark 4)

 logloglognloglogn≤1100, (13)

we have

 H(nα(n−1)kn) ≥H(knα) =H(elognloglogn) =−elognloglogn+elognloglognlog(elognloglogn)+1 ≥−elognloglogn+elognloglognlog(elognloglogn) =(−e+eloge)lognloglogn+elognloglogn(loglogn−logloglogn) =elognloglogn(loglogn−logloglogn) =e(logn)(1−logloglognloglogn) ≥e(logn)(1−1100) ≥52logn.

Continuing with estimate Eq. 11 with , we get

 Pr[ω≥kn+1] ≤Pr[E1≥kn∨⋯∨En≥kn] (14) ≤nPr[E1≥kn] (15) ≤nPr[Z≥kn] (16) ≤1n, (17)

where

 Z:∼Bi(n−1,αn). (18)

Property Eq. 10 and Lemma 4 imply the correctness of Eq. 16 if is sufficiently large. In Eq. 15 we used the fact that are identically distributed (but not independent) and the Bonferroni inequality for any events . It remains to prove Eq. 17.

.

###### Proof

We use the tail probability bound from Lemma 5 for . Observe that if , then and the assumption of Lemma 5 is satisfied.

If , then . With Lemma 6 we get the desired estimate

 Pr[Z≥kn] ≤exp[−αn−1n⋅H(nα(n−1)⋅kn)] ≤exp[−αn−1n⋅52logn] ≤exp[−45⋅52logn] =1n2.