Quantum speedups of some general-purpose numerical optimisation algorithms

04/14/2020 ∙ by Cezar-Mihail Alexandru, et al. ∙ University of Bristol 0

We give quantum speedups of several general-purpose numerical optimisation methods for minimising a function f:R^n →R. First, we show that many techniques for global optimisation under a Lipschitz constraint can be accelerated near-quadratically. Second, we show that backtracking line search, an ingredient in quasi-Newton optimisation algorithms, can be accelerated up to quadratically. Third, we show that a component of the Nelder-Mead algorithm can be accelerated by up to a multiplicative factor of O(√(n)). Fourth, we show that a quantum gradient computation algorithm of Gilyén et al. can be used to approximately compute gradients in the framework of stochastic gradient descent. In each case, our results are based on applying existing quantum algorithms to accelerate specific components of the classical algorithms, rather than developing new quantum techniques.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Quantum computers are designed to use quantum mechanics to outperform their classical counterparts. As well as the remarkable exponential speedups that are known for specialised problems such as integer factorisation and simulation of quantum-mechanical systems, there are also quantum algorithms which speed up general-purpose classical algorithms in the domains of combinatorial search and optimisation. These algorithms may achieve relatively modest speedups, but make up for this by having very broad applications. The most famous example is Grover’s algorithm [26], which achieves a quadratic speedup of classical unstructured search, and can be used to accelerate classical algorithms for solving hard constraint satisfaction problems such as Boolean satisfiability.

Here our focus is on quantum algorithms that accelerate classical numerical optimisation algorithms: that is, algorithms that attempt to solve the problem of finding such that is minimised, for some function . (We use boldface throughout for elements of .) A vast number of optimisation algorithms are known. Some algorithms seek to find (or approximate) a global minimum of , given some constraints on ; others only attempt to find a local minimum. Some algorithms have provable correctness and/or performance bounds, while the performance of others must be verified experimentally. Whether or not an algorithm has good theoretical properties, its performance on a given problem often can only be determined by running it. These factors have led to the development and use of many numerical optimisation algorithms based on varied techniques.

Here we consider some prominent general-purpose numerical optimisation techniques, and investigate the extent to which they can be accelerated by quantum algorithms. We stress that our goal is not to develop new quantum optimisation techniques (that perhaps would not have rigorous performance bounds), but rather to find quantum algorithms that speed up existing classical techniques, while retaining the same performance guarantees. That is, if the classical algorithm performs well in terms of solution quality or execution time on a given problem instance, the quantum algorithm should also perform well. We assume throughout that the quantum algorithm has access to an oracle that computes exactly on particular inputs , implemented as a quantum circuit111As we would like to store in a register of qubits, technically this is only possible if we consider inputs within a bounded region and discretised up to a certain level of precision, and assume that is also bounded. However, this is also the case for the corresponding classical algorithms that we accelerate.. That is, we assume we have access to the map . This contrasts with a model sometimes used elsewhere in the literature, where is assumed to be provided to the quantum algorithm as a quantum state of qubits [34, 47] stored in a quantum RAM, and the goal is to produce a quantum state corresponding to .

Our results can be summarised as follows, where we use the notation (as in the rest of the paper) for an upper bound on the time required to evaluate the function . See Table 1 for a summary of the speedups we obtain.

  • Section 2: We show that a number of techniques for global optimisation under a Lipschitz constraint can be accelerated near-quadratically, and also discuss some challenges associated with speeding up the related and well-known classical algorithm DIRECT [31]. In Lipschitzian optimisation, one assumes that for some that is known in advance (the Lipschitz constant of ), where is the Euclidean norm. Many techniques for Lipschitzian optimisation can be understood in the framework of branch-and-bound algorithms [28]. These algorithms are based on dividing ’s domain into subsets, and using a lower-bounding procedure to rule out certain subsets from consideration. This enables the use of a quantum algorithm for speeding up branch-and-bound algorithms [43]. The complexity of branch-and-bound algorithms is controlled by a parameter discussed below; the quantum algorithm achieves a quadratic reduction in complexity in terms of this parameter. A simple representative example of an algorithm fitting into this framework is Galperin’s cubic algorithm [21]. In this case, the quantum algorithm’s complexity is then , where is the depth of the branch-and-bound tree, whereas the classical complexity is .

  • Section 3: We show that backtracking line search [45, Algorithm 3.1], a subroutine used in many quasi-Newton optimisation algorithms such as the BFGS algorithm, can be accelerated using a quantum algorithm which is a variant of Grover search [39]. Backtracking line search is based on choosing a direction and searching along that direction. If the overall algorithm makes iterations, the complexity of choosing is , and the number of search steps taken by the classical algorithm is , the complexity of one iteration of this classical routine is , while the complexity of the quantum algorithm is .

  • Section 4: We show that the Nelder-Mead algorithm [44], a widely-used derivative-free numerical optimisation algorithm, can be accelerated using quantum minimum-finding [17]. The algorithm is an iterative procedure based on maintaining a simplex. Assume that , and that the algorithm performs iterations, of which are “shrink” steps (qv). Then the complexity of the quantum algorithm is , as compared with the classical complexity, . So if the number of shrink steps is large with respect to , or is small, the quantum speedup can be relatively substantial (up to a factor).

  • Section 5: Approximate computation of a gradient is a key subroutine in many optimisation algorithms, including the very widely-used gradient descent algorithm [8]. We show that the gradient of functions of the form can be computed more efficiently using a quantum algorithm of Gilyén, Arunachalam and Wiebe [22]. Given that each individual function is bounded and can be computed in time (and satisfies some technical constraints on its partial derivatives), the quantum algorithm outputs an approximation of the gradient that is accurate up to in the norm, in time , as compared with the classical complexity . (The notation hides polylogarithmic factors in , and .) However, as we will discuss, it is not clear whether this notion of approximation is sufficient to accelerate classical stochastic gradient descent algorithms.

§ Algorithm Classical Quantum Technique
2 Global opt. w/Lipschitz constraint (e.g.) Branch-and-bound [43]
3 Backtracking line search Variant of Grover’s algorithm [39]
4 Nelder-Mead Quantum minimum-finding [17]
5 Gradients of averaged functions Quantum gradient computation [22]
Table 1: Informal summary of the results obtained in this paper. Parameters for algorithms are described in the respective sections of the paper, and summarised as follows. : complexity of computing ; : size of a truncated branch-and-bound tree; : depth of a branch-and-bound tree; : complexity of computing a descent direction; : number of iterations; : worst-case number of backtracking line search steps; : number of simplex shrinking steps; : accuracy. The bounds make various assumptions about that are detailed in the text.

In each case, the quantum speedups we find are based on the use of existing quantum algorithms, rather than the development of new algorithmic techniques. We believe that there are many more quantum speedups of numerical optimisation algorithms to be discovered. We remark that, in many of the cases we consider, the extent of the quantum speedup achieved depends on the interplay of various parameters governing the optimisation algorithm’s runtime, so not every problem instance will yield a speedup.

Prior work on quantum speedups of numerical optimisation algorithms (as opposed to the analysis of new quantum algorithms such as the adiabatic algorithm [20] or quantum approximate optimisation algorithm [30, 19]) has been relatively limited. Dürr and Høyer [17] gave a quantum algorithm to find a global minimum of a function on a discrete space of size , which is based on the use of Grover’s algorithm and uses evaluations of . Arunachalam [5]

applied Dürr and Høyer’s algorithm to improve the generalised pattern search and mesh-adaptive direct search optimisation algorithms. A sequence of papers has found quantum speedups of linear programming and semidefinite programming algorithms 

[10, 3, 2, 35, 9]; quantum speedups of more general convex optimisation algorithms are also known [51, 14]. Quantum speedups are known for computing gradients [32, 22, 15], an important subroutine in many optimisation algorithms; larger (exponential) speedups could be available in gradient descent-type algorithms if the inputs to the optimisation algorithm are available in a quantum RAM (qRAM) [34, 47]. Recently, it was shown that classical algorithms based on the general technique known as branch-and-bound can be accelerated near-quadratically [43].

2 Branch-and-bound algorithms for global optimisation with a Lipschitz constraint

Finding a global minimum of an arbitrary function can be a very challenging (or indeed impossible) task. One way to make this problem more tractable is to assume that satisfies a Lipschitz condition: for some that is known in advance, where is the Euclidean norm. Finding a global minimum of under this condition is known as Lipschitzian optimisation. Lipschitzian optimisation is very general and hence can be applied in many contexts. Hansen and Jaumard [28] describe a selection of applications of Lipschitzian optimisation, including solution of nonlinear equations and inequalities; parametrisation of statistical models; black box system optimisation; and location problems.

It is natural to restrict the domain of to , and to assume that is bounded such that for all . Finally, we can relax to solving the approximate optimisation problem of finding such that , for some accuracy parameter that is determined in advance. Even in the case and with these restrictions, this problem is far from trivial. One class of algorithms that can solve Lipschitzian optimisation problems are branch-and-bound algorithms. Generically, a branch-and-bound algorithm solves a minimisation problem using the following procedures:

  • A branching procedure which, given a subset of possible solutions, divides into two or more smaller subsets, or returns that should not be divided further.

  • A bounding procedure which, when given a subset produced during the branching process, returns a lower bound such that .

Branch-and-bound algorithms can be seen as exploring a tree, whose vertices correspond to subsets . The children of a subset correspond to the subsets which was divided into, and leaves are subsets that should not be divided further. For a leaf, one should additionally have that . Branch-and-bound algorithms use the additional information provided by the branch and bound procedures to explore the most promising sets early on, and to avoid exploring subsets such that is larger than the best solution found so far. One can show that the complexity of an optimal classical branch-and-bound algorithm based on these generic procedures is controlled by the size of the branch-and-bound tree, truncated by deleting all vertices whose corresponding lower bounds are less than the optimal cost : if the size of this tree is , the optimal classical algorithm makes calls to the branch and bound procedures [33]. It is not required to know in order to apply this bound.

A generic framework for branch-and-bound algorithms in the context of Lipschitzian optimisation was given by Hansen and Jaumard [28, Section 3.3], and we describe it as Algorithm 1. The algorithm splits into hyperrectangles , each of which is recursively split again. Each hyperrectangle has an associated upper bound (obtained by evaluating at a discrete set of points in that hyperrectangle) and lower bound (obtained via a separate lower-bounding function), and the algorithm terminates when it finds a hyperrectangle whose upper bound is sufficiently close to its lower bound. Convergence is guaranteed if some simple criteria are satisfied, discussed in [28] (for example, the upper bound and lower bound should converge as the interval size tends to 0). Hansen and Jaumard show that many previously known algorithms for Lipschitzian optimisation can be understood as particular cases of Algorithm 1. These include Galperin’s cubic algorithm [21], which proceeds by dividing the search space into hypercubes, and algorithms of Pijavskii [46], Shubert [48] and Mladineo [40].

Choose a discrete set and set ; [Initialise upper bound] Let be a lower-bounding function of on and compute [Initialise lower bound] If , stop. Otherwise, [Initialise branch-and-bound tree] While is nonempty: Let be a subset of chosen according to a selection rule For each subproblem in : Partition into hyperrectangles according to a branching rule [Branch] For : Choose a discrete set . For all : If then , and [Update upper bound] Compute , where is a lower-bounding function on [Compute lower bound] If : then if , is an -optimal solution of problem else add to [Explore interval further] Delete from all subproblems with .

Algorithm 1: Generic branch-and-bound algorithm for Lipschitzian optimisation problems [28]

The branching procedure of Algorithm 1 fits into the standard branch-and-bound framework. Given a subset , an upper bound is obtained by evaluating at a discrete set of positions , and a lower bound is obtained using the bounding function . If the two are within , should not be expanded further. Otherwise, is split into subsets. Algorithm 1 has a notion of selecting the next subset in using a selection rule, but it is shown in [33] that the best possible selection rule in branch-and-bound procedures (in a query complexity sense) is to expand the subset whose bounding function is smallest222The proof of this is based on the intuition that the algorithm cannot rule out subsets whose lower bound is smaller than the cost of the optimal solution. In the setting of Lipschitzian optimisation, this only holds if the lower bounding rule is tight, in the sense that given a lower bound on , for , there exists a Lipschitz function such that this lower bound is achieved..

There is a quantum algorithm that can achieve a near-quadratic speedup of classical branch-and-bound algorithms [43]

. The algorithm is based on the use of quantum procedures for estimating the size of an unknown tree 

[1], and searching within such a tree [6, 7, 42]. The algorithm achieves a complexity of uses of the branch and bound procedures for finding the minimum of up to accuracy . In this bound is the maximal depth of the branch-and-bound tree and the notation hides polylogarithmic factors in , , and , where

is the probability of failure. (We remark that the algorithm as presented in 

[43] assumes knowledge of an upper bound on in advance, but such a bound can be found efficiently by applying the quantum tree search algorithms of [42, 6, 7] to the branch-and-bound tree obtained by truncating at depth , with exponentially increasing choices of , until is found where the corresponding tree does not contain any internal vertices that have not been expanded.)

The quantum branch-and-bound algorithm can immediately be applied to Algorithm 1. If the time complexity of the branching and bounding rules is upper-bounded by , the cost of the quantum algorithm is , as compared with the classical complexity, which is . If , the speedup of the quantum algorithm over its classical counterpart in terms of the number of uses of the branching and bounding rules is near-quadratic. If these rules in turn are relatively simple to compute compared with (as is likely to be the case for challenging optimisation problems that occur in practice), this translates into a near-quadratic runtime speedup.

To illustrate how this approach could be applied in practice, a simple example of an algorithm fitting into this framework is Galperin’s cubic algorithm [21]. The branch and bound procedures are defined as follows, recalling that is the Lipschitz constant of :

  • Branch: the subproblem corresponding to a hypercube is divided into equal hypercubes, for some , by dividing each side into equal parts.

  • Lower bounding rule: Let be an extreme point of . has side length for some integer . Then a lower bound is , maximised over extreme points of .

  • Upper bounding rule: Evaluate on the extreme points of and return the minimum value found.

Galperin’s algorithm is illustrated in Figure 2 for the case . The complexity of the branch and bounding steps is dominated by the cost of evaluating at the extreme points of each hypercube , which is . The quantum complexity is then , whereas the classical complexity is ; so we see that the speedup is largest for small , e.g. .

0

-1

2
-2

-0.31

-0.81

3
-1.25

-0.33

-0.75

1
-3

-0.25

-1

1
Figure 2: Galperin’s cubic algorithm for , applied to the function (plotted in blue) with Lipschitz bound , which is minimised at with . The result of a few steps of splitting into subintervals is shown. The centres of intervals are labelled below with the step at which they are divided into subintervals (red), and the lower bound in that interval (blue). Endpoints are labelled above with the evaluated function values, shown to two decimal places.

2.1 The DIRECT algorithm

A prominent algorithm proposed to handle Lipschitzian optimisation for -variate functions where one does not know the Lipschitz constant in advance is known as DIRECT [31] (for “dividing rectangles”). The basic concept is to divide into (hyper)rectangles, and at each step of the algorithm to produce a list of potentially optimal rectangles, which are those that should be expanded further; see Appendix A for more details. This is similar to the branch-and-bound algorithms of the previous section, but with the additional complication of generating the list of potentially optimal rectangles, which involves interaction across several nodes of the branch-and-bound tree. This creates a difficulty for the quantum branch-and-bound algorithm, as it can only use branch and bound procedures based on only local information from the tree. Therefore it is unclear whether a similar quadratic speedup can be obtained.

To identify the potentially optimal vertices, the DIRECT algorithm uses a 2d convex hull algorithm. It is a natural idea to speed this up via a quantum convex hull algorithm. Lanzagorta and Uhlmann [38] have described a quantum algorithm based on Grover’s algorithm for computing a convex hull of points in 2d with complexity , where

is the number of points in the convex hull; they also give an algorithm based on a heuristic whose runtime may be

for practically relevant problems. However, the special case of the convex hull problem that is relevant to DIRECT can be solved in time  [31], so this does not lead to an overall quantum speedup.

3 Backtracking line search

Backtracking line search333Not to be confused with the combinatorial optimisation technique known as backtracking. [45] is a line search optimisation algorithm devised by Armijo in 1966 [4]. The goal of a line search method is, given a starting point and a direction , to move to a new point in the direction , in order to minimise a function . Backtracking line search is a particular line search technique based on the use of an exponentially decreasing parameter . A generic optimisation method based on backtracking line search is described as Algorithm 3. In this section we describe a quantum speedup of this algorithm.

Choose a starting point and constants and . Set . Choose a direction such that , where is the directional derivative in direction . If no such exists (), terminate. Compute the step size: , where . As , always exists. Set . Go back to step 2 if termination condition is not met (number of iterations, threshold, etc.)

Algorithm 3: Generic line search method based on backtracking line search

Different approaches can be used to choose . These include:

  • Steepest descent: .

  • Newton’s method: , where is the Hessian of .

  • Quasi-Newton methods (such as BFGS): , where is some approximation of .

Let denote the complexity of choosing the direction ; note that , because just writing down requires time . Then the overall complexity of one iteration of Algorithm 3 is . We can reduce this complexity using the following result of Lin and Lin [39] (see also [36]):

Theorem 1 (Lin and Lin [39]).

Consider a function . Let , if this set is nonempty, or otherwise . Then there is a quantum algorithm that succeeds with probability at least 0.99 and outputs using evaluations of if , and otherwise outputs that in steps.

We apply this result to step 3 of the classical algorithm to achieve a square-root reduction in the dependence on . To achieve a final probability of failure bounded by a small constant, by a union bound over the iterations, it is sufficient to repeat the algorithm of Theorem 1 times to achieve failure probability at each iteration. This gives an overall complexity of the quantum algorithm which is per iteration. If the overall algorithm makes iterations, and is the largest value of for any iteration, we have an overall complexity of . In cases where (such as the steepest descent method), , and is not exponentially large in , the dominant term in this complexity bound is the second one, and we always achieve a quantum speedup. The assumption is natural if depends on all variables.

This condition that is used in step 3 is called the Armijo condition. If is Lipschitz at with Lipschitz constant (), any

satisfies the Armijo condition [25, Theorem 2.1]. If we choose such that , then since , , . Therefore, the speedup achieved by the quantum algorithm (based on this worst-case bound) will be greatest when is large (representing that could change rapidly), yet is small (representing that does not change rapidly in direction ).

Another way in which one might hope to speed up Algorithm 3 is computing more efficiently. For example, a quantum algorithm was presented by Gilyén, Arunachalam and Wiebe [23], based on a detailed analysis of and modifications to an earlier algorithm of Jordan [32], that approximately computes for smooth functions quadratically more efficiently than classical methods (that are based e.g. on finite differences). However, it seems challenging to prove that such an approximation can be inserted in the backtracking line search framework without affecting the performance of the overall algorithm, in the worst case. This is because even a small change in the direction can significantly change the behaviour of the algorithm, as the definition of Step 3 of Algorithm 3 is such that an arbitrarily small change to the values taken by along the direction can change substantially. See Section 5 below for a further discussion of this algorithm.

Finally, we remark that one simple way to find a direction such that is nonzero, as required for the line search procedure, is to choose such that is nonzero. Although a valid choice, in practice this could be less efficient than (for example) moving in the direction of steepest descent. The use of Grover’s algorithm would reduce the complexity of this step to , as compared with the classical .

4 Nelder-Mead algorithm

The Nelder-Mead algorithm is a direct search optimisation algorithm; that is, one which does not require information about the gradient of the objective function. It is commonly-used and implemented within many computer algebra packages. However, little convergence theory exists and in practice it is ineffective in higher dimensions444Indeed, according to Lagarias et al. [37], “given all the known inefficiencies and failures of the Nelder-Mead algorithm… one might wonder why it is used at all, let alone why it is so extraordinarily popular.”. [37, 27]. The Nelder-Mead algorithm uses expansion, reflection, contraction and shrink steps to update a simplex in . A number of variants of the algorithm have been proposed. The variant we will use was analysed by Lagarias et al. [37], and is presented as Algorithm 4. Algorithm 4 does not specify a termination criterion. Termination criteria that could be used include the function values at the simplex points becoming sufficiently close; the simplex points themselves becoming sufficiently close; or an iteration limit being reached.

Let , , , be parameters defined such that , , , , . Standard choices are , , , . Initialise. Define an -dimensional simplex with vertices, . Sort. Order and relabel the vertices of the simplex such that and let be the worst vertex, the next-worst vertex and the best vertex. Set . Reflection. Calculate the reflection point, . If accept reflection, replace with and return to step . Expansion. If , calculate the expansion point . If accept the expansion point and replace with , otherwise accept the reflection point and replace with . Return to step . Outside Contraction. If , compute the outer contraction point, . If , accept the outside contraction point, replace with and return to step . Else go to step . Inside Contraction. If , calculate the inside contraction point, . If accept inside contraction, replace with and return to step . Else go to step . Shrink. For all points other than the best point, replace it with its shrink point, for . Go to step 2.

Algorithm 4: Nelder-Mead algorithm (see e.g. [37])

In this section we describe a quantum speedup of the Nelder-Mead algorithm. We first determine the classical complexity of the algorithm, drawing on the analysis of [49]. The complexity of step 1 is to write down the points. To analyse step 2, observe that a complete ordering of the points is never required; the only information about the ordering needed is the worst vertex , the next-worst vertex , and the best vertex . Knowledge of the identities of these points is sufficient to compute the centroid , and to carry out all the updates required, including the shrink step. So the first time that step 2 is executed, its complexity is , where the comes from computing the centroid. Each time step 2 is executed subsequently, except following a shrink step, the required updates can be made in time . The complexity of each of steps 3 to 6 is ; step 7 is . So the complexity of performing iterations, of which include a shrink step, is . If , this simplifies to .

The complexity of step 2, when executed for the first time or following a shrink step, can be improved using quantum minimum-finding:

Theorem 2 (Dürr and Høyer [17]).

Given a function and , there is a quantum algorithm that outputs with probability at least using evaluations of .

Thus a quantum algorithm using Theorem 2 can find the worst, next-worst and best vertices with failure probability at each iteration in time in total. This choice of failure probability is so that, by a union bound, the total probability of failure can be bounded by an arbitrarily small constant. Further, observe that the centroid can be updated in time following a shrink step, as if denotes the updated centroid, then . This does not give a quantum speedup of step 2 in all cases; the first time that step 2 is executed, if , its complexity is dominated by the cost of computing the centroid. There also remains an cost for updating the points at each shrink step. (There may be a more efficient way of keeping track of these shrink steps; however, we do not pursue this further here.) Then the overall complexity of the quantum algorithm is , and using a union bound over the steps, the algorithm’s failure probability is bounded above by an arbitrarily small constant. If , this simplifies to . Comparing with the classical complexity, we see that the quantum speedup is largest when is large compared with .

However, in practice shrink steps appear to be rare; in one set of experiments, only 33 shrink steps were observed in 2.9M iterations [50], and shrink steps never occur when Nelder-Mead is applied to a strictly convex function [37]. If there are no shrink steps and , the complexity of the quantum algorithm is , while the complexity of the classical algorithm is . This is still a quantum speedup if ; on the other hand, if , the complexity is dominated by evaluating once at each iteration, and it is difficult to see how a quantum speedup could be achieved.

To be able to use quantum minimum-finding, we have assumed the ability to construct superpositions of the form , which enables us to evaluate in superposition. This is a quantum RAM [24], and quantum RAMs are often assumed to be difficult to construct; however, our requirements are very weak, because we only need the addressing to be performed in time , rather than , which can be achieved using an explicit quantum circuit.

Finally, we consider the possibility of accelerating calculation of the centroid

using a quantum algorithm. If each component of each vector

is suitably bounded (e.g. ) we could use quantum mean estimation [29, 11, 41] to estimate each component of up to accuracy in time with failure probability bounded by a small constant, where the term comes from reducing the failure probability for each component to . Classical mean estimation could be used instead with an overhead of an additional factor. This would give an overall time complexity similar to that derived above, but it is not obvious what the effect of replacing the centroid with an approximate centroid would be on the overall algorithm. For example, it is argued in [18] that random perturbations to the centroid throughout the algorithm can be beneficial.

5 Stochastic gradient descent

One of the most widely-used, effective and simple methods for finding a local minimum of a function is gradient descent. Given a function and an initial point , the algorithm moves to the point , where

. In application areas such as machine learning 

[8], one often encounters functions of the form

(1)

for some “simple” functions , where is large. (For example,

could be the error of a neural network parametrised by

on the ’th item of training data, and we might seek to minimise the average error.) Rather than computing the exact gradient by summing over all choices for , it is natural to approximate by sampling random indices with replacement and outputting . (The case is known as stochastic gradient descent; the sample is sometimes known as a mini-batch.) If satisfies the Lipschitz condition that , to approximate up to additive error in the norm with failure probability it is sufficient to take by a Chernoff bound argument. Let denote an upper bound on the time required to compute for all . If we approximate using the finite difference method, then each approximation to can be computed in time , giving a total complexity of .

The use of quantum amplitude estimation [12] would improve the dependence on quadratically. Here we observe that the dependence on can also be improved quadratically, using a result of Gilyén, Arunachalam and Wiebe [22]. We will impose the restriction (for technical reasons) that the range of each function is within , where these numbers could be replaced with any constants between 0 and 1. Given the more typical constraint that (e.g. if the output of represents a probability),

can easily be modified to satisfy this constraint by a simple linear transformation, which does not change

.

The results of [22] use two somewhat nonstandard oracle models which we now define. First we will consider probability access, and define what a probability oracle is.

Definition 1 (Probability oracle).

Let , where forms an orthonormal basis of the Hilbert space , and let be an ancilla register on qubits. Then an operator is called a probability oracle for if

for some arbitrary qubit states , .

Essentially, within this model our objective function corresponds precisely to the probability of a certain outcome being observed upon measurement (in particular, the probability of seeing when measuring the final qubit). Indeed, given a classical description of the function , an oracle of this form can be constructed without a significant overhead [13]. The next access model we consider is access via a phase oracle.

Definition 2 (Phase oracle).

Given a function , and given that forms an orthonormal basis of the Hilbert space , then the corresponding phase oracle allows queries of the form

The authors of [22] showed that a probability oracle is capable of simulating a phase oracle, and vice versa, with only logarithmic overhead:

Theorem 3 (Converting between probability and phase oracles [22]).

Suppose is given by access to a probability oracle which makes use of auxiliary qubits. Then we can simulate an -approximate phase oracle using queries to ; the gate complexity is the same up to a factor of . Similarly, suppose is given by access to a phase oracle . Then we can construct an -approximate probability oracle for using queries to . The gate complexity is the same up to a factor of .

What this shows is that the two access models are more-or-less equivalent in power. Now we have defined probability oracles, we can show that access to probability oracles for the individual functions immediately gives such access for itself.

Lemma 4.

Assume we have access to each function via a probability oracle . Then we can construct a probability oracle for with a single use of controlled- operations (in superposition) and additional operations.

Proof.

We start with the superposition , where denotes a description of the real vector in terms of binary, up to some digits of precision, leading to an orthonormal basis. If is a power of 2, this state can be constructed easily by applying Hadamard gates to each qubit in a register of qubits. If not, the state can be constructed in circuit complexity as follows: attach a register of qubits; apply Hadamard gates to produce ; compute the function “” into an ancilla qubit using an efficient comparison circuit (e.g. [16]); measure the ancilla qubit; and proceed only if the answer is 1. If not (which occurs with probability at most ), repeat this step. We then apply the controlled operation . This produces

for some sequences of normalised states , . Rearranging subsystems, we can write this as

for some unnormalised states , where as required by the definition of a probability oracle for . ∎

We will use this probability oracle within the framework of the fast quantum algorithm of [22] for computing gradients. This algorithm is applicable to functions that satisfy a certain smoothness condition. Given some analytic function , let , and for any , , let

The following result shows that if each function satisfies the required smoothness condition [22], we have that the overall function also satisfies the same condition.

Claim 5.

Let be a real constant, and fix some . Suppose that for all the function is analytic, and that for every natural number , and , we have that

then we have that also satisfies the same condition.

Proof.

We apply the linearity of . Observe that

and we are done. ∎

In fact it’s not too hard to see that this claim generalises to more-or-less any bound on the partial derivatives. We can now state the result we will need from [22].

Theorem 6 (Gilyén, Arunachalam and Wiebe [22, Theorem 25]).

Suppose that is an analytic function such that, for all and , . Assume access to is given by a phase oracle . Then there exists an algorithm that outputs a vector such that with 99% probability, using queries to the oracle and additional time .

Note that, if the time complexity of evaluating is , this dominates the overall runtime bound. We can encapsulate the combination of these results in the following theorem.

Theorem 7.

Let be defined as in (1), and assume that each function satisfies the conditions required for Theorem 6 and can be computed in time , for some bound such that . Then there is a quantum algorithm that outputs such that with 99% probability, in time .

Proof.

Given the ability to compute each function in time , we can produce a phase oracle computing in time . By Theorem 3, and using that , we can then obtain an operation approximating a probability oracle for up to error in time . By Lemma 4, this gives a probability oracle for , at additional cost . By Theorem 3, we then obtain a phase oracle for at additional cost . This finally allows us to apply Theorem 6 to achieve the stated complexity. ∎

Despite Theorem 7 giving a more efficient quantum algorithm for approximately computing , it is not clear whether this translates into a more efficient quantum algorithm for stochastic gradient descent, or a quantum speedup of other algorithms making use of . This is because the algorithm of [22]

only outputs an approximate gradient, and one which may not be an unbiased estimate of

. To prove approximate convergence of stochastic gradient descent, it is not essential for the gradient estimates to be unbiased [8], and it is plausible that an approximate estimate of the gradient should lead to an approximate minimiser for being found. However, the technique used in [8] to show approximate convergence in this scenario requires the 2-norm of the approximate gradient to be close to that of . The algorithm of [22] provides accuracy in the -norm, which would only give accuracy in the 2-norm. Further, it was shown by Cornelissen [15] that if is picked from a certain class of smooth functions, approximating up to 2-norm accuracy requires uses of a phase oracle for in the worst case, so this is not merely a technical restriction. Nevertheless, it is possible that quantum gradient estimation may be more efficient than stochastic gradient descent in practice.

Acknowledgements

We would like to thank Srinivasan Arunachalam for helpful explanations of the results of [22]. We acknowledge support from the QuantERA ERA-NET Cofund in Quantum Technologies implemented within the European Union’s Horizon 2020 Programme (QuantAlgo project) and EPSRC grants EP/R043957/1 and EP/T001062/1. This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. 817581). No new data were created during this study.

Appendix A The DIRECT algorithm

In this appendix we briefly describe the DIRECT (“dividing rectangles”) algorithm [31] for global optimisation of functions , which is presented as Algorithm 5. The algorithm is based on maintaining a partition of the hypercube into hyperrectangles using the concept of “potentially optimal” hyperrectangles:

Definition 3.

Let , let be the current best function value found, and let be the current number of hyperrectangles in the partition of . Let denote the centre of the th hyperrectangle, and let denote the distance from the centre to the vertices. Hyperrectangle is said to be potentially optimal if there exists such that ,

(2)

and

(3)

We think of in Definition 3 as a surrogate for the Lipschitz constant of (which is not assumed to be known in advance). An example of the first couple of steps of dividing into rectangles is shown in Figure 5(a). The set of potentially optimal hyperrectangles can be determined in time , where is the number of distinct interval lengths, using a convex hull technique described in [31] and illustrated in Figure 5(b). The conditions (2) and (3) are satisfied by the points that lie on the lower convex hull when is plotted against for each hyperrectangle, and we also include the point . In Figure 5(b) the red dots represent potentially optimal hyperrectangles whereas the black dots represent hyperrectangles that are not potentially optimal.

Let be the centre of and evaluate . Assign , , . Let be the set of potentially optimal hyperrectangles. Select any hyperrectangle . Evaluate hyperrectangle and decide where to divide it using the following procedure: Let be the set of dimensions with maximal side length. Let be one-third of this maximal side length. Let be the centre of hyperrectangle . Evaluate at the points for all , where is the ’th vector in the standard basis. Divide the hyperrectangle containing into thirds along the dimensions , in ascending order of . Let be the number of new points evaluated. Update , new best min. . If , go to step 3. . If , where is the iteration limit, then stop, if not go to step 2.

Algorithm 5: DIRECT algorithm [31] for optimisation over .

3

2

4

5
(a) Dividing the initial hypercube

(b) Identifying potentially optimal hyperrectangles
Figure 6: Illustration of aspects of the DIRECT algorithm

References