 # Contractivity of Runge-Kutta methods for convex gradient systems

We consider the application of Runge-Kutta (RK) methods to gradient systems (d/dt)x = -∇ V(x), where, as in many optimization problems, V is convex and ∇ V (globally) Lipschitz-continuous with Lipschitz constant L. Solutions of this system behave contractively, i.e. the distance between two solutions x(t) and x(t) is a nonincreasing function of t. It is then of interest to investigate whether a similar contraction takes place, at least for suitably small step sizes h, for the discrete solution. We prove that there are RK schemes that for arbitrarily small h do not behave contractively. We derive a sufficient condition that guarantees the contractivity of a given RK method for a given h. It turns out that there are explicit RK schemes that behave contractively whenever Lh is below a scheme-dependent constant, which is some cases coincides with the one given by linear stability analysis. We show that, among all consistent, explicit RK methods, Euler's rule is optimal in this regard.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

 ddtx=−∇V(x), (1.1)

where is a continuously differentiable real function defined in , arise in many applications and, accordingly, have attracted the interest of numerical analysts for a long time, see e.g. [1, 2] among many others. Since , the potential (in Physics problems) or objective function (in optimization applications) decreases along solutions. Furthermore, if , then is a stationary point of , i.e., . These facts explain the well-known connections between the numerical integrators for (1.1) and algorithms for the minimization of . The simplest example is provided by the Euler integrator, that gives rise to the gradient descent optimization algorithm . In the case where possesses a global Lipschitz constant and (1.1) is integrated with an arbitrary Runge-Kutta (RK) method, Humphries and Stuart  showed that the value of decreases along the computed solution, i.e. , for positive stepsizes with , where only depends on and on the RK scheme.

In view of the important role that convex objective functions play in optimization theory, see e.g. [3, Section 2.1.2], it is certainly of interest to study numerical integrators for (1.1) in the specific case where is convex, i.e.,

 ∀x,y,⟨∇V(x)−∇V(y),x−y⟩≥0 (1.2)

( and stand throughout for the Euclidean inner product and norm in ). From a differential equations point of view, the relation (1.2) means that the one-sided Lipschitz constant , see [4, Section IV.2] or [5, Definition 112A], of the right-hand side of (1.1) vanishes. It follows that, for any two solutions , , we have the

contractivity estimate

 ∀t≥0,∥˜x(t)−x(t)∥≤∥˜x(0)−x(0)∥, (1.3)

and in particular for any solution and any stationary point (which by convexity will automatically be a minimizer)

 ∀t≥0,∥x(t)−x⋆∥≤∥x(0)−x⋆∥.

The study of linear multistep methods that, when applied to systems (gradient or otherwise) with one-sided Lipschitz constant mimic the contractive behaviour in (1.3), began with the pioneering work of Dahlquist . The corresponding developments in the Runge-Kutta (RK) field followed immediately . Those studies had the primary aim of extending the notion of A-stability  to nonlinear settings111The word implicit in the title of  is revealing in this regard. and, accordingly, their emphasis was on identifying methods that behave contractively for arbitrary choices of the step-size . This gave rise to the notions of algebraic stability/B-stability of RK methods ([4, Section IV.12] or [5, Section 357]) and G-stability of multistep methods ([4, Section V.6] or [5, Section 45]). These developments are the subjet of the monograph . Of course, algebraically stable/B-stable RK schemes and G-stable multistep methods have to be implicit and therefore are not well suited to be the basis of optimization algorithms for large problems.

In this article we focus on the application RK methods to gradient systems (1.1) where is convex and is Lipschitz continuous with Lipschitz constant

 ∀x,y,∥∇V(x)−∇V(y)∥≤L∥x−y∥,

or, in optimization terminology, where the objective function is convex and -smooth. For our purposes here, we shall say that an interval , , is an interval of contractivity of a given RK scheme if, for , any -smooth convex , and any two initial points , , the corresponding RK solutions after one time step satisfy

 ∥˜x1−x1∥≤∥˜x0−x0∥. (1.4)

By analogy with the result by Humphries and Stuart quoted above, one may perhaps expect that each (consistent) Runge-Kutta method would possess an interval of contractivity; however this is not true. We establish in Section 2 that the familiar second-order method due to Runge that for the general system takes the form

 y1=y0+hF(y0+h2F(y0)) (1.5)

possesses no interval of contractivity. The proof proceeds in two stages. We first follow the approach in [10, 11], based on ideas from robust control theory, and identify, for given and , initial points , and gradient values

 ∇V(x0),∇V(˜x0),∇V(x0−h2∇V(x0)),∇V(˜x0−h2∇V(˜x0))

that ensure that (1.4) is violated. In a second stage we provide a counterexample by constructing a suitable -smooth

by convex interpolation; this is not an easy task because multivariate convex interpolation problems with scattered data are difficult to handle

[12, 13].

In Section 3 we present a result (Theorem 3.1) that, under suitable hypotheses, ensures contractivity of a given RK integrator for a given value of . It turns out that there are explicit RK formulas that possess a contractivity interval and in fact this contractivity interval may even be as large as the standard linear stability interval. We also show (Theorem 3.3) that, among all explicit RK schemes, the contractivity interval provided by Theorem 3.1 is longest for Euler’s rule/steepest descent. In order not to stop the flow of the paper, some proofs have been postponed to the final Sections 4 and 5.

Before closing the introduction we point out that there has been much recent interest [14, 15, 16, 17] in interpreting optimization algorithms as discretizations of differential equations (not necessarily of the form (1.1)), among other things because differential equations help to gain intuition on the behaviour of discrete algorithms.

## 2 An RK scheme without contractivity interval

In this section we show that the RK scheme (1.5) has no interval of contractivity.

For the system (1.1), we write the formulas for performing one step from the initial points and in as

 x1=x0+hkh,˜x1=˜x0+h˜kh,xh=x0+h2hk0,˜xh=˜x0+h2˜k0, (2.1)

with

 k0=−∇V(x0),˜k0=−∇V(˜x0),kh=−∇V(xh),˜kh=−∇V(˜xh) (2.2)

(the subindices , , refer to the beginning of the step, , the end of the step, , and the halfway location, , respectively). Following the approach in [10, 11], we regard , , , , , , as inputs, and , , , , , as outputs222Note that , are both inputs and outputs.. The relations (2.2) provide a feedback loop that expresses the inputs , , , as values of a nonlinear function computed at the outputs , , , . The function that establishes this feedback is the negative gradient of some that is convex and -smooth. Now it is well known [3, Theorem 2.1.5] that possesses both of these properties simultaneously if and only if

 ∀x,y,1L∥∇V(x)−∇V(y)∥2≤⟨∇V(x)−∇V(y),x−y⟩. (2.3)

This implies that the vectors

, , , delivered by the feedback loop must obey the following constraints:

 1L∥˜k0−k0∥2 ≤ −⟨˜k0−k0,˜x0−x0⟩, (2.4) 1L∥˜kh−kh∥2 ≤ −⟨˜kh−kh,˜xh−xh⟩, (2.5) 1L∥kh−k0∥2 ≤ −⟨kh−k0,xh−x0⟩, (2.6) 1L∥˜kh−˜k0∥2 ≤ −⟨˜kh−˜k0,˜xh−˜x0⟩, (2.7) 1L∥˜kh−k0∥2 ≤ −⟨˜kh−k0,˜xh−x0⟩, (2.8) 1L∥kh−˜k0∥2 ≤ −⟨kh−˜k0,xh−˜x0⟩ (2.9)

(we are dealing with four gradient values and therefore (2.3) may be applied in ways). In a robust control approach, we will not assume at this stage that the vectors are values of one and the same function , on the contrary the vectors are seen as arbitrary except for the above constraints. More precisely, for fixed and , we investigate the lack of contractivity by studying the real function

 ∥˜x1−x1∥2∥˜x0−x0∥2 (2.10)

of the input variables , , , , , , subject to the constraints and (2.4)–(2.9). Here , , , are known linear combinations of the inputs given in (2.1).

Our task is made easier by the following observations. First of all, multiplication of , , , , , , , , , by the same scalar preserves the relations (2.1), the constraints (2.4)–(2.9) and the value of the quotient (2.10). Therefore we may assume at the outset that . In addition, since the problem is also invariant by translations and rotations in , we may set and . After these simplifications, we are left with the task of ascertaining if we can make larger than by choosing appropriately the vectors , , , subject to the constraints. Here is a choice in that works (see Section 4 for the origin of these vectors)

 k0 = [0,−3/h]T, (2.11) ˜k0 = [−L/2,−3/h+L/2]T, (2.12) kh = [0,−3/h+L]T, (2.13) ˜kh = [L3h2/64,−3/h+L−L2h/8]T. (2.14)

In fact, with

 x0=[0,0]T,˜x0=[1,0]T (2.15)

and (2.11)–(2.14), the relations (2.1) yield

 xh = [0,−3/2]T, (2.16) ˜xh = [1−Lh/4,−3/2+Lh/4]T, (2.17) x1 = [0,−3+Lh]T, (2.18) ˜x1 = [1+L3h3/64,−3+Lh+L2h2/8]T. (2.19)

It is a simple exercise to check that the constraints are satisfied at least for . In addition

 ˜x1−x1=[1+L3h3/64,L2h2/8]T

and, accordingly,

 ∥˜x1−x1∥2=1+132L3h3+164L4h4+14096L6h6>1=∥˜x0−x0∥2. (2.20)

(The third power in matches the size of the local error of the scheme.)

###### Remark 2.1.

The vectors (2.11)–(2.14) become longer as decreases. This is a consequence of the way we addressed the study of (2.10) where we fixed the length of for mathematical convenience. As pointed out above we could alternatively have chosen , and multiplied (2.11)–(2.14) by a factor of and that would have given a configuration with bounded gradients resulting in lack of contractivity. Figure 1: A configuration that satisfies the constraints resulting from convexity and L-smoothness and leads to lack of contractivity for L=2, h=1

To get some insight, we have depicted in Figure 1, when , , the points , , , , , along with the vectors , , , (for clarity, the vectors have been drawn after multiplying their length by ). The difference vector forms, as required by convexity, an obtuse angle with and this causes to be shorter than . Similarly the difference forms by convexity an obtuse angle with and if and were alternatively defined as and respectively we see from the Figure that we would have . (That alternative time stepping is studied in Example 4 in the next section.) However for our RK scheme (1.5) the direction of is used to displace (rather than ) to get (and similarly for the points with tilde); the vector forms an acute angle with and this makes it possible for to be longer than . For smaller values of and/or the effect is not so marked as that displayed in the figure but is nevertheless present.

While (2.20) is consistent with the scheme having no interval of contractivity, we are not yet done, because it is not obvious whether there is a convex, -smooth that realizes the relations (2.2) for the ’s and ’s we have found. Nevertheless the preceding material will provide the basis for proving in the final section the following result:

###### Theorem 2.3.

Fix . For the RK scheme (1.5) and each arbitrarily small value of , there exist an -smooth, convex and initial points and such that (1.4) is not satisfied. As a consequence the scheme does not possess an interval of contractivity.

One could perhaps say that the method has an empty interval of contractivity.

## 3 Sufficient conditions for contractivity

The application of the -stage RK method with coefficients and weights , , to the system of differential equations (1.1) results in the relations

 x1 = x0+hs∑j=1bjkj, (3.1) Xi = x0+hs∑j=1aijkj,i=1,…,s, kj = −∇V(Xj),j=1,…,s.

Here the and are the stage vectors and slopes respectively. Of course, the scheme is consistent/convergent provided that .

The symmetric matrix with entries

 mij=biaij+bjaji−bibj

that features in the next result plays a central role in the study of algebraic stability as defined by Burrage and Butcher, [4, Definition 12.5] or [5, Definition 357B] and also in symplectic integration . In fact the hypotheses 1. and 2. of the following theorem may be seen as adaptations to the present situation of the requirements for algebraic stability.

###### Theorem 3.1.

Let the scheme (3.1) be applied to the gradient system (1.1) with convex, -smooth .

Assume that:

1. The weights , , are nonnegative.

2. The symmetric matrix with entries

 ¯¯¯¯¯mij(h)=2hbiLδij+h2mij

( is Kronecker’s symbol) is positive semidefinite.

Then, if and are the RK solutions after a step of lenght starting from and respectively the contractivity estimate (1.4) holds. In particular, if is a minimizer of , then

 ∥x1−x⋆∥≤∥x0−x⋆∥.

Proof. We start from the identity [4, Theorem 12.4]

 ∥˜x1−x1∥2=∥˜x0−x0∥2+2hs∑i=1bi⟨˜ki−ki,˜Xi−Xi⟩−h2s∑i,j=1mij⟨˜ki−ki,˜kj−kj⟩,

where and respectively denote the stage vectors and slopes for the step . (This identity holds if and are replaced by any symmetric bilinear map and the associated quadratic map respectively, see [18, Lemma 2.5].) From (2.3), for ,

 ⟨˜ki−ki,˜Xi−Xi⟩≤−1L⟨˜ki−ki,˜ki−ki⟩,

which implies, in view of the nonnegativity of the weights,

 ∥˜x1−x1∥2≤∥˜x0−x0∥2−s∑i,j=1¯¯¯¯¯mij(h)⟨˜ki−ki,˜kj−kj⟩.

If is positive semidefinite the sum is and the proof is complete. In addition, if we now set , we trivially obtain .

Let us present some examples. Hereafter means that the matrix is positive semidefinite.

Example 1. For Euler’s rule, , , , we find and therefore we have contractivity for in the interval . This happens to coincide with the familiar stability interval for the linear scalar test equation , . The restriction on the step size is well known in the analysis of the gradient descent algorithm, see e.g. . Observe that the scalar test equation arises from the -smooth convex potential and that therefore no RK scheme can have an interval of contractivity longer than its linear stability interval.

Example 2. For the well-known explicit, two-stage, second-order scheme defined and , we compute

 ¯¯¯¯¯¯M(h)=⎡⎢⎣hL−h24h24h24hL−h24⎤⎥⎦.

This is for (which happens to coincide with the condition for linear stability of the method).

Example 3. The formula two-stage, second order (1.5) studied in the preceding section has , and . Thus

 ¯¯¯¯¯¯M(h)=⎡⎢⎣0h22h222hL−h2⎤⎥⎦.

There is no value of for which this matrix is , which agrees with Theorem 2.3. Recall that this scheme has a linear stability interval (in fact this scheme and the one in Example 2. give exactly the same approximations when applied to linear systems with constant coefficients).

Example 4. Explicit, two-stage, first-order scheme with and . Here

 ¯¯¯¯¯¯M(h)=⎡⎢⎣hL−h2400hL−h24,⎤⎥⎦

and we have contractivity for . This could have been concluded from Example 1, because performing one step with this method yields the same result as taking two steps of length with Euler’s rule and accordingly, for this method, ensures contractivity.

Example 5. We may generalize as follows. Consider the explicit -stage, first-order scheme with Butcher tableau

 000⋯0b100⋯0b1b20⋯0⋮⋮⋮⋮⋮b1b2b3⋯0b1b2b3⋯bs (3.2)

(i.e., whenever ) with

 s∑i=1bi=1,bi≥0,i=1,…,s.

Performing one step with this scheme is equivalent to successively performing steps with Euler’s rule with step-sizes , …, , and therefore contractivity is ensured for . This conclusion may alternatively be reached by applying the theorem; the method has given by

 diag(2hb1/L−h2b21,2hb2/L−h2b22,…,2hbs/L−h2b2s), (3.3)

a matrix that is if and only if . If we see the weights as parameters, then the least severe restriction on is attained by choosing equal weights , , leading to the condition . But then one is really time-stepping with Euler rule with stepsize .

Example 6. For the “classical” order four, four stages method of Kutta (with weights , ) the theorem leads, after some tedious algebra, to the interval of contractivity , which is shorter than the linear stability interval , .

Example 7. Since, for nonnegative weights, clearly whenever , Theorem 3.1 implies the well-known result that algebraically stable RK schemes have interval of contractivity .

In Examples 1–5 and 7 the interval of contractivity that one obtains by applying the theorem (with the convention that in Example 4 the contractivity interval has zero length) is the longest possible. However it is not known whether the conditions in the theorem above are necessary for contractivity, so that one may not rule out that for some RK methods the application of the theorem may lead to suboptimal results. Proving that algebraic stability is necessary for contractivity/B-stability for (not necessarily gradient) contractive problems is a very delicate task  that requires much mathematical machinery and we fear that, likewise, the investigation of the necessity of the conditions in Theorem 3.1 will not be easy.

We close this section by presenting some results for explicit RK methods, i.e., to methods with whenever ; accordingly all RK schemes are hereafter assumed to be explicit.

The method (1.5) (Example 3) does not satisfy the second hypothesis in Theorem 3.1 for any choice of . We shall prove next that, essentially, the same will be the case for any explicit, consistent RK algorithm with one or more zero weights. Before we state the result, we note that any RK scheme may be rewritten as one with some zero weights by the simple, dubious expedient of adding additional stages that play no role in the computation of the numerical solution . To avoid trivialities, we then assume that

 |bj|+s∑i=j+1|aij|>0,j=1,…,s. (3.4)

When this condition does not hold, we have, for some , , (i.e., the -th column of the tableau vanishes). Then the slope is never used by the algorithm, the stage may be done away with and the scheme may be rewritten as one with stages. Thus (3.4) will be fulfilled for any bona fide RK method.

Remark. RK methods that may be rewritten in a simpler way are called reducible. The property (3.4) is essentially equivalent to the notion of DJ-reducibility [4, Definition 12.15].

We then have the following result:

###### Theorem 3.2.

Assume that the explicit, consistent RK scheme (3.1) satisfies (3.4) and has nonnegative weights. If there exist such that , then all the weights are positive.

Proof. By contradiction. Assume that one of the weights vanishes and choose the largest with . From (3.4), , and then by definition of we know that for . The matrix obtained by keeping the and , for some rows and columns of is given by

(we have taken into account that because the method is explicit). For such that , the matrix in the last display is also and in particular has determinant . Since ,

 Δ(h)=−(h2bjajj0)2,

which combined with yields . Since this holds for any , we have found a contradiction with (3.4).

Here is our last result, where, in view of Theorem 3.2, we assume positive weights.

###### Theorem 3.3.

Consider an -stage, explicit, consistent RK method with weights .

1. If for some , , then .

2. If for , , then the method is necessarily given by (3.2) with , (i.e., it is the concatenation of Euler substeps of equal length ).

Proof. For the first item, we first note that, as we saw in Example 5, the result is true for the particular case where the scheme is of the form (3.2), i.e., a concatenation of Euler’s substeps. Let be the matrix associated with the scheme of the form (3.2) that possesses the same weights as the given scheme (recall that this matrix was computed in (3.3)). The first item will be proved if we show that implies , because, as we have just noted, the last condition guarantees that . Assume that . Then, its diagonal entries must be nonnegative,

 0≤¯¯¯¯¯mii(h)=2hbi/L−h2b2i,i=1,…,s,

and, in view of (3.3), this entails that , as we wanted to establish.

We now prove the second part of the theorem. If , then

 0≤¯¯¯¯¯mii(2s/L)=4sbi/L2−4s2b2i/L2,i=1,…,s,

or, after dividing by , . Since , we conclude that , , which leads to for each . A semidefinite positive matrix with vanishing diagonal elements must be the null matrix and therefore for

 0=¯¯¯¯¯mij(2s/L)=(2s/L)2(biaij−bibj)

and then . The proof is now complete.

## 4 The construction of the auxiliary gradients

The proof of Theorem 2.3 hinges on the use of the vectors (2.11)–(2.14). In this section we briefly describe how we constructed them.

Let us introduce the vectors in

 δ0=˜x0−x0,δh=˜xh−xh,δ1=˜x1−x1

and

 Δ0=˜k0−k0,Δh=˜kh−kh,

so that and . We fixed as explained in Section 2, saw and as variables in and considered the problem of maximizing under the constraints (2.4)–(2.5), i.e

 1L∥Δ0∥2≤−⟨Δ0,δ0⟩,1L∥Δh∥2≤−⟨Δh,δh⟩,

With some patience, this maximization problem may be solved analytically in closed form after introducing Lagrange multipliers. Both constraints are active at the solution. The expression of the maximizer is a complicated function of and and to simplify the subsequent algebra we expanded that expression in powers of and kept the leading terms. This results in

 Δ0=[−L/2,L/2]T,Δh=[L3h2/64,−L2h/8]T

(there is a second solution obtained by reflecting this with respect to the first coordinate axis).

Once we had found candidates for the differences , , we identified suitable candidates for and . We arbitrarily fixed the direction of by choosing it to be perpendicular to (see (2.11)). Its second component was sought in the form ( a constant) so that the distance between and behaved like as (recall that we have scaled things in such a way that and are also at a distance as ). We also took to be perpendicular to ; the second component of this vector was chosen to be of the form so as to have with a view to satisfying (2.6). After some experimentation we saw that the values , led to a set of vectors for which all six contraints (2.4)–(2.9) hold at least for .

For the sake of curiosity we also carried out numerically the maximization of (2.10) subject to the constraints. It turns out that the maximum value of the quotient is approximately for small, independently of the dimension of the problem (for the experiments suggest that the scheme is contractive). Since, in (2.20), the vectors (2.11)–(2.14) are very close to providing the combination of gradients that leads to the greatest dilation (2.10).

## 5 Proof of Theorem 2.3

The proof proceeds in two stages. We first construct an auxiliary piecewise linear, convex and then we regularize it to obtain .

### 5.1 Constructing a piecewise linear potential by convex interpolation

Let be the Lispchitz constant and set , where is a positive safety factor, independent of and , whose value will be determined later. Restrict hereafter the attention to values of with . We wish to construct a potential for which the application of the RK scheme starting from the two initial conditions (2.15) lead to the relations (2.11)–(2.14), (2.16)–(2.19) with in lieu of and therefore, as we know, to lack of contractivity.

We consider the following four (pairwise distinct) points in the plane of the variable (see (2.16)–(2.19))

 Z1 = [0,0]T, Z2 = [1,0]T, Z3 = [0,−3/2]T, Z4 = [1−L′h/4,−3/2+L′h/4]T,

and associate with them the four (pairwise distinct) vectors (see (2.11)–(2.14))

 G1 = [0,3/h]T, G2 = [L′/2,3/h−L′/2]T, G3 = [0,3/h−L′]T, G4 = [−L′3h2/64,3/h−L′+L′2h/8]T,

and four real numbers that will be determined below. We then pose the following Hermite convex interpolation problem: Find a real convex function defined in , differentiable in the neighbourhood of the , and such that

 \definecolor[named]pgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0˜V(Zi)=Fi,∇\definecolor[named]pgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0˜V(Zi)=Gi,i=1,…,4.

If the interpolation problem has a solution, then the tangent plane to at is given by the equation with

 \definecolor[named]pgfstrokecolorrgb0,0,0\pgfsys@color@gray@stroke0\pgfsys@color@gray@fill0πi(ζ) =Fi+⟨Gi,ζ−Zi⟩,i=1,…,4.

and, by convexity,

 Fi≥πj(Zi),i≠j,i,j=1,…,4. (5.1)

This is then a necessary condition for the Hermite problem to have a solution. We found the following set of values

 F1 = 0, F2 = L′4, F3 = −92h+9L′8, F4 = −92h+15L′8−L′2h4+L′3h2128,

that satisfy the relations (5.1) (in fact they satisfy all of them with strict inequality).

It is not difficult to see [12, 13], that once we have ensured (5.1), the Hermite problem is solvable. The solution is not unique and among all solutions the minimal is clearly given by the piecewise linear function

 \definecolor[named]pgfstrokecolorrgb0,