 # Spatial extreme values: variational techniques and stochastic integrals

This work employs variational techniques to revisit and expand the construction and analysis of extreme value processes. These techniques permit a novel study of spatial statistics of the location of minimizing events. We develop integral formulas for computing statistics of spatially-biased extremal events, and show that they are analogous to stochastic integrals in the setting of standard stochastic processes. We also establish an asymptotic result in the spirit of the Fisher-Tippett-Gnedenko theory for a broader class of extremal events and discuss some applications of our results.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Extreme value theory is the branch of probability theory and statistics concerned with the study of extreme deviations from the median behavior. A fundamental problem is to characterize the asymptotic distribution of

 Mn=anmini=1…n(Xi−bn),

where the

are i.i.d. random variables. A complete account of the possible limit distributions of this type of process is given by the Fisher-Tippett-Gnedenko theory (see e.g.

[Res13]), which also explains the role of the normalizing sequences and . For instance if the are strictly positive with unit density at zero and , then the

converge in law to the exponential distribution with unit rate. This classic theory and some related stochastic processes will be reviewed in Subection

1.2. In many modern settings it is desirable to not only understand the distribution of minima, but also of the index (or “spatial location”) of minimizers; this paper investigates some questions arising in such settings. Suppose, as an illustration, that a user sends a request to a group of spatially-distributed servers with distinct locations . The user and the parties running the servers will certainly be interested in waiting time for the first response to occur, but also will likely be interested in identifying the location of the first responding server. In order to model such a system, we will suppose that the response time of the servers is given by

 fn(x):={nλ(xi)ξi+g(xi), for x=xi,i=1,…,n,+∞, otherwise, (1.1)

where represents a deterministic communication latency (i.e. the amount of time it takes the message to reach the -th server) that is not necessarily spatially uniform, the are i.i.d. exponential variables, and the represents the processing speed of the server at location (which could be related to either hardware or workload). We have chosen the scaling for the random exponential term so that in the large limit it has comparable size to the deterministic latency. In a way that will be made precise, this scaling ensures that, for large the overall processing rate of all servers is of order and given by the function . The latency imposes a relatively weak deterministic spatial correlation structure and models a scenario where the processing power needed for the request is high and the communication overhead is low. The locations of the servers will be assumed to be distributed according to a probability density in a bounded domain

so that, in general, the locations will not be uniformly distributed. In this framework, it is natural to study the distribution of the first response time, along with the distribution of the

location of the first server making a response.

A first contribution of this paper is to provide integral representation formulas, which to our knowledge are novel, for the distribution of mins and argmins. These formulas may be of use in various inference problems, some of which will be outlined in Section 1.3. We will be mostly interested in the limit problem as and a second contribution is to introduce an appropriate analytical setting in which to study such a limit using modern tools from the calculus of variations [DM93]. The response time functions will be viewed as random variables with values in the space of lower semi-continuous functions (a precise definition of this space, and the associated topology of -convergence, will be given in Section 2.1). Intuitively, is a natural choice of function space, as it permits the evaluation of minima and minimizers both in the discrete case with finite and in the limiting case, . Moreover, the functions above are immediately lower semi-continuous, as they are defined to be except at the . Finally, a third contribution of this paper is to draw several analogies between extreme value processes and standard stochastic processes. We show that, if the large distribution of mins and argmins of the response times is governed by the distribution of mins and argmins of a limiting extreme value process that can be seen as an analog of Brownian motion. The process has been studied before, but here our focus is on the spatial statistics of minimizers, which differs from much of the previous literature –see Subsection 1.2. In the case of general , evaluating local minima is analogous to computing stochastic integrals of with extreme value processes playing the role of Brownian motion.

The remainder of the introduction goes as follows. In Subsection 1.1 we state our main results. We review related work in Subsection 1.2, and applications and future lines of research in Subsection 1.3. We close with an outline of the paper in 1.4.

### 1.1 Main results

The main results of this paper are concerned with the random process , which will serve as a limit of the process described above in the case . Here and throughout denotes a bounded domain in , its closure, and the space of lower semi-continuous function on (see Section 2 for more details on this space).

###### Definition 1.1.

Let be a continuous function. We define to be the random object with values in satisfying the following properties:

1. For every closed set the random variable is exponentially distributed with rate

 λC:=∫C∩¯¯¯¯Dλ(x)dx.
2. For any finite collection of disjoint and closed sets , the random variables

 minx∈C1Wλ(x),…,minx∈CkWλ(x)

are independent.

We will show in Propositions 2.9 and 2.10 that in fact the random process exists and is well defined. Furthermore, we will show in Proposition 2.19 that after adding a deterministic function to , the resulting random function has well-defined, unique -argmins (as defined in Definition 2.14

). Our first main result establishes a representation formula for the joint distribution of the first

-argmins.

###### Theorem 1.2.

Let (see Section 2.1) and let be as in Definition 1.1. Then, (with probability 1) the random function has well-defined, unique -argmins for every (see Definition 2.14). In addition, the joint distribution of the first -argmins is given by the density function:

 P(X(1)∈dx1,…,X(k)∈dxk)dx1…dxk=∫Rk−1Φ(xk,(g−rk−1)+)Ψ(x1,r1,g)×k−1∏j=2Ψ(xj,rj−rj−1,(g−rj−1)+)dr1…drk−1, (1.2)

with

 Φ(z,h):=∫∞h(z)exp(−¯λ∫t−∞Hλ,h(s)ds)dt,Ψ(z,t,h):=χt>h(z)exp(−¯λ∫t−∞Hλ,h(s)ds),Hλ,h(s):=μλ({x∈D:h(x)≤s}),¯λ:=∫Dλ(x)dx,μλ(E):=1¯λ∫Eλ(x)dx. (1.3)

Our second main result establishes that serves as an appropriate limiting process for a wide class of discrete processes, including the above.

###### Theorem 1.3.

Suppose that is a sequence of points in for which the empirical measures

 1nn∑i=1δxi

converge weakly, as , towards the measure , where is a continuous density function. Let be a continuous function and let . Let

be independent, non-negative random variables (not necessarily identically distributed) whose cumulative distribution functions

satisfy

 Fi(t)=t+O(t2) (1.4)

near zero (uniformly in ). Define the random function by

 fn(x)={nλ(xi)ξi+g(xi),if x=xi,i=1,…,n,+∞,otherwise.

Then, the joint distribution of the first argmins of converges, as , towards the joint distribution of the first argmins of , where

 ~λ(x)=λ(x)ρ(x),x∈D.

The proofs of Theorems 1.2 and 1.3 will be given in Section 3. An immediate corollary of these theorems is the following:

###### Corollary 1.4.

In the setting of Theorem 1.3, the joint distribution of the first argmins of converges, as , towards the distribution given by (1.2) with replaced by

###### Remark 1.5.

In most proofs below the random variables are assumed to be exponentially distributed with unit rate and the points to be uniformly distributed. This assumption can be made without loss of generality, as the rates of the and fluctuations in may be absorbed into . Moreover, the assumption that the are exponential may be relaxed in the limit as (see e.g. Theorem 1.3). Second, the Fisher-Tippett-Gnedenko theory provides a broader class of possible limiting distributions (analogous to -stable distributions in the classical central limit theory), which would not require such a specific behavior of the near zero. We believe that it is of interest to extend our results to that broader family, but such an extension is beyond the scope of this work.

### 1.2 Related work

Extreme value theory and extremal processes is a classical branch of probability and statistics. While we do not attempt to give an exhaustive account of the field, we will highlight some issues that are relevant to our work. We will also draw parallels with other limit theorems and stochastic processes.

One of the first questions in probability theory is to understand the limits of combinations of independent random variables. In the context of the central limit theorem, one considers

 limn→∞n−1/2(n∑i=1Xi−nE(Xi)).

The central limit theorem states that, for i.i.d.

with finite variance, the limit is a normally-distributed random variable. In the context of extreme values, given a sequence

of i.i.d. variables one similarly considers

 limn→∞an(mini=1…nXi−bn).

Here and are normalizing constants, analogous to and in the central limit theorem case. The Fisher-Tippett-Gnedenko Theorem [Gne43] completely specifies possible limits for this process. A detailed description of this theory can be found in [LLR83] or [Res13]. In this paper we focus on the case where is positive, and has a density function which is positive at zero.

Subsequently, various authors [Dwa64] [Lam64] studied the distribution of the “k-records”

 Mkn=an(k-mini=1…nXi−bn),

where denotes the value of the -th smallest element in the collection. Studying -th mins is an important branch of extreme value theory that we pursue and extend here by characterizing the spatial distribution of the -th minima.

Returning to the discussion of first mins, a natural object is the rescaled process

 Mn(t)=an(mini

that tracks mins over time. Taking the limit yields so-called extremal processes [Res13]. The construction of such processes is completely analogous to the construction of Brownian motion by Lévy, where sums are replaced by mins.

In standard stochastic processes, a central concept is the family of stable stochastic processes, which are invariant under linear combinations. In the context of extreme values one instead considers the family of max-stable processes, which are stochastic processes that satisfy

 maxi=1…rM(i)(t)d=rM(t),

where the are independent copies of and denotes equality in distribution. A significant literature studies these processes by means of Poisson processes on the plane [Pic71], and gives a spectral representation of these processes [dH84]. Min-stable processes (namely processes for which is max-stable) were studied in [dHP84]. Again, much of this literature focuses on the distribution of minima and not on the locations at which minima occur.

The previous objects can be easily generalized by evaluating the minimum value over some Borel set ,

 Mn(A)=an(minin∈AXi−bn).

Note that letting recovers the previously-defined process . Again, one may then consider the limit

 M(A):=limn→∞Mn(A).

The set functions , are called inf-measures. In the context of maximization the analog is known as a sup-measure.

In this paper we work at the level of lower semi-continuous functions, that is, we study directly the functions as opposed to the minimum value that they take. This seems more natural here, as it allows us to study simultaneously both minimizers and minima. The approach via lower semi-continuous functions can be shown to be equivalent to that of inf-measures. Namely, given a inf-measure we can define the inf-derivative of that measure as

 d∧M(x):=infG:x∈GM(G),

where is allowed to vary across Borel sets. Here will be a lower semi-continuous function. Similarly, given a lower semi-continuous function we can define the inf-integral of the function via

 f∧(G):=infx∈Gf(x).

Again, here is any Borel set. It can be shown that is then an inf-measure. Hence one can develop the theory either in terms of inf-measures or in terms of lower semi-continuous functions.

A mathematically-sophisticated development of sup-measures, as well as a detailed account of connections with probability, optimization and analysis literature, is given in the excellent article [Ver97]. Unfortunately, Vervaat passed away before his work was published, and hence his work is only published somewhat obscurely.

The papers [RR91] and [RR92] extend and apply many of the ideas in [Ver97] to construct general random upper semi-continuous functions. These works are closely related to ours in that they construct a version of the process . However, they are strongly connected to the framework of choice optimization.

Various authors have addressed spatial effects in different contexts. For example, [dHL01] studies the probability that the maximum of some sequence of random functions exceeds a deterministic function , with

. This is similar to our framework in that they permit spatially-inhomogeneous shifts, but they do not study the distribution of the location of extremal events. Statistical estimation of spatial extremes in the context of max-stable processes has been used to study various geophysical processes

[Smi90][DG12]. These works focus primarily on using spectral representations of max stable processes to tackle challenging spatial statistical problems.

#### 1.2.1 Analogue with stochastic integrals

The discussion above suggests that for most of the basic developments of classical stochastic processes there is an analogue in terms of extreme values. In all cases the main difference is that taking sums of some underlying variables is replaced by taking the minima. In this light, one can see the theory of extreme values as a version of stochastic processes where the algebraic operator “” is replaced with “”. Extending this analogy, a family of algebraic operations similar to the ring , may be obtained by considering the operations . The resulting algebraic structure111Technically this is a semi-ring and not a proper ring, because the operator is not invertible. However, this will not be important for our purposes. is known as a tropical algebra.

Our discussion of extremal processes above has not made use of “multiplication” of processes. However, this is critical for the construction of stochastic integrals. Recall that with classical Brownian motion one defines

 ∫t0HdB=limn→∞n∑i=1Hti−1(Bti−Bti−1),

where describe partitions of .

Since in studying extreme values we are replacing sums with mins and multiplication with addition, we thus ought to define our extremal stochastic integrals via

 ‘‘∫t0gdWλ"=limn→∞mini

In this paper we study these processes and give explicit formulae for the distribution of their first mins and argmins.

### 1.3 Applications and extensions

There are several extensions and special cases of interest that can be derived from our main results. In this section we briefly describe a few promising examples.

#### 1.3.1 Bayesian estimation via k-argmins

Suppose that the server problem in the introduction is repeated many times. That is, one sequentially sends requests to servers, and observes which were the first servers to respond, as well as their response times. One can then rightfully ask: can we infer the functions and ?

This question naturally fits a non-parametric Bayesian framework in which the functions and are assumed to be unknown, and to be distributed according to a prior measure over functions. The explicit formulas developed in this paper would be nothing but the likelihood of the observations given the unknown functions ,

. With these formulas at hand, a Markov chain Monte Carlo algorithm could be used to estimate expectations under the

posterior distribution of unknowns given observations. In this way we could make predictions about where the location of the next fastest server will be and quantify how confident we are about our prediction.

#### 1.3.2 Dependence of rates and delays on density

Suppose the points are distributed according to the density . There may be applications where the rate function and the latency function depend on only through the density of points around . In particular, we may consider a model of the form

 λ(x):=Λ(ρ(x)),g(x):=G(ρ(x)),

where and

are scalar functions. For example, we may imagine a parametric model of the form

 Λ(t)=tα,G(t)=tβ,

where and are real numbers. Estimating the functions and then reduces to learning the parameters and . Estimating these parameters may provide valuable qualitative, as well as quantitative, information about the structure of latency and rate patterns due to high or low concentration of servers.

#### 1.3.3 Extension to weakly-correlated response times

The response times in Theorem 1.3 were assumed to be independent. Nevertheless, it is possible to extend our results to weakly-correlated processing times, allowing spatially correlated perturbations. Precisely, one could consider =

 fn\normalcolor(xi)=nλ(xi)ξi+g(xi)

where is a random field independent from for which, with probability one,

 ||g||∞<∞,

and where the are independent. In this context, to obtain the asymptotic distribution of the first argmins of we may use the independence lemma to first obtain the asymptotic distribution of using Corollary 1.4 and then integrate with respect to the distribution of . The latency function could be modeled according to a Gaussian random field.

#### 1.3.4 Extensions to network structures

In the setting of server responses, it may be more accurate to consider a graph structure as opposed to an Euclidean one. Thus, consider a very large graph with vertex set . Suppose that to each node in the graph we associate a server which can finish a task in time . It is natural to ask whether one can learn the latency, rate, and weak correlation structure of processing times in such a large graph.

A possible practical approach to answer this question is to use an embedding of the set of nodes into Euclidean space. For example, we can consider a spectral map constructed using the graph Laplacian and then imagine that the servers are actually located at the “geographic” locations

. One could then use the Bayesian inference methods described above to estimate the desired quantities. The idea of using the map

is that points are “close” when the points are close to each other; in this way we translate the non-geographic information contained in into geographic information which fits the set-up considered in this paper. Specific applications of these ideas are to be explored in the future.

### 1.4 Outline

The remainder of the paper is organized as follows. In Section 2 we collect some background results and we establish some properties of the process The main results are proved in Section 3. We conclude in Section 4 with a short illustrative example.

## 2 Preliminaries

This section contains background results that will be employed in the proof of our main theorems. Subsection 2.1 describes the function space of lower semicontinuous functions endowed with a suitable topology. Subsection 2.2 establishes some properties and constructions of the process .

### 2.1 The space S(¯¯¯¯¯D) and the topology of Γ-convergence

Our goal here is to introduce the topology of -convergence on the space of lower semi-continuous functions. This topology generates a notion of convergence which preserves the structure of minima and minimizers. -convergence is also known as epi-convergence in the probability literature [Ver97]. This topology has found application in many fields, e.g. materials science, Ginzburg-Landau theory, and image processing. We will follow the presentation in [DM93], Chapter 10.

For any metric space we define to be the family of lower semi-continuous functions on taking values in . The following definition is standard:

###### Definition 2.1.

Let be a metric space. A sequence is said to -converge to (written ) if

1. For all there exists a sequence satisfying such that

 f(x)≥limsupn→∞fn(xn). (2.5)
2. For all we have that

 f(x)≤liminfn→∞fn(xn). (2.6)

In essence this definition requires that the limiting object takes values below any possible limit, but that the value of is obtained by some appropriately chosen “recovery sequence”. This notion of convergence imposes essentially the weakest conditions needed to guarantee convergence of minima and minimizers. This is manifest in the following proposition which is of high relevance for our purposes:

###### Proposition 2.2.

Suppose that , in some metric space . Then . If is also compact then if then any limit point of will be a minimizer of .

It is natural to seek a topology which describes -convergence. To this end, we define the following topologies:

###### Definition 2.3.
1. We let be the topology generated by sets of the form , where is any open subset of and .

2. We let be the topology generated by sets of the form , where is any compact subset of and .

3. We let be the smallest topology containing and .

These topologies permit the measurement of minima on open and closed sets. It turns out that this topology is equivalent to -convergence in the following sense (c.f. Theorem 10.17 in [DM93])

###### Proposition 2.4.

Let be a metric space. A sequence converges in if and only if it -converges.

This space of functions is somewhat different from many of the standard function spaces. For example, it possesses the following compactness property (c.f. Theorem 10.6 in [DM93]):

###### Theorem 2.5.

The topological space is a compact space.

An immediate application, which is of importance to the present investigation, is the following:

###### Corollary 2.6.

Any sequence has a subsequence which -converges.

In general the topology generated by -convergence will not be metrizable (or even Hausdorff). However, in the setting where is compact we have the following as a consequence of Corollary 10.23 in [DM93]; see also Theorem 5.5 in [Ver97]:

###### Proposition 2.7.

Let be a compact metric space. Then is metrizable.

Finally, in this paper we will work with where is a bounded domain in . This allows us to characterize the topology in terms of cubes. The following is a direct application of Theorem 5.3 in [Ver97]:

###### Proposition 2.8.

For where is an open bounded domain, are generated by evaluating the infimum of functions over being open cubes and being closed cubes (as opposed to arbitrary open and closed sets).

### 2.2 The process Wλ

In this subsection we establish some properties of the extreme value process introduced in Definition 1.1. We first show the existence and uniqueness of , and then the existence an uniqueness of its -th argmins. We will denote by the space of probability measures over the space of lower semi-continuous functions on .

#### 2.2.1 Existence and uniqueness of the process Wλ

We first show that the distribution of is a well-defined object.

###### Proposition 2.9.

Any two random variables with values in satisfying Definition 1.1 have the same distribution.

###### Proof.

Suppose that are the distributions of two random variables satisfying Definition 1.1. Let

 A={E⊂S(D)|E∈σ,P1(E)=P2(E)}.

It is straightforward to verify that is a -system.

Next, by Property 1 in Definition 1.1, we have that any set of the form is in for any closed cube and real number . Similarly, by taking a limit of cubes from the outside and again using Property 1 in Definition 1.1 we have that any set of the form is in for any open cube and real number . By using Property 2 in Definition 1.1 we will have that any finite intersection of sets of these forms will also be in . Thus contains the -system generated by sets of these forms. By the theorem, contains the sigma algebra generated by sets of these forms. By Proposition 2.8 it follows that . This implies that , which concludes the proof. ∎

Now we will demonstrate the existence of by constructing an appropriate approximating sequence.

###### Proposition 2.10.

Let be a countable subset of and for every let

 μn:=1nn∑i=1δxi.

We assume that converges weakly towards the distribution with density proportional to . Let be i.i.d. exponentially-distributed random variables with rate one. Then, the random functions

 Wn(x):={n¯¯¯λξi, if x=xi,+∞, otherwise,

converge weakly (in ) towards , where in the formula for we are using

 ¯¯¯λ:=∫Dλ(x)dx.
###### Proof.

Since the space is compact, any sequence in is tight. Therefore, using that is a separable, complete, metrizable space, Prokhorov’s theorem implies that must have a limit point (in the sense of weak convergence of measures) in .

Let be some limit point of the . Our goal is to show that satisfies the conditions given in the definition for . To that end, first consider a closed set . From the definition of , at every point , we have an independent exponential variable, with rate . This then implies that is exponentially distributed with rate . As , we have that

 limn→∞minC∩¯¯¯¯DWn

is exponentially distributed with rate . Since taking mins over closed sets is measurable in , must satisfy the first point in the definition of .

For the second property, we notice that the min over a finite number of disjoint closed sets will be independent under , as the are all independent. These events will also be in the topology . Hence by taking limits we obtain that will satisfy the second property in the definition of . The uniqueness of given by Proposition 2.9 then gives that the unique limit point of in is precisely , which completes the proof.

The previous construction of closely mirrors the framework described in the introduction: namely that we trace the rescaled minimum of many exponentially-distributed variables. An alternative construction (which can be connected to the Poisson process construction of extremal processes in [Pic71]) is also possible. This construction makes certain properties of the simpler to visualize, and we include it for completeness.

###### Proposition 2.11.

Let be a sequence of points, i.i.d according to the probability measure . Let be an i.i.d. sequence of random variables distributed as . Define the following random functions:

 Wn(x)=⎧⎪⎨⎪⎩i∑j=1ξj, for x=xi,i≤n,+∞, otherwise.

Then the converge weakly (in ) towards .

###### Proof.

As in the previous construction, we only need to prove that asymptotically satisfies Properties 1 and 2 from Definition 1.1. We will do this by direct computation.

For Property 1 in Definition 1.1, given any closed set , the probability that lies in is equal to . Thus the first such that

is geometrically distributed with parameter

. Furthermore, has an Erlang distribution with density . In turn

 P(minx∈C∩¯¯¯¯DWn≥r)=∫∞rn∑j=1(1−~λC)j−1~λC¯λjsj−1e−¯λs(j−1)!ds+∞∑j=n+1(1−~λC)j−1~λC,

where on the right we are using the fact that the choice of points is independent of the values of the . Taking a limit as and simplifying the series, we find that

 limn→∞P(minx∈C∩¯¯¯¯DWn≥r)=∫∞r~λC¯λexp(−~λC¯λs)ds.

This proves Property 1.

With regards to Property 2, for we can compute (letting be the associated density function):

 pn(minC1Wn =r,minC2Wn=s)=n∑k1=1(1−~λC1−~λC2)k1−1~λC1¯λk1rk1−1e−¯λr(k1−1)! ×n∑k2=k1+1(1−~λC2)k2−k1−1~λC2~λk2−k1(s−r)k2−k1−1e−¯λ(s−r)(k2−k1−1)! =n∑k1=1(1−~λC1−~λC2)k1−1~λC1¯λk1rk1−1e−¯λr(k1−1)! ×n−k1∑~k=1(1−~λC2)~k−1~λC2¯λ~k(s−r)~k−1e−¯λ(s−r)(~k−1)!.

Taking we then obtain

 limn→∞pn(minC1Wn=r,minC2Wn=s) =~λC1¯λexp(−¯λr+¯λr(1−~λC1−~λC2))×~λC2¯λexp(−¯λ(s−r)+(s−r)¯λ(1−~λC2)) =~λC1¯λexp(−¯λ~λC1r)×~λC2¯λexp(−¯λ~λC2s).

This proves Property 2 in the case of two sets. The case with more than two sets is completely analogous. ∎

###### Remark 2.12.

The previous construction highlights the memorylessness of the process . That is, the distribution of the -th min can be found by restarting the process after the arrival of the -th min. This memorylessness will be very convenient in establishing integral formulas in Section 3. However, one cannot expect this property to hold for other extreme value distributions, see Remark 1.5.

###### Remark 2.13.

There are several classical constructions of Brownian motion. Some, such as Lévy’s piecewise linear construction, are easy to visualize but only converge as a measure on the space of continuous functions. Others, such as the “wavelet-type” construction of Lévy-Ciesielski, or the Fourier construction of Weiner, converge uniformly towards Brownian motion (see e.g. Chapter 3 in [SP14] for more details). Here we only characterize the convergence of the distribution of towards as measures on the space of lower semi-continuous functions. One could seek to demonstrate that certain alternative constructions converges uniformly, in an appropriate metric on the space of lower semi-continuous functions as in Proposition 2.7 (see also [DM93], Proposition 10.21). This will be the subject of future analysis.

#### 2.2.2 Existence and uniqueness of k-th argmins of Wλ

Next we turn to studying the existence of -th argmins for the process . We begin with a definition:

###### Definition 2.14.

Let . We define the minimum value of

 m1(f):=minx∈¯¯¯¯Df(x),

as well as the set of minimizers of

 M1(f):={x∈¯¯¯¯¯D:f(x)=m1(f)}.

Moreover, having defined the numbers and the sets we define

 mk(f):=infx∈¯¯¯¯D∖∪k−1i=1Mi(f)f(x) (2.7)

and

 Mk(f):={x∈¯¯¯¯¯D:mk−1(f)

We refer to the elements of as -th argmins of .

###### Remark 2.15.

The set is always non-empty due to the compactness of and the lower semi-continuity of . The lower semi-continuity of also implies that for every , the set

 k⋃i=1Mi(f)

is a closed set.

###### Remark 2.16.

In general the sets for may be empty. For example, it should be clear that a continuous function does not have -th argmins. In this paper, however, the random object that we consider is almost surely a lower semi-continuous function with well-defined (and unique) -argmins for all (see Proposition 2.19 below).

###### Remark 2.17.

It is an easy exercise to show that if is a non-empty set then is also non-empty.

The following lemma will be critical in proving the existence of -argmins of .

###### Lemma 2.18.

Let , let and suppose that:

1. .

2. The set is a singleton for all .

3. There exists a number such that

 d(Mi(fn),Mj(fn))>β,∀n∈N,∀i,j=1,…,k+1,i≠j. (2.8)

In the above, is defined as

 d(Mi(fn),Mj(fn)):=inf{|x−y|:x∈Mi(fn),y∈Mj(fn)}.
4. For sufficiently large , for all .

Then all limit points of , where , belong to . In particular, is non-empty.

###### Proof.

Step 1: We start by proving the case in order to illustrate the ideas.

Let be the unique minimizer of . By compactness of we know that converges up to subsequence towards a point ; without loss of generality we assume that the full sequence converges towards . We need to show that .

First, we claim that , and so . To see this, let be a minimizer for . By the -convergence assumption and the fact that is the unique minimizer of , it follows that In particular, for large enough we have

 |x1n−x1|<β/2,

where is as in (2.8). From the triangle inequality it follows that for large enough

 |x2n−x1|≥|x1n−x2n|−|x1n−x1|≥β−β/2=β/2.

Therefore,

 |~x2−x1|≥β/2>0,

establishing the claim.

Next, we show that for arbitrary we have that This implies that , which together with the above shows that Using the lim-sup inequality we can find a sequence converging towards for which

 limsupn→∞fn(xn)≤f(x).

Notice that for all large enough , . Indeed, if that was not the case, the limit of would be which contradicts that . In particular, the value of . Hence,

 f(~x2)≤liminfn→∞fn(x2n)≤limsupn→∞fn(xn)≤f(x),

where the first inequality follows from the liminf inequality given that .

Step 2: Generalizing to arbitrary follows the same ideas.

Let, for , be the unique -th argmin of and let, for We assume without loss of generality that for some . Our goal is to prove that .

Restricted to the set we have that for sufficiently large minimizes . -convergence then implies that . By the definition of we have that, for , and hence . Note that , simply by -convergence (on the whole space ) and the fact that is a singleton.

Now, we claim that for all we have that . We proceed by induction. The base case was already proved. Suppose that for all . If then we must have (by the fact that is a singleton) that . By -convergence there exists a sequence so that . This then implies that for sufficiently large. This violates the fact that , which then proves the claim.

Finally, let be arbitrary. We will show that which implies that By -convergence, there exists a sequence with . Since we have that, for