 # Template Matching with Ranks

We consider the problem of matching a template to a noisy signal. Motivated by some recent proposals in the signal processing literature, we suggest a rank-based method and study its asymptotic properties using some well-established techniques in empirical process theory combined with Hájek's projection method. The resulting estimator of the shift is shown to achieve a parametric rate of convergence and to be asymptotically normal. Some numerical simulations corroborate these findings.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Template matching is the process of matching a clean and noiseless template to an observed, typically noisy signal. This topic is closely related to the problem of matching two or more noisy signals, sometimes referred to as ‘aligning’ or ‘registering’ the signals, and to methodology in spatial statistics falling under the umbrella name of ‘scan statistic’. When the template has a point of discontinuity, matching a template to the signal can be interpreted as detecting the location of the discontinuity, a more specialized task more broadly referred to as ‘change-point detection’ in statistics. For pointers to the vast literature on these related topics, we refer the reader to the introduction of another recent paper of ours on the topic (Arias-Castro and Zheng, 2020).

While in our previous work we proposed and analyzed M-estimators for template matching, in the present paper we consider R-estimators instead. The former includes what is perhaps the most widely used method which consists in maximizing the Pearson correlation of the signal with a shift of the template; see (2) below. We focus here on the rank variant of this approach, which is an example of the latter category and consists, instead, in replacing the signal with the corresponding ranks before maximizing the correlation over shifts of the template; see (4).

Rank-based methods are, of course, classical in statistics (Šidák et al., 1999; Hettmansperger and McKean, 2010; Lehmann, 2006; Gibbons and Chakraborti, 2011). The theory of rank tests is particularly well-developed, featuring some prominent methods such as the Wilcoxon/Mann–Whitney and Kruskal–Wallis two- and multi-sample tests, and essentially all (other) distribution-free tests for goodness-of-fit such as the Kolmogorov–Smirnov test, and more closely related to our topic here, the Spearman and Kendall rank correlation tests for independence. The theory of rank estimators, sometimes called R-estimators, is also well developed, although perhaps not as well-known (Hettmansperger and McKean, 2010)

. The most famous example may be the Hodges–Lehmann estimator, which is derived from the Wilcoxon signed-rank test. In multiple linear regression, the asymptotic linearity and resulting normality of certain R-estimators is established in a number of publications

(Jurečková, 1971; Heiler and Willers, 1988; Koul and Mukherjee, 1993; Giraitis et al., 1996; Draper, 1988). Closer to our setting, some papers consider the use of ranks for the detection and/or localization of one or multiple change-points (Darkhovskh, 1976; Gombay and Hušková, 1998; Hušková, 1997; Lung-Yut-Fong et al., 2015; Gerstenberger, 2018; Wang et al., 2020; Arias-Castro et al., 2018).

In the signal and image processing literature per se, where a lot of the work on template matching resides, rank-based methods have been attracting some attention in recent years. Kordelas and Daras (2009) propose a rank variant of the well-known feature extractor SIFT, while Xiong et al. (2020) propose a rank-based local self-similarity feature descriptor for use in synthetic-aperture radar (SAR) imaging. Ayinde and Yang (2002)

present a face recognition approach using the rank correlation of Gabor filtered images, while

Galea and Farrugia (2016) apply the Spearman rank correlation for template matching in face photo-sketch recognition. Kong et al. (2008) construct a matched-filter object detection algorithm based on the Spearman rank correlation to detect Ca sparks in biochemical applications. A number of papers use ranks to align images, a task also known as ‘stereo matching’ (Banks and Bennamoun, 2001; Banks et al., 1999; Chen et al., 2012; Geng and Gou, 2012) — a problem we will not address here but which also has a sizable literature in statistics; we provide some pointers to the literature in our recent paper (Arias-Castro and Zheng, 2020).

### 1.1 Model and methods

We consider a standard model for template matching, where we observe a shift of the template with additive noise,

 Yi=f(xi−θ∗)+Zi,i=1,…,n, (1)

where denote the design points, is a known 1-periodic function referred to as the template, and is the unknown shift of interest. The noise or measurement error variables are assumed iid with density and distribution function denoted .

###### Assumption 1.

We assume everywhere that is 1-periodic, and in fact exactly so in the sense that on a set of positive measure whenever . We also assume that is Lipschitz continuous.

###### Assumption 2.

We assume everywhere that is even, so that the noise is symmetric about 0.

Our goal is, in signal processing terminology, to match the template to the signal , which in statistical terms consists in the estimation of the shift . One of the most popular ways to do so is via maximization of the Pearson correlation, leading to the estimator

 ^θ:=argmaxθn∑i=1Yif(xi−θ). (2)

In the present context, this is equivalent to the least squares estimator in that the same estimator is also the solution to the following least squares problem

 ^θ=argminθn∑i=1(Yi−f(xi−θ))2. (3)

Note that this corresponds to the maximum likelihood estimator when the noise distribution is Gaussian. Of course, other loss functions can be used, some of them leading to methods that are robust to noise distributions with heavy tails or to the presence of gross errors (outliers) in the signal. Our recent paper

(Arias-Castro and Zheng, 2020) studies these so-called M-estimators in great detail.

In the present paper we consider, instead, estimators based on ranks. These go by the name of R-estimators in the statistics literature. The most direct route to such an estimator is to replace the response values with their ranks, yielding

 ^θrank:=argmaxθn∑i=1Rif(xi−θ), (4)

where denotes the rank of in in increasing order. Doing so is sometimes called the ‘rank transformation’ in the signal processing literature. Note that this is similar in spirit to replacing the Pearson correlation in (2) with the Spearman rank correlation, except that we do not rank the values of the template itself. We find that there is no real reason to want to replace the template values with the corresponding ranks as the template is assumed to be free of noise.

### 1.2 Content

The rest of the paper is devoted to studying the asymptotic () properties of the estimator we propose in (4), which we will refer to as the R-estimator. In Section 2, we derive some basic results describing the asymptotic behavior of this estimator. In more detail, in Section 2.1 we discuss the consistency of this estimator, which we are not fully able to establish but is clearly supported by computer simulations; in Section 2.2 we derive a rate of convergence for our R-estimator, which happens to be parametric and also minimax optimal; and in Section 2.3 we derive a normal limit distribution of the same estimator, as well as its asymptotic relative efficiency with respect to maximum likelihood estimator under Gaussian noise. Section 3 summarizes the result of some numerical experiments we performed to probe our asymptotic theory in finite samples. Section 4 provides a discussion of possible extensions. The mathematical proofs are gathered in Section 5.

## 2 Theoretical properties

In this section we study the asymptotic properties of the estimator defined in (4). We first study its consistency in Section 2.1; bound its rate of convergence in Section 2.2; and derive its asymptotic (normal) limit distribution in Section 2.3.

### 2.1 Consistency

The estimator in (4) can be equivalently defined via

 ^θn:=argmaxθˆMn(θ), ˆMn(θ):=1nn∑i=1Rinf(xi−θ), (5)

where we have added the subscript to emphasize that the estimator is being computed on a sample of size . To understand the large-sample behavior of this estimator, we need to understand that of as a function of , and this leads directly to empirical process theory.

###### Lemma 2.1.

In probability,

 supθ∈R∣∣ˆMn(θ)−M(θ)∣∣n→∞−−−→0,

where

 M(θ):=∫10∫10Φ2(f(x0)−f(x))f(x0+θ∗−θ)dxdx0, (6)

with

With the uniform convergence of to established in Lemma 2.1, and with the fact that is continuous and 1-periodic, it suffices that be the unique maximizer of for the estimator defined in (5) to converge to in probability, that is, to be consistent (Van der Vaart, 1998, Th 5.7). We are able, under an additional mild assumption on the noise distribution, to prove that is a local maximizer of .

###### Proposition 2.2.

is twice differentiable, with , and if is positive everywhere, , in which case is a local maximizer of .

Unfortunately, we are not able to verify that is indeed a global maximizer. However, since it is borne out by our numerical experiments (Section 3), we conjecture this is true, possibly under some additional (reasonable) conditions on the model. In the meantime, we formulate this as an assumption, which henceforth forms part of our basic assumptions.

###### Assumption 3.

is the unique maximum point of defined in (6).

###### Theorem 2.3.

Under the basic assumptions, converges in probability to as .

### 2.2 Rate of convergence and minimaxity

Besides consistency, we derive the estimator’s rate of convergence in this section. The rate turns out to be parametric, i.e., the convergence of the estimator to the true value of the parameter is in .

###### Theorem 2.4.

Under the basic assumptions, is -consistent.

The parametric rate of happens to be minimax optimal in the present setting. This is established in (Arias-Castro and Zheng, 2020, Cor 3.9)

in the context of a random design corresponding to the situation where the design points, instead of being the grid points spanning the unit interval, are generated as an iid sample from the uniform distribution on the unit interval. But similar arguments carry over. We omit details and refer the reader to the discussion in

(Arias-Castro and Zheng, 2020, Sec 6.1).

###### Corollary 2.5.

The R-estimator achieves the minimax rate of convergence.

### 2.3 Limit distribution and asymptotic relative efficiency

In addition to obtaining a rate of convergence, we are also able to derive the asymptotic distribution, which happens to be normal.

###### Theorem 2.6.

Under the basic assumptions, strengthened with the assumption that is continuously differentiable,

converges weakly to the centered normal distribution with variance

, where

 γ2:=∫10∫∞−∞[∫10(f′(x)−f′(x0))Ξ(f(x)−f(x0),z)dx]2ϕ(z)dzdx0,

with , and

 M′′(θ∗)=−∫10∫10f′(x)2ϕ2(f(x)−f(x0))dxdx0,

with

Now that the R-estimator is known to be asymptotically normal with an explicit expression for the asymptotic variance (after standardization), we can consider its (Pitman) efficiency relative to the more popular estimator based on maximizing the Pearson correlation (2) (which coincides with the MLE when the noise is Gaussian). Indeed, this estimator (denoted now) was studied in (Arias-Castro and Zheng, 2020, Sec 3) in the setting of a random design. Adapting the arguments there, which are very similar to (and in fact simpler than) those used here, we find that is asymptotically normal with mean zero and variance , where is the noise variance. With Theorem 2.6, we are thus able to conclude the following.

###### Corollary 2.7.

Suppose that the noise has finite variance . Then the asymptotic efficiency of the R-estimator (4) relative to the standard estimator (2) is given by

 σ2/∫10f′(x)2dxγ2/M′′(θ∗)2.

The result, in fact, continues to hold even when the noise has infinite variance, and in that case the asymptotic relative efficiency of the R-estimator relative to the more common estimator is infinite.

## 3 Numerical experiments

We performed some simple numerical experiments to probe the asymptotic theory developed in the previous sections. We note that the implementation of the R-estimator (4) is completely straightforward, as it only requires replacing the observations with their ranks before the usual template matching by maximization of the correlation over shifts as in (2

), which is typically implemented by a fast Fourier transform. This ease of computation is in contrast with rank methods for, say, linear regression which are computationally demanding and require dedicated algorithms and implementations — difficulties that may explain the very limited adoption of such methods in practice.

In our experiments, we considered three noise distributions: Gaussian distribution, Student t-distribution with 3 degrees of freedom, and Cauchy distribution. We took

throughout, which is really without loss of generality since the two methods we compared — the standard method based on maximizing the Pearson correlation and the rank-based method that we study — are translation equivariant. We chose to work with the following three filters:

 Template A: f(x)=⎧⎨⎩4x−10.25≤x<0.5,3−4x0.5≤x<0.75,0otherwise, (7)
 Template B: f(x)=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩10x−20.2≤x<0.3,4−10x0.3≤x<0.4,10x−60.6≤x<0.7,8−10x0.7≤x<0.8,0otherwise (8)

and

 Template C: f(x)=max{0,(1−(4x−2)2)3}. (9)

All are Lipschitz, with Template being even smoother. See Figure 1 for an illustration.

We set the sample size at . Each setting, defined by a choice of filter and of noise distribution, was repeated times. Box plots of estimation error are presented in Figures 2, 3 and 4, while the distribution of is depicted in Figures 5, 6 and 7 via histograms. The results are congruent with what the theory predicts: The rank-based method is slightly inferior to the standard method when the noise is Gaussian (exactly when the standard method coincides with the MLE), while it is superior when the noise distribution has heavier tails; and the histograms overlay nicely with the predicted asymptotic distribution. The asymptotic relative efficiency of R-estimator relative to the method based on maximizing the Pearson correlation is displayed in Table 1.

## 4 Discussion

Our main goal in this paper was to show that a standard rank-based approach to template matching is viable and amenable to study using well-established techniques in mathematical statistics — some basic results in empirical process theory and the projection method of Hájek. This provides some theoretical foundation for related approaches proposed in recent years in the signal processing literature. We chose to keep the exposition contained and, in particular, have focused on the ‘smooth setting’ of a Lipschitz template. We leave the equally important case of a discontinuous template for future work (likely by others). Based on our previous work (Arias-Castro and Zheng, 2020) — where we did study this case in detail — and on our work here — in particular Hájek’s projection technique used in the proof of Theorem 2.6 — we do believe that the study of this case is within the reach of similarly standard tools.

Some of the other extensions discussed in our previous work are also relevant here. In particular, while we focused on shifts in the context of 1D signals, other settings are possible, including shifts in 2D or 3D signals (i.e., images), as well as other transformations. We omit details and simply affirm that such extensions are also amenable to a similar mathematical analysis.

In signal processing, rank-based methods seem more prominently represented in the literature on signal registration. We are confident that the study of such methods is well within the range of established techniques in mathematical statistics. However, the situation becomes substantially more complex as 1) the setting is semi-parametric, and 2) some smoothing seems to be required to achieve good performance — as transpires from the statistics literature on the topic as mentioned in our previous work (Arias-Castro and Zheng, 2020). We leave a further exploration of rank-based methods for registration to future endeavors.

## 5 Proofs

### 5.1 Preliminaries

The following is an extension of the celebrated Glivenko–Cantelli theorem.

###### Lemma 5.1.

Suppose that

are independent random variables with uniformly tight and equicontinuous distributions

. Define

 ^F[n](y):=1nn∑i=1I{Yi,n≤y}, F[n]:=E[^F[n]]=1nn∑i=1Fi,n.

Then the following uniform convergence holds in probability

 supt|^F[n](t)−F[n](t)|n→∞−−−→0.
###### Proof.

The proof arguments are very close to those supporting the more classical situation in which the random variables have the same distribution (Van der Vaart, 1998, Th 19.1). We provide a proof for completeness only.

Our assumptions mean 1) that for each there is such that and ; and 2) that is continuous at 0; we then speak of as the modulus of continuity of the family . Fix and let be defined as above. Also, let be such that and set be such that for all . We then have, regardless of , that and ; and that is a modulus of continuity for , meaning that .

• For , we have

 |^F[n](t)−F[n](t)|≤^F[n](t)+F[n](t)≤^F[n](t1)+ε.
• For , we have

 |^F[n](t)−F[n](t)|≤(1−^F[n](t))+(1−F[n](t))≤(1−^F[n](tm))+ε.
• For , if is such that ,

 ^F[n](t)−F[n](t)≤^F[n](tj+1)−F[n](tj)≤^F[n](tj+1)−F[n](tj+1)+ε,

and

 ^F[n](t)−F[n](t)≥^F[n](tj)−F[n](tj+1)≤^F[n](tj)−F[n](tj)−ε.

By Chebyshev’s inequality, in probability as for every fixed . In particular, in probability, and under the event that this maximum is bounded by , we have . Since is arbitrary, the proof is complete. ∎

The following is a simple result on functions defined as the linear combination of uniformly equicontinuous functions with random coefficients.

###### Lemma 5.2.

Suppose that are independent such that and for all and all ; and that are uniformly bounded and uniformly equicontinuous functions either defined on a compact interval. Then, in probability,

 ∣∣1nn∑i=1Bi,nfi,n∣∣∞n→∞−−−→0.
###### Proof.

The arguments are quite similar to those supporting Lemma 5.1. Define

 Sn:=1nn∑i=1Bi,nfi,n.

Suppose without loss of generality that the functions are defined on the unit interval. Because they are uniformly equicontinuous, for any given , there is such that for all such that . With fixed, and as such, let be such that for all . Then for any , if is such that ,

 |Sn(t)−Sn(tj)|≤1nn∑i=1|Bi,n||fi,n(t)−fi,n(tj)|≤Ksupnsupi|fi,n(t)−fi,n(tj)|≤Kε.

In particular,

 |Sn|∞≤maxj|Sn(tj)|+Kε.

Furthermore, by Chebyshev’s inequality, in probability,

 maxj|Sn(tj)|n→∞−−−→0.

Hence, in probability,

 limsupn|Sn|∞≤Kε.

Since was chosen arbitrary, the proof is complete. ∎

The following is a well-known error bound for Riemann sums.

###### Lemma 5.3.

Assume that is continuous with modulus of continuity . Then

 ∣∣ ∣∣∫baf(u)du−1mm∑j=1f(a+j(b−a)/m)∣∣ ∣∣ ≤(b−a)ω((b−a)/m) ≤(b−a)2m|f′|∞if f % is Lipschitz.
###### Proof.

This well-known result is a simple consequence of partitioning into sub-intervals of length . ∎

The next two lemmas are refinements of Lemma 5.1 and Lemma 5.2. They clearly subsume them, but they are also much deeper, and we only provide proof sketches, relying on arguments borrowed from (Van der Vaart, 1998).

###### Lemma 5.4.

In the context of Lemma 5.1, for some constant ,

 E[supt|^F[n](t)−F[n](t)|]≤C/√n.
###### Proof.

The result is classical when the variables are not only independent, but also identically distributed, say for all and all , and is a special case of so-called entropy bounds on the supremum of an empirical processes of the form

 Sn:=supg∈G1√nn∑i=1(g(Yi,n)−E[g(Yi,n)]). (10)

For example, assuming that the class is uniformly bounded, Cor 19.35 in (Van der Vaart, 1998) gives

 E[Sn]≤C0J(G,F),

where is a constant and

 J(G,F):=∫∞0√logN(ε,G,L2(F))dε,

denoting the -bracketing number of the class with respect to the metric.111Two functions such that pointwise define a bracket made of all functions such that . It is said to be an -bracket with respect to , for a positive measure , if . Given a class of functions , its -bracketing number with respect to is the minimum number of -brackets needed to cover .

The proof of that result takes several pages, but a close examination reveals that the ‘identically distributed’ property is not used in an essential way. Indeed, the assumption that the variables are iid is only used when applying Bernstein’s concentration inequality (Lem 19.32 there), and it is well-known that the result applies in a generalized form to variables that are only independent, say . Everything follows from that, essentially verbatim, and yields

 E[Sn]≤C0J(G,F[n]).

It turns out that, for a given distribution function , can be bounded based on the modulus of continuity of (see Ex 19.6 in the same reference). And if is the modulus of continuity of , then it is also a modulus of continuity for , and with this we can bound independently of just based on .

When dealing with the empirical distribution function, which is our focus here, the class is taken to be . For that class, for any distribution function, and this implies via the arguments above that , concluding the proof. ∎

###### Lemma 5.5.

Suppose that are independent random variables that are centered and bounded in absolute value by . And let be -Lipschitz functions on with for all . Then there is a some constant such that, for any ,

 E[sup|t|≤t0∣∣1nn∑i=1Bifi(t)∣∣]≤Ct0√n.
###### Proof.

As in the proof of Lemma 5.4, we rely on entropy bounds. Here we use Dudley’s entropy bound as presented in (Giné and Nickl, 2016, Th 2.3.6). For a given , let , we have

 S(t)−S(s)=n∑i=1Bin(fi(t)−fi(s)), (11)

with the variables being independent, centered, and bounded in absolute value by . In (Giné and Nickl, 2016), by Eq (3.8) and based on Def 2.3.5, the process is sub-Gaussian on with respect to the metric . Because , Th 2.3.6 there gives that

 E[sup|t|≤t0|S(t)|]≤4√2∫D/20√log(2N(ε))dε,

where and are the diameter and -covering number of with respect to . Immediately, , and . With a simple change of variable in the integral, this gives us

 E[sup|t|≤t0|S(t)|]≤C1KLt0/√n,

for a universal constant . ∎

### 5.2 Proof of Lemma 2.1

We assume without loss of generality that . We have

 Ri=n∑j=1I{Yj≤Yi}=n^Ψn(Yi),where ^Ψn(y):=1nn∑j=1I{Yj≤y}.

Note that is the empirical distribution function of . Although these are not iid, they are independent, and the Glivenko–Cantelli theorem applies to give that, in probability,

 supy∈R|^Ψn(y)−E[^Ψn(y)]|n→∞−−−→0.

See Lemma 5.1 for details. Further, we have

 E[^Ψn(y)] =1nn∑j=1P(Yj≤y) =1nn∑j=1P(Zj≤y−f(xj)) =1nn∑j=1Φ(y−f(j/n)) n→∞−−−→Ψ(y):=∫10Φ(y−f(x))dx,

where the convergence is by definition of the Riemann integral defining the limit. The convergence is in fact uniform in . This comes from an application of Lemma 5.3 using with the fact that has derivative , which has supremum norm bounded by (independent of ). Hence, a simple application of the triangle inequality gives that, in probability,

 A1,n:=supy∈R|^Ψn(y)−Ψ(y)|n→∞−−−→0. (12)

This is useful to us because , which then triggers

 ˆMn(θ)=1nn∑i=1Ψ(Yi)f(xi−θ)±A2,n,

with . We may thus focus on the first term on the right-hand side. We have

 1nn∑i=1Ψ(Yi)f(xi−θ)=1nn∑i=1E[Ψ(Yi)]f(xi−θ)+Qn(θ),

with

 Qn(θ):=1nn∑i=1(Ψ(Yi)−E[Ψ(Yi)])f(xi−θ).

On the one hand, by a standard argument consisting in discretizing the values of and using the uniform continuity of , we obtain

 A3,n:=supθ∈R|Qn(θ)|n→∞−−−→0,

in probability. See Lemma 5.2 for details. On the other hand,

 1nn∑i=1E[Ψ(Yi)]f(xi−θ) =1nn∑i=1∫∞−∞Ψ(z+f(i/n))ϕ(z)dz⋅f(i/n−θ) n→∞−−−→∫10∫∞−∞Ψ(z+f(x))ϕ(z)dz⋅f(x−θ)dx =∫10∫10Φ2(f(x)−f(t))f(x−θ)dtdx=M(θ),

using again the definition of Riemann integral. In fact the convergence is uniform in , by an application of Lemma 5.3 to the function

 gθ(x):=∫∞−∞Ψ(z+f(x))ϕ(z)dz⋅f(x−θ),

whose derivative can be bounded independently of as follows:

 |g′θ(x)| =∣∣∫∞−∞f′(x)Ψ′(z+f(x))ϕ(z)dz⋅f(x−θ)+∫∞−∞Ψ(z+f(x))ϕ(z)dz⋅f′(x−θ)∣∣ ≤|f′|∞|Ψ′|∞|f|∞+|Ψ|∞|f′|∞ ≤|f′|∞|ϕ|∞|f|∞+|f′|∞,

using the fact that is a density, that is a distribution function, and that is non-negative and bounded by . All combined, we can conclude that indeed converges in probability as to uniformly in .

### 5.3 Proof of Proposition 2.2

Assume without loss of generality that . Define

 g(y):=∫10Φ2(y−f(x))dx,

so that

 M(θ)=∫10g(f(t))f(t−θ)dt.

Note that for all . When is Lipschitz, it is absolutely continuous with bounded derivative, so that by dominated convergence, is differentiable with derivative

 M′(θ)=−∫10g(f(t))f′(t−θ)dt=−∫10g(f(t+θ))f′(t)dt.

The reason we transferred to is to be able to differentiate again. Indeed, is also differentiable by dominated convergence, with derivative (recall that )

 g′(y)=∫10∫∞−∞ϕ(z+y−f(x))ϕ(z)dzdx,

which is bounded, so that in its second form is also differentiable (by dominated convergence again), with derivative

 M′′(θ)=−∫10f′(t+θ)g′(f(t+θ))f′(t)dt.

Therefore, is twice differentiable (with bounded second derivative at that).

We now look at . Let be the indeterminate integral of . We have

 M′(0) =−∫10g(f(t))f′(t)dt =−[G(f(1))−G(f(0))]=0,

by the fact that is 1-periodic. We also have

 M′′(0) =−∫10g′(f(t))f′(t)2dt≤0,

by the fact that is non-negative (by simply looking at the integrand defining it, recalling that is a density). In fact the inequality is strict by our assumption on , as it forces to be strictly positive everywhere.

### 5.4 Proof of Theorem 2.4

When working with the raw responses , the result can be proved using (Van der Vaart, 1998, Th 5.52). Since we work with the ranks instead, we elaborate, even though the core arguments are essentially the same. Assume without loss of generality that , so that we need to show that is bounded in probability.

On the one hand, by definition, we have . On the other hand, by consistency (since Theorem 2.3 applies), we have that is small, and by the fact that is close to quadratic in the neighborhood of (Proposition 2.2), we have that for some constant . Combined, these two observations yield

 C1^θ2n≤ˆMn(^θn)−M(^θn)−(ˆMn(0)−M(0)),

with probability tending to 1.

Let . For any , we have

 ˆMn(θ)−M(θ)−(ˆMn(0)−M(0)) (13) =1n∑i(^Ψn(Yi)−E[^Ψn](Yi))fi(θ) (14) +1n∑i(E[^Ψn](Yi)−Ψ(Yi))fi(θ) (15) (16) +1n∑iE[^Ψ(Yi)]fi(θ)−(M(θ)−M(0)). (17)

We saw in the proof of Lemma 2.1 that the terms in (15) and (17) are Riemannian sums and at most of order uniformly in given that the . (This is crude, but enough for our purposes here.) For (14), we apply Lemma 5.4 together with Markov’s inequality to get that, in probability as ,

 supy∈R|^Ψn(y)−E[^Ψn](y)|≤C2/√n.

With this, and the fact that , we have that the quantity in (14) is bounded in absolute value by for all , that is, this term is uniformly in . Hence, if denotes the term in (16), we have with probability tending to 1,

 C1^θ2n≤Sn(^θn)+C3(|^θn|/√n+1/n).

Let be such that whenever , so that when . Let be the smallest integer such that . Then, for , we have

 P(√n|^θn|≥2J) =∑j≥JP(2j<√n|^θn|≤2j+1) ≤∑j≥JP(max√n|θ|≤