# Learning Patterns for Detection with Multiscale Scan Statistics

This paper addresses detecting anomalous patterns in images, time-series, and tensor data when the location and scale of the pattern is unknown a priori. The multiscale scan statistic convolves the proposed pattern with the image at various scales and returns the maximum of the resulting tensor. Scale corrected multiscale scan statistics apply different standardizations at each scale, and the limiting distribution under the null hypothesis---that the data is only noise---is known for smooth patterns. We consider the problem of simultaneously learning and detecting the anomalous pattern from a dictionary of smooth patterns and a database of many tensors. To this end, we show that the multiscale scan statistic is a subexponential random variable, and prove a chaining lemma for standardized suprema, which may be of independent interest. Then by averaging the statistics over the database of tensors we can learn the pattern and obtain Bernstein-type error bounds. We will also provide a construction of an ϵ-net of the location and scale parameters, providing a computationally tractable approximation with similar error bounds.

## Authors

• 17 publications
• ### Calibrating the scan statistic: finite sample performance vs. asymptotics

We consider the problem of detecting an elevated mean on an interval wit...
08/13/2020 ∙ by Guenther Walther, et al. ∙ 0

• ### A Multiscale Scan Statistic for Adaptive Submatrix Localization

We consider the problem of localizing a submatrix with larger-than-usual...
06/20/2019 ∙ by Yuchao Liu, et al. ∙ 0

• ### On the Asymptotic Distribution of the Scan Statistic for Point Clouds

We derive the large-sample distribution of several variants of the scan ...
10/04/2019 ∙ by Andrew Ying, et al. ∙ 0

• ### Optimal Inference with a Multidimensional Multiscale Statistic

We observe a stochastic process Y on [0,1]^d (d≥ 1) satisfying dY(t)=n^1...
06/06/2018 ∙ by Pratyay Datta, et al. ∙ 0

• ### Detection of Sparse Mixtures: Higher Criticism and Scan Statistic

We consider the problem of detecting a sparse mixture as studied by Ings...
02/23/2018 ∙ by Ery Arias-Castro, et al. ∙ 0

• ### Dynamic Time Scan Forecasting

The dynamic time scan forecasting method relies on the premise that the ...
06/12/2019 ∙ by Marcelo Azevedo Costa, et al. ∙ 0

• ### Spatial Autoregressive Models for Scan Statistic

Spatial scan statistics are well-known methods for cluster detection and...
11/22/2019 ∙ by Mohamed-Salem Ahmed, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Detection is the statistical task of determining if there is some structured signal within noisy data. If classification answers the question, “what am I seeing?”, detection answers the question, “do I see anything at all?”. In a sensor network (see [6]

), one is often interested in the dual problems of noticing an anomaly (detection) and then determining its location and extent (classification). Sensor networks are deployed in natural environments for contaminant detection,

[32, 31], real-time surveillance ([4]), radiation monitoring ([3]), and fire detection ([22]). In medical imaging, the critical task is often to test the existence of an anomaly ([17, 14]). Quick detection of outbreaks of pathogens, [13, 24], can lead to early intervention. Yet, given that this is such a fundamental task in many applications, the development of detection methodology lags behind the sophisticated tools for classification.

In images, time series, and tensors, it is natural to assume that there is a structured signal, such as blob-like objects, but we do not know its location or scale. In Figure 1, we can see a chemical plume from a multispectral image where each pixel value is lighter if a certain spectral signature is present (see [16]). We did not know a-priori where this chemical would appear, nor how large it would be. A natural approach to detecting signals of this type would be to center a regular shape such as a square or an ellipse around every pixel and detect anomalously large quantities of the spectral signature.

The aptly named scan statistic is a test statistic that is based on scanning the image or time series for a pattern that may be centered anywhere in the domain. For each location, one can form a likelihood ratio, and test if this statistic exceeds some predetermined threshold (thus, it is a generalized likelihood ratio test). Scan statistics are widely used in spatial detection applications; see

[10] for a thorough introduction to the topic. They are commonly used to detect patterns in point clouds, [18, 20], a closely related problem to our own. Because we often do not know the size as well as the location of the anomaly, we will scan all locations and scales simultaneously—this is called the multiscale scan statistic

. Hence, the multiscale scan will translate various scaled versions of a pattern, such as rectangles with many options of side lengths or circles of varying radii. In Neyman-Pearson testing, we attempt to control the probability of false rejection under a null hypothesis, when our data only consists of noise. To this end, we either must approximate the distribution of the scan statistic asymptotically, or set the detection threshold by simulation or resampling.

[26] provided a weak limit for the scan consisting of all intervals in 1 dimension (other approximations can be found in [19] and [21]). In 2 dimensions, [11, 12, 30, 15], provided approximations of the null distribution for the multiscale scan statistic. [1, 2] analyzed scan statistics for blob-like patterns and determined thresholds for detectability in this context.

It was observed that if one naively tests all scales in the multiscale scan at the same threshold, then the rejection events will be dominated by the finest scale. It was shown in [8] that by separately standardizing the 1-dimensional scan at each scale one can detect with a signal-to-noise ratio that adapts to the scale of the anomalous signal. This scale correction for multiscale scan statistics with rectangular shapes was further studied in [9, 25]. [23] showed that similar results can be achieved for a large class of patterns that satisfy an average Hölder continuity assumption.

In traditional detection applications, a tensor is scanned for a known function or for specific blob-like patterns. In a database of many tensors with a repeated anomalous pattern in a database then it may be possible to simultaneously test for all of the patterns in a class . The catch is that each time-series or image—more generally, tensor—will have the anomalous pattern in a different location and at a different scale. For example, we have many time series, all of which have some embedded smooth signal (such as the sinusoid in Figure 1) where the sinusoid begins at various time points, and has different periodicities. Without knowing a priori that we are looking for a sinusoid this problem seems intractable, but given that we know enough about the signal (such as it comes from a finite dictionary of functions), then with enough data we may both learn and detect the pattern within very noisy data.

### 1.1 Contributions

We will introduce the multiscale scan statistic, with scale correction, in Section 2, and describe an -net construction using repeated dilation and convolution operations. We begin our theoretical analysis, in Section 3, by proving a chaining result which shows that the standardized supremum of certain subGaussian random fields is a subexponential random variable. This is then used, in Section 4 to provide a finite sample bound on the scale corrected multiscale scan statistic (until now it was only known that it was A.S. bounded). We conclude by demonstrating that our -net construction is indeed correct and a type 2 error control in Section 5.

## 2 Method and Model

### 2.1 Continuous scan statistic

Let’s begin with the basic scan statistic over an image. For an image, which can be represented by a matrix, we can convolve a pattern with the image,

 (P⋆Y)k,l=H∑k′,l′=−HYk−k′,l−l′Pk′,l′,k,l=−L+H,…,L−H.

Then the simple scan statistic is (one can take the absolute value by also considering ). In the case that then this scans a square activation pattern over the image. A common assumption in this problem, is that under the null hypothesis, are independent standard subGaussian random variables. For an arbitrary, pattern matrix we would like to scale both dimensions, so that in dimension for some (e.g.  stretching the square to form a rectangle). For general patterns, this can be difficult, and one approach is to scale the dimensions for a continuous function, , and then rasterize it. This analysis can become very cumbersome and not terribly enlightening, so we will approximate the pixels with a continuous domain and the image with a random field.

In order to implement our scan, we begin by proposing a given pattern, which is a function over the -dimensional domain. For images, , and for time-series, , but we will only assume that the is fixed in our asymptotics. We assume that for every , and is supported over . We will further assume that has continuous gradient (). For a given field, , over , we propose a scale parameter,

 h∈H:=×j[1,L),

such that is the scale parameter for dimension . (Throughout and will always indicate dimension.) Given an , we select

 t∈Th:=×j[−(L−hj),L−hj]

respectively, and test if the pattern centered at and scaled by is hidden within image . This is accomplished by convolving the field, , with the scaled function ,

 (fh⋆dXi)(t)=∫ΩLfh(τ)dXi(t−τ)=∫Ω1√h∙f(τ)dXi(t−hτ),

where

and vector operations are such that

are performed elementwise. (Throughout, if is outside of the domain of .) The multiscale scan statistic takes the form,

 s(Xi;f):=maxh∈Hvh(maxt∈Th(fh⋆dXi)(t)−vh). (1)

where in this work we will take . Hence, when the scale is coarse ( is large) then is smaller. At the finest scale, is large, the maximum at this scale concentrates about with a rate parameter

(meaning that this concentrates more tightly than the pixel noise variance of

).

Throughout, we will denote the set of all valid scale and location parameters as

 D:={(t,h):h∈H,t∈Th}.

We will assume that the fields,

, are independent and have additive white noise terms,

. Our noise model is such that is a zero mean subGaussian random field indexed by with variance (see Section 3 for a definition). We will test if the field is just noise (null hypothesis) or the noise is added to the function, , which is translated by and scaled by ,

 H0 :dXi(τ)=dWi(τ),i=1,…,n H1 :dXi(τ)=μfhi(ti−τ)dτ+dWi(τ) for some f∈F, and (ti,hi)∈D,i=1,…,n.

Critically, the location and scale parameters , differ for each image. Notice that , so under the null hypothesis, the convolution has mean and variance . Because each statistic, , is independent and standardized by , then we can average these in order to increase the power of our final test statistic. To this end, let’s define the pattern adapted multiscale scan statistic (PAMSS),

 Sn(X;F):=maxf∈F1√nn∑i=1sn(Xi;f). (2)

We will show that is subexponential, and so we can obtain probabilistic bounds on with the subexponential Bernstein-type inequality.

### 2.2 Multiscale ϵ-net construction

In practice, data is discrete, and the scan statistic must be computed over a finite set of scales and locations. Suppose that for sample, , we have a draw from the null hypothesis, so that . Instead of scanning over all , we use a finite subset, . Let be values in that are close to in some sense. Then the expectation of the scan at this approximating location and scale is

 E(fh′⋆dXi)(t′)=μ∫fh′(τ)fhi(τ+ti−t′)dτ.

If we define the shift operator then we see that this expectation is . (This metric used for -net construction in [2].) Incidentally, this metric appears when we consider the variation of our scan statistic under the null hypothesis,

 νf((t,h),(t′,h′)):=∥Stfh−St′fh′∥L2=(V((fh⋆dW)(t)−(fh′⋆dW)(t′)))12,

where is the multivariate Wiener process. This fact will be useful for when we provide type 1 error control for our scan statistic. With this metric we will say that a finite subset is an -net if for any and any point there exist points such that .

The sensitivity of to small changes in will depend on the smoothness of . If has large gradients then a small shift, , can misalign the function with the unshifted version. To this end, we will consider two different notions of smoothness for the functions in our dictionary, . Define the isotropic total variation (recall that the functions have continuous gradients),

 ∥f∥TV:=∫Ω∥∇f(u)∥2du.

Then we may assume that all of the function are of bounded variation,

 ∃γ1>0 s.t. ∀f∈F,∥f∥TV≤γ1. (TVC)

The bounded variation assumption is consistent with the assumption in [8], although we generalize this work to the multidimensional case. Another notion of smoothness is the average Hölder condition of [23]. Define the Hölder functional,

 At,s(f):=∫ΩL|f(t−z)−f(s−z)|2dz.

Then an alternative to the bounded variation assumption is that

 ∃0<γ2≤1 s.t. ∀f∈F,At,s(f)≤cA∥t−s∥2γ22 (AHC)

where is some constant. Throughout the paper we will refer to both conditions, and denote as either or depending on context.

Given that our functions satisfy some smoothness assumptions, we can specify an -net construction, which was developed first in [29]. For a given define . We will begin with a subset of scales,

 Hβ={βℓ:ℓ={0,…,ℓmax}d}.

where is some parameter. At a given scale, we will consider a grid of evenly spaced locations, where the distances between grid points increases with the scale, ,

 Tα,h:=(×j(αhj⋅Z))∩Th.

The spacing, , is a tuning parameter as well in this construction. With specified, we can consider the -net to be

 Dβ,α={(t,h):h∈Hβ,t∈Tα,h}.

(Notationally, we exchange .) This is similar to the construction in [25], which has a fast implementation on GPUs using a hierarchy of convolution and downsampling layers. The main idea is that instead of expanding the function can instead downsample the tensor (but the details on a finite image are somewhat arduous). We expect that a similar implementation is possible in this context, when the functions are adaptively rasterized, but such developments are outside of the scope of this paper.

Given an -net, we can compute the approximate scan by restricting the evaluations to the finite set of location and scales,

Then we similarly can define the -net pattern adapted multiscale scan statistic (-PAMSS),

 En(X;F)=maxf∈F1√nn∑i=1eβ,α(Xi;f).

It is clear that since we are maximizing over a strict subset of the continuous scan. Hence, any type 1 error bound on is also conferred to .

## 3 A chaining bound for standardized suprema

The scale correction in (1) is based on a precise characterization of the rate and location of a supremum of the random field resulting from the convolution. This standardization complicates the analysis significantly, and until now, it was only known that was almost surely bounded. We initially select the location and scale parameters and take the supremum over this selection in forming the scan of over . Because under , the random variables form iid copies of the same random variable, we will seek to exponential concentration inequalities bounding . With this in hand, we can hope to obtain PAC-style bounds on their average and the resulting selection of pattern from a finite function class . We will restrict this work to , but a continuous class of location and scale parameters, . While this bar seems to be set pretty low compared to the rich developments in classification, such as Vapnik-Chernovenkis theory, as we will see, controlling our statistic under finite function classes is a challenging first step to a more complete understanding of learning patterns in detection. Let us begin with a formal definition of a subGaussian random field, and recall that this is what we assume for the field, , which satisfied when is the -dimensional Weiner process.

###### Definition 1.

We say that a random field, , is a (zero mean) standard subGaussian process if there exists a constant such that

 P{|Z(ι0)−Z(ι1)|≥u}≤2exp(−u22dZ(ι0,ι1)), (3) P{Z(ι0)≥u}≤exp(−u22), (4)

for any , , and , is the canonical distance.

Our noise model assumption can be formally stated as

 {(fh⋆dWi)(t):(t,h)∈D} is a subGaussian % random field with canonical distance νf. (5)

The generic chaining is a tool for bounding the suprema of random fields with subGaussian tails, [27] (it also has generalizations to other types of concentration). For general subGaussian random fields, the expectation of the supremum is bounded by a quantity, , and one can also show that is a standard subGaussian random variable (using the standard chain construction). Hence, as the expectation bound, , increases the scale can also grow with (meaning that the supremum becomes more dispersed). This is in contrast to what we know from extreme value theory, where the maximum of independent random variables tends to concentrate more tightly (with a lower asymptotic variance), not less. For example, let be iid standard normal random variables. Then we know that approaches a Gumbel distribution where is a specific sequence (see [7]). Notice that the scale of the max decreases like , which is in contrast to what we obtain from the standard construction in the generic chaining, which has increasing scale. Multiscale scan statistics also have a Gumbel limiting distribution, [25], and so we know that the scale of the statistic should be decreasing, and not increasing. The following theorem significantly modifies the construction in the generic chaining in order to provide an exponential inequality under parametric-style conditions on the entropy of the random field, .

###### Theorem 2.

Let be a standard subGaussian process over an index set . Suppose that the metric space has the following bound on the -covering number (),

 N(I,dZ,ϵ)≤Γϵ−ρ. (6)

Then there exists an such that for any , the following supremum is bounded in probability,

 P{√c0logΓ(supι∈IZ(ι)−√2logΓ)−a0loglogΓ>u}≤e−u, (7)

for where are constant depending on (but not on ). In words, the supremum of such a subGaussian process is subexponential with location and rate parameter, .

Proof Sketch. The generic chaining consists of a clever use of the union bound, subGaussian concentration, and a detailed chain construction. In order to illustrate how we can obtain an exponential inequality for the max of subGaussian random variables, let us consider the max of iid standard Gaussian random variables, . Notice that by union bound and subGaussian concentration,

 P{maxizi>√2logN+u2}≤Ne−logN−u22=e−u22.

Now, instead of bounding, as is done in the generic chaining, we will use the bound (we Taylor expand the square root around instead of around ). Hence, we obtain,

 P{2√2logN(maxizi−√2logN)>u}≤e−u.

This is the main technique that we use for obtaining subexponential bounds from the max of independent subGaussian random variables.

The chain construction refers to a sequence of partitionings of the space . Given a partition of , we let be the element that contains the point and be its radius. In the standard generic chaining, a partition is called admissible if . Then the supremum is controlled by uniformly bounding the centers of the single element , and then the differences between the centers of and their closest centers in at level . The standard result is a bound on the supremum based on a functional of the radii . For our modifications, the subexponential bound above requires a growing number of independent points to work, so we begin our chain at a deeper level than at . Furthermore, we have to modify the definition of an admissible partition to be where as . Technical details regarding the chain introduce the term. See the appendix for a complete proof. Although, we assume a specific form for the covering numbers, it may be possible to generalize this technique to other entropy bounds.

## 4 Type 1 error guarantees for learning patterns

A type 1 error—detecting an anomaly under the null hypothesis—is typically the first error to be controlled in the Neyman-Pearson testing framework. Our statistic will be compared to a threshold, which is determined through calibration, simulation, or theoretical guarantees. In this section, we will provide a finite-sample probabilistic bound for the multiscale scan statistic with exponential tail probability. We will then use this result to obtain a finite sample bound on the PAMSS, , which increases logarithmically with the number of functions, .

###### Lemma 3.

Suppose that satisfies either (TVC) or (AHC), and that satisfies the noise assumption, (5). Let , and . Then when ,

 P{c1⋅maxh∈Hℓ,t∈Thvh((fh⋆dW)(t)−vh)−a1loglogL>u}≤e−u (8)

for constants depending on only.

###### Proof.

Let denote the index set in the above display. Throughout, let and denote arbitrary constants depending on alone. By assumption, is a subGaussian random field with canonical distance . An -net of , by definition, will be an -covering of , so we just need to bound the size of the -net, . By construction,

 |D′β,α|=∑h∈H2(ℓ)∩Hβ|Tα,h|≤c1∑h∈H2(ℓ)∩Hβ∏jLαhj≤c1(Llogβ2)dαd2∑jℓj.

and we can take as specified in the proof of Theorem 6. Furthermore, notice that is within a factor of 2 of any in . Then, there are constants such that We can see this because for some constant depending on . By Theorem 2, we have the subexponential bound for the random variable,

 sup(t,h)∈D′yh((fh⋆dW)(t)−yh)  where  yh=√2∑jlog(L/hj)+2log~C

Notice that which gives us the result, along with . ∎

From here, we simply apply the union bound with the bound in Lemma 3, for every element in the partitioning . There are on the order of such elements. This gives us

 P{c2⋅sn(Wi,f)loglogL−a2>u}≤e−u,

for some constants depending on . Hence, is subexponential with only depending on ( is the Orlitz 1-norm). Thus, by the subexponential Bernstein inequality (Prop. 5.16 in [28]),

 P{n∑i=1zi(f)≥t}≤exp(−c3min{t2nK2,tK}),

where is an absolute constant. This gives us our main result (we absorb into below).

###### Theorem 4.

Let be finite and assume that either all functions in satisfy either (TVC) or (AHC). Let

 (9)

then for some constants depending on , and ,

 P{Sn(X,F)>Fn(δ)⋅loglogL}≤δ, (10)

when is drawn according to the null hypothesis, .

Theorem 4 proves this paper’s main hypothesis, that we can learn patterns from a finite dictionary where the type 1 error bound increases logarithmically with . In fact if then we have subGaussian concentration of the final test statistic .

## 5 ϵ-net approximation and type 2 error

We have provided a construction of the -net with parameters and , but we did not specify the selection of either or prove the veracity of our claim that this indeed produces an -net. We used the construction of our -net in the proof of Lemma 3, so we will be careful in this section to prove the correctness of our construction from first principles. The following technical lemma is the main driver of these results.

###### Lemma 5.

There is a constant depending on alone such that

1. Suppose that (TVC) holds for the class , then

 νf((t,h),(t′,h′))2≤Cγ1⎛⎝∥∥∥t−t′h∥∥∥22+∥∥∥h−h′h∥∥∥22+(√h′∙h∙−1)2⎞⎠.
2. [[23]] Suppose that (AHC) holds for the class , then

 νf((t,h),(t′,h′))2≤C⎛⎜ ⎜ ⎜⎝d∑j=1∣∣ ∣∣tj−t′jhj∣∣ ∣∣2γ2+d∑j=1∣∣ ∣ ∣∣hj−h′j√hjh′j∣∣ ∣ ∣∣2+d∑j=1∣∣ ∣ ∣∣hj−h′j√hjh′j∣∣ ∣ ∣∣2γ2⎞⎟ ⎟ ⎟⎠.
###### Proof.

(2) is proven in [23] (pg. 32) so we will focus on (1). We will partially follow the arguments in [8] (pg. 145). Let denote a standard Wiener process in dimensions. Notice that we can define as,

 νf((t,h),(t′,h′))2:=V(∫ΩLfh(t−z)dW(z)−∫ΩLfh′(t′−z)dW(z)).

Recall that

 (fh⋆dW)(t)=1√h∙∫Ωf(u)dW(t−hu).

By integration by parts,

 1√h∙∫Ωf(u)dW(t−hu)=∫ΩW(t−hu)√h∙⋅∇f(u)du.

Therefore,

 (fh⋆dW)(t)−(fh′⋆dW)(t′)=∫Ω(W(t−hu)√h∙−W(t′−h′u)√h′∙)⋅∇f(u)du ≤∫Ω∥∥ ∥∥W(t−hu)√h∙−W(t′−h′u)√h′∙∥∥ ∥∥2⋅∥∇f(u)∥2du ≤supu∈Ω∥∥ ∥∥W(t−hu)√h∙−W(t′−h′u)√h′∙∥∥ ∥∥2⋅∫Ω∥∇f(u)∥2du.

Hence,

 νf((t,h),(t′,h′))2≤∥f∥2TV⋅Esupu∈Ω∥∥ ∥∥W(t−hu)√h∙−W(t′−h′u)√h′∙∥∥ ∥∥22.

Decompose the supremum term,

 Esupu∈Ω∥∥ ∥∥W(t−hu)√h∙−W(t′−h′u)√h′∙∥∥ ∥∥2 ≤Esupu∈Ω(∥∥∥W(t−hu)−W(t′−h′u)√h∙∥∥∥2+∥W(t′−h′u)∥2∣∣ ∣∣1√h∙−1√h′∙∣∣ ∣∣)

By Brownian scaling, Hence, the th coordinate in the above LHS is a 1-dimensional Brownian motion equal in distribution to, By the reflection principle, we have that for some constant ,

 Esupu∈ΩB(t−t′h−(1−h′h)u)2≤c′⎛⎝∣∣ ∣∣tj−t′jhj∣∣ ∣∣2+∣∣ ∣∣hj−h′jhj∣∣ ∣∣2⎞⎠.

Hence,

 Esupu∈Ω∥∥∥W(t−hu)−W(t′−h′u)√h∙∥∥∥22≤c′⎛⎝∑j∣∣ ∣∣hj−h′jhj∣∣ ∣∣2+∣∣ ∣∣tj−t′jhj∣∣ ∣∣2⎞⎠.

Without loss of generality, we can select by translation invariance of . Furthermore by a similar argument, , for another universal constant . Hence,

 Esupu∈Ω∥W(t′−h′u)∥2∣∣ ∣∣1√h∙−1√h′∙∣∣ ∣∣≤c′∣∣ ∣∣√h′∙h∙−1∣∣ ∣∣.

Combining these bounds completes the proof. ∎

###### Theorem 6.

Suppose that either one of (TVC) or (AHC) holds. Let , then there exists a depending on such that for any , there exists (the -net) with

###### Proof.

Throughout, let mean either depending on context. By the triangle inequality let us bound,

 νf((t,h),(t′,h′))≤νf((t,h′),(t′,h′))+νf((t,h),(t,h′)).

In our construction, notice that for any there is a grid point in that is within from it in dimension . Hence,

 d∑j=1∣∣ ∣∣tj−t′jh′j∣∣ ∣∣2γ2≤dα2γ2;d∑j=1∣∣ ∣∣tj−t′jh′j∣∣ ∣∣2≤dα2.

Furthermore, by the -net construction, there exists an such that for every . Thus,

 ∣∣ ∣ ∣∣hj−h′j√hjh′j∣∣ ∣ ∣∣≤β−1;∣∣ ∣∣√h′∙h∙−1∣∣ ∣∣≤βd2−1.

Thus there is a constant, , depending on , such that for small enough and is sufficient. ∎

Suppose that we are under the alternative hypothesis, , so that there is some embedded signal in each image. Consider evaluating the scan at the true location and scale, for a given field , then

is normally distributed with mean

and variance . In the event that we scan over an -net, then by arguments in Section 2, there is an element in the approximate scan with mean . Hence, we have the following type 1 error bound, by summing the resulting normal random variables.

###### Proposition 7.

Suppose that are drawn from (with possibly different location and scale parameters). Then define and ,

where is the standard Normal CDF. The above display is also true if we let and substitute .

In order to have diminishing type 1 error probability, we set a threshold for at for . Assume that (however slowly) then to have the type 2 error probability to decrease as well, we require that

 μWn−Vn√n−Fn(δ)loglogL=ω(√Vnn)

Suppose that and (the standard multiscale scan setting), then this would require

 μ−vh1−Kvh1log1δ⋅loglogL=ω(1),

which is consistent up to some constants with previously known rates, [23]. Critically, the dependence on is logarithmic, and this provides the first such finite sample bound.

Let us conclude with a remark about the restrictiveness of the assumption that we have a finite function class . It is known that functions of bounded variation have Haar wavelet coefficients that are bounded in a weak norm, [5]. It is reasonable to discretize the allowed coefficient values and then restrict our function class to functions with -sparse wavelet coefficients of then the size of the class scales like which is very managable. One advantage with this approach is that the sparse Haar wavelets will naturally satisfy condition (TVC). It is outside of the scope of this work to extend the result to infinite function classes, but this would present a very interesting and important extension.

## 6 Conclusions

We have addressed learning and detecting patterns from a function class, , using multiscale scan statistics. We have introduced the multiscale scan statistic and proved a subexponential concentration bound for it, which relied on a novel chaining result for standardized suprema of subGaussian random fields (a result that may be of independent interest). We introduced the pattern adapted multiscale scan statistic, that can learn patterns in a database of tensors (when the locations and scales vary). This result allowed us to prove Bernstein-type concentration for the PAMSS, meaning that we can learn finite function classes that grow exponentially with the sample size,

. With evidence that representation learning and detection are not incompatible, we anticipate that efficient methods for learning functions in this setting will emerge, by using modern tools from deep learning and multiscale methods.

Acknowledgements: JS is supported in part by DMS-1712996.

## References

• [1] E. Arias-Castro, D.L. Donoho, and X. Huo. Near-optimal detection of geometric objects by fast multiscale methods. IEEE Trans. Inform. Theory, 51(7):2402–2425, 2005.
• [2] Ery Arias-Castro, Emmanuel J. Candès, and Arnaud Durand. Detection of an anomalous cluster in a network. Ann. Statist., 39(1):278–304, 2011.
• [3] Sean M. Brennan, Angela M. Mielke, David C. Torney, and Arthur B. Maccabe. Radiation detection with distributed sensor networks. Computer, 37(8):57–59, 2004.
• [4] Y. Caron, P. Makris, and N. Vincent. A method for detecting artificial objects in natural environments. In

Proceedings 16th International Conference on Pattern Recognition

, volume 1, pages 600–603. IEEE Comput. Soc., 2002.
• [5] Albert Cohen, Wolfgang Dahmen, Ingrid Daubechies, Ronald DeVore, et al. Harmonic analysis of the space bv. Revista Matematica Iberoamericana, 19(1):235–263, 2003.
• [6] D. Culler, D. Estrin, and M. Srivastava. Overview of sensor networks. IEEE Computer, 37(8):41–49, 2004.
• [7] Laurens De Haan and Ana Ferreira. Extreme value theory: an introduction. Springer Science & Business Media, 2007.
• [8] Lutz Dumbgen and Vladimir G Spokoiny. Multiscale testing of qualitative hypotheses. Annals of Statistics, pages 124–152, 2001.
• [9] Lutz Dümbgen and Günther Walther. Multiscale inference about a density. The Annals of Statistics, pages 1758–1785, 2008.
• [10] Joseph Glaz, Joseph I Naus, and Sylvan Wallenstein. Scan statistics. Springer, 2001.
• [11] Joseph Glaz and Zhenkui Zhang. Multiple window discrete scan statistics. Journal of Applied Statistics, 31(8):967–980, 2004.
• [12] G Haiman and C Preda. Estimation for the distribution of two-dimensional discrete scan statistics. Methodology and Computing in Applied Probability, 8(3):373–382, 2006.
• [13] R. Heffernan, F. Mostashari, D. Das, A. Karpati, M. Kulldorff, and D. Weiss. Syndromic surveillance in public health practice, New York City. Emerging Infectious Diseases, 10(5):858–864, 2004.
• [14] D. James, B. D. Clymer, and P. Schmalbrock. Texture detection of simulated microcalcification susceptibility effects in magnetic resonance imaging of breasts. Journal of Magnetic Resonance Imaging, 13(6):876–881, 2001.
• [15] Zakhar Kabluchko. Extremes of the standardized gaussian noise. Stochastic Processes and their Applications, 121(3):515–533, 2011.
• [16] Dimitris Manolakis and Gary Shaw. Detection algorithms for hyperspectral imaging applications. IEEE signal processing magazine, 19(1):29–43, 2002.
• [17] Nathan Moon, Elizabeth Bullitt, Koen van Leemput, and Guido Gerig. Automatic brain and tumor segmentation. In MICCAI ’02: Proceedings of the 5th International Conference on Medical Image Computing and Computer-Assisted Intervention-Part I, pages 372–379, London, UK, 2002. Springer-Verlag.
• [18] Joseph I. Naus. The distribution of the size of the maximum cluster of points on a line. J. Amer. Statist. Assoc., 60:532–538, 1965.
• [19] Joseph I Naus and Sylvan Wallenstein. Multiple window and cluster size scan procedures. Methodology and Computing in Applied Probability, 6(4):389–400, 2004.
• [20] Daniel B Neill. Fast subset scan for spatial pattern detection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(2):337–360, 2012.
• [21] Vladimir Pozdnyakov, Joseph Glaz, Martin Kulldorff, and J Michael Steele. A martingale approach to scan statistics. Annals of the Institute of Statistical Mathematics, 57(1):21–37, 2005.
• [22] D. Pozo, F.J. Olmo, and L. Alados-Arboledas. Fire detection and growth monitoring using a multitemporal technique on AVHRR mid-infrared and thermal channels. Remote Sensing of Environment, 60(2):111–120, 1997.
• [23] Katharina Proksch, Frank Werner, and Axel Munk. Multiscale scanning in inverse problems. arXiv preprint arXiv:1611.04537, 2016.
• [24] L.D. Rotz and J.M. Hughes. Advances in detecting and responding to threats from bioterrorism and emerging infectious disease. Nature Medicine, pages S130–S136, 2004.
• [25] James Sharpnack and Ery Arias-Castro. Exact asymptotics for the scan statistic and fast alternatives. Electronic Journal of Statistics, 10(2):2641–2684, 2016.
• [26] David O Siegmund and Keith J Worsley. Testing for a signal with unknown location and scale in a stationary gaussian random field. The Annals of Statistics, pages 608–639, 1995.
• [27] Michel Talagrand. The generic chaining: upper and lower bounds of stochastic processes. Springer Science & Business Media, 2006.
• [28] Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
• [29] Guenther Walther. Optimal and fast detection of spatial clusters with scan statistics. Ann. Statist., 38(2):1010–1033, 2010.
• [30] Xiao Wang and Joseph Glaz. Variable window scan statistics for normal data. Communications in Statistics-Theory and Methods, 43(10-12):2489–2504, 2014.
• [31] Brian A White, Antonios Tsourdos, Immanuel Ashokaraj, S Subchan, and Rafal Zbikowski. Contaminant cloud boundary monitoring using network of uav sensors. IEEE Sensors Journal, 8(10):1681–1692, 2008.
• [32] Y Jeffrey Yang, Roy C Haught, and James A Goodrich. Real-time contaminant detection and classification in a drinking water pipe using conventional water quality sensors: Techniques and experimental results. Journal of environmental management, 90(8):2494–2506, 2009.

## Appendix A Proof of Theorem 2

###### Proof.

We will follow the construction of generic chains as in [27], but will be significantly more careful about the details of the construction. This proof is similar in spirit to [8], but we will get probabilistic bounds and prefer this proof because it uses only first principles. Throughout this proof we will call variables that dependent only on , constants, and some like may change from line to line. Variables from the main body of the paper, other than those defined in Theorem 2, may appear in this proof and mean something different (we suppose that they are in a different scope).

Let’s begin by defining and

 a=(1−log−1G)−1>1.

Let an admissible partition, , be any partition of of size at most . Let be the element of the partition containing and let the center of this element be

 τn(ι):=infτ∈An(ι)supι′∈An(t)dZ(τ,ι′),

and be its radius. Let then . By the union bound and (4),

 P{∃ι∈I:Z(τk0(ι))>√2loga⋅ak0+u2} ≤aak0exp(−12(2loga⋅ak0+u2)) =e−u22.

Let

 ϵk=G⋅a−akρ, (11)

then there exists an admissible partition where the radius of the balls are satisfy . Hence,

 P{∃ι∈I:|Z(τk+1(ι))−Z(τk(ι))|>ϵk√2loga⋅ak+1+dku2} ≤2aak+1exp(−12(2loga⋅ak+1+dku2)) =2e−dku22,

where

 dk=e12(1+loga)⋅(k−k0). (12)

Define the quantities,

 Ak0 =√2loga⋅ak0+u2+