# Anomaly detection and classification for streaming data using PDEs

Nondominated sorting, also called Pareto Depth Analysis (PDA), is widely used in multi-objective optimization and has recently found important applications in multi-criteria anomaly detection. Recently, a partial differential equation (PDE) continuum limit was discovered for nondominated sorting leading to a very fast approximate sorting algorithm called PDE-based ranking. We propose in this paper a fast real-time streaming version of the PDA algorithm for anomaly detection that exploits the computational advantages of PDE continuum limits. Furthermore, we derive new PDE continuum limits for sorting points within their nondominated layers and show how the new PDEs can be used to classify anomalies based on which criterion was more significantly violated. We also prove statistical convergence rates for PDE-based ranking, and present the results of numerical experiments with both synthetic and real data.

## Authors

• 2 publications
• 24 publications
• 11 publications
04/10/2022

### How much can one learn a partial differential equation from its solution?

In this work we study the problem about learning a partial differential ...
08/20/2015

### Multi-criteria Similarity-based Anomaly Detection using Pareto Depth Analysis

We consider the problem of identifying patterns in a data set that exhib...
11/30/2018

### PDE-Net 2.0: Learning PDEs from Data with A Numeric-Symbolic Hybrid Deep Network

Partial differential equations (PDEs) are commonly derived based on empi...
06/09/2021

### Any equation is a forest: Symbolic genetic algorithm for discovering open-form partial differential equations (SGA-PDE)

Partial differential equations (PDEs) are concise and understandable rep...
04/30/2021

### A note on a PDE approach to option pricing under xVA

In this paper we study partial differential equations (PDEs) that can be...
03/11/2021

### Multi-objective discovery of PDE systems using evolutionary approach

Usually, the systems of partial differential equations (PDEs) are discov...
02/15/2018

### Black Hole Metric: Overcoming the PageRank Normalization Problem

In network science, there is often the need to sort the graph nodes. Whi...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Sorting, or ordering of multivariate data is an important and challenging problem in many fields of computational science. Since there is no canonical linear ordering for multivariate data, many different notions of ordering have been proposed in the literature [25], and the problem is very much application dependent.

In the context of multiobjective optimization, ordering by dominance relations has achieved prominence. A general multiobjective optimization problem involves finding among a set of feasible solutions those that minimize a collection of objectives. One feasible solution is said to dominate another if it gives a smaller value for every objective. The collection of feasible solutions that are not dominated by any other solution are called Pareto-optimal or nondominated. In the database community the Pareto-optimal solutions are called the skyline of the dataset [24].

The notion of Pareto-optimality is widely used in evolutionary algorithms for multiobjective optimization

[29]

, such as the Nondominated Sorting Genetic Algorithm (NSGA-II)

[11], the Strength Pareto Evolutionary Algorithm (SPEA) [32, 33], and the Pareto envelope-based selection algorithm (PESA) [10], among many others (see [16] for a survey). Central to many of these algorithms is the assignment of a fitness to each feasible solution based on sorting all the feasible solutions via dominance.

The NSGA-II algorithm assigns its fitness level via nondominated sorting, sometimes called Pareto Depth Analysis (PDA), which arranges the feasible solutions into layers by repeatedly peeling off the Pareto-optimal solutions. Nondominated sorting has also found applications in gene selection and ranking [18], anomaly detection [21, 22]

, and multiquery image retrieval

[20]. As it turns out, nondominated sorting is equivalent to the longest chain problem

, which has a long history in combinatorics and probability

[2, 17, 30].

Due to the wide use of NSGA-II, there has been significant interest in fast algorithms for nondominated sorting [11, 23, 15]. Recently, Calder et al. [6] established a continuum limit for nondominated sorting that corresponds to solving a Hamilton-Jacobi partial differential equation (HJE). This result shows that there is a simple asymptotic structure underlying nondominated sorting, and this opens the door to extremely fast algorithms based on exploiting this structure. Calder et al. [7] recently proposed a sublinear algorithm for approximate nondominated sorting called PDE-based ranking

that is based on estimating the distribution of the data and solving the HJE numerically.

The purpose of this paper is twofold. First, we show how to use PDE-based ranking to significantly improve the performance of algorithms that are based on nondominated sorting. To illustrate this in a concrete setting, we propose a new real-time version of the multi-criteria PDA anomaly detection algorithm from [22] that uses PDE-based ranking in place of nondominated sorting. The computational complexity is reduced by an order of magnitude (from quadratic to linear), and this allows the model to be updated in real-time upon the acquisition of each additional data sample. We also prove in Theorem 2 a statistical convergence rate for the PDE-based ranking continuum approximation.

Second, we present a new partial differential equation (PDE) continuum limit for ordering solutions within the layers generated by nondominated sorting in two dimensions. This new continuum limit allows us to efficiently explore the tradeoff between multiple objectives. In the context of multi-criteria anomaly detection, we show how to use this PDE continuum limit to classify anomalies based on which criterion is more significantly violated. We give a derivation of these new continuum limits and present a convergence analysis. In both cases, we trade exact algorithms of high computational complexity for fast approximate algorithms that are convergent, meaning that the error in the approximation goes to zero as the sample size grows.

This paper is organized as follows. In Section 2 we review the continuum limit for nondominated sorting from [6], and present the PDA multicriteria anomaly detection algorithm from [22]. In Section 3, we derive two new PDE continuum limits for ordering points within Pareto fronts, in Section 4 we construct fast upwind schemes for solving these PDEs numerically. In Section 5, we present our fast PDE-based anomaly detection and classification algorithm in the context of streaming data, and in Section 6 we present the results of numerical experiments for both real and synthetic data streams.

## 2 Previous work

### 2.1 Nondominated sorting

Nondominated sorting arranges a set of points in into layers by repeatedly peeling off the coordinatewise minimal points. The coordinatewise partial order on is defined by

 x≦y  ⟺  ∀i  xi≤yi    (x,y∈Rd).

Let be a collection of points in . We say a point is minimal (or nondominated), if no other non-identical point in is smaller with respect to the coordinatewise partial order . The first nondominated layer, denoted , consists of the minimal points from . The second nondominated layer, denoted , consists of the minimal points from and the layer is defined recursively by

 Fk=Minimal points from Sn∖(F1∪⋯∪Fk−1).

Nondominated sorting refers to the process of arranging the set into the nondominated layers , which are also called Pareto fronts. See Figure 1 for a demonstration of nondominated sorting applied to random points. The index of the Pareto front that a point lies on is often called the Pareto depth or Pareto rank of , and provides the fitness score for the NSGA-II algorithm. In the context of multiobjective optimization, represents the number of objectives.

The original nondominated sorting algorithm proposed in [11] requires memory and operations. The quadratic memory complexity in renders the algorithm intractable for even moderate . Jensen [23] proposed an algorithm with asymptotic complexity of as . The two dimensional version of Jensen’s algorithm was discovered independently in the combinatorics community by Felsner and Wernisch [14]. Fortin et al. [15] recently made some improvements to Jensen’s algorithm regarding its treatment of points with identical coordinates. The exponential complexity of the Jensen-Fortin algorithm with respect to suggests it may not be useful for high dimensional problems. However, recent numerical results have suggested a better asymptotic complexity as with fixed [22]. We also mention there are several other notable approaches to nondominated sorting [28, 8, 13].

Calder et al. [6] discovered a Hamilton-Jacobi equation continuum limit for nondominated sorting. The result applies in the setting where is a sequence of i.i.d. random variables with probability density on the unit box . Define the Pareto depth function associated with nondominated sorting of by if and only if . The following continuum limit was established in [6, 4].

###### Theorem 1 (HJE Continuum Limit).

With probability one

 n−1dUn⟶Cdu   uniformly on [0,1]d as n→∞

where is a constant depending only on , and is the unique nondecreasing viscosity solution of the Hamilton-Jacobi equation

 {ux1⋯uxd=fin (0,1]du=0on ∂(0,1)d∖(0,1]d. (1)

Here denotes the partial derivative of with respect to , and by nondecreasing we mean that for all . We note that when is the uniform density, the solution of (1) is

 u(x)=d(x1⋯xd)1d. (2)

Figure 1 gives an illustration of this continuum limit when are independent and uniformly distributed. The continuum limit in Theorem 1 states that the Pareto fronts converge to the level sets of the viscosity solution of (1). While the value of is not needed for sorting, we should mention that it is known only in dimension , in which case  [26, 31].

Calder et al. [7] proposed a fast algorithm for approximate nondominated sorting called PDE-based ranking that is based on estimating the density function from a small subset of the data and then solving (1) numerically. PDE-based ranking can drastically reduce the computation time of nondominated sorting in low dimensions () while maintaining very high sorting accuracy.

Let us say a few words about viscosity solutions. Hamilton-Jacobi equations like (1) generally do not admit classical solutions (i.e., continuously differentiable solutions) due to the possibility of crossing characteristics. There are, however, infinitely many functions that are differentiable almost everywhere and satisfy (1) at each point of differentiability. The notion of viscosity solution selects from among these infinitely many feasible solutions the one that is ‘physically correct’ for a very wide range of problems. The viscosity solution is correct in this context because it captures the continuum limit of the Pareto depth function [6].

The notion of viscosity solution is based on the maximum principle and enjoys very strong stability properties. It is a notion of weak solution that allows merely continuous functions to be solutions of a fully nonlinear PDE. While viscosity solutions may not possess the derivatives appearing in the equation in the classical sense, the reader will not lose much in the way of understanding by assuming that is continuously differentiable. In the context of viscosity solutions, the maximum principle is used to prove a comparison principle, which says that subsolutions lie below supersolutions provided their boundary conditions do as well. The reason the notion of viscosity solution is correct in this context is that nondominated sorting also obeys a comparison principle; namely, if we introduce new points into our data set (i.e., we increase the point density), then the Pareto depth function increases as well. For more details on viscosity solutions we refer the reader to [1].

### 2.2 Anomaly detection

To illustrate the computational advantages of PDE-based ranking, we consider a concrete application of nondominated sorting to anomaly detection [22]. Anomaly detection refers to the problem of detecting patterns in data that deviate from the expected behavior. It is an important and challenging problem with a wide array of applications, including computer intrusion detection, video surveillance, credit card fraud, and biometrics [19, 9]. Many anomaly detection algorithms rely on the availability of a measure of distance (or similarity) between data samples, and look for anomalies by finding samples that are far from their nearest neighbors (see [22] and references therein). These algorithms are usually called similarity-based, and are widely used due to their simplicity and robustness.

In contrast, feature-based algorithms seek to embed the data into a relatively low dimensional Euclidean space and make use of the ambient Euclidean (or other) distance to detect anomalies. Techniques used for feature-based algorithms include support vector machines (SVM), clustering, neural networks, and statistical approaches based on density estimation

[9]. In this paper we consider similarity-based approaches.

In many applications, multiple measures of similarity may be required to detect certain types of anomalies. For example, when tracking pedestrians in video surveillance, one criterion may correspond to differences in individual walking speeds, while another might correspond to differences in the shapes of trajectories. Using multiple criteria allows one to detect a wider range of anomalies than could be obtained from a single criterion alone.

Hsiao et al. [22] proposed an algorithm for multi-criteria anomaly detection that integrates the information from multiple similarity measures via nondominated sorting (or Pareto Depth Analysis (PDA)). Suppose we have a training set consisting of objects and measures of similarity for comparing these objects. Without loss of generality, we assume —a lower score indicates the objects are more similar with respect to the criteria. The training phase of the algorithm consists of computing the dyads

 Xi,j=(c1(Yi,Yj),…,cd(Yi,Yj))∈[0,1]d, (3)

and constructing the Pareto depth function by applying nondominated sorting to the points . Recall that if and only if belongs to the Pareto front.

The testing phase of the algorithm receives a new object and compares it to all training samples to create new dyads given by

 Zj=(c1(Y,Yj),…,cd(Y,Yj)).

Fix a number , and let denote the indices of training samples that are among the nearest neighbors of with respect to at least one similarity measure . The anomaly score for is

 ν=1|I|∑j∈IUn(Zi), (4)

and is declared an anomaly when is larger than a predefined threshold . We note that it is possible to allow different values of for each criterion. The idea is that nominal samples should be close to many of their nearest neighbors from the training data in one or many similarities, and thus the dyads will lie on earlier Pareto fronts. An anomalous sample should be far from its nearest neighbors in the training set in many or all of the similarities, and the dyads will consequently fall on deeper fronts. The PDA anomaly detection algorithm has been validated on real and synthetic data in [21, 22] and has been shown to achieve state of the art results for integrating information from multiple similarities.

## 3 New PDE continuum limits

The Hamilton-Jacobi Equation (HJE) continuum limit (1) gives information about which Pareto front a sample lies on. It is also important in applications to know where a sample lies within its Pareto front, as this gives information about the trade-off between the multiple objectives. We present here some new PDE continuum limits for ordering of points within Pareto fronts in dimension two.

Suppose and let be a sequence of i.i.d. random variables with continuous density on . Apply nondominated sorting to and then order the points within each Pareto front by -coordinate. This defines a function given by

 Vn(Xi):=Index of Xi within its Pareto front. (5)

Figure 2 gives an illustration of .

By the continuum limit of nondominated sorting (Theorem 1), there are on the order of Pareto fronts. Since there are points in total, each front should have on the order of points. Therefore let us suppose that

 n−12Vn⟶v    as   n→∞

uniformly with probability one, where is continuously differentiable.

Fix a large value of and consider a point . Fix and let

 y=x+ε∇⊥u(x),

where . Since is tangent to the level set , we have , i.e., and are roughly on the same Pareto front. Let denote the rectangle whose diagonal is the line segment from to , and let denote the number of points on the Pareto front passing through and that fall within . See Figure 3 for an illustration of the setup. Then we have

 ε∇v(x)⋅∇⊥u(x)≈v(y)−v(x)≈n−12(Vn(y)−Vn(x))=n−12Ln. (6)

Here, denotes the gradient of . When is small, the random variables within are approximately uniformly distributed within . Furthermore, as illustrated in Figure 3, we can scale to the unit box without changing the partial ordering within . Hence, it is reasonable to conjecture that as , where is a universal constant and is the number of samples falling in . While the value of is not needed for sorting (since we perform a normalization in (9) below), a simple scaling argument suggests that , so we will take

. By the law of large numbers

 m∼n∫Afdx≈n|A|f(x)=nε2ux1ux2f(x),

since the side lengths of are and . Combining this with , (6) and yields

 ε∇v(x)⋅∇⊥u(x)≈f(x)ε.

Hence this simple heuristic argument suggests that

satisfies the linear transport equation

 (7)

Recall is the viscosity solution of (1). When we have . If we plug this into (7) and look for a separable solution of the form we find that when

 v(x)=−log(x2)√x1x2. (8)

Since each Pareto front has in general a different number of points, the values of within different fronts are difficult to compare. Therefore it is natural to consider the following normalization:

 Wn(Xi):=Vn(Xi)#F(Xi), (9)

where denotes the Pareto front that belongs to. The quantity is an index between 0 and 1 that gives information about where the point falls within its Pareto front. The arguments above suggest that

 Wn⟶w   uniformly with probability one, where
 w(x)=v(x)v(1,ψ(u(x)), (10)

and is the inverse of . In other words, we are normalizing by the asymptotic number of points on the front to which belongs.

The expression in (10) is difficult to work with numerically. We will instead derive a PDE for . Differentiating (10) yields

 v∇w=w∇v−w2vx2ψ′(u)∇u.

Take the dot product of both sides with and recall (7) to find that

 v∇w⋅∇⊥u=w∇v⋅∇⊥u=wf.

Since on , can be characterized as the solution of the following transport equation

 (11)

We note that it would seem equally reasonable to have chosen the boundary condition on instead. However, in this case it is easy to verify that would solve (11), so the solution is not uniquely determined by the boundary condition on . This issue arises numerically as well. Indeed, we have found experimentally that if we solve (11) numerically with an upwind scheme and the boundary condition on we find the solver automatically computes instead of , and it is unclear how to select the correct solution without changing the boundary condition to on . It is impossible to specify both boundary conditions simultaneously since the characteristic curves, which are the level curves of , flow through both boundaries.

Note that when we have and . Plugging these into (11) and using the method characteristics we find that for

 w(x)=log(x2)log(x1)+log(x2). (12)

See Figure 4 for a comparison of and to their continuum limits (7) and (11), respectively. While the arguments in this section are not rigorous, we present a convergence analysis in Section 3.1 that gives very strong numerical evidence for their validity. A rigorous proof is the subject of current investigation and is far outside the scope of this paper.

On a more technical note, since is not necessarily differentiable the velocity in the transport equations (7) and (11) is not well-defined. To make sense of these PDE rigorously, we can instead write them in divergence form, since

 −div(u∇⊥v)=∇v⋅∇⊥u.

This suggests that it is possible to prove existence and uniqueness of weak solutions (defined via integration by parts) of (7) in the Sobolev space , under the requirement that is merely continuous. Such results are outside the scope of this paper, and we intend to pursue them in a future work.

### 3.1 Convergence analysis

We present here a convergence analysis for the continuum limits (7) and (11) in the case that , i.e., the samples are independent and uniformly distributed on the unit box . In this case we can solve all three PDEs (1), (7), and (11) in closed form using the formulas (2), (8), and (12), respectively.

We performed a convergence analysis by drawing independent and uniformly distributed on and computing and according to their definitions (5) and (9), respectively. We measured the discrepancy with the continuum limits in the and norms, computed by

 ∥v−n−12Vn∥ℓ1:=1nn∑i=1|v(Xi)−n−12Vn(Xi)|.

and

 ∥v−n−12Vn∥ℓ∞:=max1≤i≤n|v(Xi)−n−12Vn(Xi)|,

respectively. The definitions of and are similar. Figure 5 shows the errors for a single realization and various values of ranging from to . Each of the errors is observed to be converging to zero at a rate of where , except for . Upon closer inspection, the function is discontinuous at , hence uniform convergence is impossible. This discontinuity reflects the fact that near the Pareto fronts cut off an infinitesimal portion of the top corner of the box, and hence transitions from to over an infinitesimally short distance.

## 4 Numerical schemes

We address in this section the problems of solving the PDEs (1), (7), (11) numerically. Since each PDE involves the solution of the previous PDEs, they must be solved in the order (1)-(7)-(11). We solve (1) numerically using the (formally) first order accurate upwind scheme presented in [5]. The scheme is similar to fast sweeping, but only requires sweeping the grid exactly once, and thus has linear complexity in the number of grid points.

Here, we show how to construct upwind finite difference schemes for the transport PDEs (7) and (11). Fix a grid resolution and for define . We will solve the transport equations on the grid . Both PDEs are degenerate when , which can only happen when vanishes (since ). To avoid this degeneracy, we numerically precondition the density by replacing with before solving (1) numerically.

The transport equation (7) can be written out as

 ux2vx1−ux1vx2=f  on (0,1)2

with . The unknown function is ; is obtained by solving (1). Since and the coefficients of and have opposite signs. Thus an upwind scheme will use either (A) backward differences for and forward differences for , or (B) vice-versa. The choice depends on the direction we want information to propagate. Since our boundary condition is on and information flows along Pareto fronts in the positive direction and negative direction, the correct choice for an upwind scheme is (A) backward differences in and forward differences in , which effectively forces the scheme to look backwards along the current Pareto front (or level set of ).

Thus, our upwind scheme for (7) is

 {ux2D−1vh−ux1D+2vh=fin (0,1)2h,vh(0,x2)=vh(x1,1)=0. (13)

where is the numerical solution, and are the finite differences defined by

 D±iv(x):=±v(x±hei)−v(x)h,

where and . At each grid point, (13) is linear equation that is readily solved for in terms of and to obtain

 vh(x)=ux2(x)vh(x−he1)+ux1(x)vh(x+he2)+hf(x)ux1+ux2. (14)

The numerical solution is computed by sweeping the grid exactly once in the upwind direction starting with the boundary condition . We note that the boundary condition is just for numerical convenience, and is not used directly by the scheme, since so on the line so the solution depends only on when . When computing , we replace and in (14) by first order finite differences of the solution of (1) on the same grid. Our numerical experiments suggest that the scheme is not sensitive to the choice of discretization of and . We note that convergence of the scheme (13) is a classical result when , however, is in general only Hölder continuous. We leave the analysis of the scheme for to future work.

We now consider the second transport equation (11). If we again write the PDE out we have

 vux2wx1−vux1wx2=wf  in (0,1)2

with boundary condition . Since and we have the same upwind choices as before. However, here we want information to propagate from the boundary where or backwards along Pareto fronts (level sets of ) in the direction . Thus we use forward differences for and backward differences for , and our upwind scheme for (15) has the form

 {vhux2D+1wh−vhux1D−2wh=whfin (0,1)2h,wh(1,x2)=wh(x1,0)=1. (15)

This is a linear equation that can be solved for in terms of and to obtain

 wh(x)=vh(x)ux2(x)wh(x+he1)+vh(x)ux1(x)wh(x−he2)hf(x)+vh(x)ux2(x)+vh(x)ux1(x). (16)

This scheme can also be solved in a fast sweeping pattern visiting each grid point exactly once, and thus has linear complexity. Note that the boundary condition is not directly used by the scheme, since . We impose the condition simply for convenience. We note here as well that convergence of the scheme is classical when , and we leave the case of to future work.

## 5 Real-time anomaly detection

We propose here a modification of the PDA multicriteria anomaly detection algorithm [22] to the setting of online streaming data. Suppose we have a stream of possibly nonstationary data , and measures of similarity for comparing data samples. As before we suppose that . In the streaming setting, we observe the data sequentially and must determine whether is an anomaly based only on the previous history . Due to memory and computational constraints, it may not be feasible to use this entire history, especially when is large. Therefore, we fix and consider the windowed history

 Ht={Ys:t−T≤s≤t−1}. (17)

We use the history as training data in order to determine whether is an anomaly. Even without memory constraints, only the recent history can be considered reliable when the data is nonstationary. As before, we define dyads

 Xr,s=(c1(Yr,Ys),…,cd(Yr,Ys))∈[0,1]d

corresponding to every pair of the data stream. If we use the PDA anomaly detection algorithm with exact nondominated sorting, then we would need to store in memory all of the dyads corresponding to pairs from the history . Since the addition of a single new sample can potentially affect the arrangement of all the Pareto fronts, re-training the model when new samples are acquired requires applying nondominated sorting to all dyads, which has complexity slightly worse than for memory and operations. This makes it impossible to update the model frequently in the streaming setting without considering some type of approximation to the sorting.

Using PDE-based ranking we can reduce this complexity to

. We keep a running estimate of the marginal distribution of the dyads using the following kernel density estimator

 ft(x)=1nhd∑t−T≤r

Although there are terms in the sum above, the density estimation can be updated recursively in time by writing where

 gt(x)=1nhdt−1∑s=t−TK(x−Xs,th)−K(x−X(t−T−1),sh).

In our experiments, we use a simple histogram estimator, which is a special case of (18). We then compute an approximation of the Pareto depth function by solving the HJE (1) numerically using the estimated density . By Theorem 1 the continuum approximation of the anomaly score is

 νt=1|It|∑s∈ItUt(Xs,t), (19)

where denotes the indices of samples from the history that are among the nearest neighbors of with respect to for at least one . We declare anomalous if is greater than a predefined threshold . The steps above work in arbitrary dimension as well.

For the anomaly classification, we specialize to the case of . If is declared an anomaly, we then solve the transport equations (7) and (11) using the schemes (13) and (15), respectively, to obtain . We define the anomaly classification score

 μt=1|It|∑s∈ItWt(Xs,t), (20)

and we declare a -anomaly if , and a -anomaly if . The idea is that if a sample is a -anomaly, then the first coordinate of the dyads should be larger on average than in the training set, which is our windowed history. Therefore, the dyads corresponding to will be on average further to the right along each Pareto front. This corresponds to a front index larger than on average. The situation is similar for anomalies, except that the dyads will be biased towards the left side of the fronts. See Algorithm 1 for a summary of our algorithm in pseudocode.

There are some obvious modifications we could make to improve the performance of the algorithm. First, the continuum limit PDE need not be solved at every iteration, and could instead be solved only periodically, or whenever the density estimation has substantially changed. Second, to keep track of a larger history without incurring additional costs, the history could contain elements equally (or randomly) spaced among the previous samples, where . The algorithm would remain otherwise unchanged and the complexity of each iteration remains .

Since the PDE-based anomaly detection algorithm is based on continuum approximations, it is natural to seek a quantification of the approximation error. If we assume the dyads are i.i.d. we can prove the following convergence rate.

###### Theorem 2 (Convergence Rate).

Let be i.i.d. with a Lipschitz continuous probability density , where . For let denote the numerical solution of (1) obtained via estimating from via a histogram aligned to the grid of spacing on . Then there exist constants such that

 max[0,1]2h|ˆuh−u|≤C1√h (21)

holds with probability at least , where is the viscosity solution of (1).

Theorem 2 suggests we should choose as a function of so that . In particular, if we choose so that

 limn→∞nh5log(n)=∞,

then by the Borel-Cantelli Lemma almost surely and uniformly on as . We also note that Theorem 2 extends easily to higher dimensions . In this case the same convergence rate (21) holds with probability at least

 1−exp(−C2nh2d+1−dlog(h)).
###### Proof of Theorem 2.

Let be independent and identically distributed random variables on with Lipschitz continuous density . Recall that is assumed to be positive on , i.e., there exists such that . Let be the grid resolution for solving (1) numerically and for estimating the density with a histogram estimator, where . For and let

 Bij=[(i−1)h,ih)×[(j−1)h,jh)

denote the grid cell corresponding to , and let denote the number of samples from falling in . Then is a Binomial random variable with parameters and

 pij=∫Bijf(x)dx.

By the Chernoff-Hoeffding bound (see, e.g., [12]) we have

 P(|Nij−ENij|≥t)≤exp(−2t2n) (22)

for all . Let denote the grid points. The histogram estimation of at grid point is given by

 ˆfh(xij):=Nijn|Bij|=Nijnh2.

Combining this with (22) we have

 P(∣∣ˆfh(xij)−Eˆfh(xij)∣∣≥t)≤exp(−2nh4t2) (23)

Since we have

 ∣∣f(xij)−Eˆfh(xij)∣∣=1h2∣∣∣∫Bijf(xij)−f(x)dx∣∣∣≤1h2∫Bij|f(xij)−f(x)|dx≤Ch, (24)

due to the fact that is Lipschitz. Here, depends on the Lipschitz constant of , which is defined by

 Lip(f)=supx≠y|f(x)−f(y)||x−y|.

It follows from (24) that

 |f(xij)−ˆfh(xij)| ≤∣∣ˆfh(xij)−Eˆfh(xij)∣∣+∣∣f(xij)−Eˆfh(xij)∣∣ ≤∣∣ˆfh(xij)−Eˆfh(xij)∣∣+Ch.

Combining this with (23) and the union bound, there exists such that

 P(∥ˆfh−f∥∞≥λ)≤exp(−2nh4(λ−C1h)2−2log(h)), (25)

for all , where

 ∥u−v∥∞:=maxxij∈[0,1]2h|u(xij)−v(xij)|.

Let and denote the numerical solutions of (1) on the grid of spacing computed with and on the right hand side, respectively. Standard maximum principle arguments (see [5]) yield

 ∥ˆuh−uh∥∞≤C∥ˆfh−f∥∞, (26)

where the constant depends on the lower bound on . By [5, Theorem 1,2], there exists a constant such that

 ∥uh−u∥∞≤C√h.

Combining this with (26) we have

 ∥ˆuh−u∥∞≤C2(∥ˆfh−f∥∞+√h),

for some . By (25)

 P(∥ˆuh−uh∥∞≥C2(λ+√h))≤exp(−2nh4(λ−C1h)2−2log(h)).

Setting for large enough we have

 P(∥ˆuh−u∥∞≥C3√h)≤exp(−C4nh5−2log(h)),

for all . ∎

## 6 Numerical results

We present several experiments that provide numerical evidence supporting the above arguments and outlining the effectiveness of our algorithm. The first two experiments were performed using synthetic data from [21, 22]. The streaming experiments consist of total samples with a window history of . To underscore the adaptive nature of our algorithm, each of these experiments incurs a significant trend change in the middle of the stream. The third and final experiment was performed with a real pedestrian trajectory data set from a video surveillance problem.

To evaluate the performance of the streaming algorithm we use a Receiver Operating Characteristic (ROC) curve and its resulting Area Under the Curve (AUC). We consider how the AUC varies with time as the algorithm takes in points from a stream. When changing the trend in the simulated streams, we also accordingly change the data used to generate the ROC curves, thereby giving us an appropriate method to visualize the learning aspect of the algorithm. In each simulated data stream we evaluate both the anomaly detection and anomaly classification. The results presented below represent the average of 20 trials. All PDEs were solved on a grid.

In each experiment, we compare our continuum limits against the exact sorting PDA algorithm from [21, 22] and we see little to no difference in anomaly detection performance. The PDE-approximations reduce the complexity by an order of magnitude—from to . To give an idea of the difference in CPU time, each trial in the experiments below takes 27 seconds to process with the PDE-approximations, compared to 413 seconds with exact sorting. If we increase the stream length and data history by a factor of 3, the PDE-approximations take 160 seconds, while the exact sorting PDA algorithm takes over 9.3 hours.

### 6.1 Uniformly distributed data

The first experiment conducted with synthetic data took i.i.d. uniform samples on to be nominal, and uniform samples from the region to be anomalous. Halfway through the stream the nominal region was changed to the box , and the corresponding anomalous region was changed to . The two similarity criteria were simply taken to be the component-wise differences and , respectively. The nearest neighbour parameters were chosen as . At each time step in the simulated stream there was a 0.05 probability of drawing from the anomalous region.

Figure 6 shows the resulting AUCs at each time step. As expected, one can see a significant drop in the AUC of the anomaly detection at the mid-point when the trend is changed. We observe a sharp recovery of the AUC of the anomaly detection once the training history contains a significant number of samples from the new distribution. This illustrates how the algorithm can quickly and efficiently learn a new trend in the data. Note that the AUC for the anomaly classification remains unchanged throughout the experiment because the classification of the anomalies in the new trend are the same with respect to the old trend.

### 6.2 Categorical data

For the second experiment, we used the synthetic categorical data from [22]. Each sample consists of 2 groups of categorical data and . Each group is comprised of 20 different attributes, where each attribute can assume a different number of values. The number of possible values for the attribute of the group, denoted , is chosen uniformly at random between 6 and 10. Each attribute is then assigned a categorical distribution with parameters which are in turn drawn from a Dirichlet distribution with parameters .

The nominal distribution is characterized by setting and for every for every attribute. This forces a bias towards attributes assuming the value one. For the anomalous distribution we set for every , so that no attribute has a bias towards assuming any particular value. Halfway through the stream the nominal distributions were changed so that for every attribute, the parameters of the categorical distribution were drawn from a Dirichlet distribution with parameters and for every . This shifts the nominal bias towards the value two. The anomalous distribution was unchanged.

To generate a nominal sample, we draw from the nominal distribution for each group. To generate an anomalous sample, we randomly choose a group with probability and draw from the anomalous distribution for that group, and nominal distribution for the other. At each time step in the stream there was a probability of drawing an anomalous sample.

The similarity between samples was computed between respective groups using the Inverse Occurrence Frequency (IOF) measure presented in [3]. The Goodall2 and Overlap metrics gave similar performance. The nearest neighbour parameters were chosen as . Figure 6 shows the resulting AUCs at each time step. Similar to the previous experiment we observe a drop in the AUC of the anomaly detection and a recovery thereafter. We also observe a similar drop in anomaly classification and the corresponding recovery. In contrast to the previous example, the new anomalies are anomalies in both criteria with respect to the old trend, so that the classification has no bias towards a specific criteria.

### 6.3 Pedestrian trajectories

Our third experiment consisted of data from a real pedestrian trajectory data set [27], with over 100,000 trajectories. The first similarity criterion used to compare trajectories was their difference in shape, given by the

-distance between interpolated trajectories. The second was their difference in walking speed, given by the

-distance between the velocity histograms of each trajectory.

As a preliminary experiment, we tested the anomaly detection and anomaly classification on 1666 trajectories from a single day. These trajectories were hand-labelled as normal or anomalous by [22]. In each experiment the training set consisted of 500 trajectories randomly drawn from a total of 1666 trajectories that day. The mean AUCs of the PDE-based and exact sorting based algorithms were and , respectively and the ROC curves are shown in Figure 7(a). We observe very little difference between the exact sorting and the PDE-approximations. We cannot present quantitative results for anomaly classification in this setting as there is no ground truth labeled data available. Along with some normal trajectories, we also plotted some anomalous trajectories with their respective classification scores in Figure 7(b,c).

Finally, we applied the PDE-based streaming anomaly detection algorithm to a large portion of the pedestrian dataset, spanning over several days of data. Figure 6 shows the AUC as a function of artificial time for a simulated stream consisting of 15,000 trajectories with an initial training set of 400 randomly drawn trajectories. The small labeled portion of the dataset (approx. 1000 trajectories) was used to generate the ROC curves.

## 7 Conclusion

In this paper, we showed how to use some recently discovered PDE continuum limits for nondominated sorting to perform anomaly detection and classification in real-time in a streaming setting. The classification is performed using new PDE continuum limits for ordering within the nondominated layers. We proved convergence rates for the continuum approximations and presented the results of numerical experiments with synthetic and real data that show our algorithm can adapt quickly and efficiently to a changing data stream. Although we focused in this paper on the anomaly detection problem, the ideas are not restricted to this context. Indeed, nondominated sorting is widely used in multiobjective optimization, and the ideas in this paper potentially apply to any such application, leaving many interesting problems for future work.

In particular, we outline below some directions for future work that we are currently investigating.

1. The arguments used in Section 3 to derive the new PDE continuum limits (7) and (11) for sorting points within layers are not rigorous. We are currently investigating a rigorous proof of these conjectured continuum limits.

2. The upwind finite difference schemes for the transport equations (7) and (11) presented in Section 4 are provably convergent only when , which is not generally true, and only when we assume the exact values of and are used in the scheme. It would be interesting to prove convergence of these new upwind schemes without the assumption that , and under the more realistic condition that and are replaced by their finite difference approximations in the schemes for (7) and (11).

3. The PDEs for sorting points within fronts were presented in only dimensions here. It would be interesting to extend these results to higher dimensions. In dimensions, there is no canonical linear ordering of the points within each front. Instead, we can consider nondominated sorting of the points within each front under a partial order that “forgets” about one of the coordinates (that is, we project to by removing the coordinate and apply the usual nondominated sorting). This is akin to sorting the points within each front with respect to the direction. Thus, we have different ways to sort the points along each Pareto front, and similar arguments can be used to derive different continuum limit PDEs for sorting functions of the form

 ∏i≠k∇vk⋅∇⊥k,iu=ud−2xkf  in (0,1)d,

with boundary condition on . Here, , and