Online Data Thinning via Multi-Subspace Tracking

09/12/2016 ∙ by Xin Jiang, et al. ∙ 0

In an era of ubiquitous large-scale streaming data, the availability of data far exceeds the capacity of expert human analysts. In many settings, such data is either discarded or stored unprocessed in datacenters. This paper proposes a method of online data thinning, in which large-scale streaming datasets are winnowed to preserve unique, anomalous, or salient elements for timely expert analysis. At the heart of this proposed approach is an online anomaly detection method based on dynamic, low-rank Gaussian mixture models. Specifically, the high-dimensional covariances matrices associated with the Gaussian components are associated with low-rank models. According to this model, most observations lie near a union of subspaces. The low-rank modeling mitigates the curse of dimensionality associated with anomaly detection for high-dimensional data, and recent advances in subspace clustering and subspace tracking allow the proposed method to adapt to dynamic environments. Furthermore, the proposed method allows subsampling, is robust to missing data, and uses a mini-batch online optimization approach. The resulting algorithms are scalable, efficient, and are capable of operating in real time. Experiments on wide-area motion imagery and e-mail databases illustrate the efficacy of the proposed approach.



There are no comments yet.


page 3

page 7

page 20

page 21

page 22

page 24

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern sensors are collecting high-dimensional data at unprecedented volume and speed; human analysts cannot keep pace. For instance, many sources of intelligence data must be translated by human experts before they can be widely accessible to analysts and actionable; the translation step is a significant bottleneck [1]. Typical NASA missions collect terabytes of data every day [2, 3, 4, 5]. Incredibly, the Large Hadron Collider (LHC) at CERN “generates so much data that scientists must discard the overwhelming majority of it—hoping hard they’ve not thrown away anything useful.” [6] There is a pressing need to help analysts prioritize data accurately and efficiently from a storage medium or a data stream. This task is complicated by the fact that, typically, the data is neither thoroughly annotated nor meaningfully catalogued. Failure to extract relevant data could lead to incorrect conclusions in the analysis, while extraction of irrelevant data could overwhelm and frustrate human analysts, throttling the discovery process.

This paper focuses on scalable online data processing algorithms that can winnow large datasets to produce smaller subsets of the most important or informative data for human analysts. This process is described as “data thinning.” Often, the data thinning process involves flagging observations which are inconsistent with previous observations from a specified class or category of interest, or are ranked highly according to a learned ranking function. Typically we are interested in methods which can perform these assessments from streaming data, as batch algorithms are inefficient on very large datasets.

One generic approach to the problem of data thinning for large quantities of (possibly streaming) high-dimensional data requires estimating and tracking a probability distribution

underlying the stream of observations , and flagging an observation as anomalous whenever for some small threshold , as demonstrated in past work [7, 8]. Ultimately, the goal is to ensure that the flagged data is salient to human analysts on the receiving end without being buried in an avalanche of irrelevant data. Within this general framework, there are three key challenges:

  • Dynamic environments: The data may not be from a stationary distribution. For example, it may exhibit diurnal, location- or weather-dependent patterns. Effective data thinning methods must adapt to those dynamics and sources of bias. Global summary statistics and naive online learning algorithms will fail in this context.

  • High-dimensional data: Individual data points may be high-dimensional, resulting in the classical “curse of dimensionality” [9, 10]. While large quantities of data may be available, the combination of high-dimensional data and a non-stationary environment still results in an ill-posed estimation problem.

  • Real-time processing: In applications like those with NASA and CERN, large quantities of streaming data preclude computationally intensive or batch processing.

1.1 Data thinning for wide-area motion imagery

While our approach is not restricted to imaging data, one important application of our data thinning approach is real-time video analysis. Recent advances in optical engineering have led to the advent of new imaging sensors that collect data at an unprecedented rate and scale; these data often cannot be transmitted efficiently or analyzed by humans due to their sheer volume. For example, the ARGUS system developed by BAE Systems is reported to collect video-rate gigapixel imagery [11, 12], and even higher data rates are anticipated soon [13, 14, 15]. This type of data is often referred to as wide-area motion imagery (WAMI). Currently WAMI streams are used primarily in a forensic context – after a significant event occurs (e.g., a security breach), the data immediately preceding the event are analyzed reactively to piece together what led to that event. However, there is a strong need for predictive analysis which can be used to help anticipate or detect negative events in real time.

Unfortunately, the latter form of analysis is often infeasible for two reasons: (1) the data acquisition rate exceeds the capacity of many sensor platforms’ downlinks; and (2) size, weight, and power constraints limit processing capabilities on airborne sensor platforms. Thus an emerging and fundamental challenge is efficiently downloading salient information to ground-based analysts over a limited-bandwidth channel. While data compression has a long history, conventional compression methods may distort information particularly relevant to analysts. In particular, standard motion imagery compression techniques typically focus on optimizing peak signal-to-noise ratio or psycho-visual metrics which apply globally to an entire video and are often unrelated to any specific task.

Instead, a better solution would be to identify unique objects or regions of WAMI, and transmit only features of these objects. This concept is illustrated in Fig. 1. Ideally, this method will identify regions and features of a data stream most critical to a given task, and prioritize these features when preparing data for storage or transmission. This task is clearly related to “visual saliency detection” (cf., [16, 17, 18, 19]); we describe the connections between the proposed work and saliency detection in Section 2.

Note that in this setting a key challenge is that the sensor may be placed on a vibrating platform that introduces significant jitter into the data and precludes direct comparison of successive frames. While real-time video stabilization has been considered in the video processing literature (cf., [20, 21, 22, 23, 24]), such methods are often robust for small motions associated with a hand-held device and break down with large motions associated with mechanical vibrations. More robust methods capable of processing larger degrees of jitter can be computationally prohibitive on energy-constrained platforms.

Figure 1: Conceptual illustration of proposed objectives. An airborne platform collects wide-area motion imagery (WAMI), identifies task-specific salient patches, and transmits only those patches. The ground-based receiver can then perform more sophisticated processing, including registration, geolocation, and activity analysis.

1.2 Problem formulation and approach

Suppose we are given a sequence of data and for , , where denotes the ambient dimension. Assume that comes from some unknown distribution, i.e., there exists some sequence of distributions such that

where evolves over time, and its distribution density function is denoted by . The goal is to find the that are unusual or anomalous. In particular, we assign each observation an anomalousness score proportional to its negative log likelihood under the estimated model—i.e., . Observations with a low anomalousness score can then either be directed to a human analyst or flagged for further processing and analysis.

The key challenge here is two-fold: (a) the dimension of the signal, , can be quite large, and (b) may evolve rapidly over time. The combination of these factors means that our problem is ill-posed, because we are unlikely to gather enough samples to reliably learn the density .

This paper proposes a method for estimating and tracking the time-series of density functions over

. In stationary, low-dimensional settings, we might consider a Gaussian mixture model that could be estimated, for instance, using an online expectation-maximization (EM) algorithm

[25]. However, the non-stationary setting and high dimensions make that approach unviable, as we demonstrate experimentally later in the paper. The proposed approach, by contrast, considers a constrained class of Gaussian mixture models in which the Gaussian covariance matrices (each in the positive-semidefinite cone ) are low-rank. This model is equivalent to assuming most lie near a union of low-dimensional subspaces. While this union of subspaces is unknown a priori, we may leverage recent advances in subspace tracking (cf., [26, 27, 28, 29]) and subspace clustering (cf., [30, 31, 32, 33, 34, 35]) to yield an accurate sequence of density estimates , and mitigate the curse of dimensionality.

In addition, we consider certain computational and statistical tradeoffs associated with the data thinning problem. In particular, there are two ways to reduce the computational complexity associated with computing anomalousness scores. First, we can reduce the frequency with which we update our model. Second, we can subsample the elements of each and leverage missing data models for fast calculations and updates. We demonstrate that these methods, which are not amenable to standard stochastic filtering methods, can yield significant computational speedups with only small decreases in thinning accuracy.

1.3 Contributions and paper layout

This paper presents a data thinning method for high-dimensional streaming data in a dynamic environment. The algorithm adapts to changing environments using tracking methods for union of subspaces. As shown by both synthetic and real-data experiments, the algorithm (a) efficiently tracks the subspaces in which most observation lie and hence precisely detects observations that occur with low probability, and (b) can be applied to a variety of real-world applications and tasks.

Section 2 describes related work. Section 3.1 explains the probability density model based on unions of subspaces, and Section 3 presents the algorithm for tracking such densities. Section 4 describes the computational and statistical tradeoffs associated with the proposed approach. Section 5 reports synthetic experiments which demonstrates the ability of the algorithm to precisely track the density and detect anomalous signals within a changing environment. Section 6, tests the algorithm on the wide-area motion imagery (WAMI) videos to detect salient objects, while Section 7 tests the algorithm on the Enron email database to detect major events.

2 Related work

While data thinning is an emerging concept associated with modern high-dimensional, high-velocity data streams, the formulation described in Section 1.2 is closely related to anomaly detection, visual saliency detection, and subspace clustering and tracking.

2.1 Anomaly detection

The study of anomaly detection has a long and rich history, where the earliest papers can date back to the 19th century [36]. Despite the long history of the study of anomaly detection, most existing detection methods do not work well with high dimensional data, and often do not work online. A 2009 survey on anomaly detection [37] categorizes the available methods into classification-based methods, nearest neighbor-based methods, cluster-based methods, information theoretic methods, statistical anomaly detection methods, and spectral methods.

Among the six categories, classification based methods (cf., [38, 39, 40, 41, 42]) require a large training pool with labeled data that is typically unavailable in the settings of interest. Also, the classification based methods depend highly on the training data, and do not effectively adapt to changing environments. Nearest neighbor (cf., [43, 44, 45, 46, 47]) and cluster-based methods (cf., [48, 49, 50, 51]) can both be extended to work online, but the computational costs are usually high, scaling with the amount of data. Furthermore, the performance of the nearest neighbor and cluster-based methods highly depend on the distance measure, and the optimal distance measure is highly problem-dependent.

Certain statistical methods (cf., [52, 53, 54, 55]

) assume that the data are drawn from some standard or predetermined distribution, and determines outliers by computing the likelihood of the signal coming from such distributions. These methods can often work online, and do not rely on a big training set, but estimating the distribution of high-dimensional data is a non-trivial task, and the statistical assumptions do not always hold true, especially for high-dimensional data where there could be spatial correlations.

Information theoretic techniques (cf., [56, 57, 58]) identify the anomalies by trying to find a small subset such that removing the subset will greatly reduce the complexity of the whole set. The approach requires no supervision, and does not make assumptions about the underlying statistical distribution of the data. However, they usually have exponential time complexity and are batch methods. Additionally, it is difficult to assign anomalousness scores to a single data point.

Spectral methods (cf., [59, 60, 61, 62]) assume that data can be embedded into a lower dimensional subspace, and detect anomalies over the embedded space rather than the original space. Because spectral methods essentially operate on a reduced-dimensional representation of the data, they are well-suited to high-dimensional data. Spectral methods can also be integrated with other methods, and are thus highly versatile. However, spectral methods can incur high computational costs; even online anomaly detection algorithms (cf., [63], [64] and [65]) face this challenge. Furthermore, the subspace model underlying spectral methods is less flexible than the union of subspace model underlying this paper’s proposed method.

2.2 Visual saliency detection

In the special case of imagery or video data, data thinning is closely related to visual saliency detection. Like anomaly detection, saliency detection has been widely studied over the last few decades. A standard benchmark for comparison in image saliency detection is proposed by Itti et al. in [16]. This paper attempts to explain human visual search strategies, using biologically motivated algorithms. However, this algorithm is too slow to apply to real time videos. Hou and Zhang in [17] use spectral analysis to detect salient objects for faster speed. However, the analysis breaks down when multiple types of salient objects are present in the scene. Graph-based methods (cf., [18]) work well even when there is no central object in the scene, which is often difficult for other methods to handle, but suffers from high computational complexity. Rao et al. proposed a cluster-based algorithm in [19], where the salient object is identified by first clustering all the pixels according to their local features, and then finding the group of pixels that contains the most salient information. It works better than [16], but not as well as the graph-based algorithms. The information theoretic model based algorithm proposed in [66] claims to work as well as [16], but requires less tuning. This work is improved in [67] for faster speed and better performance.

Methods for image saliency detection have been extended to video saliency detection, but those methods assume a stable imaging platform and video stream free of jitter. In the WAMI application described above, however, sensors can be placed on vibrating platforms that preclude most video saliency detection methods.

2.3 Subspace clustering and tracking

The proposed method is also closely related to the subspace clustering and tracking algorithms. Subspace clustering is a relatively new, but vibrant field of study. These methods cluster observations into low-dimensional subspaces to mitigate the curse of dimensionality, which often make nearest-neighbors-based methods inaccurate [68]

. Early works in the field can only identify subspaces that are parallel to the axes, which is not useful when the data is not sparse, but lives on an arbitrarily oriented hyperplane. Newer methods

[30, 31, 32, 33, 34, 35], which are also called correlation clustering methods, can identify multiple arbitrarily angled subspaces at the same time, but all share the same problem of high computational cost. Even [35], which is shown to beat other methods in speed, still has an overall complexity of , where is the dimension of the problem, and is the total number of data points. More recent methods based on sparse modeling (cf., [69, 70, 71, 72, 73]) require solving convex optimization problems that can be inefficient in high-dimensional settings. Thus, the high complexity of the algorithms make them less than ideal candidates for an efficient online algorithm.

Subspace tracking is a classical problem that experienced recent attention with the development of algorithms that are robust to missing and outlier elements of the data points . For example, the Grassmannian Rank-One Update Subspace Estimation (GROUSE) [26], Parallel Estimation and Tracking by REcursive Least Squares (PETRELS) [27, 28], and Robust Online Subspace Estimation and Tracking Algorithm (ROSETA) [29]

effectively track a single subspace using incomplete data vectors. These algorithms are capable of tracking and adapting to changing environments. The subspace model used in these methods, however, is inherently strong, whereas a plethora of empirical studies have demonstrated that high-dimensional data often lie near manifolds with non-negligible curvature

[74, 75, 76].

In contrast, the non-parametric mixture of factor analyzers [77] uses a mixture of low-dimensional approximations to fit to unknown and spatially-varying (but static) curvatures. The Multiscale Online Union of SubSpaces Estimation (MOUSSE) method developed by Xie et al. [78] employs union of subspaces tracking for change point detection in high-dimensional streaming data. Thanks to the adoption of the state-of-the-art subspace tracking techniques, the algorithm is both accurate and efficient (with complexity linear in ). However, MOUSSE cannot be directly applied for our data thinning task for a few reasons. First, MOUSSE is designed for change-point detection and does not have a probabilistic model. Thus observations in a rare subspace would still be treated as typical, which makes it difficult to discover the rare observations. Second, MOUSSE can only process one observation at a time, i.e., it does not allow for mini-batch updates that can be helpful in data thinning applications, where data could arrive in blocks. Last but not least, although MOUSSE is able to deal with missing data, [78] does not explore the computational-statistical tradeoffs that are important for time- or power-sensitive applications. This paper presents a method that is designed for the data thinning task, has a specific statistical model, and allows for mini-batch updates which increases the algorithm’s efficiency. Also, we will explore the computational-statistical tradeoffs in Section 4.

3 Data thinning via tracking union of subspaces

3.1 Union of subspaces model

Recall from Section 1.2 that each is assumed to be drawn from a distribution with density , and that is modeled as a mixture of Gaussians where each Gaussian’s covariance matrix is the sum of a rank- matrix (for

) and a scaled identity matrix. We refer to this as a

dynamic low-rank GMM. In particular, the Gaussian mixture component is modeled as

where is the mean and

Here is assumed to have orthonormal columns, and is a diagonal matrix with positive diagonal entries. If , then would be rank- and any point drawn from that Gaussian would lie within the subspace spanned by the columns of – shifted by . By allowing we model points drawn from this Gaussian lying near that -dimensional shifted subspace. Overall, we model


where is the number of mixture components in the model at time and is the probability of coming from mixture component .

To better understand this model, we can think of each observation as having the form , where lies in a union of subspaces (or more precisely, because of the Gaussian means, a union of hyperplanes) defined by the s and within ellipsoids embedded in those hyperplanes, where the ellipsoid axis lengths are determined by the s.

Fig. 2 illustrates the union of subspaces model. Fig. (a)a shows a sample image where one person is walking on a road with trees on both sides [79]. In such a situation, we would want to be able to learn from a sequence of such images that the trees, grass and the road which occupy most of the pixels are typical of the background, and label the person as salient because it is uncommon in the scene. Fig. (b)b illustrates the union of subspaces model. When we divide the image into patches, the vast majority of patches are plant, and road patches, and only a few patches contain the person. Thus, the plant and road patches live on a union of subspaces as illustrated and can be thinned, leaving anomalous patches for further analysis.

(a) Image of a pedestrian walking on a road with trees on the sides
(b) Illustration of the union of subspaces idea
Figure 2: Illustration of the union of subspaces idea. Fig. (a)a shows a pedestrian walking on a road with trees on the sides [79]. The road and the plants occupy most of the pixels, and they can be considered living in a union of subspaces. The person on the road would be considered as an outlier.

3.2 Algorithm highlights

This section explains how the proposed method estimates the evolving Gaussian mixture model using the techniques from the union of subspaces tracking algorithms. These steps are summarized in in Fig. 3. As seen, this data thinning method shares some features with the online EM algorithm for GMM estimation. However, there are a few key differences which are elaborated below:

  • We constrain covariances to lie in a union of subspaces, which significantly reduces the problem size for estimating the covariance matrices. This constraint improves the accuracy of the algorithm, and also makes our method much stabler when the environment is changing rapidly relative to the data availability. This constraint also reduces computation time. (More details of computational complexity are discussed in Section 3.5.)

  • In some settings, such as when working with WAMI data, we receive groups of ’s simultaneously and can perform model updates more efficiently using mini-batch techniques. (The mini-batch approach is discussed in Section 3.3.3.)

  • For large, high-velocity data streams, real-time processing is paramount. Even evaluating the likelihood of each new observation can be time consuming. We explore subsampling-based approximations which reduce computational burden yet still yield accurate results. (Accuracy and computational complexity tradeoffs are discussed in Section 4.)

  • For the online EM algorithm for GMM estimation, the number of mixture components is selected a priori, and does not change for the duration of the task. This would work when the environment does not change over time, but is inappropriate for applications that work in dynamic environments. The proposed method adapts to changing numbers of mixture components, which allows the mixture model to better track the environmental dynamics. The method adapts the number of mixture components using a multiscale representation of a hierarchy of subspaces, which allows us to reduce the model order using a simple complexity regularization criterion. The method also tracks hidden subspaces which are then used to increase the model order when data calls for it. (More details about the multi-scale model is discussed in Section 3.3.)

Figure 3: Flow chart of the main steps in the data thinning method.

3.3 The Online Thinning algorithm

This section describes the updates of the parameters associated with the proposed dynamic low-rank GMM in (1). The updates of the mixture component weights () and means (

) are computed using stochastic gradient descent. The updates of the covariance matrices are more sophisticated and leverage subspace tracking methods. In particular, we focus on methods which admit observations

with missing elements; this will allow us to subsample for computational speedups. These updates are detailed below.

The biggest challenge is updating , the number of mixture components. In real-life applications, the number of mixture components is in general (a) not known a priori, and (b) can change with . Thus a mechanism for adaptively choosing the number of subspaces is needed. Reducing model order is slightly less challenging because it is relatively simple to merge two nearby mixture components. However, increasing model order is a much more complex issue, especially in an online setting.

To address these challenges, we organize these mixture components using a tree structure, as illustrated in Fig. 4. The idea for a multiscale tree structure stems from the multiscale harmonic analysis literature [80] and online updates of such models are introduced in [78]. In our setting, at time , the

node is associated with a Gaussian distribution parameterized by its mean vector

, low-rank covariance matrix parameters , and weight . Most of the probability mass associated with each Gaussian is an ellipsoid centered at , where and characterize the principle axes and principal axis lengths, respectively, of the ellipsoid. Finally, is approximately the probability of an observation falling inside this ellipsoid.

In the tree structure, we denote the set of leaf nodes as and have . The leaves of the tree correspond to the Gaussian mixture components in the model shown in Eq. (1). Each parent node corresponds to a single Gaussian which approximates the weighted sum of the Gaussians associated with its two children, where the weights correspond to the children’s parameters. Each of the tree leaves is also associated with two virtual children nodes. The virtual children nodes correspond to their own Gaussian distributions that can be used to grow the tree. The decision of pruning and growing are made based on (a) the accuracy of the Gaussian mixture model, i.e., the cumulative (with a forgetting factor) anomalousness score, and (b) the size of the mixture model, i.e., the total number of leaf nodes at time .

Figure 4: Multiscale representation of low-rank Gaussian mixture model. Consider a density with its mass concentrated along the black dashed curve. Each successive level in the multiscale representation has more Gaussian mixture components (depicted via contour plots) with covariance matrices corresponding to more compact ellipsoids, and hence yields a more accurate approximation of the underlying density. Given a particular binary tree representation of a GMM, the approximation error can be allowed to increase or decrease by pruning or growing the binary tree connecting the different scales. The ellipsoids are all very compact along some axes because they correspond to covariance matrices that are the sum of a low-rank matrix and a scaled identity matrix.

3.3.1 Computation of the Gaussian mixture likelihood (and anomalousness score)

The proposed algorithm uses the negative log-likelihood of the Gaussian mixture model give the data point as its anomalousness score. The likelihood of under the Gaussian associated with node is given by (recall )


Using the model in Eq. (1), the Gaussian mixture negative log-likelihood function (and hence anomalousness score) for any is:


3.3.2 Selective update

With the observation of each , the algorithm first compute the likelihood of under each of the Gaussian mixture components, and then assign to the component that maximizes the likelihood. Specifically, after the likelihood computations above, is assigned to the mixture component


Note that the weights are not used here in order to avoid biasing towards components with large weights. This assignment is made in order to reduce the computational complexity of the parameter update step: with each , instead of updating all the parameters of the entire tree, the algorithm only updates the tree branch associated leaf node . That is, the algorithm updates the parameters of node , all of its ancestors, and one of node ’s virtual children (the one under which is more likely). This approach significantly reduces the time complexity of the updates, especially when the model is complex (i.e., when the number of leaf nodes is large).

3.3.3 Mini-batch update

In previous sections, we have always assumed that we have one observation arriving at each time . However, in many applications, multiple observations can arrive simultaneously. For example, in WAMI settings, hundreds of image patches in a single frame arrive at the same time. One way to deal with this is simply treat each patch as arriving at a different time, and update the model parameters separately with each observation. However, when the number of patches is large (for HD videos, there can be thousands of patches per frame), this sequential processing can be extremely time-consuming.

To reduce the computation cost, we can instead update the mixture model in mini-batches, i.e., when multiple observations are received at the same time, we first compute the anomalousness score of each observation, and assign them to their own mixture component. The collection of observations assigned to a given mixture component then form a mini-batch. The mixture model and tree structure are then updated only once for each mini-batch. When the size of mini-batches is much larger than 1 (e.g., hundreds of image patches assigned to a tree of size ), this approach significantly reduces the number of times needed to update the mixture component parameters and tree structures, and thus saves computation time. Note that this mini-batch processing does not affect the computation of the anomalousness score and component assignment, where each observation is processed sequentially as if they arrive separately.

Thus, now instead of assuming a single vector arrives at time , we assume that we receive a collection of observations stored in matrix at time , where for all . A special case of this is , which is the sequential update without mini-batches. After assigning each column in to the leaf nodes in the hierarchical tree structure based on their distance to the corresponding mixture components, we can rewrite into mini-batches, , where . Here each is a block of data points that are assigned to the node in the tree (must be a leaf node). Note that .

Our update equations are based on a “forgetting factor” that places more weight on more recent observations; this quantity affects how quickly a changing distribution is tracked and is considered a tuning parameter to be set by the end user. Then for each leaf node that needs updates (i.e., with assigned observations), the weights are then updated by


Note that for the leaf nodes the weights need to add to 1, i.e.,  for all . If we initialize s.t. , and the weight of any parent node is the sum of the weights of its two children, then this update preserves for all . The mixture component means are updated by


The diagonal matrix


, contains eigenvalues of the covariance matrix of the projected data onto each subspace. Let


be a means matrix computed by concatenating copies of together. Let


be the residual signal, where the superscript denotes the pseudo-inverse of a matrix (for orthonormal , the pseudo-inverse is its transpose). Denote its row as . Then we can update


The subspace matrices are updated using Algorithm. 1. The updates of and are a mini-batch extension of the PETRELS [28, 27] update equations, with an added step of orthonormalization of since PETRELS does not guarantee the orthogonality of .

For the ancestors of each leaf node that got updated, we combine all the mini-batches assigned to its children, and update the node with the same formulae as above using the combined mini-batches. For the virtual children of leaf nodes that got updated, we divide each mini-batch into two sub-mini-batches based on the likelihood of each observation under the Gaussian of the virtual node, and update each virtual node with its assigned sub-mini-batch.

1:Initialize: (with training data),
Algorithm 1 Mini-Batch Update of Covariance Parameters

3.3.4 Tree structure update

The growing (splitting nodes) and pruning (merging nodes) of the tree structure allow the complexity of the GMM to adapt to the diversity of the observed data. The number of nodes in the tree controls the tradeoff between the model accuracy and complexity. The proposed method determines whether to grow or prune the tree by greedily minimizing a cost function consisting of the weighted cumulative anomalousness score (with weights corresponding to the forgetting factor described above) and the model complexity ().

Define as the cumulative anomalousness score where , and

For each node (including virtual children), a similar cumulative score is kept based only on the mini-batches assigned to that node. Let (for virtual nodes this set is the indices of its sub-mini-batch), initialize , and is updated by

Let tol be a pre-set error tolerance. For each leaf node that is assigned new observations, let be its parent, be its sibling, and be its virtual children. Let be a positive constant. Split node if




Note the left side of Ineq. (11) is the penalized cumulative score of node (where the penalty is proportional to the number of nodes in the tree), while the right side of Eq. (11) is the average penalized cumulative score of node ’s two virtual children. We split node if the average penalized cumulative score is smaller at the virtual children level.

Similarly, merge nodes and if




Note the left side of Ineq. (13) is the penalized (with tree size) cumulative score of node ’s parent , while the right side of Eq. (11) is the average penalized cumulative score of node and its sibling . We merge and if the average penalized cumulative score of and is larger than the penalized score of their parent. The use of these penalized scores to choose a tree which is both (a) a good fit to the observed data and (b) a small as possible to avoid overfitting is common in classification and regression trees [81, 82, 83, 84, 85, 86]. The splitting and merging operations are detailed in Algorithm 2 and Algorithm 3. The complete Online Thinning algorithm is summarized in Algorithm 4.

1:Input: Node with virtual children nodes and
3:Create new virtual children: for new leaf node , and for new leaf node
4:Let be the first column of
5:Initialize virtual nodes and :
6: for
Algorithm 2 Grow tree
1:Input: Node with children nodes and to be merged
2:Delete all four virtual children nodes of and
4:Define , as the virtual children nodes of the new leaf node
Algorithm 3 Prune tree
1:input: error tolerance , threshold , forgetting factor
2:initialize: tree structure, set initial error
3:for  do
4:     receive new data
5:     for  do
6:         let be the column of
7:         for all , compute likelihood of under node :
8:         compute anomalousness score :
9:         assign to leaf node
10:         compute the likelihood of under ’s two virtual children nodes, and also assign to the virtual child with higher likelihood
11:     end for
12:     update
13:     for  all nodes in the tree do
14:         set
15:         if  is not empty then
16:              denote all data assigned to node or its children as
17:              update
18:              update
19:              update
20:              set
21:              set
22:              for  do
23:                  update
24:              end for
25:              update by calling Algorithm 1
26:              if  and  then
27:                  call Algorithm 2
28:              else if  and  then
29:                  call Algorithm 3
30:              end if
31:         else update
32:         end if
33:     end for
35:end for
36:output: sequence of thinned data
Algorithm 4 Online Thinning with Mini-Batch Updates

3.4 Subsampling observations

When is large and computation time is critical, we can subsample the elements of each and leverage missing data models for fast calculations and updates. Algorithm 1 is a modified version of PETRELS [27, 28] with mini-batches. Note that PETRELS was specifically designed to work with missing entries, where [27, 28] thoroughly investigated the effect of missing data in subspace tracking algorithms.

Specifically, to modify our Online Thinning algorithm for with subsampled entries, we define be the subset of entries used at time . Assume all have the same size and define . Define an operator that selects the rows indexed by . Then, for the likelihood and score computation, denote , and compute

as the likelihood. For the mini-batch update step, replace and with and , respectively in Eq. (6),(7), and (8), and use Algorithm 5 instead of Algorithm 1.

1:Initialize: (with training data),
Algorithm 5 Mini-Batch Update of Covariance Parameters with Subsampling

3.5 Computational complexity

As discussed in Section 3, the union of subspaces assumption significantly reduces the problem size for estimating the covariance matrices. This not only improves the algorithm accuracy and stability for high dimensional problems, but also reduces computation time.

Take the computation of as an example. is the covariance matrix of the mixture component at time . For a full-rank GMM model, computing the takes operations, and computing given takes operations. Thus the total complexity is . However, with the low-rank assumption we have , and with the Woodbury matrix identity [87], we can compute


Note that computing is easy because (a) since the columns of are orthonormal, and (b) is diagonal. Computing the whole equation takes operations. Thus, by using the low-dimensional structure, we reduced the computation complexity from to .

Another example is the computation of the determinant of . For a full-rank GMM model, computing takes operations. For our low-rank model with , we can use the matrix determinant lemma [88] and compute

The number of operation needed is O(r) since and is diagonal.

3.5.1 Computational complexity without subsampling

Below we summarize the computational complexity of each major steps of the Online Thinning algorithm.Let be the total number of time steps. For simplicity, we assume the mini-batch sizes are the same for all . Let be the total number of observations received, and be the maximum number of leaf nodes in the tree.

  • Likelihood and score computation has a complexity of

    where each likelihood computation takes operations, and this is computed times per observation ( leaf nodes plus two virtual children of the assigned leaf node) for all observations. The computation of the anomalousness score is computationally inexpensive since it is a weighted sum of pre-computed likelihoods.

  • Mini-batch updates have complexities of at most

    where the first term comes from the calculation of and , and the second term comes from the orthonormalization of . The term is the number of batches received.

  • Tree structure updates have complexities of at most

    where the first term is an upper bound of complexity for updating the cumulative likelihood of the parents of leaf nodes (for leaf nodes and their virtual children, the likelihood is computed when at the score computing stage). In the second term, the term comes from the number of operations needed to copy the subspace when splitting nodes, and the maximum number of splitting at each time is bounded by the maximum number of leaf nodes (merging nodes is computational inexpensive).

Adding three steps together, the Online Thinning algorithm has a complexity of at most

3.5.2 Computational complexity with subsampling

Let be the number of entries observed after subsampling, then the complexity of each major step is as follows.

  • Likelihood and score computation has a complexity of

    which scales with .

  • Mini-batch updates have complexities of at most

    where the first term comes from the calculation of and , and is affected by subsampling. Note the extra comes from the added complexity from computing . When no subsampling is done, . However, this is generally not true when we perform subsampling, and this extra step adds complexity. However, in general scales with and is much larger than . The second term comes from the orthonormalization of , which is not affected by subsampling.

  • Tree structure updates have complexities of at most

    where the first term comes from the likelihood computation, and scales with . The second comes from the tree splitting, which is not affected by subsampling.

Adding three steps together, the Online Thinning algorithm has a complexity of at most

3.5.3 Remarks

Subsampling changes the first term of the complexity. Higher subsampling rates (smaller ) reduce the complexity of computing the likelihood, and affect some steps in the mini-batch update. However, subsampling does not affect the orthonormalization of , or the splitting and merging of tree structures. Additionally, subsampling makes the computation of difficult, since in general . Still, the effect of this added complexity in computing is generally small since is usually much larger than .

Changing the size of mini-batches changes , and thus the second term of the complexity. Specifically, in the algorithm, changing changes (a) the number of times needed to update the tree structure, (b) the number of times needed to update and its pseudo-inverse (see Algorithm 1), and (c) the number of time needed to perform the orthonormalization of (see Algorithm 1). When has the same order of (or larger than) , and without subsampling, the algorithm’s complexity is linear to the total number of observations , the observation dimension , the tree size , and the subspace dimension .

4 Computational and statistical tradeoffs

Different systems have different delay allowances and precision requirements, and it is natural to ask how much performance we sacrifice by trying to reduce the computation time. Understanding such tradeoffs is crucial for applications where real-time processing is required, yet the computational power is limited, as with many mobile surveillance systems. This section explores the tradeoff between processing time and detection accuracy for the Online Thinning algorithm.

There are two primary ways to reduce the computational complexity of the data thinning: (1) by randomly subsampling the entries of , i.e., we only use partially observed data to update the dynamic low-rank GMM model parameters and estimate the anomalousness score; and (2) by varying the size of the mini-batches. Note that these are made possible because, as discussed in Section 3, data thinning (a) is robust to unobserved entries, and (b) can process data in mini-batches, respectively.

To explore this further, two experiments are conducted—one in which we vary the mini-batch size, and one in which we vary the subsampling rate. For these experiments, the data is generated as follows: The ambient dimension is . We first generate points in

in a union of three (shifted) subspaces of dimension ten; in which 95% of the points lie in the union of the first two subspaces. The other 5% of the points lie in a third subspace that is orthogonal to the other two. All three subspaces have shifts close to 0. We then add white Gaussian noise with variance

to these points to generate our observations. The two subspaces where the 95% of observations come from are dynamic, where the subspaces rotate at a speed . For , we have

where is a skew-symmetric matrix. Denote the set of ’s coming from each of the three subspaces as , respectively. The goal is to identify the 5% of the observations that come from .

The experiment streams in four thousand observations in total. An initial model is estimated using the first one thousand samples, and the models are then updated in an online fashion for the remaining three thousand samples. The anomalousness score is calculated as the negative log-likelihood of each data point according to the estimated model. We then select observations for which , and compute the detection rate and false alarm rate

The threshold is tuned to minimize the detection error