The task of anomaly detection, i.e. the task of determining whether a given observation is unusual compared to a corpus of observations deemed to be normal or usual, is a challenge with applications in various fields such as medicine Hauskrecht et al. (2013), financial fraud Nian et al. (2016) and cybersecurity Jones and Sielken (2000).
The idea of using a metric to discriminate a corpus of scenarios from anomalies, and the view that an event is an anomaly if it is some distance from the set of observations, seem natural; these have been used many times Chandola et al. (2009)
. The main weakness of this approach is the arbitrariness of the choice of metric and, possibly, some way of calibrating the power of the technique. An important innovation in this paper is the use of the variance, the dual norm to the covariance, as the metric. As we will explain, in many way it is surprising that the choice works, but in fact there is a strong and quite deep mathematical explanation for its effectiveness in terms of concentration of measure. It is a measure of exceptionality that can be applied to any corpus of data described through a vector feature set. It also provides internal measures of its own effectiveness in terms of the extent to which members of the corpus are themselves anomalies to the rest of the corpus. It requires no external choices or parameters. For example, linear transformations of the features do not change the analysis or the measures of exceptionality at all.
1.1 Existing work
Anomaly detection comprises a vast literature spanning multiple disciplines Chandola et al. (2009). Among unsupervised anomaly detection techniques applicable to multivariate data, existing work includes density-based approaches Breunig et al. (2000), clustering He et al. (2003)et al. (2012) et al. (2013)
and neural networksChalapathy and Chawla (2019).
The time series anomaly detection literature has largely focused on detecting anomalous points or subsequences within a time series, rather than detecting entire time series as anomalous. Hyndman Hyndman et al. (2015) detects anomalous time series by calculating a set of features of the overall time series, and projecting to the two principal components; Beggel et al. Beggel et al. (2019) learn shapelet-based features that are particularly associated with the normal class.
1.2 Our work
There are many data science contexts where it is already meaningful to construct vector representations or features to describe data. Word2VecMikolov et al. (2013) and Kernels Hofmann et al. (2008) provide two examples. The method introduced here could easily be applied to these contexts. In this paper, we initially, and specifically, focus on the signature as a vectorisation for streamed data, establishing that the methods are easy to apply and effective.
Definition 1.1 (Variance norm).
be a probability measure on a vector space. The covariance quadratic form , defined on the dual of , induces a dual norm defined for by
on . It is finite on the linear span of the support of , and infinite outside of it. We refer to this norm, computed for the measure re-centered to have mean zero, as the variance norm associated to .
The variance norm is well defined whenever the measure has finite second moments and, in particular, for the empirical measure associated to a finite set of observations.
This variance is surprisingly useful for detecting anomalies. Consider the standard normal distribution indimensions. That is to say consider where the are independent normal variables with mean zero and variance one. Then the covariance is the usual Euclidean inner product, and the variance of is the usual Euclidean norm. Note that the expected value of by is , and as the norm of the path is of the order of , converging to infinity with . In the high dimensional case, the norm is huge and in the infinite dimension case it is infinite. For Brownian motion on it is the norm of the gradient of the path. Clearly no Brownian path has a derivative, let alone a square integrable one. We see that the variance is intrinsic, but actually provides a very demanding notion of nearness that looks totally unrealistic.
However, keeping the Gaussian context, there is a fundamental theorem in stochastic analysis known as the TSB isoperimetric inequality111TSB stands for Tsirelson-Sudakov-Borell. Adler and Taylor (2007). An immediate corollary is that if one takes any set of probability one half in , and a new sample from the Gaussian measure, then the probability that is a variance-distance from is at most and so vanishing small if is even of moderate size. A Brownian path may be irregular, but if you take a corpus of Brownian paths with probability at least a half, then it will differ from one of those paths by a differentiable path of small norm. This makes the variance an excellent measure of exceptionality, it is selective and discriminatory, but it must be used to compare with the corpus and not used directly. A new member of the corpus will be far away from most members of the corpus, but there will with very high probability be some members of the corpus to which it is associated very well. With this in mind we make the following definition:
Let be a probability measure on a vector space . Define the conformance of to to be the distance
If is a linear map, then is more conformant to than is to (the conformance score is reduced by the linear map).
Keeping to the Gaussian context, let be any set of measure and let be the Gaussian measure restricted to and normalised. Then, reiterating, the TSB inequality ensures conformance to is an excellent measure of exceptionality.
An empirical measure is not in itself Gaussian, even if drawn from a Gaussian. So taking half of the ensemble only captures the other half tightly when the sample size is large enough compared with the dimension of the feature set that balls round it capture a good proportion of the probability measure. Before that, the resolution provided by the feature set is so high that essentially every sample is non-conformant. Fortunately this is easy to measure empirically by looking to identify . Therefore, if we split the corpus randomly into two halves, the probability is one half that a point chosen from the second half of the corpus is within a distance of the first half. From that scale on, if the conformance of a new observation to the corpus is , then should provide an effective measure of being an anomaly and provides a measure of the extent to which the dimension of the feature set is overwhelming the sample size and in this context every observation is innovative and an anomaly.
The non-Gaussian case lacks the very sharp theoretical underpinning of the Gaussian case, but the approach remains clear and its power can still easily be determined from the data. We validate the approach by identifying anomalies in streamed data using signatures as the vector features.
Our methodology provides a data-driven notion of a distance (i.e. conformance) between an arbitrary stream of data and the corpus. Moreover, it has four properties that are particularly useful for anomaly detection:
The variance norm is intrinsic to the vector representation and independent of any choice of basis.
The conformance score, as a measure of anomaly, does not depend on any external choice of metric, etc.
By using the signature to vectorise the corpus of streamed data, it is straightforward to accommodate streams that are differently sampled and essentially multimodal.
There are no distribution assumptions on the corpus of vectors.
The paper is structured as follows: Section 2 introduces the basic signature tools. In Section 3 we combine conformance with signatures to analyse anomalies in streamed data. In Section 4, we report results of numerical experiments on PenDigits, marine vessel traffic data and univariate time series from the UEA & UCR repository. In Section 5 we briefly summarise the contribution of the paper.
2 Streams of data and signature features
2.1 Streams of data
Below we give a formal definition of a stream of data, (Kidger et al., 2019, Definition 2.1).
Definition 2.1 (Stream of data).
The space of streams of data in a set is defined as
When a person writes a character by hand, the stroke of the pen naturally determines a path. If we record the trajectory we obtain a two-dimensional stream of data . If we record the stroke of a different writer, the associated stream of data could have a different number of points. The distance between successive points may also vary.
2.2 Signature features
Definition 2.3 (Signature).
Let be a stream of data in dimensions. Let be such that for
and linear interpolation in between. Then, we define the signature ofof order as
The signature of a stream of data is a vector of scalars. The dimension of this vector is
For each , define , which is the dimension of the signature of order . There exists a product
called the shuffle product such that
where denotes the inner (dot) product.
See (Kalsi et al., 2020, Definition 2.5) for an explicit construction of the shuffle product .
2.3 Stream transformations
Stream transformations map a stream of data to another stream of data that one considers might contain relevant information for the problem at hand.
A stream transformation is a mapping
where typically .
Below we introduce a few stream transformations that have proved to be popular with signatures in the literature, and will be used in later sections. More than one transformation can simultaneously be applied on a single stream.
2.3.1 Time transformation
The time transformation adds an extra dimension to a stream of data, which accounts for time:
where are chosen to be strictly increasing. Assume our data includes timestamps . A variant of this transformation involves computing differences between successive timestamps:
2.3.2 Lead-lag transformation
The lead-lag transformation of a -dimensional stream of data of length is a -dimensional stream of data of length , defined as follows:
for . The work in Flint et al. (2016) studies the signature of lead-lag transformed streams.
2.3.3 Invisibility transform
Signatures are constructed from increments of the stream of data. As a consequence, all information about the absolute value of the steps of the stream is lost. Sometimes it is desirable to keep reference to the absolute value of the underlying stream; in this case the invisibility transform Wu et al. (2020) is useful. When taking the signature of a stream after applying the invisibility transform, the absolute value of the stream is preserved.
The invisibility transform is defined as follows:
3 Anomalies in streamed data
Let be a finite corpus (or empirical measure) of streams of data. Let be the signature of order . Then is the variance norm associated with the empirical measure of .
There are a number of interesting relationships between the variance norm and the signature, one being ease of computation; the variance norm of order can easily be computed from the expected signature of order .
Let . We have
Let for . Then,
Appendix C describes some other interesting properties.
3.1 Anomaly detection using conformance
Let be a finite corpus of vector data. Use a large conformance score (Definition 1.2) to identify outlying behaviour. As explained above, each corpus has its own threshold of conformance. So, we randomly split the corpus into two equal-sized parts and denote the empirical probability measures on those two parts by and . For a random point with law we can look at its conformance to . By looking at the right
-tail of the random variablewith a given probability we have a natural quantified choice of anomalous behaviour. A point chosen randomly from has a probability of at most of a conformance that exceeds the threshold.
Depending on the choice of vector feature map for the corpus the power of this approach will change. For example, if the feature map is very high-dimensional, the threshold will have poor discriminatory power. The same is true for very low-dimensional feature maps. This is where, in the context of streamed data, the graded nature of the signature features proves to be advantageous.
We apply our method to the task of unsupervised anomaly detection. That is, we have a data set partitioned into those data deemed to be normal and those data deemed to be anomalous . By further partitioning, we obtain the corpus which we use for training; as our testing data we use .
We perform experiments on a 2018 MacBook Pro equipped with a 2.6 GHz 6-Core Intel Core i7 processor and 32 GB 2400 MHz DDR4 memory. For the results reported in Table 1, Table 2, Figure 2, the respective CPU times observed are 54min, 2d 3h 51min, 4h 59min. To compute signatures of streams, we use the iisignature library Reizenstein and Graham (2020).
4.1 Handwritten digits
We evaluate our proposed method using the PenDigits-orig data set Dua and Graff (2017). This data set consists of 10 992 instances of hand-written digits captured from 44 subjects using a digital tablet and stylus, with each digit represented approximately equally frequently. Each instance is represented as a 2-dimensional stream, based on sampling the stylus position at 10Hz.
We apply the PenDigits data to unsupervised anomaly detection by defining as the set of instances representing digit . We define as the subset of labelled as ‘training’ by the annotators. Furthermore, we define as the set of instances labelled as ‘testing’ by the annotators (). Finally, we define as the subset of not representing digit . Considering all possible digits, we obtain on average , . Assuming that digit class is invariant to translation and scaling, we apply Min-Max normalisation to each individual stream.
Table 1 displays results based on taking signatures of order and without any stream transformations applied. The results are based on aggregating conformance values across the set of possible digits before computing the ROC AUC. As we observe, performance increases monotonically from 0.901 () to 0.989 (). Figure 3 displays plots of empirical cumulative distributions of conformance values that we obtain for normal and anomalous testing data across values of .
4.2 Marine vessel traffic data
Next, we consider a sample of marine vessel traffic data222https://coast.noaa.gov/htdata/CMSP/AISDataHandler/2017/AIS_2017_01_Zone17.zip, accessed May 2020., based on the automatic identification system (AIS) which reports a ship’s geographical position alongside other vessel information. The AIS data that we consider were collected by the US Coast Guard in January 2017, with a total of 31 884 021 geographical positions recorded for 6 282 distinct vessel identifiers. We consider the stream of timestamped latitude/longitude position data associated with each vessel a representation of the vessel’s path. Figure 1 displays stream data for a sample of vessels.
We prepare the marine vessel data by retaining only those data points with a valid associated vessel identifier. In addition, we discard vessels with any missing or invalid vessel length information. Next, to help constrain computation time, we compress each stream by retaining a given position only if its distance relative to the previously retained position exceeds a threshold of 10m. Finally, to help ensure that streams are faithful representations of ship movement, we retain only those vessels whose distance between initial and final positions exceeds 5km. To evaluate the effect of stream length on performance, we disintegrate streams so that the length between initial and final points in each sub-stream remains constant with . After disintegrating streams, we retain only those sub-streams whose maximum distance between successive points is less than 1km.
We partition the data by deeming a sub-stream normal if it belongs to a vessel with a reported vessel length greater than 100m. Conversely, we deem sub-steams anomalous if they belong to vessels with a reported length less than or equal to 50m. We obtain the corpus from 607 vessels, whose sub-streams total between 10 111 (km) and 104 369 (km); we obtain the subset of normal instances used for testing from 607 vessels, whose sub-streams total between 11 254 (km) and 114 071 (km); lastly we obtain the set of anomalous instances from 997 vessels whose sub-streams total between 8 890 (km) and 123 237 (km). To account for any imbalance in the number of sub-streams associated with vessels, we use for each of the aforementioned three subsets a weighted sample of 5 000 instances.
After computing sub-streams and transforming them as described in Sections 2.3.2 and 2.3.3, we apply Min-Max normalisation with respect to the corpus . To account for velocity, we incorporate the difference between successive timestamps as an additional dimension, as described in Section 2.3.1.
We report results based on taking signatures of order
. For comparison, as a baseline approach we summarise each sub-stream by estimating its component-wise mean and covariance, retaining the upper triangular part of the covariance matrix. This results in feature vectors of dimensionalitywhich we provide as input to an isolation forest Liu et al. (2008). We train the isolation forest using 100 trees and for each tree in the ensemble using 256 samples represented by a single random feature.
Table 2 displays results for our proposed approach in comparison to the baseline, for combinations of stream transformations and values of the sub-stream length . Signature conformance yields higher ROC AUC scores than the baseline for 30 out of 32 parameter combinations. The maximum ROC AUC score of 0.891 is for a combination of lead-lag, time differences, and invisibility reset transformations with km, using the signature conformance. Compared to the best-performing baseline parameter combination, this represents a performance gain of 6.8 percentage points.
|Transformation||Conformance||Isolation forest baseline|
|Sub-stream length||Sub-stream length|
4.3 Univariate time series
For the specific case of detecting anomalous univariate time series, we benchmark our method against the ADSL shapelet method of Beggel et al. Beggel et al. (2019), using their set of 28 data sets from the UEA & UCR time series repository Bagnall et al. (2020) adapted in exactly the same manner. Each data set comprises a set of time series of equal length, together with class labels. One class (the same as in ADSL) is designated as a normal class, with all other classes designated as anomalies. To prepare the data for our method, we convert each time series into a 2-dimensional stream by incorporating a uniformly-increasing time dimension. We apply no other transformations to the data, and take signatures of order .
We create training and test sets exactly as in ADSL. The training corpus consists of 80% of the normal time series, contaminated by a proportion of anomalies (we compute results for anomaly rates of 0.1% and 5%). Across these data sets ranges from 10 (Beef) to 840 (ChlorineConcentration at 5%), ranges from 2 (Beef) to 200 (ChlorineConcentration), and ranges from 19 (BeetleFly and BirdChicken at 0.1%) to 6401 (Wafer at 5%). We run experiments with ten random train-test splits, and take the median result. The performance measure used by ADSL is the balanced accuracy, which requires a threshold to be set for detecting anomalies. We report the best achievable balanced accuracy across all possible thresholds, and compare against the best value reported for ADSL. Figure 2 plots our results. Individual scores are available in Table 3.
Our method performs competitively with ADSL, both when the proportion of anomalies in the training corpus is low and when it is high. It is able to detect anomalies in four of the six data sets where ADSL struggles because the anomalies are less visually distinguishable (ChlorineConcentration, ECG200, Wafer, Wine). However, there are data sets where ADSL performs better (BeetleFly, BirdChicken, FaceFour, ToeSegmentation1 and ToeSegmentation2): these data sets largely originate from research into shapelet methods, and they appear to contain features that are detected well by shapelets. Applying transformations to the data sets before input may improve our method’s results.
Motivated by the TSB isoparametric inequality we introduce the notion of conformance as an intrinsic and canonical tool to identify anomalous behaviour. It seems well-matched to the important challenge of identifying anomalous trajectories of streamed data against a corpus of ‘normality’. The approach appears robust when tested against a wide variety of data sets.
The experiments in this paper focused on applications of the conformance method to streamed data. It would be interesting to study how the method works on other types of vector data.
As with any anomaly detection method, there might be some intrinsic ethical issues depending on the data that is used and the intended use case, particularly if it involves people. However, the authors cannot identify any ethical issues that are specific to this method.
This work was supported by the Defence and Security Programme at the Alan Turing Institute, funded by the UK Government. PF, TL, IPA were supported by the Alan Turing Institute under the EPSRC grant EP/N510129/1 and the EPSRC program grant EP/S026347/1 DATASIG.
The authors are grateful for the UEA & UCR time series classification repository Bagnall et al. (2020), without which it would have been much more difficult to validate our approach.
- Gaussian inequalities. Random Fields and Geometry, pp. 49–64. Cited by: §1.2.
Enhancing one-class support vector machines for unsupervised anomaly detection.
Proceedings of the ACM SIGKDD Workshop on Outlier Detection and Description, pp. 8–15. Cited by: §1.1.
- The UEA & UCR time series classification repository. Note: www.timeseriesclassification.com, accessed May 2020 Cited by: §4.3, Broader Impact.
- Time series anomaly detection based on shapelet learning. Computational Statistics 34 (3), pp. 945–976. Cited by: §1.1, Figure 2, §4.3.
- LOF: identifying density-based local outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, pp. 93–104. Cited by: §1.1.
- Deep learning for anomaly detection: a survey. arXiv preprint arXiv:1901.03407. Cited by: §1.1.
- Anomaly detection: a survey. ACM computing surveys (CSUR) 41 (3), pp. 1–58. Cited by: §1.1, §1.
UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Cited by: §4.1.
- Discretely sampled signals and the rough Hoff process. Stochastic Processes and their Applications 126 (9), pp. 2593–2614. Cited by: §2.3.2.
- Outlier detection for patient monitoring and alerting. Journal of Biomedical Informatics 46 (1), pp. 47–55. Cited by: §1.
- Discovering cluster-based local outliers. Pattern Recognition Letters 24 (9-10), pp. 1641–1650. Cited by: §1.1.
- Kernel methods in machine learning. The Annals of Statistics, pp. 1171–1220. Cited by: §1.2.
- Large-scale unusual time series detection. In Proceedings of the 2015 IEEE International Conference on Data Mining Workshop (ICDMW), Vol. , pp. 1616–1619. Cited by: §1.1.
- Computer system intrusion detection: a survey. Technical report Citeseer, University of Virginia. Cited by: §1.
- Optimal execution with rough path signatures. SIAM Journal on Financial Mathematics 11 (2), pp. 470–493. Cited by: §2.2.
- Deep signature transforms. In Proceedings of the Advances in Neural Information Processing Systems Conference, pp. 3099–3109. Cited by: §2.1.
- Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, pp. 413–422. Cited by: §4.2.
- Isolation-based anomaly detection. ACM Transactions on Knowledge Discovery from Data (TKDD) 6 (1), pp. 1–39. Cited by: §1.1.
- Differential equations driven by rough paths. Springer. Cited by: Appendix C.
- Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §1.2.
- Auto insurance fraud detection using unsupervised spectral ranking for anomaly. Journal of Finance and Data Science 2 (1), pp. 58–75. Cited by: §1.
- Algorithm 1004: the iisignature library: efficient calculation of iterated-integral signatures and log signatures. ACM Transactions on Mathematical Software (TOMS) 46 (1), pp. 1–21. Cited by: §4.
- Signature features with the visibility transformation. arXiv preprint arXiv:2004.04006. Cited by: §2.3.3.
Appendix A Plots of conformance distances for PenDigits data set
Appendix B Table of results for univariate time series data
|0.1% anomaly rate||5% anomaly rate|
|Adiac||1.00 (0.00)||0.99 (0.10)||0.99 (0.09)||0.95 (0.05)|
|ArrowHead||0.80 (0.07)||0.65 (0.03)||0.74 (0.06)||0.64 (0.03)|
|Beef||0.80 (0.22)||0.57 (0.15)||0.80 (0.22)||0.73 (0.12)|
|BeetleFly||0.75 (0.08)||0.90 (0.08)||0.72 (0.08)||0.84 (0.08)|
|BirdChicken||0.75 (0.13)||0.85 (0.15)||0.77 (0.15)||0.79 (0.09)|
|CBF||0.97 (0.01)||0.80 (0.04)||0.86 (0.03)||0.68 (0.03)|
|ChlorineConcentration||0.91 (0.01)||0.50 (0.00)||0.88 (0.01)||0.47 (0.01)|
|Coffee||0.80 (0.05)||0.84 (0.04)||0.78 (0.05)||0.73 (0.05)|
|ECG200||0.80 (0.06)||0.50 (0.03)||0.75 (0.05)||0.47 (0.04)|
|ECGFiveDays||0.97 (0.02)||0.94 (0.11)||0.83 (0.02)||0.86 (0.01)|
|FaceFour||0.78 (0.10)||0.94 (0.10)||0.78 (0.13)||0.88 (0.11)|
|GunPoint||0.85 (0.05)||0.75 (0.03)||0.81 (0.05)||0.68 (0.04)|
|Ham||0.52 (0.04)||0.50 (0.02)||0.52 (0.04)||0.50 (0.03)|
|Herring||0.58 (0.06)||0.52 (0.02)||0.57 (0.04)||0.49 (0.04)|
|Lightning2||0.73 (0.04)||0.63 (0.07)||0.75 (0.05)||0.50 (0.07)|
|Lightning7||0.94 (0.09)||0.73 (0.11)||0.82 (0.09)||0.68 (0.07)|
|Meat||0.94 (0.03)||1.00 (0.04)||0.79 (0.07)||0.87 (0.05)|
|MedicalImages||0.97 (0.03)||0.90 (0.03)||0.95 (0.04)||0.83 (0.05)|
|MoteStrain||0.89 (0.01)||0.74 (0.01)||0.86 (0.02)||0.71 (0.03)|
|Plane||1.00 (0.00)||1.00 (0.04)||1.00 (0.04)||1.00 (0.04)|
|Strawberry||0.92 (0.01)||0.77 (0.03)||0.88 (0.01)||0.67 (0.02)|
|Symbols||1.00 (0.01)||0.96 (0.02)||0.99 (0.01)||0.95 (0.03)|
|ToeSegmentation1||0.77 (0.03)||0.95 (0.01)||0.76 (0.05)||0.84 (0.03)|
|ToeSegmentation2||0.80 (0.06)||0.88 (0.02)||0.77 (0.06)||0.80 (0.10)|
|Trace||1.00 (0.00)||1.00 (0.04)||1.00 (0.05)||1.00 (0.02)|
|TwoLeadECG||0.92 (0.02)||0.89 (0.01)||0.82 (0.02)||0.81 (0.02)|
|Wafer||0.97 (0.02)||0.56 (0.02)||0.81 (0.03)||0.53 (0.01)|
|Wine||0.85 (0.06)||0.53 (0.02)||0.81 (0.09)||0.53 (0.02)|
. Values in brackets are standard deviations with respect to testing folds.
Appendix C Properties of the variance norm for signatures
Below we give a few properties of the variance norm (1) for streamed data. Intuitively, these properties are interpreted as follows. The order of the signature can be seen as a measure of the resolution at which the streams are viewed. If is small, only general features of the streams are considered. If is increased, more and more details of the streams are considered, as they’re viewed at a higher resolution.
Given a finite corpus , any stream not belonging to the corpus is, in a way, an anomaly. In other words, viewed at a sufficiently high resolution any stream that is not in the corpus is an anomaly. The degree to which it should be considered as an anomaly should also increase with :
Let be a finite corpus. Take . Then, is non-decreasing as a function of .
Let . We have
as the supremum is taken over a larger set. ∎
Moreover, for a sufficiently high resolution, any stream of data not belonging to the corpus has infinite variance:
Let be a finite corpus. Let be a stream of data that does not belong to the corpus, . Then, there exists large enough such that
If , there exists large enough such that is independent to . Therefore, there exists such that for all and . It then follows that