1 Introduction
In largescale distributed systems or cloud environments, the detection of anomalous events allows operators to detect and understand operational issues and facilitates swift troubleshooting. Undetected anomalies can result in potentially significant losses and can impact customers of these systems and services negatively. Designing an effective anomaly detection system is therefore an important task. This task entails significant challenges, beginning with the fact that the notion of “anomaly” is itself ambiguous and is used to denote different kinds of events depending on the application domain. For example, what is considered an anomaly when monitoring an EEG signal in a health care application differs from what is considered anomalous in financial time series. To reduce this ambiguity, we focus on anomaly detection in the context of our target application of monitoring compute systems and cloud resources, where main object of interest are metrics emitted by these systems; we refer to this setting as cloud monitoring. We refer the reader to detailed overviews [1, 2, 3, 4, 5] on other application areas for anomaly detecion.
In the setting of cloud monitoring, an anomaly detection system needs to be able to efficiently detect anomalous events in a streaming fashion. The fundamental difficulties that any anomaly detection system has to face are threefold. First, due to the number and diversity of the monitored metrics (often millions) and the streaming nature of the data, it is uncommon to have sufficient amounts of labeled data available to employ supervised learning techniques. Even if labels are available, due to the subjectivity of the task, labels may not represent an “objective” ground truth, and there may be significant disagreement between multiple labelers. This raises the need for
unsupervised techniques. Second, the monitoring systems have to track the evolution of a large number of time series simultaneously, which often leads to a considerable flow of data to process in near realtime, so the models have to scale efficiently to the amount of data available. Here, scalability comes not only in the traditional flavor of computational scalability, but also in terms of the need to involve experts to tune the systems. With millions of metrics to be monitored, methods are required that have robust outofthebox performance requiring no manual intervention. Finally, the methods have to be flexible in order to handle time series of different nature (e.g. CPU usage, latency, error rate), and anomalies presenting a wide range of patterns (point anomalies, collective anomalies^{1}^{1}1A collective anomaly consists of a subset of points that deviates from the rest of the dataset even though individually each point may appear normal (see for example Figure 2)., contextual anomalies, abrupt changes in trend).The main contribution of the present work is a novel anomaly detection method based on distributional time series models that addresses all three challenges. To the best of our knowledge it is the first anomaly detection methodology that builds on a predictive model for a distributional time series representation. It employs an autoregressive LSTMbased recurrent neural network to provide flexibility while still being statistically sound. Our model scales well at inference time and has a compact model state making it deployable in lowlatency, streaming application scenarios. It readily allows the incorporation of covariates, enabling the model to detect contextual anomalies where the context is not limited to the temporal one. For example, a high CPU utilization may be expected if the number of incoming requests is large, but abnormal when the number of requests is low. Finally, our methodology can detect collective anomalies, which most nondistributional techniques are unable to detect.
We evaluate our method on a number of data sets including synthetic, publicly available, and AWSinternal data sets, and show that our method compares favorably to the state of the art. While we develop our methodology for the cloud monitoring setting, we further show that our method is competitive in classical anomaly detection settings.
2 Motivation
In the following we motivate our distributional time series modeling approach from two angles: the data generation process of requestdriven metrics, and highfrequency time series.
In a typical (micro) service monitoring setup, a metric datum is emitted for each request handled by the service, containing information about e.g. the processing time for the request and whether it resulted in a success or failure (and potentially more finegrained information about the request and the response). The raw monitoring data is thus a stream of events, where each event is a tuple consisting of a timestamp and a set of measurements. As a measurement is triggered for each incoming request, the time stamps are not equally spaced, and for large services one may collect hundreds of thousands of events per minute. To facilitate further processing and modeling, the typical anomaly detection pipeline starts with a temporal aggregation step, where the event data is aggregated into fixedsized time intervals (e.g. one minute), recovering the classical, equallyspaced time series setting. This aggregation of events requires choosing a meaningful statistic which summarizes all measurements within a given time interval, while allowing detection of abnormal behaviors. Commonly used summary statistics are e.g. the mean (or median), and extreme percentiles (e.g. the maximum, minimum, or the 99th percentile). However, the summary statistics chosen ultimately determine the range of anomalies one can detect, and one risks missing anomalies if the statistics are chosen inappropriately. For example, in Figure 1, we show three different quantiles of latency measurements of an internal service handling a large number of requests. We observe that if we choose to monitor the median or the 5% quantile, we would miss the anomaly that can be seen in the 95% quantile.
The method we propose here embraces this eventbased data generation process by considering the entire distribution of measurements within each time interval. This means considering time series of equally spaced “points” in time, but where each “point” is a probability distribution, called a distributional data point. This is in contrast to most classical anomaly detection approaches that take a time series of equallyspaced, realvalued data points as input and do not explicitly model the temporal data aggregation step.
Even though the proposed method was originally designed for the particular nature of the data described above, we demonstrate highly competitive performance even in the “classical” setting, where the starting point are time series of real values sampled at a regular frequency. We discuss this via the example of highfrequency time series, arising for example from measuring the CPU utilization or temperature of a compute node every second. Our approach solves several difficulties specific to such metrics.
The main challenge one faces when performing anomaly detection for highfrequency data is that the temporal dynamics governing the data evolve at a slower pace than the frequency of observation. In typical application settings, meaningful variations in metrics are expected to occur from one hour to the next, but not every second. The underlying dynamics can thus often be adequately described by using one hour or half an hour time granularity, with seasonal patterns that are daily, weekly or even monthly. However, both classical and deeplearningbased time series models are commonly unable to model long range dependencies (measured in number of observations), so that if highfrequency data is modeled directly, these models commonly fail to capture medium and long term patterns.
Our approach allows modeling the the temporal evolution at a more appropriate frequency by aggregating the observations, while retaining the ability to detect anomalies at the original frequency by modeling the distribution of observations within each time interval. Within each aggregated time interval , we treat the highfrequency data point as samples from this distribution. As an illustration, consider time series with a oneminute sampling frequency. If aggregated hourly, this yields observations per aggregated time interval. Based on the preceding onehour data distributions, our approach predicts the distribution of the observations for the hour to come. Then, at the th minute of that hour, we compute the likelihood of the current observation, which is used to determine if it is anomalous or not. We can also compute the joint likelihood of the past observations in the hour, allowing us to detect collective anomalies. An example of such anomalies is given in Figure 2
. We can see that the variance of the data distribution decreases drastically: individually, each observation falls into a highdensity region under the distribution of recent observations and does not appear as an outlier; however, observing these
values in a row is highly improbable. Classical time series anomaly detection algorithms are not able to detect such anomalies.3 Model
The backbone of our anomaly detection technique is a deep probabilistic distributional time series model, i.e. a probabilistic model for time series , where each element is a probability distribution over real values. See Figure 3 for an illustration. This is in contrast to the traditional setting where the time series elements themselves are real values. In the following, we introduce the necessary notation and tools used in the rest of the paper.
We start by briefly giving a highlevel description of the generic anomaly detection methodology with a probabilistic model. More details can be found for example in [6].
Anomaly detection with probabilistic models. One generic strategy for anomaly detection using probabilistic models is to compute the predictive distribution under the model, and mark an observation as anomalous if its probability under this distribution is low. In the time series setting, i.e. where the observations are a generic time series taking values , a probabilistic time series model aims at modeling , i.e. the conditional distribution of the next value given the history of observations
. This conditional model can take various forms, the most commonly used one being a parametric model whose parameters are determined by the preceding observations. Once we construct an estimate
, we can use it to define a credible region and mark any observation that falls outside of this credible region as anomalous.In the classical time series anomaly detection setting, the observation space is taken to be (), whereas here we take it to be the space of probability distributions on .
3.1 Distributional Time Series
Let
be a time series of univariate probability distributions, represented by their cumulative distribution functions (CDFs). We assume that the support for all
is the interval .^{2}^{2}2In practice, we need to determine these bounds for each monitored time series. One strategy is to choose them as a function of extreme values observed historically.Even though these distributions are the objects of interest, we usually do not to have access to them directly. Because of this, we also consider the scenario where we observe only indirectly through samples, i.e. at each time a set of iid samples from is observed. Depending on the value of , we can differentiate three realworld use cases:

Monitoring services with frequent requests: This corresponds to the setting described in Section 2, where for each time interval (e.g. each minute), the number of measurements is large, e.g. on the order of or more. The underlying distributions can then be estimated with a high enough precision for us to consider that they are directly observed. We will also refer to this as the asymptotic settting, since it corresponds to our model when .

Highfrequency time series: This corresponds to the setting where the temporal resolution of the original time series is higher than the scale at which meaningful temporal variation occurs. As modeling longrange dependencies is challenging for both classical and deeplearningbased time series models, it can be desirable to aggregate the observations to a lower, more meaningful time granularity. This leads to the case where but small, e.g. when aggregating from seconds to minutes. In this setting is usually constant, as the underlying time series will typically have a regular timegranularity.

Lowfrequency time series: We also consider the setting, where our model reduces to a classical probabilistic time series model over realvalues observations. Even though this is not the setting for which our approach was originally designed, we will show in the experiment section that it still yields competitive results.
Our model handles all three settings. We will refer to the last two scenarios as the finite scenarios, in contrast to the first one. In the asymptotic setting, the distributions are observed directly. In the finite settings, are unobserved and we only observe samples from them. Therefore, we need to be able to assess the likelihood of for the asymptotic regime, and the likelihood of , where is marginalized out, for the finite regimes.
3.2 Probabilistic Model on Binned Densities
A common approach to modeling distributional data is to represent the functions of interest (e.g. the CDFs or PDFs) by a point in a carefully chosen finitedimensional space. In this work, we will consider the space of piecewise linear functions to approximate the CDFs, or equivalently, the space of binned (piecewiseconstant) distributions to approximate the PDFs.^{3}^{3}3Other choices exist, which make different tradeoffs: In the field of functional data analysis, where the objects of interest are general functions, not probability distributions, a common choice for the finitedimensional representation are coordinates with respect to a truncated orthonormal basis of functions, composed of e.g. sinusoidal, wavelet or functional principal components. These function bases are not natural when modeling probability distributions, as it is nontrivial to enforce the required constraints: nonnegativity and monotonicity for CDFs, nonnegativity and unit integral for PDFs. Previous approaches employing such a representation usually resort to a postprocessing step that enforces these constraints. Another finitedimensional representation is common in the probabilistic modeling literature, where densities are often modeled as a finite mixture of base densities, typicallly chosen to be Gaussian. This representation is by construction suitable for distributional data, however obtaining such a decomposition is computationally expensive, even for a single given . Specifically, we chose to approximate each CDF by a piecewise linear function , composed of linear pieces. A given function in this class is specified by two sets of parameters: the start and end points of linear pieces (the knot positions), and the slopes in each segment. While it is possible to adapt the knot positions dynamically (as done in [7]), we keep the knot positions fixed and only model the temporal evolution of the slopes within each segment.
We divide into bins using the grid Let
be the piecewise linear CDF that interpolates the points
. Therefore, the corresponding density function is piecewise constant, and the probability of falling into one of the bins is given by(1) 
Specifying a distribution on the dimensional probability vector entails a distribution over the piecewise linear CDFs . We model this distribution over probability vectors using a Dirichlet distribution, i.e. ,
where denotes the concentration parameter whose temporal evolution is modeled using an RNN.
Before proceeding let us emphasize the fact that one can approximate any arbitrarily well as the grid becomes finer (
becomes larger). Indeed, for any random variable
with distribution , and with distribution (its piecewise linear approximation as defined above), as , will converge in distribution to . Based on the previous remark, we will simplify the setting and drop the notation , and assume that the themselves are piecewise linear. To simplify notation further (and without loss of generality), we will take .Given as described above, we have that the probability of a single observation given is if it falls in the interval ,
where denotes the indicator of the event , and refers to the categorical distribution with parameter . Hence, the count vector with , is a sufficient statistic, and the likelihood for a set of observations is given by
(2) 
where refers to the Multinomial distribution with trials and outcome probabilities . Since we take to be Dirichletdistributed, it can be marginalized out and we have a closed form probability mass function for the observations . More precisely, follows a DirichletMultinomial distribution with number of trials and concentration vector , with probability mass function,
To summarize, given , the likelihood of the observation is:
(Asymptotic setting)  
(Finite setting) 
where as explained previously in the asymptotic regime we suppose that we directly observe which is equal to the normalized counts .
3.3 RNN Temporal Dynamics Model
In both settings, the temporal evolution of the data is described through the timevarying parameter , and it is this dynamic behavior that we aim to learn. In order to do so, we will use an autoregressive LSTMbased recurrent neural network, whose architecture follows the one described in [8]. Figures 4 and 5 in the appendix illustrate the model’s architecture for the training and prediction steps.
Recurrent neural networks (RNNs) form a class of artificial neural networks designed to handle sequential data. They have been successfully used in a wide range of applications such as time series, natural language processing and speech recognition. One of the key benefits of RNNs is their ability to handle sequences of varying lengths. RNNs sequentially update a hidden state
: at every time step , the next hidden state is computed by using the previous and the next input (in a time series this can be the next observation and other covariates). A crucial detail is that the weights of the network are shared across time steps, which makes the RNN recurrent, and capable of handling sequences of varying length. The hidden states can be seen as a dynamic memory of a feature representation of the raw input. This compact representation makes them amenable to streaming settings. Here, we mainly rely on long shortterm memory networks (LSTM), the arguably most popular subclass of RNNs.
Let be the sequence of observations, either or depending on the setting. Denote the parameters of the RNN model. Given a horizon , the aim is to predict the probability distribution of future trajectories , with the potential use of observed covariates .
The parameter is function of the output of an autoregressive recurrent neural network with
(3)  
(4) 
where is a multilayer recurrent neural network with LSTM cells. The model is autoregressive and reccurent in the sense that it uses respectively the observation at the last time step and the previous hidden state as input. Then a layer projects the output to , the domain of . The parameters of the model are chosen to minimize the negative log likelihood:
Finally, we note that when dealing with anomaly detection we only require a time horizon .
3.4 Anomaly Detection with Level Sets
Once we forecast , we can assess whether the observation is a potential anomaly. Indeed, given , we know the distribution of the random variable , of which should be a sample if no anomaly happened. We can consequently compute a credible region with total mass for a given level . Then, if
, we will say that the observation is an anomaly. The difficulty one faces when considering credible regions is that they are not unique. Even though this problem exists for an univariate setting, it can be easily circumvented and natural credible intervals can be designed. In a multivariate setting, this issue is more challenging and one needs to choose meaningful credible regions. The credible regions we will consider are the levelsets of the likelihood, defined by
We will then take such that
and . In other words, the credible region will be the highest density region that achieves a total mass of , and the observation will be considered as an anomaly if . The remaining difficulty is to compute . This theoretically requires computing the mass of the levelsets, and then invert the function . When the number of possible outcomes for is finite and relatively small, this can be done exactly by computing the likelihoods of all outcomes. Otherwise, we will approximate such inverse function by means of a Monte Carlo method, following an idea that goes back to [9]. If we consider the univariate random variable defined as , we remark that can be interpreted as the quantile of that distribution. Therefore, we will construct the following estimator : first we sample realizations of , then we compute the associated likelihoods, and finally take the quantile of their empirical distribution.
For the asymptotic or the low frequency setting, we simply apply the approach described above. For the high frequency setting, we use a twostage approach. For the sake of exposition we will describe the procedure on the following illustrative example. Suppose that we observe a minute frequency time series, and suppose we are interested in hourly aggregation. Suppose that we have hours of observations, which when aggregated give sets of observations, where . From the forecasting module, we predict and hence the distribution of the observations for the hour to come. Every minute we obtain a new observation. In the first stage, before the hour is over, we assess every minute whether the current observation is anomalous. For this stage, there are only possible outcomes, since the observation will fall in one of possible bins, therefore we can compute the level sets exactly without Monte Carlo estimation. Once the hour is over, we assess whether the past observations jointly constitute a collective anomaly. This is done by constructing the count vector and checking if it falls within the credible region. Here, since the number of possible outcomes is too large, we will use a MonteCarlo estimate. If we want to detect collective anomalies that are shorter, we can add an intermediate stage.
For example every minutes we can assess whether the previous observations are jointly anomalous. Finally, as explained in the experiment section, we will need to give an anomaly score to each time point to evaluate the models. The score used is the logarithm of the pvalue, which is the smallest for which a given point is considered as an anomaly. For the twostage strategy, we simply add the two scores.
4 Experiments
Our implementation^{4}^{4}4The code is available at https://github.com/awslabs/gluonts/tree/distribution_anomaly_detection/distribution_anomaly_detection. is based on GluonTS [10] which in turn is based on MXNet [11]. Since we inherit scalability for both training and inference from MXNet, we do not explicitly perform experiments with respect to computational efficiency and scalability and resort to report the following details. Note that we learn a global model (across all metrics) which takes roughly 3mins per 100 metrics. For such models, we do not have to retrain often, so we may disregard the training time for the production scenario. Inference scales embarrasingly parallel. Scoring of a single data point take 1ms for 1 minutely aggregated data (note that we do not perform the costly MonteCarlo estimates at every time point). We can limit memory consumption of the models to a fixed size of 80kb per metric. We note, that our model runs in a streaming setting where data is provided by Amazon Kinesis. We can process approximately 65k metrics per minute on a standard 16core EC2 instance and scale out horizontally to millions of metrics effortlessly.
For all the experiments we learn the parameters of the model on the learning time range , and we perform anomaly detection on the detection time range . We fix the prediction time length to , and perform anomaly detection in a streaming way: observing , we predict the distribution of , then assess the likelihood of the next observation . Then, knowing , we predict the distribution of and compare it to the next observation and so on. We consider two different grids to define the bins. The first one is the simple regular grid, . The second grid is obtained using the quantiles of the marginal distribution; in the asymptotic case we compute the average of the observed cdfs, in the finite setting we concatenate all the sets of observations; then for both cases we take the quantiles with regularly spaced levels. Depending on the problem, the regular or the quantiles grid can be better. Indeed, the quantiles grid can give poor results if some distributions are very different from the marginal, since this approach can leave large regions with very few bins. Another approach could be to consider a time varying grid, for example consider the quantiles of the marginal distribution of each day of the week, but we didn’t experiment this approach yet.
4.1 Evaluation metric
For comparing the different models we will use the area under the receiver operating characteristic curve (ROCAUC). It is a metric commonly used for classification problems to compare algorithms which performances depend on selecting a threshold. This measure quantifies how much a model is able to distinguish between the two classes. It takes values between
and , the higher the better. This score is independent of the threshold chosen since it only considers the ranking of the observations by the model in terms of how much abnormal it looks. Therefore it allows to quantify the maximum potential of a method.4.2 Synthetic data
Let and , where is a period length and are iid noise. We will consider the two following dynamics:

DS1:

DS2:
We consider learning time points (which corresponds to approximately 2 months of hourly data) and a detection time horizon of . In the detection time range, we add an anomaly with probability at each location independently. For each experiment, we use one of two types of anomalies: a sudden distributional shift (by adding 1 to ), or a distributional collapse (removing
to the standard deviation
, as in Figure 2 in the introduction). We therefore get four different settings, we will denote them respectively DS1 , DS1 , DS2 and DS2 .We set the threshold for anomaly detection to be . For the Monte Carlo approximation of the level set, we take samples from the predictive distribution of the log likelihoods and estimate the corresponding . An observation can then be considered anomalous in two cases. In the first case, the generated noise term falls outside of a confidence interval of the : these can be considered as statistical anomalies, and if the model perfectly captures the generating process, this should happen of the times on average; these are false positives. The second case corresponds to the anomalies that are artificially added, and can be considered as malfunctions, or true positives.
In every experiment, we repeat the whole process (data generation, learning and anomaly detection) times to have an idea of the variability of the results.
Asymptotic setting
In this setting, similarly to popular services on AWS, we will suppose we have access to a grid of a thousand quantiles of at each time step, and construct from these quantiles. For this setting, we take a regular grid of bins, which is enough to detect the malfunctions with a high accuracy. We report the results in Tables 1. The proportion of statistical anomalies is computed on the set of time points that are not malfunctions. This corresponds to the False Positives Rate (FPR). We can see that the proportion of such anomalies is stable around , as it should. The introduction of of malfunctions does not seem to impact it too much. We can also notice that for dataset 1, where noise is added on the mean in the learning phase generating process, the algorithm is slightly worse at detecting distributional shift. And similarly, in dataset 2 where noise is added to the standard deviation, we see decreased performance in detecting distributional collapse.
False Positive Rate  Recall  

DS1    
DS1  
DS2  
DS2    
DS2  
DS2 
Finite setting In this setting, we observe samples from evey distribution , which corresponds to hourly aggregation of minute data. Here the task is more complex. We take a quantile grid of bins. With such a small dimension, the model can not always capture the changes in the trend, however, the relatively small amount of data prevents us taking a larger , contrary to the asymptotic case. In most practical settings, we are able to take much larger since we can make use of multiple time series simultaneously, even though they represent different metrics (CPU usage, Latency, Number of connected users, etc.). But for this synthetic setting, we restrict our method by learning and performing anomaly detection one time series at a time.
The objective of this experiment is to see how well our approach performs compared to the standard approach of monitoring an aggregated statistic. We use two stateoftheart open source anomaly detection algorithms, namely Luminol and TwitterAD, as competitors. These algorithms are run on the appropriate aggregated statistics. For the mean malfunctions we use the time series of empirical means (per hour), and for the variance malfunctions we use the empirical standard deviations. While we use the right aggregated statistic for the injected anomalies, we note that in a practical setting we don’t know which statistics is most appropriate to monitor.
The results are reported in Table 2. From this experiment we see that in general it is more complicated to detect standard deviation change malfunctions, with the worst case being dataset 2 with collapse malfunctions. The most plausible interpretation is that high order statistics are more difficult to estimate with a finite number of samples.
Distribution  TwitterAD  Luminol  

DS1  0.9928  0.9998  0.9400 
DS1  0.9864  0.5010  0.9691 
DS2  0.9973  0.9999  0.9596 
DS2  0.9797  0.4990  0.9456 
4.3 Yahoo webscope Dataset
Yahoo Webscope is an open dataset often used as a benchmark for anomaly detection since it is labeled. It is composed of 367 time series, varying in length from 700 to 1700 observations. Some of these time series come from real traffic to Yahoo services and some are synthetic. The dataset is divided into 4 subbenchmarks, from to . The time frequency of all the time series is one hour. Since the frequency is relatively low, and since there are no collective anomalies in this dataset, we take . This setting corresponds to the regular anomaly detection setting, where we only have one observation per time step. We report the results of [12] to compare the performance of our approach with the state of the art anomaly detection algorithms. We report the results per subbenchmark, since they contain different patterns. We use of each time series for training. We learn a single model for all the series of a same subbenchmark, which means that we train the model on all the time series simultaneously. The hyper parameters are selected based on a few runs on the training set (which has been subdivided into training and validation sets). The results are given in Table 3. Here, since , the total number of possible outcomes is equal to . Therefore, we do not need Monte Carlo estimates.
Benchmark  iForest  OCSVM  LOF  PCA  TwitterAD  DeepAnT  FuseAD  Distribution 

A1  0.9471  0.9435  
A2  0.9999  
A3  0.9987  0.9988  
A4  0.9701 
.
4.4 AWS data
Finally, we consider three benchmark datasets of high frequency time series, collected from AWS. These datasets are often used internally at Amazon to compare models. The benchmark B1 has a minute time frequency, it is composed of time series. The benchmarks B2 and B3 have a minute time frequency. They are composed of time series each. All datasets are composed of different metrics, among them CPU usage, latency, number of users, and so on. Each time series of all three benchmarks have approximately time points. We use of the time range for training and and the remainder for detection. We set and aggregate all time series to a minutes frequency, so for B1 and for B2 and B3. However the quality of the labeling is heterogeneous, B1 being the most reliable one. Some examples of inaccurate labeling from the third benchmark can be found in Figure 7 in the appendix. We find that the anomalies identified by our method are false positives under the labels but should probably be counted as true positives. We perform a two stage anomaly detection, the first stage gives scores for the single observations, the second for the collection of observations within halfhours. We simply add the two scores to get the final score. We again compare to Luminol and TwitterAD. The results are reported in Table 4 which show the dominance of our method on this data set.
Distribution  TwitterAD  Luminol  

B1  0.8183  0.7134  0.6467 
B2  0.7534  0.5895  0.5804 
B3  0.6860  0.5889  0.5860 
5 Related work
Anomaly detection is a rich field with many applications and solutions available (see for example [1, 2, 3, 4, 5, 13, 14, 15, 16, 17]). Since we consider a specific anomaly detection method on time series here, we focus our discussion of related work on methods in the following (as opposed to systems or solutions) and even more specifically, unsupervised anomaly detection models. Such methods are appropriate in our scenario because labels in industrial settings as we consider them with millions of time series, are sparse. However, we acknowledge that supervised and semisupervised methods would be an interesting avenue for further work [18].
Most related to our approach are anomaly detection methods using (probabilistic) time series or forecasting models. Overview of forecasting models can be found in recent tutorials [6, 19]. These approaches have the advantage that we can obtain an interpretable and normalized score (e.g., how likely is a point under the model) which allows to tune thresholds over large panels of time series simulatenously. Other approaches, e.g., [20, 21] do not allow for this and hence we do not consider them further.
The anomaly detection algorithm consists in estimating the parameters of a time series or forecasting model in a first step. Then, the score of a point in the time series will be a quantification of the distance between the prediction and the observation (for example distance or the likelihood of the observation if the model is probabilistic). The choices of the time series model differ in the literature. The most common models are the classical ARIMA (see for example [22, 23] for applications) and increasingly deep learning based approaches as we apply it here [24, 25, 12]. Even though in the more general area of sequence learning, attentionbased models [26] have become the state of the art, our choice of an RNN similar to [8, 7] is motivated by the streaming setting that we consider here. The compact model state of an RNN is well suited to a streaming setting, whereas attentionbased models have a prohibitely large state for a streaming setting. This is why we do not further consider them or convolutional alternatives [27] here.
Existing time series models that consider distributional data rely on Functional Data Analysis to provide the necessary mathematical tools for analyzing such problems (see [28], [29] or [30] for an overview of existing non parametric methods). These Functional Time Series (FTS) models are the most related to our problem. In that setting, the learner observes a time sequence of functional data, and tries to forecast the next functions. This framework and the resulting methods have many applications ranging from the study of demographic curves (for example [31, 32]) to electricity price forecasting (see [33]). While these models allow for more general functions than distributions as data points, the restriction to distributions has led to further models [34, 35, 36, 37] and Bayesian variants have also been proposed [38, 39, 40].
Instead of time series models, it is also possible to disregard time dependencies and case the time series prediction as a regression problem. This would allow for employing distribution regression models such as [41, 42, 43, 44, 45, 46]. Given the strong autocorrelation in metrics data as we observe in our application, time dependence should not be disregarded and we therefore do not consider this approach further. To the best of our knowledge, we are not aware of other anomaly detection methods relying on distributional time series models beside ours.
6 Conclusion
We presented the first anomaly detection method based on deep distributional time series models. The development of this model was motivated by realworld anomaly detection data and usecases that we commonly find in monitoring cloud services. In the experiments, we show that on synthetic, public and AWSinternal data, our method compares favorably to other anomaly detection offerings. Our method was designed for streaming scenarios as they occur in monitoring compute metrics and it is fully elastic.
While labels for anomalies are sparse, imbalanced and noisy, they nevertheless exists. Future work should consider how to improve the algorithms described here by incorporating labels during learning and acquiring them during production runs to lead to a continuously improving anomaly detection system.
References
 [1] Douglas M Hawkins. Identification of outliers, volume 11. Springer, 1980.
 [2] SM Bendre. Outliers in statistical data, 1994.

[3]
Victoria Hodge and Jim Austin.
A survey of outlier detection methodologies.
Artificial intelligence review, 22(2):85–126, 2004.  [4] Varun Chandola, Arindam Banerjee, and Vipin Kumar. Anomaly detection: A survey. ACM computing surveys (CSUR), 41(3):1–58, 2009.
 [5] Charu C Aggarwal. Outlier analysis. In Data mining, pages 237–263. Springer, 2015.
 [6] Christos Faloutsos, Jan Gasthaus, Tim Januschowski, and Yuyang Wang. Forecasting big time series: old and new. Proceedings of the VLDB Endowment, 11(12):2102–2105, 2018.
 [7] Jan Gasthaus, Konstantinos Benidis, Yuyang Wang, Syama Sundar Rangapuram, David Salinas, Valentin Flunkert, and Tim Januschowski. Probabilistic forecasting with spline quantile function rnns. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1901–1910, 2019.
 [8] David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 2019.
 [9] Rob J Hyndman. Computing and graphing highest density regions. The American Statistician, 50(2):120–126, 1996.
 [10] Alexander Alexandrov, Konstantinos Benidis, Michael BohlkeSchneider, Valentin Flunkert, Jan Gasthaus, Tim Januschowski, Danielle C Maddix, Syama Rangapuram, David Salinas, Jasper Schulz, et al. Gluonts: Probabilistic time series models in python. arXiv preprint arXiv:1906.05264, 2019.
 [11] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv:1512.01274, 2015.
 [12] Mohsin Munir, Shoaib Ahmed Siddiqui, Muhammad Ali Chattha, Andreas Dengel, and Sheraz Ahmed. Fusead: unsupervised anomaly detection in streaming sensors data by fusing statistical and deep learning models. Sensors, 19(11):2451, 2019.
 [13] Sebastian Schelter, Dustin Lange, Philipp Schmidt, Meltem Celikel, Felix Biessmann, and Andreas Grafberger. Automating largescale data quality verification. Proceedings of the VLDB Endowment, 11(12):1781–1794, 2018.
 [14] Xiaoou Ding, Hongzhi Wang, Jiaxuan Su, Zijue Li, Jianzhong Li, and Hong Gao. Cleanits: A data cleaning system for industrial time series. Proceedings of the VLDB Endowment, 12(12):1786–1789, 2019.
 [15] Jordan Hochenbaum, Owen S Vallis, and Arun Kejariwal. Automatic anomaly detection in the cloud via statistical learning. arXiv preprint arXiv:1704.07706, 2017.
 [16] Hansheng Ren, Bixiong Xu, Yujing Wang, Chao Yi, Congrui Huang, Xiaoyu Kou, Tony Xing, Mao Yang, Jie Tong, and Qi Zhang. Timeseries anomaly detection service at microsoft. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3009–3017, 2019.
 [17] Aoqian Zhang, Shaoxu Song, Jianmin Wang, and Philip S Yu. Time series data cleaning: From anomaly detection to anomaly repairing. Proceedings of the VLDB Endowment, 10(10):1046–1057, 2017.
 [18] Jingkun Gao, Xiaomin Song, Qingsong Wen, Pichao Wang, Liang Sun, and Huan Xu. Robusttad: Robust time series anomaly detection via decomposition and convolutional neural networks. arXiv preprint arXiv:2002.09545, 2020.
 [19] Christos Faloutsos, Jan Gasthaus, Tim Januschowski, and Yuyang Wang. Classical and contemporary approaches to big time series forecasting. In Proceedings of the 2019 International Conference on Management of Data, pages 2042–2047, 2019.

[20]
Sudipto Guha, Nina Mishra, Gourav Roy, and Okke Schrijvers.
Robust random cut forest based anomaly detection on streams.
In
International conference on machine learning
, pages 2712–2721, 2016.  [21] ChinChia Michael Yeh, Yan Zhu, Liudmila Ulanova, Nurjahan Begum, Yifei Ding, Hoang Anh Dau, Diego Furtado Silva, Abdullah Mueen, and Eamonn Keogh. Matrix profile i: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In 2016 IEEE 16th international conference on data mining (ICDM), pages 1317–1322. Ieee, 2016.
 [22] H Zare Moayedi and MA MasnadiShirazi. Arima model for network traffic prediction and anomaly detection. In 2008 International Symposium on Information Technology, volume 4, pages 1–6. IEEE, 2008.
 [23] Asrul H Yaacob, Ian KT Tan, Su Fong Chien, and Hon Khi Tan. Arima based network anomaly detection. In 2010 Second International Conference on Communication Software and Networks, pages 205–209. IEEE, 2010.
 [24] Pankaj Malhotra, Lovekesh Vig, Gautam Shroff, and Puneet Agarwal. Long short term memory networks for anomaly detection in time series. In Proceedings, volume 89, pages 89–94. Presses universitaires de Louvain, 2015.
 [25] Zachary C Lipton, David C Kale, Charles Elkan, and Randall Wetzel. Learning to diagnose with lstm recurrent neural networks. arXiv preprint arXiv:1511.03677, 2015.
 [26] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
 [27] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
 [28] James O Ramsay. Functional data analysis. Encyclopedia of Statistical Sciences, 4, 2004.
 [29] Lajos Horváth and Piotr Kokoszka. Inference for functional data with applications, volume 200. Springer Science & Business Media, 2012.
 [30] Frédéric Ferraty and Philippe Vieu. Nonparametric functional data analysis: theory and practice. Springer Science & Business Media, 2006.
 [31] Rob J Hyndman and Md Shahid Ullah. Robust forecasting of mortality and fertility rates: a functional data approach. Computational Statistics & Data Analysis, 51(10):4942–4956, 2007.
 [32] Rob J Hyndman and Han Lin Shang. Forecasting functional time series. Journal of the Korean Statistical Society, 38:199–211, 2009.
 [33] José Portela González, Antonio Munoz San Roque, and Estrella Alonso Perez. Forecasting functional time series with a new hilbertian armax model: Application to electricity price forecasting. IEEE Transactions on Power Systems, 33(1):545–556, 2017.
 [34] Joon Y Park and Junhui Qian. Functional regression of continuous state distributions. Journal of Econometrics, 167(2):397–412, 2012.
 [35] Yoosoon Chang, Chang Sik Kim, and Joon Y Park. Nonstationarity in time series of state densities. Journal of Econometrics, 192(1):152–167, 2016.
 [36] Yoosoon Chang, Robert K Kaufmann, Chang Sik Kim, J Isaac Miller, Joon Y Park, and Sungkeun Park. Evaluating trends in time series of distributions: A spatial fingerprint of human effects on climate. Journal of Econometrics, 214(1):274–294, 2020.
 [37] Piotr Kokoszka, Hong Miao, Alexander Petersen, and Han Lin Shang. Forecasting of density functions with an application to crosssectional and intraday returns. International Journal of Forecasting, 35(4):1304–1317, 2019.
 [38] Francois Caron, Manuel Davy, Arnaud Doucet, Emmanuel Duflos, and Philippe Vanheeghe. Bayesian inference for linear dynamic models with dirichlet process mixtures. IEEE Transactions on Signal Processing, 56(1):71–84, 2007.
 [39] Abel Rodriguez, Enrique Ter Horst, et al. Bayesian dynamic density estimation. Bayesian Analysis, 3(2):339–365, 2008.
 [40] Ramsés H Mena, Matteo Ruggiero, et al. Dynamic density estimation with diffusive dirichlet mixtures. Bernoulli, 22(2):901–926, 2016.
 [41] Krikamol Muandet, Kenji Fukumizu, Francesco Dinuzzo, and Bernhard Schölkopf. Learning from distributions via support measure machines. In Advances in neural information processing systems, pages 10–18, 2012.
 [42] Ho Chung Leon Law, Dougal Sutherland, Dino Sejdinovic, and Seth Flaxman. Bayesian approaches to distribution regression. In International Conference on Artificial Intelligence and Statistics, pages 1167–1176, 2018.
 [43] Zoltán Szabó, Bharath K Sriperumbudur, Barnabás Póczos, and Arthur Gretton. Learning theory for distribution regression. The Journal of Machine Learning Research, 17(1):5272–5311, 2016.
 [44] Barnabás Póczos, Aarti Singh, Alessandro Rinaldo, and Larry Wasserman. Distributionfree distribution regression. In Artificial Intelligence and Statistics, pages 507–515, 2013.
 [45] Junier Oliva, Barnabás Póczos, and Jeff Schneider. Distribution to distribution regression. In International Conference on Machine Learning, pages 1049–1057, 2013.
 [46] Connie Khor Li Kou, Hwee Kuan Lee, and Teck Khim Ng. A compact network learning model for distribution regression. Neural Networks, 110:199–212, 2019.
Appendix
6.1 RNN architecture:
As explained in the main text, we will use an autoregressive LSTMbased recurrent neural network whose architecture follows the one described in [8]. The Figures 4 and 5 hereafter give a high level description of the model’s architecture for the training and prediction steps. Further details can be found in the aforementioned work. In an anomaly detection setting, we only consider one step ahead prediction, and therefore take in the prediction network.
6.2 Further experimental results
Synthetic data: Asymptotic setting
Figure 6 show an example of a statistical anomaly as well as a malfunction for the dataset 1 with collapse malfunctions in the asymptotic setting: we show the observed distribution (in red), the predicted coordinatewise credible intervals (blue). We also show the predicted distribution of loglikelihoods, using the histogram of the samples from the predictive.
Synthethic data: Finite
In Table 5, we report the average False Positives Rate and Recall in the finite setting.
False Positives Rate  Recall  

DS1    
DS1  
DS2  
DS2    
DS2  
DS2 
AWS data
In Figure 7 we give two examples of time series from the AWS B2 benchmark dataset. As we can see, the labeling can be inaccurate which explains why the scores of all methods are not very high.
Comments
There are no comments yet.