Anomaly Detection based on Compressed Data: an Information Theoretic Characterization

We analyze the effect of lossy compression in the processing of sensor signals that must be used to detect anomalous events in the system under observation. The intuitive relationship between the quality loss at higher compression and the possibility of telling anomalous behaviours from normal ones is formalized in terms of information-theoretic quantities. Some analytic derivations are made within the Gaussian framework and possibly in the asymptotic regime for what concerns the stretch of signals considered. Analytical conclusions are matched with the performance of practical detectors in a toy case allowing the assessment of different compression/detector configurations.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

12/09/2020

Deep Unsupervised Image Anomaly Detection: An Information Theoretic Framework

Surrogate task based methods have recently shown great promise for unsup...
03/21/2020

On Information Plane Analyses of Neural Network Classifiers – A Review

We review the current literature concerned with information plane analys...
12/11/2018

Deep Anomaly Detection with Outlier Exposure

It is important to detect and handle anomalous inputs when deploying mac...
04/25/2021

Unsupervised Learning of Multi-level Structures for Anomaly Detection

The main difficulty in high-dimensional anomaly detection tasks is the l...
07/22/2021

Using UMAP to Inspect Audio Data for Unsupervised Anomaly Detection under Domain-Shift Conditions

The goal of Unsupervised Anomaly Detection (UAD) is to detect anomalous ...
01/28/2019

Generalization of feature embeddings transferred from different video anomaly detection domains

Detecting anomalous activity in video surveillance often involves using ...
07/07/2021

Information-theoretic characterization of the complete genotype-phenotype map of a complex pre-biotic world

How information is encoded in bio-molecular sequences is difficult to qu...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

A typical scenario for nowadays massive acquisition systems can be modelled as a large number of sensing units, each transforming some physical unknown quantities into samples of random processes that are then transmitted over a network. To reduce transmission bitrate, signals are often compressed by a lossy mechanism that is theoretically capable of preserving useful information. Before reaching some cloud facility in which they will be ultimately stored or processed, the corresponding bitstreams may traverse several levels of hierarchical aggregation and intermediate devices that are often indicated as the edge of the cloud [1]. For latency or privacy reasons, some computational tasks may benefit from their deployment at the edge. One of those tasks is the detection of anomalies/novelties.

This is especially true when dealing, for example, with networks that sensorize plants or structures subject to monitoring as depicted in Fig. 1. The aggregated sensor readings may be processed in the cloud for off-line monitoring relying on long-term historical trends, while the outputs of subsets of sensors may be processed at the edge to give low-latency feedback on possible critical events that require immediate intervention.

Usually, compression schemes applied to sensor data are asymmetric and entail a lightweight encoding performed on very-low complexity devices paired with a possibly expensive decoding stage running on the cloud. In these conditions, it is sensible that anomaly detectors work on compressed data and not on the recovered signal.

Yet, lossy compression bases its effectiveness on neglecting some of the signal details. This translates into a distortion between the original and the recovered signal but also in a loss of details that, in principle, could have been used to tell normal behaviours from anomalous ones.

This sets a trade-off between compression and distinguishability between normal and anomalous signals. Such a trade-off goes in parallel to the one between compression and distortion. Here, we analyze the former with the information theoretic machinery that leads also to the well-known rate-distortion analysis, and implicitly show that the two trade-offs are different.

Fig. 1: A sensorized plant whose acquisitions are aggregated and the edge before being sent to the cloud.

In a sense, the approach we pursue is somehow similar to the information-bottleneck scheme [2, 3]

. In that scheme, distortion is substituted by a very general criterion which identifies the features that should be preserved when compressing with the information that the original signal contains about a second (suitably introduced) signal. Yet, our discussion takes a different direction as, when dealing with anomaly/novelty/outlier, we may completely ignore the statistics of the anomaly and, even if we have priors on that, we need to be able to treat also cases in which the mutual information between normal and anomalous signals is null.

For the same reason, the analysis we propose is also different from other modifications of classical rate-distortion theory that substitute energy-based distortion with perceptive criteria [4, 5].

Though not overlapping with the problem we address, it is also worthwhile to mention [6, 7]

, in which the authors assume that the original signal is characterized by some parameters (e.g., their mean) and study how the estimation of such parameters is affected by lossy compression.

Note also that other applications exist in which rate and distortion are paired with additional merit figures taking into account relevant features of the system. As an example [8] adds computational effort considerations to the analysis of rate-distortion of wavelet-based video coding.

Finally, even without emphasis on compression, the relationship between the analysis of suitably defined subcomponents of a signal to detect possible outlier behaviours is a classic theme that is still under investigation [9, 10, 11].

In this paper, we propose two information-theoretic measures of distinguishability to model the potential capabilities of two kinds of anomaly detectors, namely those that know the normal behaviour but are agnostic of the anomaly, and those that have information on both normal and anomalous behaviours. We then study how these measures behave when we change the distortion and thus the compression of the lossy mechanism, by deriving some analytical results when signals are Gaussian, both in the finite and asymptotic regime.

Section II gives the mathematical definition of the signal and quantities we use in the following with special emphasis on the functionals that quantify the effectiveness of compression, the distortion, and the distinguishability between normal and anomalous signals. Section III specializes the model to the case in which signals are Gaussian and revisits well-known results on Gaussian rate-distortion trade-off. Section IV applies the Gaussian assumption to specialize distinguishability both in the pointwise and average case, with an emphasis on the asymptotic characterization of anomalies in the high-dimensional case. Section V reports some numerical evidence analysing the behaviour of some suitably simplified anomaly detection strategies with respect to ideal and suboptimal compression strategies. Theoretical curves anticipate many aspects of practical performance trends and show that compression that optimizes the rate-distortion trade-off is not necessarily addressing at best the compromise with distinguishability. Some conclusion are finally drawn. Proofs of the properties stated in the discussion are reported in the Appendix.

Ii Model definition

Fig. 2: The signal chain we consider is tuned to the normal signal which is compressed into a smaller representation that can be re-expanded into a distorted version of the original signal. Sometimes, an anomalous signal is presented and the input of the chain. A detector should signal this based on the compressed version of the signal.

The signal flow we consider is reported in Fig. 2.

Normal and anomalous behaviours are modelled as two sources producing independent discrete-time, stationary, -dimensional stochastic processes and with different PDFs and . At any , the observable discrete-time process is either or .

The observable is passed to an encoding stage producing a compressed version that may then be decompressed into , where is a finite subset of . The compression mechanism is lossy as the encoding stage is not injective and thus introduces some distortion.

As discussed in the Introduction, the encoded signal may be used for anomaly detection, i.e., to decide whether the original signal is or . We assume the decoding stage to be injective so that, in abstract terms, the detector may be thought to work on even if in practical embodiments decoding is not performed at the edge.

Compression is designed assuming and thus for every . It is characterized by the average distortion

(1)

Since is finite, a digital word can be assigned to each of its elements so that is encoded with bits. This allows to define the average rate of the stream as .

Rate and distortion are clearly two elements of a trade-off whose Pareto curve is the so called rate-distortion curve, i.e., the function

To identify such a function, it is classical [12, Chapter 9] to model the cascade of encoding and decoding as the conditional PDF

such that the joint probability of

and is given by .

We design the system for normal cases , i.e., considering and the marginal PDF with which distortion is expressed as

With this, classical rate-distortion theory (see, e.g., [12, Chapter 13]) shows that

where is the mutual information between and [12, Chapter 8].

To proceed further it is convenient to define the functional

that is the average coding rate of a source characterized by the PDF with a code optimized for a source with PDF , so that is equal to the differential entropy of [12, Chapter 8].

When and we deal with a quadratic distortion constraint, one may derive a well known Property [12, Chapter 13].

Property 1.

If and we constrain , then the minimum achievable rate has the lower bound

(2)

that can be achieved when the encoding is such that

is a Gaussian random variable.

Property 1 is the basis of classical developments and indicates that an encoding in which distributes as a Gaussian has rate-minimization capabilities.

If anomalies need to be considered, the situation to model becomes more complex. In fact, as rate decreases, distortion increases, and the compressed versions of normal and anomalous signals tend to be less distinguishable. Hence, if the compressed stream is used for anomaly detection, one expects that detector performance degrades for increasing compression. To quantify this effect we need theoretical performance figures that characterize the distinguishability of two signals.

To begin tackling the problem, one can notice that anomalies are distorted with the encoder tuned on normal signals . This means that the input and the output of the distortion are characterized by the joint PDF and the marginal PDF .

Anomaly detectors work on the difference between and that we quantify with two kinds of information-theoretic measure, which model two distinct scenarios, i.e., one in which the detector knows both and and one in which it knows only .

Ii-a Distinguishability in anomaly-aware detection

When both and are known, it is most natural to measure their distinguishability as

(3)

where one readily identify the latter expression as the Kullback-Leibler divergence of anomalies from normal signals.

This models a detector that knows the optimal codes for both the normal and the anomalous sources. It then observes the distorted stream , encodes it with both codes and decides whether there is an anomaly or not depending on which of the two encoding rates is lower. As a result, large values correspond to system configurations with high detection capability.

Ii-B Distinguishability in anomaly-agnostic detection

When only the statistical characterization of the normal source is known at the detector, we may measure the distinguishability between and with

(4)

This models a detector using only the code optimized for the normal source. When anomalous signals appear, their encoding causes a difference in average rate that is used to reveal them. Hence, large values stand for high distinguishability.

Unlike , is not always positive since there may be anomalies whose encoding yields a lower rate with respect to normal signals. When a positive quantity is needed we will use .

Both in the anomaly-aware and in the anomaly-agnostic case, we may want to assess average performance when the anomalies are randomly drawn from a certain set of possible behaviours.

Iii Optimum rate-distortion trade-off in the Gaussian framework

We assume that are

-dimensional Gaussian vectors such that if

, then and are independent and identically distributed , i.e., zero mean Gaussian vectors with covariance matrix . Anomalous instances are also independent and identically distributed .

In general, , but we will assume meaning that, on the average, each sample in the vector contributes with a unit energy.

By applying a proper orthonormal change of coordinates to (and thus to both and ) we may assume with and .

These assumptions allow to derive an analytical assessment of the impact of compression on the anomaly detection capabilities. Such derivations rely on the function

of the vector parameterized by and by , that is the PDF of a random vector .

The Gaussian nature of allows a well-known derivation [12, Chapter 13] of the rate-distortion function .

The basic building block comes from the application of Property 1 to a single Gaussian source, [12, Lemma 13.3.2]

Property 2.

If then the the minimum rate achievable given a chosen distortion is

When we deal with a vector of independent Gaussians, Property 2 must be paired with the fact that the total distortion is the sum of the distortions imposed to individual components. This leads to a water-filling result [12, Theorem 13.3.3] depending on a parameter

(5)
(6)

To track the effect of water filling we define as the

-dimensional identity matrix and use the matrices:

  • , whose diagonal elements account for the fraction of energy cancelled by distortion along each component;

  • , whose diagonal elements account for the fraction of energy that survives distortion along each component;

  • .

The trade-off between compression and distortion that is optimized along the rate-distortion curve is due to a proper distortion that produces a signal with a PDF . Both and are given by the following Property, whose derivation is in the appendix.

Property 3.

The optimal distortion is given by

(7)

and the optimally distorted signal has the PDF

(8)

The compression mechanism that optimally addresses the rate-distortion trade-off for the normal source is used also on the anomalous instances. This produces whose PDF is given by the following Property, whose proof is in the appendix.

Property 4.

If an anomalous source is encoded with the compression scheme of Property 3, then

(9)

Such a result has two noteworthy corner cases.

  • If there is no distortion. In fact, since , Property 4 gives .

  • If there is no anomaly, , and

    where the last equality holds since , the possible disagreements between and correspond to components multiplied by zero by the last factor. Hence, Property 4 can be compared with Property 3 to confirm that .

Iv Distinguishability in the Gaussian framework

Iv-a Pointwise distinguishability

Assume that, after compression, and are both a Gaussian -dimensional random vectors whose first components survive distortion and are non-null, while the other components are set to and thus cannot be used to tell anomalous from normal cases.

Say that the first components of are distributed according to while the corresponding components of are distributed according to .

Within this Gaussian assumption, we may obtain an expression for the distinguishability measures (3) and (4). The starting point is the specialization of the building block when and are Gaussian as given by the following Property whose derivation is in the appendix.

Property 5.

If and then

(10)

where indicates the determinant of its matrix argument.

By properly combining the definitions in (3) and (4) with (10) we obtain

(11)
(12)

in which distinguishability measures are given in bits.

Clearly, both figures vanish when , i.e., when along the directions surviving distortion, the normal and the anomalous signals coincide. Note also that, with respect to , is linear while is convex [13, Chapter 3].

Properties 3 and 4 imply that when the normal and anomalous signals are Gaussian before compression, performance of anomaly detectors depend on how much we are capable of distinguishing between the two distributions in (8) and (9).

In this case, we have , is the upper-left submatrix of in (8), and is the upper-left submatrix of in (9).

By straightforward substitution and exploiting the diagonality of both and , we obtain that is the upper-left submatrix of .

As a noteworthy particular case, when the normal signal is white, i.e., when , we have that and that for any , , and . Hence, , , and .

Hence, the distinguishability measures in (11) and (12) become

(13)
(14)

Having in case of normal signal distributed as white Gaussian noise is not surprising since the distinguishability modelled by depends only on the statistics of that has no exploitable structure.

Iv-B Average on the set of possible anomalies

Since anomalies are zero-mean Gaussian vectors, they are completely defined by their covariance matrix .

We decompose with and orthonormal.

The set of all possible is

while the set of all possible is that of orthonormal matrices

By indicating with

the uniform distribution in the argument domain, we will assume that when

is not known then and than when is not known then , independently of .

Note now that is invariant with respect to any permutation of the . Since , also must be invariant with respect to the same permutations so that for any . Since has a constrained sum and is the diagonal of we have . This implies

(15)
(16)

Hence, in our setting, the average anomaly is white and we may compute the corresponding distinguishability measures and . Note that, in this case, is the upper-left submatrix of , i.e., the diagonal matrix in which the -th diagonal entry is the sum of the -th diagonal entry of and of the -th diagonal entry of , i.e.,

With these quantities, the expressions of the distinguishability measures become

(17)
(18)

Note that due to the linearity of and the convexity of we have and .

Moreover, the very simple structure of allows the derivation of the following Property whose proof is in the appendix.

Property 6.

If , then for at least one point

Property 6 means that in case of

distributed as Gaussian white noise, there exists at least a critical level of distortion that makes the detectors that do not use information of the anomaly ineffective.

Iv-C Asymptotic distinguishability

White signals are not only the average anomalies but are also typical anomalies in the sense specified by the following Property whose proof is in the appendix.

Property 7.

If and then, as , tends to in probability.

Hence, when increases, most of the possible anomalies behave as white signals and , that thus enjoys the properties shown in the previous subsection.

V Numerical examples

In this section we match the theoretical derivations with the quantitative assessment of the performance of some practical anomaly detectors applied to compressed signals.

Normal signals are assumed to be where and is set to yield different degree of non-whiteness. Non-whiteness is measured with the so-called localization defined as

that goes from when the signal is white to when all the energy is concentrated along a single direction of the signal space (see [14] for more details). To show the effect of realistic localization [15] we consider values of corresponding to .

Anomalous signals are generated as , where is randomly picked according to the uniform distribution defined in IV-B.

To generate , we follow [16] to first draw for and then set

To generate , we follow [17] and start by generating a matrix within the Ginibre ensemble [18], i.e., with independent entries for . We then set to the orthonormal factor of the -decomposition of .

A first use of this random sampling is the possibility of pairing Property 7 with some numerical evidence. Fig. 3 reports the vanishing trend of the average squared and uniform deviation from of a population of uniformly distributed covariance matrices . Though not a theoretical result, note that empirical evidence supports a classical convergence.

Fig. 3: Trend of and when increases. Solid lines are median trends while shaded areas contain of the population.

As far as detector assessment is concerned, we decide to set and consider three compression techniques tuned to the normal signal and applied to both normal and anomalous instances. More specifically is mapped to by

  • the minimum-rate-given-distortion compression in (7) (Rate-Distortion Compression RDC);

  • projecting

    along the subspace spanned by the eigenvectors of

    with the largest eigenvalues (Principal Component Compression

    PCC);

  • a family of autoencoders

    [19, Chapter 14] with an increasingly deficient latent representation (Auto-Encoder Compression AEC). Assuming that

    is the dimensionality of the representation, the encoder is a neural network with fully connected layers of dimensions

    , , , , and the decoder is the dual network whose layers have dimensions , , and and the number of input is . The family of autoencoders is trained to minimize distortion computed as in (1). To smooth performance degradation we first train an autoencoder with . Then, the node of the latent representation along which we measure the least average energy is dropped to produce a smaller network with an -dimensional latent space, that is re-trained using the previous weights as initialization. This is repeated decreasing and thus considering larger distortion values.

The three schemes we consider address in a different way the trade-off between compression and distortion. This can be shown by pairing each of them with a quantization stage ensuring that rate values are finite. In particular we encode each component of with bits and this yields rates of less than

bits per time step. We assume that quantization is fine enough to substantially preserve the Gaussian distribution of

and thus evaluate the mutual information between and as if they were jointly gaussian with a covariance matrix that we estimate by Monte Carlo simulation [20]. Such estimation yields the rate-distortion curves in Fig. 4.

Fig. 4: Rate distortion curves for the three compression schemes we consider and for different value of the localization of the original signal.

As expected, RDC yields the smallest rates while PCC gives the largest ones. Between the two we have AEC, whose performance depends on the effectiveness of the training.

The compressed version of the signal is then passed to a detector whose task is to compute a score such that high-score instances should be more likely to be anomalous. The final binary decision is taken by matching the score against a threshold.

We consider two detectors not relying on information of the anomaly

  • a Likelihood Detector (LD) whose score is the inverse of the log-likelihood of the instance with respect to the normal signal distribution, so that to the instance we associate the score ;

  • a One-Class Support-Vector Machine (

    OCSVM) [21] trained on a set of instances of normal signals contaminated by 1% of unlabelled white instances to help the algorithm finding the envelope of normal instances.

We also consider two detectors that are able to leverage information on the anomaly

  • a Neyman-Pearson Detector (NPD) [22, Chapter 3], whose score is the difference between the log-likelihoods of the instance with respect to the normal and anomalous distributions, so that to instance we associate the score ;

  • a Deep Neural Network (DNN) with 3 fully connected hidden layers with , , neurons with a ReLu activation and a final sigmoid neuron producing the score. The network is trained by backpropagation with a binary cross-entropy loss against a dataset containing labelled normal and anomalous instances.

LD and NPD detectors can be employed only on signals compressed by RDC or by PCC method since they rely on the statistical characterization of the signals that is not available after the nonlinear processing in AEC.

detector training assessment
# #instances # #instances
LD
OCSVM 1
NPD
DNN 50 50
TABLE I: Number of anomalies () and, for each anomaly, the number of normal () and anomalous () signal instances used in the training and assessment of the detectors.
Fig. 5: Distinguishability measures , and against normalized distortion in case of RDC.
Fig. 6: Distinguishability measures , and against normalized distortion in case of PCC.

Table I shows how many different anomalies and how many signal instances are generated for the training (when needed) and for the assessment of the detectors. Note that in the DNN case we limited to anomalies since the training process must be repeated for each of them.

To be independent of the choice of thresholds, detectors’ performance is assessed by the Area-Under-the-Curve () methodology [23]. estimates the probability that given a random normal instance and a random anomalous instance, the former has a lower score with respect to the latter, as it should be in an ideal setting. Hence, is a positive performance index.

Clearly, detectors with are no better than coin tossing. Yet, if , the score has some ability to distinguishing normal and anomalous signals if it is interpreted in a reverse way. Hence, it is convenient to set our empirical distinguishability measure to

to obtain a non-negative quantity that increases up to when the detector improves.

In the following the trends of are reported and matched with the trends of and to show how theoretical properties reflect on real cases. Comparisons must be partially qualitative as the range and the significance of and are different from those of and as the latter refers to practical detectors. All plots are made against a normalized distortion in the range as larger relative distortions are usually beyond operative ranges.

V-a Rdc

Fig. 5 summarizes the results we have in this case with two rows of 3 plots each. The upper row of plots correspond to detectors that do not exploit information on the anomaly, while the lower row of plots concerns detectors that may leverage information on the anomaly. Colors correspond to different , dashed trends assume that the anomaly is the average one, i.e., white, and shaded areas show the span of of the Monte Carlo population. The profiles of and on the left shall be matched with the profiles on the right that correspond to the four detectors we consider. No profile appears for as in that case .

Theory anticipates that without any knowledge of the anomaly (upper row), a limited amount of distortion may cause distinguishability to vanish and thus detectors to fail. This actually happens for practical detectors such as LD and OCSVM. The distortion level at which detectors fail is also anticipated by and depends on as predicted by Property 6. Overall, theory anticipates that in the low-distortion region more localized signals are more distinguishable from anomaly though they cause detector failures at smaller distortions with respect to less localized signals.

Detectors leveraging the knowledge of the anomaly (lower row) fail completely only at distortion as revealed by the abstract distinguishability measure . Also in this case, by comparing the trend of with the zoomed areas in the NPD and DNN plots we see how theory anticipates that in the low-distortion region more localized signals tend to be more distinguishable from anomalies but cause a more definite performance degradation of detectors when increases.

V-B Pcc

From the point of view of the rate-distortion trade-off PCC is largely suboptimal. Yet, due to its linear nature and are still jointly Gaussian. This allows to compute the theoretical and by means of (11) and (12).

Fig. 6 summarizes the results we have in this case with plots of the same kind of Fig. 5. The qualitative behaviours commented in the previous subsection appear in the new plots and are anticipated by the theoretical trends.

The distortion levels at which anomaly-agnostic detectors fail change with respect to the RDC case but are still anticipated by the theoretical curves and Property 6.

In this case, the values of beyond breakdown distortion levels increase slightly more that in the optimal compression scenario. Hence, by adopting a compression strategy that is suboptimal in the rate-distortion sense one may obtain a better distinguishability of the compressed normal signal from the compressed anomalies. This is, in fact, what happens in practice as highlighted by the LD and OCSVM plots in the first row of Fig. 6.

V-C Aec

In this case compression is non-linear so that and may not be jointly Gaussian. This prevents us from computing the theoretical curves and and from applying LD and NPD that rely on the knowledge of the distribution of the signals.

For this reason, Fig. 7 reports only the performance of OCSVM and DNN detectors.

Notice how the qualitative trends of those performance still follow, though with a larger level of approximation, what is indicated by the theoretical curves for other compressors.

Fig. 7: Distinguishability measure against normalized distortion in case of AEC.

Vi Conclusion

Massive sensing systems may rely on lossy compression to reduce the bitrate needed to transmit acquisitions to the cloud while theoretically maintaining the important information. At some intermediate point along their path to centralized servers, compressed sensor readings may be processed for early detection of anomalies in the systems under observation. Such detection must be performed on compressed data.

We here analyze the trade-off between compression and performance of anomaly detectors modeling rate, distortion and distinguishability by means of information-theoretic quantities.

Such a trade-off is different from the well-known compression-distortion trade-off.

Some numerical examples in a toy case show that theoretical prediction are able to anticipate a number of features of the performance trends of real detectors.

Appendix

Proof of Property 3.

Distortion is tuned to the normal case that entails a memoryless sources. Hence we may drop time indications and concentrate on a vector with independent components for .

We know from [24] that for a given value of the parameter , each component is transformed separately into . In particular, if then is encoded into . If then is encoded into where, to achieve the Shannon lower bound, must be an instance of a Gaussian random variable independent of . Hence, the three quantities , and must be such that with

(19)

That explains in which sense encodes . In fact, the non-diagonal elements are positive and thus and are positively correlated.

From (19), if we agree to identify a Gaussian with variance with a Dirac’s delta we infer that and thus .

Moreover, with the upper-left submatrix of in (19).

If we assume that , from the joint probability of and , we may compute the action of on the -th component of as the PDF of given , i.e.,

If we define we have and

that becomes for (maximum distortion of this component implies that the corresponding output is set to ) and for (no distortion of this component, the output is equal to the input).

We may collect the component-wise PDFs into a vector PDF by using the matrix , and the matrix thus yielding the thesis. ∎

Proof of Property 4.

The PDF of distorted by means of can be computed as

Assume first to be in the low-distortion condition that implies , and write