I Introduction
A typical scenario for nowadays massive acquisition systems can be modelled as a large number of sensing units, each transforming some physical unknown quantities into samples of random processes that are then transmitted over a network. To reduce transmission bitrate, signals are often compressed by a lossy mechanism that is theoretically capable of preserving useful information. Before reaching some cloud facility in which they will be ultimately stored or processed, the corresponding bitstreams may traverse several levels of hierarchical aggregation and intermediate devices that are often indicated as the edge of the cloud [1]. For latency or privacy reasons, some computational tasks may benefit from their deployment at the edge. One of those tasks is the detection of anomalies/novelties.
This is especially true when dealing, for example, with networks that sensorize plants or structures subject to monitoring as depicted in Fig. 1. The aggregated sensor readings may be processed in the cloud for offline monitoring relying on longterm historical trends, while the outputs of subsets of sensors may be processed at the edge to give lowlatency feedback on possible critical events that require immediate intervention.
Usually, compression schemes applied to sensor data are asymmetric and entail a lightweight encoding performed on verylow complexity devices paired with a possibly expensive decoding stage running on the cloud. In these conditions, it is sensible that anomaly detectors work on compressed data and not on the recovered signal.
Yet, lossy compression bases its effectiveness on neglecting some of the signal details. This translates into a distortion between the original and the recovered signal but also in a loss of details that, in principle, could have been used to tell normal behaviours from anomalous ones.
This sets a tradeoff between compression and distinguishability between normal and anomalous signals. Such a tradeoff goes in parallel to the one between compression and distortion. Here, we analyze the former with the information theoretic machinery that leads also to the wellknown ratedistortion analysis, and implicitly show that the two tradeoffs are different.
In a sense, the approach we pursue is somehow similar to the informationbottleneck scheme [2, 3]
. In that scheme, distortion is substituted by a very general criterion which identifies the features that should be preserved when compressing with the information that the original signal contains about a second (suitably introduced) signal. Yet, our discussion takes a different direction as, when dealing with anomaly/novelty/outlier, we may completely ignore the statistics of the anomaly and, even if we have priors on that, we need to be able to treat also cases in which the mutual information between normal and anomalous signals is null.
For the same reason, the analysis we propose is also different from other modifications of classical ratedistortion theory that substitute energybased distortion with perceptive criteria [4, 5].
Though not overlapping with the problem we address, it is also worthwhile to mention [6, 7]
, in which the authors assume that the original signal is characterized by some parameters (e.g., their mean) and study how the estimation of such parameters is affected by lossy compression.
Note also that other applications exist in which rate and distortion are paired with additional merit figures taking into account relevant features of the system. As an example [8] adds computational effort considerations to the analysis of ratedistortion of waveletbased video coding.
Finally, even without emphasis on compression, the relationship between the analysis of suitably defined subcomponents of a signal to detect possible outlier behaviours is a classic theme that is still under investigation [9, 10, 11].
In this paper, we propose two informationtheoretic measures of distinguishability to model the potential capabilities of two kinds of anomaly detectors, namely those that know the normal behaviour but are agnostic of the anomaly, and those that have information on both normal and anomalous behaviours. We then study how these measures behave when we change the distortion and thus the compression of the lossy mechanism, by deriving some analytical results when signals are Gaussian, both in the finite and asymptotic regime.
Section II gives the mathematical definition of the signal and quantities we use in the following with special emphasis on the functionals that quantify the effectiveness of compression, the distortion, and the distinguishability between normal and anomalous signals. Section III specializes the model to the case in which signals are Gaussian and revisits wellknown results on Gaussian ratedistortion tradeoff. Section IV applies the Gaussian assumption to specialize distinguishability both in the pointwise and average case, with an emphasis on the asymptotic characterization of anomalies in the highdimensional case. Section V reports some numerical evidence analysing the behaviour of some suitably simplified anomaly detection strategies with respect to ideal and suboptimal compression strategies. Theoretical curves anticipate many aspects of practical performance trends and show that compression that optimizes the ratedistortion tradeoff is not necessarily addressing at best the compromise with distinguishability. Some conclusion are finally drawn. Proofs of the properties stated in the discussion are reported in the Appendix.
Ii Model definition
The signal flow we consider is reported in Fig. 2.
Normal and anomalous behaviours are modelled as two sources producing independent discretetime, stationary, dimensional stochastic processes and with different PDFs and . At any , the observable discretetime process is either or .
The observable is passed to an encoding stage producing a compressed version that may then be decompressed into , where is a finite subset of . The compression mechanism is lossy as the encoding stage is not injective and thus introduces some distortion.
As discussed in the Introduction, the encoded signal may be used for anomaly detection, i.e., to decide whether the original signal is or . We assume the decoding stage to be injective so that, in abstract terms, the detector may be thought to work on even if in practical embodiments decoding is not performed at the edge.
Compression is designed assuming and thus for every . It is characterized by the average distortion
(1)  
Since is finite, a digital word can be assigned to each of its elements so that is encoded with bits. This allows to define the average rate of the stream as .
Rate and distortion are clearly two elements of a tradeoff whose Pareto curve is the so called ratedistortion curve, i.e., the function
To identify such a function, it is classical [12, Chapter 9] to model the cascade of encoding and decoding as the conditional PDF
such that the joint probability of
and is given by .We design the system for normal cases , i.e., considering and the marginal PDF with which distortion is expressed as
With this, classical ratedistortion theory (see, e.g., [12, Chapter 13]) shows that
where is the mutual information between and [12, Chapter 8].
To proceed further it is convenient to define the functional
that is the average coding rate of a source characterized by the PDF with a code optimized for a source with PDF , so that is equal to the differential entropy of [12, Chapter 8].
When and we deal with a quadratic distortion constraint, one may derive a well known Property [12, Chapter 13].
Property 1.
If and we constrain , then the minimum achievable rate has the lower bound
(2) 
that can be achieved when the encoding is such that
is a Gaussian random variable.
Property 1 is the basis of classical developments and indicates that an encoding in which distributes as a Gaussian has rateminimization capabilities.
If anomalies need to be considered, the situation to model becomes more complex. In fact, as rate decreases, distortion increases, and the compressed versions of normal and anomalous signals tend to be less distinguishable. Hence, if the compressed stream is used for anomaly detection, one expects that detector performance degrades for increasing compression. To quantify this effect we need theoretical performance figures that characterize the distinguishability of two signals.
To begin tackling the problem, one can notice that anomalies are distorted with the encoder tuned on normal signals . This means that the input and the output of the distortion are characterized by the joint PDF and the marginal PDF .
Anomaly detectors work on the difference between and that we quantify with two kinds of informationtheoretic measure, which model two distinct scenarios, i.e., one in which the detector knows both and and one in which it knows only .
Iia Distinguishability in anomalyaware detection
When both and are known, it is most natural to measure their distinguishability as
(3)  
where one readily identify the latter expression as the KullbackLeibler divergence of anomalies from normal signals.
This models a detector that knows the optimal codes for both the normal and the anomalous sources. It then observes the distorted stream , encodes it with both codes and decides whether there is an anomaly or not depending on which of the two encoding rates is lower. As a result, large values correspond to system configurations with high detection capability.
IiB Distinguishability in anomalyagnostic detection
When only the statistical characterization of the normal source is known at the detector, we may measure the distinguishability between and with
(4)  
This models a detector using only the code optimized for the normal source. When anomalous signals appear, their encoding causes a difference in average rate that is used to reveal them. Hence, large values stand for high distinguishability.
Unlike , is not always positive since there may be anomalies whose encoding yields a lower rate with respect to normal signals. When a positive quantity is needed we will use .
Both in the anomalyaware and in the anomalyagnostic case, we may want to assess average performance when the anomalies are randomly drawn from a certain set of possible behaviours.
Iii Optimum ratedistortion tradeoff in the Gaussian framework
We assume that are
dimensional Gaussian vectors such that if
, then and are independent and identically distributed , i.e., zero mean Gaussian vectors with covariance matrix . Anomalous instances are also independent and identically distributed .In general, , but we will assume meaning that, on the average, each sample in the vector contributes with a unit energy.
By applying a proper orthonormal change of coordinates to (and thus to both and ) we may assume with and .
These assumptions allow to derive an analytical assessment of the impact of compression on the anomaly detection capabilities. Such derivations rely on the function
of the vector parameterized by and by , that is the PDF of a random vector .
The Gaussian nature of allows a wellknown derivation [12, Chapter 13] of the ratedistortion function .
The basic building block comes from the application of Property 1 to a single Gaussian source, [12, Lemma 13.3.2]
Property 2.
If then the the minimum rate achievable given a chosen distortion is
When we deal with a vector of independent Gaussians, Property 2 must be paired with the fact that the total distortion is the sum of the distortions imposed to individual components. This leads to a waterfilling result [12, Theorem 13.3.3] depending on a parameter
(5)  
(6) 
To track the effect of water filling we define as the
dimensional identity matrix and use the matrices:

, whose diagonal elements account for the fraction of energy cancelled by distortion along each component;

, whose diagonal elements account for the fraction of energy that survives distortion along each component;

.
The tradeoff between compression and distortion that is optimized along the ratedistortion curve is due to a proper distortion that produces a signal with a PDF . Both and are given by the following Property, whose derivation is in the appendix.
Property 3.
The optimal distortion is given by
(7) 
and the optimally distorted signal has the PDF
(8) 
The compression mechanism that optimally addresses the ratedistortion tradeoff for the normal source is used also on the anomalous instances. This produces whose PDF is given by the following Property, whose proof is in the appendix.
Such a result has two noteworthy corner cases.

If there is no distortion. In fact, since , Property 4 gives .

If there is no anomaly, , and
Iv Distinguishability in the Gaussian framework
Iva Pointwise distinguishability
Assume that, after compression, and are both a Gaussian dimensional random vectors whose first components survive distortion and are nonnull, while the other components are set to and thus cannot be used to tell anomalous from normal cases.
Say that the first components of are distributed according to while the corresponding components of are distributed according to .
Within this Gaussian assumption, we may obtain an expression for the distinguishability measures (3) and (4). The starting point is the specialization of the building block when and are Gaussian as given by the following Property whose derivation is in the appendix.
Property 5.
If and then
(10) 
where indicates the determinant of its matrix argument.
(11)  
(12) 
in which distinguishability measures are given in bits.
Clearly, both figures vanish when , i.e., when along the directions surviving distortion, the normal and the anomalous signals coincide. Note also that, with respect to , is linear while is convex [13, Chapter 3].
Properties 3 and 4 imply that when the normal and anomalous signals are Gaussian before compression, performance of anomaly detectors depend on how much we are capable of distinguishing between the two distributions in (8) and (9).
In this case, we have , is the upperleft submatrix of in (8), and is the upperleft submatrix of in (9).
By straightforward substitution and exploiting the diagonality of both and , we obtain that is the upperleft submatrix of .
As a noteworthy particular case, when the normal signal is white, i.e., when , we have that and that for any , , and . Hence, , , and .
(13)  
(14) 
Having in case of normal signal distributed as white Gaussian noise is not surprising since the distinguishability modelled by depends only on the statistics of that has no exploitable structure.
IvB Average on the set of possible anomalies
Since anomalies are zeromean Gaussian vectors, they are completely defined by their covariance matrix .
We decompose with and orthonormal.
The set of all possible is
while the set of all possible is that of orthonormal matrices
By indicating with
the uniform distribution in the argument domain, we will assume that when
is not known then and than when is not known then , independently of .Note now that is invariant with respect to any permutation of the . Since , also must be invariant with respect to the same permutations so that for any . Since has a constrained sum and is the diagonal of we have . This implies
(15)  
(16) 
Hence, in our setting, the average anomaly is white and we may compute the corresponding distinguishability measures and . Note that, in this case, is the upperleft submatrix of , i.e., the diagonal matrix in which the th diagonal entry is the sum of the th diagonal entry of and of the th diagonal entry of , i.e.,
With these quantities, the expressions of the distinguishability measures become
(17)  
(18) 
Note that due to the linearity of and the convexity of we have and .
Moreover, the very simple structure of allows the derivation of the following Property whose proof is in the appendix.
Property 6.
If , then for at least one point
Property 6 means that in case of
distributed as Gaussian white noise, there exists at least a critical level of distortion that makes the detectors that do not use information of the anomaly ineffective.
IvC Asymptotic distinguishability
White signals are not only the average anomalies but are also typical anomalies in the sense specified by the following Property whose proof is in the appendix.
Property 7.
If and then, as , tends to in probability.
Hence, when increases, most of the possible anomalies behave as white signals and , that thus enjoys the properties shown in the previous subsection.
V Numerical examples
In this section we match the theoretical derivations with the quantitative assessment of the performance of some practical anomaly detectors applied to compressed signals.
Normal signals are assumed to be where and is set to yield different degree of nonwhiteness. Nonwhiteness is measured with the socalled localization defined as
that goes from when the signal is white to when all the energy is concentrated along a single direction of the signal space (see [14] for more details). To show the effect of realistic localization [15] we consider values of corresponding to .
Anomalous signals are generated as , where is randomly picked according to the uniform distribution defined in IVB.
To generate , we follow [16] to first draw for and then set
To generate , we follow [17] and start by generating a matrix within the Ginibre ensemble [18], i.e., with independent entries for . We then set to the orthonormal factor of the decomposition of .
A first use of this random sampling is the possibility of pairing Property 7 with some numerical evidence. Fig. 3 reports the vanishing trend of the average squared and uniform deviation from of a population of uniformly distributed covariance matrices . Though not a theoretical result, note that empirical evidence supports a classical convergence.
As far as detector assessment is concerned, we decide to set and consider three compression techniques tuned to the normal signal and applied to both normal and anomalous instances. More specifically is mapped to by

the minimumrategivendistortion compression in (7) (RateDistortion Compression RDC);

projecting
along the subspace spanned by the eigenvectors of
with the largest eigenvalues (Principal Component Compression
PCC); 
a family of autoencoders
[19, Chapter 14] with an increasingly deficient latent representation (AutoEncoder Compression AEC). Assuming thatis the dimensionality of the representation, the encoder is a neural network with fully connected layers of dimensions
, , , , and the decoder is the dual network whose layers have dimensions , , and and the number of input is . The family of autoencoders is trained to minimize distortion computed as in (1). To smooth performance degradation we first train an autoencoder with . Then, the node of the latent representation along which we measure the least average energy is dropped to produce a smaller network with an dimensional latent space, that is retrained using the previous weights as initialization. This is repeated decreasing and thus considering larger distortion values.
The three schemes we consider address in a different way the tradeoff between compression and distortion. This can be shown by pairing each of them with a quantization stage ensuring that rate values are finite. In particular we encode each component of with bits and this yields rates of less than
bits per time step. We assume that quantization is fine enough to substantially preserve the Gaussian distribution of
and thus evaluate the mutual information between and as if they were jointly gaussian with a covariance matrix that we estimate by Monte Carlo simulation [20]. Such estimation yields the ratedistortion curves in Fig. 4.As expected, RDC yields the smallest rates while PCC gives the largest ones. Between the two we have AEC, whose performance depends on the effectiveness of the training.
The compressed version of the signal is then passed to a detector whose task is to compute a score such that highscore instances should be more likely to be anomalous. The final binary decision is taken by matching the score against a threshold.
We consider two detectors not relying on information of the anomaly

a Likelihood Detector (LD) whose score is the inverse of the loglikelihood of the instance with respect to the normal signal distribution, so that to the instance we associate the score ;

a OneClass SupportVector Machine (
OCSVM) [21] trained on a set of instances of normal signals contaminated by 1% of unlabelled white instances to help the algorithm finding the envelope of normal instances.
We also consider two detectors that are able to leverage information on the anomaly

a NeymanPearson Detector (NPD) [22, Chapter 3], whose score is the difference between the loglikelihoods of the instance with respect to the normal and anomalous distributions, so that to instance we associate the score ;

a Deep Neural Network (DNN) with 3 fully connected hidden layers with , , neurons with a ReLu activation and a final sigmoid neuron producing the score. The network is trained by backpropagation with a binary crossentropy loss against a dataset containing labelled normal and anomalous instances.
LD and NPD detectors can be employed only on signals compressed by RDC or by PCC method since they rely on the statistical characterization of the signals that is not available after the nonlinear processing in AEC.
detector  training  assessment  

#  #instances  #  #instances  
LD  
OCSVM  1  
NPD  
DNN  50  50 
Table I shows how many different anomalies and how many signal instances are generated for the training (when needed) and for the assessment of the detectors. Note that in the DNN case we limited to anomalies since the training process must be repeated for each of them.
To be independent of the choice of thresholds, detectors’ performance is assessed by the AreaUndertheCurve () methodology [23]. estimates the probability that given a random normal instance and a random anomalous instance, the former has a lower score with respect to the latter, as it should be in an ideal setting. Hence, is a positive performance index.
Clearly, detectors with are no better than coin tossing. Yet, if , the score has some ability to distinguishing normal and anomalous signals if it is interpreted in a reverse way. Hence, it is convenient to set our empirical distinguishability measure to
to obtain a nonnegative quantity that increases up to when the detector improves.
In the following the trends of are reported and matched with the trends of and to show how theoretical properties reflect on real cases. Comparisons must be partially qualitative as the range and the significance of and are different from those of and as the latter refers to practical detectors. All plots are made against a normalized distortion in the range as larger relative distortions are usually beyond operative ranges.
Va Rdc
Fig. 5 summarizes the results we have in this case with two rows of 3 plots each. The upper row of plots correspond to detectors that do not exploit information on the anomaly, while the lower row of plots concerns detectors that may leverage information on the anomaly. Colors correspond to different , dashed trends assume that the anomaly is the average one, i.e., white, and shaded areas show the span of of the Monte Carlo population. The profiles of and on the left shall be matched with the profiles on the right that correspond to the four detectors we consider. No profile appears for as in that case .
Theory anticipates that without any knowledge of the anomaly (upper row), a limited amount of distortion may cause distinguishability to vanish and thus detectors to fail. This actually happens for practical detectors such as LD and OCSVM. The distortion level at which detectors fail is also anticipated by and depends on as predicted by Property 6. Overall, theory anticipates that in the lowdistortion region more localized signals are more distinguishable from anomaly though they cause detector failures at smaller distortions with respect to less localized signals.
Detectors leveraging the knowledge of the anomaly (lower row) fail completely only at distortion as revealed by the abstract distinguishability measure . Also in this case, by comparing the trend of with the zoomed areas in the NPD and DNN plots we see how theory anticipates that in the lowdistortion region more localized signals tend to be more distinguishable from anomalies but cause a more definite performance degradation of detectors when increases.
VB Pcc
From the point of view of the ratedistortion tradeoff PCC is largely suboptimal. Yet, due to its linear nature and are still jointly Gaussian. This allows to compute the theoretical and by means of (11) and (12).
Fig. 6 summarizes the results we have in this case with plots of the same kind of Fig. 5. The qualitative behaviours commented in the previous subsection appear in the new plots and are anticipated by the theoretical trends.
The distortion levels at which anomalyagnostic detectors fail change with respect to the RDC case but are still anticipated by the theoretical curves and Property 6.
In this case, the values of beyond breakdown distortion levels increase slightly more that in the optimal compression scenario. Hence, by adopting a compression strategy that is suboptimal in the ratedistortion sense one may obtain a better distinguishability of the compressed normal signal from the compressed anomalies. This is, in fact, what happens in practice as highlighted by the LD and OCSVM plots in the first row of Fig. 6.
VC Aec
In this case compression is nonlinear so that and may not be jointly Gaussian. This prevents us from computing the theoretical curves and and from applying LD and NPD that rely on the knowledge of the distribution of the signals.
For this reason, Fig. 7 reports only the performance of OCSVM and DNN detectors.
Notice how the qualitative trends of those performance still follow, though with a larger level of approximation, what is indicated by the theoretical curves for other compressors.
Vi Conclusion
Massive sensing systems may rely on lossy compression to reduce the bitrate needed to transmit acquisitions to the cloud while theoretically maintaining the important information. At some intermediate point along their path to centralized servers, compressed sensor readings may be processed for early detection of anomalies in the systems under observation. Such detection must be performed on compressed data.
We here analyze the tradeoff between compression and performance of anomaly detectors modeling rate, distortion and distinguishability by means of informationtheoretic quantities.
Such a tradeoff is different from the wellknown compressiondistortion tradeoff.
Some numerical examples in a toy case show that theoretical prediction are able to anticipate a number of features of the performance trends of real detectors.
Appendix
Proof of Property 3.
Distortion is tuned to the normal case that entails a memoryless sources. Hence we may drop time indications and concentrate on a vector with independent components for .
We know from [24] that for a given value of the parameter , each component is transformed separately into . In particular, if then is encoded into . If then is encoded into where, to achieve the Shannon lower bound, must be an instance of a Gaussian random variable independent of . Hence, the three quantities , and must be such that with
(19) 
That explains in which sense encodes . In fact, the nondiagonal elements are positive and thus and are positively correlated.
From (19), if we agree to identify a Gaussian with variance with a Dirac’s delta we infer that and thus .
Moreover, with the upperleft submatrix of in (19).
If we assume that , from the joint probability of and , we may compute the action of on the th component of as the PDF of given , i.e.,
If we define we have and
that becomes for (maximum distortion of this component implies that the corresponding output is set to ) and for (no distortion of this component, the output is equal to the input).
We may collect the componentwise PDFs into a vector PDF by using the matrix , and the matrix thus yielding the thesis. ∎
Proof of Property 4.
The PDF of distorted by means of can be computed as
Assume first to be in the lowdistortion condition that implies , and write
Comments
There are no comments yet.