Data Imputation through the Identification of Local Anomalies

09/30/2014
by   Huseyin Ozkan, et al.
0

We introduce a comprehensive and statistical framework in a model free setting for a complete treatment of localized data corruptions due to severe noise sources, e.g., an occluder in the case of a visual recording. Within this framework, we propose i) a novel algorithm to efficiently separate, i.e., detect and localize, possible corruptions from a given suspicious data instance and ii) a Maximum A Posteriori (MAP) estimator to impute the corrupted data. As a generalization to Euclidean distance, we also propose a novel distance measure, which is based on the ranked deviations among the data attributes and empirically shown to be superior in separating the corruptions. Our algorithm first splits the suspicious instance into parts through a binary partitioning tree in the space of data attributes and iteratively tests those parts to detect local anomalies using the nominal statistics extracted from an uncorrupted (clean) reference data set. Once each part is labeled as anomalous vs normal, the corresponding binary patterns over this tree that characterize corruptions are identified and the affected attributes are imputed. Under a certain conditional independency structure assumed for the binary patterns, we analytically show that the false alarm rate of the introduced algorithm in detecting the corruptions is independent of the data and can be directly set without any parameter tuning. The proposed framework is tested over several well-known machine learning data sets with synthetically generated corruptions; and experimentally shown to produce remarkable improvements in terms of classification purposes with strong corruption separation capabilities. Our experiments also indicate that the proposed algorithms outperform the typical approaches and are robust to varying training phase conditions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

10/21/2016

Robust training on approximated minimal-entropy set

In this paper, we propose a general framework to learn a robust large-ma...
12/20/2015

ATD: Anomalous Topic Discovery in High Dimensional Discrete Data

We propose an algorithm for detecting patterns exhibited by anomalous cl...
06/20/2020

G2D: Generate to Detect Anomalies

In this paper, we propose a novel method for irregularity detection. Pre...
12/28/2020

Detecting Anomalous line-items by Modeling the Legal Case Lifecycle

Anomaly detection continues to be the subject of research and developmen...
11/06/2019

Searching to Exploit Memorization Effect in Learning from Corrupted Labels

Sample-selection approaches, which attempt to pick up clean instances fr...
01/13/2021

A Non-Parametric Subspace Analysis Approach with Application to Anomaly Detection Ensembles

Identifying anomalies in multi-dimensional datasets is an important task...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In many applications from a wide variety of fields, the data to be processed can partially (or even almost completely) be affected by severe noise in several phases, e.g., occlusions during a visual recording or packet losses during transmission in a communication channel. The partial, i.e., localized, corruptions resulted in the data due to such problems often severely degrade the performance of the target application; for instance, face recognition or pedestrian detection under occlusion

[1, 2, 3, 4]. In order to reduce the impact of this adverse effect, we develop a complete and novel framework, which efficiently detects, localizes and imputes corruptions by identifying the local anomalies in a given suspicious data instance. We emphasize that neither the existence nor, if exists, the location of a corruption is known in our framework. Moreover, the proposed algorithms do not assume a model but operate in a data driven manner.

We consider the local corruptions as statistical deviations from the nominal distribution of the uncorrupted (clean) observations. To detect and localize corruptions, i.e., such statistical deviations, we model a corruption as an anomaly due to an external factor (communication failure in a channel or occluder object in an image), which locally overwrites a data instance and moves it outside the support of the nominal distribution. However, corruptions that we consider as examples of anomalies have further specific properties such that (a) the corruptions in an instance are confined to unknown intervals along the data attributes, i.e., localized, and (b) not only a corrupted part but also all of its subparts are anomalous. Thus, a corruption does not provide an anomaly due to an incompatible combination of normal subparts. Based on these properties that accurately model a wide variety of real life applications, we characterize the event of corruption and formulate the corresponding detection/localization as an anomaly detection problem, cf. [5, 6, 7, 8, 9, 10, 11].

The introduced algorithm applies a series of statistical tests with a pre-specified false alarm rate to the parts of the suspicious instance after extracting the nominal statistics from a reference (training) data set of uncorrupted (clean) observations. As a result, each part is labeled as anomalous/normal and the local anomalies are identified. These parts are generated and organized through a binary tree partitioning of the data attributes, each node of which corresponds to a part of the suspicious instance. Once the nodes (or parts) are labeled as anomalous/normal on this tree, the patterns of corruption are identified using the aforementioned characterization to detect and localize corruptions. We point out that this localization procedure transforms the nominal distribution into a multivariate Bernoulli distribution with a success probability that precisely coincides with the constant false alarm rate of the local anomaly tests. Considering the hierarchy among the binary labels implied by the tree as a directed acyclic graph, the resulting multivariate Bernoulli distribution achieves a certain dependency structure. Under this condition, we derive the false alarm rate of the proposed framework in detecting the corruptions and show that it is a constant rate, namely, no parameter tuning is required even if the data change.

If a corruption is localized, then we impute/replace the affected attributes with the estimates of the underlying unknown true attributes. For this purpose, we additionally develop a novel Maximum A Posteriori (MAP) estimator using the “score function” defined in [8]. Our estimator exploits the local dependencies among the data attributes, where the locality is encoded in the binary partitioning tree. We point out that the implementation of this MAP estimator does not load extra computational cost since it utilizes the outputs of our anomaly detection approach, which are computed prior to the imputation phase. Furthermore, we also propose a novel distance measure named “ranked Euclidian distance” as a generalization to the standard Euclidean distance, which is used in the course of the labeling of each part as anomalous/normal. The proposed distance measure is compared with the standard Euclidean distance in the experiments and shown to be superior in terms of detecting and localizing corruptions.

We conduct tests over several well-known machine learning data sets [12, 13], which are exposed to severe data corruptions. Our experiments indicate that the proposed framework achieves significant improvements after imputation up to in terms of the classification purposes and outperforms the typical approaches. The proposed algorithms are also empirically shown to be robust to varying training phase conditions with strong corruption separation capabilities.

I-a Related Work

In this study, the corrupted attributes are considered to be statistically independent with the underlying unobserved true data, i.e., corrupted attributes are of no use in estimation of the uncorrupted counterparts. Hence, if one knows which attributes are corrupted in an instance, then those attributes can readily be treated as missing data, cf. [14, 15, 16, 17, 18, 19]. For example, classification and clustering with missing data is a well-studied problem in the machine learning literature. The corresponding studies such as [17, 20, 16, 18, 21] are related to inference with incomplete data [17] and generative models [20], where Bayesian frameworks [18] are used for inference under missing data conditions. Alternatively, pseudo-likelihood [22] and dependency network [23] approaches solve data completion problem by learning conditional distributions. In [24], the probability density of the missing data is modeled conditioned on a set of introduced latent variables and thereafter, a MAP based inference is used. However, all of the studies [14, 15, 16, 17, 18, 24, 22, 23, 20, 21] either assume the knowledge of the location of the missing attributes or impose strong modeling constraints; as opposed to the model free solutions in this paper.

On the other hand, imputation is commonly used as a pre-processing tool [18]. The Mixture of Factor Analyzers [25] approach replaces the missing attributes with samples drawn from a parametric density, which models the distribution of the underlying true data. The proposed imputation techniques in [26, 27] are, whereas, both non-parametric and based on the inference of the posterior densities via certain kernel expansions. On the contrary, the MAP estimator in this study does not even attempt to estimate the posterior density either in a parametric or non-parametric manner. Instead, the introduced method is only based on the sufficient rank statistics. We emphasize that unlike our approach, the incomplete data approaches generally assume the knowledge of the missing attributes, i.e., they are precisely localized and provided beforehand. For example, the occluded pixels in the event of occlusion of a target object in an image cannot be known a priori, which requires a detection and localization step. Since the existing studies do not have such a step, an exhaustive list of the occluded pixels as the result of a manual inspection of the missing attributes is required as an input to the algorithms proposed in the corresponding literature. In this regard, our study is the first to jointly handle the issues of detecting/localizing missing attributes, i.e., corruptions, as well as their imputation in one complete framework. Hence, the generic local corruption detection and imputation algorithm of our framework complements the missing data imputation approaches as an additional merit.

Data imputation and completion is also essential in image processing for handling corrupted images, e.g., [28, 29]. Generally, a corrupted image is restored by explicitly learning the image statistics [30, 31]

or by using neural networks

[32, 33, 34]. These denoising studies do not attempt to localize corruptions in an image, but treat them as a noise and filter it out using statistical approaches applied to the image globally. Even though this is a valid approach for image enhancement, an attempt to correct/enhance an image globally in case of only a localized corruption might be even detrimental since the uncorrupted parts are also affected by global operations. Additionally, it is not usually possible to locally impute corrupted portions using denoising approaches. There exist several studies that aim localization as well. Studies such as [4, 1] indicate that occlusion, as an example of corruption, is a common phenomenon and detrimental in pedestrian detection as well as face recognition applications. In this regard, detection of occluded, i.e., corrupted, visual objects had been previously investigated in a number of studies [35, 36, 37, 38]. In these studies, occlusion detection is performed using domain specific knowledge (visual cues) or external information (object geometry), which, however, are not always available in general data imputation setting. From the machine learning perspective, descriptors are extracted from various parts of the occluded object in [39] and similarly; part-based descriptors are weighted with the occlusion measure in [40] to relieve the corresponding degrading effects. Since these approaches do not directly target handling occlusions, i.e., corruptions, they only provide partial or limited solutions. Several other studies propose solutions via extracting occlusion maps, e.g., [41, 42]. In [41], HOG based classification errors; and in [42], template based reconstruction errors are used to generate such an occlusion map. However, both studies assume rigid models and significantly rely on domain specific knowledge; and in general fail to remain applicable if the data source belongs to another domain. In this study, we assume that data is generic and no domain information is available, yet detection and imputation of corruption is necessary for improving the subsequent processing stages, such as classification.

I-B Summary of Contributions

  1. This study is the first to jointly handle localized data corruptions in one statistical framework that is designed completely model free for the goal of separating a corruption and imputing the affected data attributes. We also provide a false alarm rate (in detecting corruptions) analysis of the framework via directed acyclic graphs.

  2. A novel MAP estimator for imputation and a novel distance measure for corruption localization purposes is proposed.

  3. The proposed framework is computationally efficient in the sense that (i) it effectively utilizes a binary search for corruption separation, and (ii) the computational load due to our MAP based imputation is insignificant.

  4. We propose a characterization for anomalies, e.g., rarities, incompatible combinations and corruptions, which is a novel notion.

In Section II, we provide the problem description. We then present our algorithm in Section III and the associated computational complexity in Section IV. We report the corruption detection/localization performance of the proposed algorithm as well as the improvement in classification tasks achieved by the imputation in Section V. The paper concludes with a discussion in Section VI.

Ii Problem Description

We have a possibly corrupted test instance along with a set of uncorrupted (clean) independent and identically distributed observations as the nominal training (reference) data, where , is data dimensionality and is the unknown nominal density. The test instance is considered to be corrupted with probability by severe noise in multiple non-overlapping intervals along its dimensions (attributes), which are completely unknown. Suppose that for such an interval, the corruption is localized and confined to the attributes for some and in with . We assume that the corrupted attributes are uniformly and independently distributed, , where

is the uniform distribution defined in a finite support. Moreover,

is also statistically independent with the true data and hence, the knowledge of

is irrelevant to the uncorrupted counterparts. Note that this corruption model implies a total erasure of data in several unknown portions due to an independent source overwriting the attributes in those portions, e.g., an occluder in computer vision applications

[1, 4]. Typically, since no information is provided about the independent source in such applications, we consider that the uniformity assumption draws a worst case scenario and it is realistic. On the other hand, is considered to be uncorrupted with probability . Therefore, whether a test instance includes a corruption is unknown; and it is generally modeled to be drawn from the mixture [8], where is the probability density of the corrupted instances.

The density can be derived from the unknown nominal density using the described corruption model, if the distributions of , and the number of corrupted intervals are further specified; which is unnecessary in the context of this paper. Hypothetically, if one can correct an instance drawn from the density by replacing all the corrupted attributes, e.g., , with the underlying true attributes, e.g., , and obtain , then should follow the nominal density . Similarly, if the corruptions in can be localized, then the corresponding portions would follow the multivariate uniform density of the appropriate dimensionality. On the other hand, this corruption model potentially creates significant statistical deviations from the reference data since a corrupted observation and , in general, increasingly diverges from

as the corruption strength increases. Here, the corruption strength can be considered as the number of corrupted attributes and/or the variance of the corruption

that overwrites the true data. Furthermore, our modeling of corruptions poses a missing (incomplete) data problem since the unknown true attributes in a corrupted interval are statistically irrelevant to the corrupted attributes . In this paper, by exploiting the statistical deviations from the nominal distribution of observations, we aim to detect and localize the possible corruptions in a given instance and impute the corrupted or missing attributes.

To this end, we formulate an anomaly detection approach to define this framework in Section III, where we draw the distinctions among several examples of anomalous observations and separate the event of corruption. Then, we propose our algorithm and analyze the associated false alarm probability in detecting corruptions as well as the computational complexity.

Iii A Novel Framework for Corruption Detection, Localization and Imputation

In this section, we develop a novel framework for a complete treatment of possible corruptions in the input data . For presentational clarity and without loss of generality, we assume that the input data can be corrupted only in a single interval throughout this section. Note that the generalization to the case of corruptions spread onto several intervals is immediate and indeed, we present a corresponding detailed experiment in Section V. Since the corruptions are modeled as local statistical deviations within this framework, we give a brief description of the anomaly detection approach that we work with in Section III-A. Based on the characterization of corruptions through their distinctive properties in Section III-B, we present Algorithm TCS (Tree-based Corruption Separation). After we derive a novel MAP estimator for imputation in Section III-C, we derive the false alarm rate of the proposed framework in detecting the corruptions in Section III-D.

Iii-a Detection of Statistical Deviations: Anomalies

A localized corruption is considered to affect an instance in a certain part(s) such that the affected attributes statistically deviate from the vast majority of the data. The proposed algorithm in this paper localizes the corrupted attributes by identifying the local anomalies through a series of statistical checks of the test instance with the reference data. In this section, we briefly describe the anomaly detection approach that we work with and present a novel distance measure for the corruption localization purpose.

The probability density of a possibly corrupted test instance can be modeled as

is the null hypothesis from which the nominal data are drawn,

is the hypothesis representing the corrupted observations, and is the corresponding mixing coefficient. Within the framework of anomaly detection approaches, the nominal distribution is usually assumed unknown or hard to estimate; and instead, a set of nominal observations is provided. Then for a given test instance , the task in [8] is to decide whether the null hypothesis was realized or the alternative such that the detection rate (of anomalies) is maximized with a constant false alarm rate . For this purpose, the score function [8]

(1)

is proposed, where is the indicator function and is the Euclidean distance from to its nearest ’th neighbor in , if ; and to its nearest ’th neighbor in , otherwise. Based on this score function, the test instance is declared as anomalous [8], if

(2)

When the mixing distribution is assumed uniform, it is shown in [8] that is an asymptotically consistent estimator of the density level of the test instance

(3)

under certain smoothness conditions. Remarkably, provides the minimum volume set at level , which is the most powerful decision region for testing vs with a constant false alarm rate [7]. We note that the precision of the test defined in (2) degrades faster with the dimensionality than it improves with the size of the training data. As a result, we here point out several practical issues about detecting the existence of a corruption with this approach.

Briefly, i) a direct test of an instance does not localize a possible corruption for imputation, ii) on the contrary, a truly corrupted instance, i.e., an instance of hypothesis , does not necessarily test positive due to the limited training data, high dimensionality as well as that the corruption might not be sufficiently strong; and iii) corruptions have further specific properties in addition to that they provide anomalies, which must be incorporated to achieve a better false alarm rate compared to .

Ranked Euclidean Distances: To address the first issue in this list, we propose a novel distance measure (not a metric in the mathematical sense), which is sensitive to only a certain fraction of the attributes for a given pair of instances and . For instance, a corruption of only a single attribute in a given test instance might be significantly strong such that the whole instance turns anomalous with the test in (2) used with the standard Euclidean distance. In this case, any part of the instance including the corrupted attribute would test positive, which creates an ambiguity in terms of the localization, i.e., separation, of the corrupted attribute, and in turn requires an exhaustive search over all possible subsets in the space of the attributes.

To overcome such ambiguities, we propose a distance measure so that the test in (2) results positive only when the corruption has a sufficiently large support, which disregards a pre-specified fraction of the attributes that are most responsible for a possible corruption. We define this measure for an as

(4)

where is a permutation of the attributes with

and is the floor operator. Since this distance measure depends only on the fraction of the least deviated attributes between and , a corruption must have a support of at least -length to make an instance anomalous with respect to the reference data. Here, can be seen as the precision of the localization when an anomalous instance is checked with the test in (2) using the distance measure defined in (4). This precision obviously cannot be made arbitrarily large since as approaches , the distance becomes more prone to noise and the correlation structure between the attributes is less exploited. We investigate this trade-off further in our simulations. The distance measure recovers the standard Euclidean distance when and will be named in the rest of the paper as “ranked Euclidean distance”. We note that for the cases , fails to be a metric in the mathematical sense, i.e., is not satisfied, which requires to specify a nominal density model on to derive the same asymptotic consistency in [8] for the score values in estimating the density levels with . However, we do not assume -in this work- any density model for or do not take any stochastic assumptions regarding the data source.

In the following section, we characterize the corruptions by presenting their specific properties and propose an algorithm to localize and impute corruptions.

Iii-B Modeling of Localized Corruptions

Fig. 1: An illustration of Algorithm TCS (Tree-based Corruption Separation) with

If a test instance is subject to corruption in a small part only, the corruption might not be detectable when it is checked using an anomaly detection algorithm without a detailed analysis in its parts. On the other hand, an anomalous observation does not necessarily contain a corruption since it might simply be a false alarm, in fact an uncorrupted observation. To address these two issues, we propose a statistical analysis of a test instance through its parts using a binary partitioning tree in the space of data attributes on which, we also provide a characterization to separate the event of corruption among possible anomaly scenarios.

Suppose that an instance corresponds to the root node on a binary tree. Using half-way splits for presentational simplicity, let the set of attributes be assigned to the left child node of the root and assigned to the right child node , Fig. 1. Note that with and . Based on this strategy for generating subparts of an instance, we propose Algorithm TCS (Tree-based Corruption Separation) to separate and impute corruptions, which recursively expands a depth- binary tree to partition the space of attributes. For each node created in the course of this expansion, the corresponding attributes/part of the test instance, e.g., with , is checked whether it is consistent with the reference data restricted to those attributes, e.g., with , using the test defined in (2). We here use the ranked Euclidean distance in this testing with a pre-specified . Therefore, each node encountered in this expansion is assigned a binary label as anomalous/normal and a fully labeled (possibly unbalanced) tree is obtained for the test instance . We emphasize that Algorithm TCS does not completely construct this depth -binary tree at the beginning but instead expands it by creating the nodes and the edges as needed to achieve an efficient implementation, which continues until that each data attribute is decided to be corrupted or uncorrupted.

We consider several scenarios where the observation at a node can be anomalous. In Fig. 2, the nodes are illustrated as circles, if the corresponding part is found to be anomalous; and squares otherwise. An anomaly can be wide-spread onto the attributes and consist of anomalous subparts as illustrated in Fig. 2a, which is regarded as a conclusive pattern since a corruption is characterized and defined in Section II by that all of the subparts of a corrupted instance are also corrupted. Hence, a corruption at the starred node in Fig. 2a is declared, unless it is the root node. Note that a global corruption at the root is disregarded in this work since it is not localized. In another case, an anomalous observation could be non anomalous in its parts as illustrated in Fig. 2b, which simply happens due to an incompatible or rare combination of attributes in its subparts. This is a typical situation, where an anomalous observation is not corrupted. Hence, this case also provides a conclusive pattern in our consideration such that a corruption is rejected at the anomalous node. On the contrary, the case in Fig. 2c is an inconclusive pattern, which suggests a corruption at the right child, however, whether the corruption is spread in attributes of that child or localized is unknown. Hence, the attributes of the right child is further split and explored similarly. Then, if the conclusive pattern in Fig. 2a (or Fig. 2b) is realized, then the corruption is accepted and localized (or rejected) at the starred node in Fig. 2d. Otherwise, the search continues. On the other hand, if a significantly small subset of the corrupted attributes are left at the left child node in Fig. 2c, it might not be detectable and labeled as normal. Then the corresponding attributes should further be split as illustrated in Fig. 2d. This process recursively defines a corruption localization with an improved false alarm rate as several anomalies are rejected as they are false alarms, i.e., non corrupted anomalies.

The introduced Algorithm TCS then searches in a breadth-first-search fashion the described binary tree for a corruption. When the conclusive (or terminating) pattern shown in Fig. 2a (Fig. 2b) is found in the course of this expansion, the search is stopped at the parent node of the found pattern, i.e., the tree is pruned on that branch, and corruption is declared (or no corruption is found and no action is necessary) for the corresponding attributes. This search of corruption at each branch starting from the root node continues to the corresponding leaf node, unless a terminating pattern is found. Finally, if a conclusive pattern is not encountered at a branch from the root to an anomalous leaf, we opt to accept the corruption at the leaf to favor a better detection at a cost of an increased corruption false alarm rate. An illustration of the progress of the algorithm is given in Fig. 1, where the corrupted attributes are successfully located. Note that a small set of the attributes are mislabeled as corrupted, i.e., false alarms, in the region 3, which can be corrected if the partitioning resolution is improved by increasing the depth .

Fig. 2: An anomalous observation with several scenarios in its parts. Note that the starred nodes indicate localized corruptions. (a) A conclusive pattern: corruption is detected. (b) A conclusive pattern: corruption is rejected. (c) An inconclusive pattern: Anomaly consisting of normal and anomalous parts needs to be further explored. (d) Further exploration of the test instance to locate a possible corruption by searching a conclusive pattern.

Iii-C Maximum A Posteriori (MAP) Based Imputation

We emphasize that in most of the detection and estimation applications, the posterior density, e.g., in (5

), of the target is too complicated to assume realistic parametric models so that the nonparametric approaches are often favored in such situations

[43]. In accordance, we introduce an algorithm that works under a completely model free setting regarding both the localization of the corruptions and the imputation. Furthermore, we point out that MAP based estimators are generally known to generate more plausible results when the posterior density is multi-modal, compared to MMSE based estimators, i.e., simple (possibly weighted) averaging, which can even generate infeasible solutions [44, 45, 46]. This is often the case especially for the computer vision and machine learning applications such as edge preserving image denoising [47]. For instance, the gradients in an occluded pedestrian image would get too smoothed in an MMSE based imputation, which might cause the gradient based feature extractors, e.g., HOG [48], to fail or not perform satisfactory in case of an pedestrian detection application [4, 43]. For these reasons, we propose a novel MAP based imputation technique, which always generates feasible and likely estimates and approximates the true MAP estimator as the size of the reference data increases.

Once a corruption is localized for an instance at a node , then our task is to estimate the original attributes using the training data set as well as the instance and impute accordingly, i.e., replace the corrupted attributes in with the estimates. Since we assume the corrupted attributes to be statistically independent with the underlying true data , we treat the corrupted attributes as missing data, which then should have no effect in estimation of the true attributes. Hence, we condition this estimation of the data on the remaining attributes in . On the other hand, we note that in most of the applications such as the image compression [49], the data attributes being in sufficiently close proximity are usually modeled to manifest high correlation. In accordance, we propose to estimate the unknown data conditioned on the attributes associated with its nearest neighbor on our tree, i.e., the sibling node of . Note that due to the localization of corruptions by Algorithm TCS (Tree-based Corruption Separation), the attributes at the sibling node are certainly detected to be uncorrupted in case of the standard Euclidean distance; and detected to be uncorrupted with significantly high probability in case of the ranked Euclidean distance (cf. Section III-D). In the following, we introduce a novel Maximum A Posteriori (MAP) estimator of the true data underlying the corrupted attributes based on the standard Euclidean distance ( with ) and then discuss the generalization over for the ranked Euclidean distance measure. We also stress that the implementation of this estimator is only based on the outputs of our corruption localization algorithm, which are computed before the imputation phase in the course of Algorithm TCS. Therefore, computationally, the imputation phase that we develop is efficient such that it does not require further computations.

Input: ;

1:Initialize : set of corrupted attributes
2:Initialize : imputed test data
3:Create the root node and label
4:procedure recurse()
5:     Create nodes and ; and label
6:     if the pattern in Fig. 2then
7:         if  is the root then return
8:         else
9:              Declare corruption at :
10:              Impute attributes in
11:              return
12:         end if
13:     else if the pattern in Fig. 2then return
14:     else if  is a parent of a leaf then
15:         if  ( or ) is anomalous then
16:              Declare corruption at :
17:              Impute attributes in
18:         end if
19:         return
20:     else
21:         recurse() and recurse()
22:     end if
23:end procedure

Return: and

Algorithm 1 TCS — Tree-based Corruption Separation

Since the only relevant part of the test instance to the proposed MAP estimator is , we have

(5)

where represents a realization of the conditional probability density of the true data underlying the corrupted attributes . Then, the MAP estimator of maximizes the posterior distribution as

For any , and under certain smoothness constraints on with , let

hold with some probability , where (w.r.t. the standard Euclidean distance) is the -ball around in and . Then we point out that

Hence, since can be made arbitrarily small, we obtain

and by the Bayes’ rule

(6)

with probability , where the denominator is dropped since it does not depend on the maximizer, i.e., . To approximate the MAP estimator given in (6

), we adapt the nonparametric k-nearest neighbor (knn) based density estimation approach

[50]. Let us define a small neighborhood around in as

(7)

where is the Euclidean distance and is the distance from to its nearest ’th neighbor in for some . Note that as , , where is the Lebesgue measure. Then (6) yields

(8)

with probability . When is sufficiently large with for some or is sufficiently small, we assume that is subject to negligible variations only. Then, we (with probability ) obtain the approximation:

(9)

where, in order to obtain the corresponding maximum in the reference set , knowing the rank statistics in is enough, i.e., explicitly estimating/computing the density is unnecessary. Therefore, using the density function defined in (3), we obtain

(10)

For sufficiently large , note that approximates [8], i.e.,

(11)

Using the result in (10) in combination with (11), we propose to use MAP based estimator of the true data underlying the corrupted attributes

(12)

based on which we replace, i.e., impute, the corrupted attributes in the instance with and obtain the imputed data as .

This estimator is implemented in Algorithm TCS (Tree-based Corruption Separation) at every node in the tree, where a corruption is detected. Namely, we i) obtain the neighbors of the test instance in the reference data set with respect to the attributes associated with the node ; ii) for those neighbors in , find the one, say , attaining the largest score value defined in (1) using the attributes associated with the parent node ; then iii) impute the instance , which is detected to be corrupted at the node , using for the attributes . In the realistic case of high dimensional and limited data, when the standard Euclidean distance is used as in our derivations, might include corrupted attributes even though it is detected as normal, which clearly adversely affects the calculation of the neighborhood in (7). In addition, might only include a small support of corruption, and then we would not like to impute completely. To overcome these two issues, we propose to use the ranked Euclidean distance defined in (4). To this end, the neighborhood is defined using with an appropriate in (7). This cancels the adverse effect, up to a certain degree, of a possible corruption in as desired. Nevertheless, recalling that only uses the fraction of the attributes and set the others free, is not a metric in the mathematical sense and then, as , does not hold. As a result, the correlation structure given in (5) is less exploited in imputation as decreases. Meanwhile, as decreases, the support of the detected corruption in increases, i.e., localization improves. Therefore, we obviously have a trade-off between the imputation quality and the localization, which is sensitive to the choice of and investigated in the experiments in greater detail. However, should be set typically around since we use half-way splits. Finally, note that the imputation brings almost no further computational complexity, since these steps do computationally only depend on the anomaly detection results, cf. (2) and (1), at the corrupted node, its sibling node as well as its parent node, which are all generated prior to the imputation steps.

In the following section, the proposed framework is shown to achieve a constant false alarm rate in terms of the corruption detection. Moreover, this false alarm rate is precisely calculated under a certain dependency structure among the anomalous/normal labels on the partitioning tree.

Iii-D False Alarm Rate in Detecting Corruptions

Since the imputation is an “overwriting” operation, whether or not to impute a suspicious instance is certainly a “critical” decision. In case of a false decision if the suspicious instance is in fact uncorrupted, i.e., “a false alarm in detecting corruptions”, the imputation would correspond to data loss. In this section, we study the rate of such occurrences and analyze the false alarm rate of the proposed algorithms in detecting corruptions.

The anomaly detection test applied at every node in Algorithm TCS (Tree-based Corruption Separation) operates with a constant false alarm rate , whereas the proposed approach is able to reject corruptions at anomalous nodes. For example, when the terminating pattern in Fig. 2b is encountered, all the anomalies that can be present in the tree rooted from the terminating pattern are rejected, i.e., they are not counted as corruptions. For this reason, the false alarm rate of the proposed approach must be defined in the sense of corruptions as opposed to anomalies. To analyze this false alarm rate in detecting corruptions, one also must account for that the anomaly detection test at a node could be strongly correlated with the outputs of the previous tests in the course of Algorithm TCS, since the data attributes are in general correlated. In this section, we first model the labeling of the nodes, i.e., anomalous vs normal, on the partitioning tree, cf. Fig. 1, as a directed acyclic graph [51] achieving a certain dependency structure and then derive the false alarm rate of Algorithm TCS. Under this modeling, we also show that the constant false alarm rate in detecting the local anomalies at each node globally maps to also a constant false alarm rate in detecting the corruptions.

Recall that Algorithm TCS expands the binary tree in Fig. 1 for a given uncorrupted test instance and declares a corruption only if the conclusive pattern in Fig. 2a is encountered or a leaf node is found anomalous in the described breadth-first search. In addition to the corruption localization as well as the imputation capabilities of the proposed Algorithm TCS, let us denote the corruption detection in Algorithm TCS by , if is detected to be corrupted and , otherwise. Then our task is to find the false alarm probability in detecting the corruptions, which is given by

(13)

where is the constant false alarm rate of the detection at each node and is the nominal density. Next, we observe that Algorithm TCS maps every data instance to a binary observation such that the nominal distribution is transformed into a multivariate Bernoulli distribution , i.e.,

where , is the depth and is the anomaly decision at the root node such that , if an anomaly detected; and , otherwise. Similarly for the others such as is the decision at the left hand child of the root and

is the decision at the right hand child. Note that the proposed algorithm does not completely construct the binary tree but expands, i.e., the nodes and the edges are created as needed. Therefore, we do not completely observe the binary vector

that an instance maps to, however, we temporarily suppose that all the labels are available for ease of exposition. Once is mapped to , since Algorithm TCS declares a corruption based on only the vector of binary labels , we equivalently have

(14)

where is the corruption decision (with abuse of notation), is the complement, i.e., and is the corresponding nominal probability mass function such that

In order to calculate the probability mass function

, we model the binary tree, where each node corresponds to a binary random variable, as a directed acyclic graph

[51] such that the binary random variables at any two sibling nodes are independent conditioned on the knowledge of the label at the parent node. Namely, for any non leaf node and its children and on the binary partitioning tree, we assume the following conditional independency for the associated random labels: , from which we obtain

(15)

cf. Fig. 3.

Fig. 3: We assume the conditional independency: =. Moreover, = , where defines the dependency between the parent node and its siblings such that a positive covariance is embedded. Note that implies independency.

Here, we emphasize that (or ) is assumed to be uncorrupted in the false alarm analysis to calculate the probability given in (13), i.e., it does not have any localized corruptions by definition. Then, if is declared, at the root node without loss of generality, as anomalous then this anomaly is not due to a corruption but simply a “rarity” as the test in (2) is based on density levels. On the contrary to the case of corruption, since a “rarity” at a node is not a localized phenomenon, we expect that the children inherit the parent label independently. Therefore, we assumed the conditional independency in (15) as a generating dependency structure for the simplest graph presented in Fig. 3, which straightforwardly generalizes to the binary tree of the anomalous vs normal labels from root to the leaves. Based on this, we obtain

(16)

where

is the collection of the binary variables associated with the nodes in the tree rooted from node

that excludes , and the last equation follows from (15) and the Bayes’ rule. We observe that the starred factors in the expression (16) are of similar forms such that the last equation can be expanded further using similar lines of derivations up until the leaves appear.

Thus, the calculation of requires the calculation of the probabilities of the form or , e.g., in (16). Let us denote any child of the node by for generalization. Note that if and were independent then we would have when . However, we anticipate a statistical dependency between and generating a positive covariance. That is, conditioned on the knowledge of , we would like to impose that is more likely to attain the value compared to the prior conditions, i.e., is likely to inherit the label of its parent. On the other hand, provided that and are identically dependent, we would have that , where is the indicator function. To introduce this into the derivations, we parameterize the probability mass function as the weighted average between and as

(17)

where is a parameter defining the degree of dependency which generates an increasing covariance as increases in the interval such that implies the statistical independency of and ; and implies identical dependency. Then, the probability mass function can be calculated using this parametrization based on the recursion in . Hence, exhaustively enumerating all possible ’s and running Algorithm TCS for each of them, one can calculate the false alarm rate in (14), which is not a practical choice. Instead, through the conditional factorization in (15), we opt to simplify the expression (14) and obtain an efficient recursion. To this end, for a given node with depth , let us define the probability conditioned on that Algorithm TCS does not declare a corruption in the tree rooted from denoted by as

Here, solely depends on the depth variable due to the symmetric factorization by the conditional independency from parents to children. Therefore, the notation simplifies to or . Using the possible configurations for , we can calculate as a function of . Noting that two of those configurations are the conclusive patterns, termination and corruption patterns, we obtain

where as a short hand notation; the second term corresponds to the continuation of Algorithm TCS and the first term corresponds to the terminating pattern. Unlike the second term, the first term does not have a multiplier since the search stops at such a node. Note that the corruption pattern is disregarded by definition. Similarly, we also have

Recalling that we declare corruptions at leaf nodes on the basis of local anomalies, we can further define

and provide the initialization to the recursion and . On the other hand, we never declare corruptions at root since we are focused only on localized corruptions, which is an exception and can straightforwardly incorporated in our recursions. In terms of the recursions regarding , the only change is that the corruption pattern should not be disregarded which does not lead to a corruption detection and so does not stop the search. Then, we simply have

and the recursion stays valid for . Now that we have the recursion equations defined for all depth levels on the binary tree, we can efficiently calculate the false alarm rate of Algorithm TCS as follows. Letting represent the root node, we obtain from (14)

Then, recalling that , the false alarm rate is given by

(18)

which is equivalent to first calculating the probability that Algorithm TCS never declares a corruption and then subtracting this probability from .

We point out that the false alarm rate of Algorithm TCS (Tree-based Corruption Separation) in detecting the corruptions as found in (18) is a data-independent quantity. Therefore, under the simplification through the conditional independency (15), we conclude that the false alarm rate of the anomaly detection at each node maps to a constant false alarm probability of our corruption detection . Secondly, even though the dependency parameter does not appear, i.e., hidden, in the expression (18), is clearly affected by . For example, if , i.e., if the binary label of a child node is identically dependent on the parent label and hence , then it can be shown that . If , i.e., if the binary label of a child node is independent with the parent label and hence , then obviously . In Fig. 4, we plot the hypothetical curves resulted from mapping the constant false alarm rate in detecting the local anomalies to the corruption false alarm rate via the described model of conditional independency for several degree of dependencies . We experimentally discuss the efficacy of this model in representing the relation between and in Section V. Moreover, the parameter between and can also be chosen depth dependent, i.e., , instead of a uniform choice over the partitioning tree. An example of a depth dependent modeling is given in Section V. Finally, note that the directed acyclic graph modeling of the anomalous vs normal labeling uniformly holds for all ’s in choosing the ranked Euclidean distance. We also discuss the impact of various ’s on the fitness of the described dependency structure in Section V.

In the following section, we explain the important points of our implementation and discuss the corresponding computational complexity.

Iv Computational Complexity

Computationally, the main building block in Algorithm TCS (Tree-based Corruption Separation) is the application of the anomaly test defined in (2), which computes the train-to-train distance matrix and the test-to-train distance vector . Operating on these distances, the score function defined in (1) for the test instance must be computed, which then requires the computation and sorting of . In addition, since we label each node as anomalous or not in our tree expansion, these distances must actually be computed at each node with respect to the corresponding attributes, e.g., and at a node . For this purpose, we adapt the “integral image” approach in case of using the standard Euclidean distance. Namely, let us define the volume , with ; and , (similarly for ). Then, we simply have at a node , where corresponds to the set of attributes in positions between and . The volume and the sorting of can be computed offline once the training set is provided, which defines a training phase complexity , where sorting is the dominant contributor. For a given test instance, we compute and sort at each node in the expansion of our tree, which defines the test phase complexity for our algorithm, where sorting is the dominant contributor. In case of using the ranked Euclidean distances, since it is not possible to utilize the integral image approach anymore, the computational load is multiplied by constant factors. Next, we illustrate the efficacy of the proposed framework in separating, i.e., detecting and localizing, corruptions and imputing.

V Experiments

In this section, we test the introduced Algorithm TCS (Tree-based Corruption Separation) over several well-known machine learning data sets subject to synthetically generated data corruptions to demonstrate the performance of the proposed approach. We first discuss the efficacy of the false alarm rate estimation method explained in Section III-D in terms of the corruption detection and evaluate the performance of the critical steps in Algorithm TCS, which are the corruption detection, localization and imputation. Then, we report the improvements achieved by the proposed framework in several classification tasks.

Fig. 4: Solid (dash-dot) curves correspond to the realizations (hypothetical results). The constant false alarm rate in detecting the local anomalies maps to a global constant false alarm rate in detecting the corruptions with Algorithm TCS (Tree-based Corruption Separation). We observe that setting well approximates the relation between and . In case of the identical dependency, i.e., , .

In the first set of experiments, we adapted a digit classification task consisting of a training set of samples and a test set of samples based on the USPS data [12]. Each of these samples is a gray scale image of either a “0” image or a “1” image, where each pixel has a real intensity value in [0,1]. We synthetically generate a corruption as described in Section II and apply to each instance in the test set with probability . To be more precise, for a test instance chosen to be corrupted, we (uniformly) randomly specify a square region of size between of the total area, i.e., the number of pixels in the chosen region is not less than and not more than , overwrite each pixel in this region with a value randomly (using the uniform distribution ) drawn from the interval . Then, after the training and test instances are vectorized column wise such that , the proposed Algorithm TCS is provided with the clean training data and run over the test set. We emphasize that by this vectorization scheme, the corrupted square region corresponds to multiple corrupted intervals in the vectorized observation. Hence, this example also illustrates that Algorithm TCS can handle multiple corruptions. Ideally, the neighborhood size parameter for both imputation and corruption separation purposes should be optimized at every node of our binary tree since the data dimensionality from node to node varies. However, we opt not to optimize for presentational clarity and set as near the midpoint of , which is empirically found appropriate. Using the digit USPS data, we investigate the response of the Algorithm TCS to the local anomaly detection false alarm rate and the ranked Euclidean distance parameter . As for the depth parameter, we use the deepest possible tree with such that the leaves are associated with pixels and hence, pixel at least is then used in the distance calculation with .

Fig. 5: ROC curves for detection and localization of corruptions. Solid (dash-dot) curves correspond to detection (localization) performances.

In Fig. 4, we compare the hypothetical false alarm rate we derive in Section III-D with the corresponding experimental realizations with respect to varying local anomaly detection false alarm . The hypothetical map from to is generated with several choices for the dependency parameter, , whereas the realizations correspond to several choices for the distance parameter, . Our experiments indicate that when the statistical dependency in (17) from a parent node to one of its children nodes, cf. Fig. 3, is chosen around , the relationship between the local anomaly false alarm rate and the corruption detection false alarm rate is accurately modeled. This experimentally shows that the labeling of local anomalies over a binary partitioning tree shown in Fig. 1 can be considered as a directed acyclic graph. We also observe that in the case of Euclidean distance, i.e.,