Deep Structured Cross-Modal Anomaly Detection

08/11/2019 ∙ by Yuening Li, et al. ∙ Arizona State University Texas A&M University 8

Anomaly detection is a fundamental problem in data mining field with many real-world applications. A vast majority of existing anomaly detection methods predominately focused on data collected from a single source. In real-world applications, instances often have multiple types of features, such as images (ID photos, finger prints) and texts (bank transaction histories, user online social media posts), resulting in the so-called multi-modal data. In this paper, we focus on identifying anomalies whose patterns are disparate across different modalities, i.e., cross-modal anomalies. Some of the data instances within a multi-modal context are often not anomalous when they are viewed separately in each individual modality, but contains inconsistent patterns when multiple sources are jointly considered. The existence of multi-modal data in many real-world scenarios brings both opportunities and challenges to the canonical task of anomaly detection. On the one hand, in multi-modal data, information of different modalities may complement each other in improving the detection performance. On the other hand, complicated distributions across different modalities call for a principled framework to characterize their inherent and complex correlations, which is often difficult to capture with conventional linear models. To this end, we propose a novel deep structured anomaly detection framework to identify the cross-modal anomalies embedded in the data. Experiments on real-world datasets demonstrate the effectiveness of the proposed framework comparing with the state-of-the-art.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Over the past few decades, various successful anomaly detection methods have been developed and widely applied in a number of high-impact domains, including fraud detection, cybersecurity, algorithmic trading and medical treatment [4]

. Identifying anomalies from a swarm of normal instances not only enables the detection and early alert of malicious behaviors but also helps enhance the performance of myriad downstream machine learning tasks.

Existing anomaly detection methods predominately focus on data from a single source. While in many real-world scenarios, the data we collected often comes from different sources or can be represented by different feature formats, forming the so-called multi-modal data. For instance, in synthetic ID detection, the target is to differentiate generated fake identities from the true identities of users [32]. In this task, each identity is associated with information from different modalities, such as ID photos, bank transaction histories, and user online social behaviors. Compared with the data that is collected from a single source, when different modalities are presented together, they often reveal complementary insights into the underlying application domains.

In this paper, we focus on identifying anomalies whose patterns are disparate across different modalities and we refer the studied problem as cross-modal anomaly detection, which is different from the conventional anomaly detection problem. The major reason of proposing such a problem is that a large portion of data instances within a multi-modal context are often not anomalous when they are viewed separately in each individual modality, but they present abnormal patterns or behaviors when multiple sources of information are jointly considered. For instance, in bank transaction records, each transaction is associated with different kinds of information such as user profiles, account summary, transaction records and the user signature, which are naturally of different modalities. In this case, if the user profile is not consistent with other sources of information of the same user, for instance, the ID is inconsistent with the signature, it might cause a bank fraud associated with a crucial crime. In other words, these anomalies present inconsistent behaviors across different modalities [8], and the accurate detection of these anomalies has significant implications in practice.

Fig. 1: An illustration of the proposed cross-modal anomaly detection framework - CMAD. In Figure 1(a), different colors represent different classes of instances (e.g., the blue denotes the class of number ‘three’ while the red denotes the class of number ‘seven’) and each shape denotes one particular data modality (e.g., the circle denotes the image modality and the triangle denotes the text tag modality). Figure 1(b) shows an example of processing inconsistent patterns (in orange dashed line) and consistent patterns (in green dashed line), where the goal is to find a consensus feature space in which consistent patterns across different modalities are pulled together while inconsistent patterns across different modalities are pushed away. Figure 1(c) shows the final data distribution in the consensus feature space, and we can find that the instances with consistent patterns across different modalities are indeed grouped together.

However, cross-modal anomaly detection remains a challenging task. The main reason is that, to distinguish the cross-modal anomalies from other normal instances, we need to develop a principled framework to model the correlations between different modalities as the distributions of different modalities could be very complicated and may vary remarkably. Existing efforts either aim at maximizing the correlations between different modalities through linear mapping [27, 13], or try to extract features from different modalities with low-rank approximation techniques [15, 31]

. In other words, the main focus of these methods is to embed various modalities into a consensus low-dimensional consensus feature space with a shallow learning model. In fact, the correlations might not be well captured in the low-dimensional consensus feature space by these shallow models due to their limited representation ability. In a nutshell, these shallow models cannot fully capture the nonlinear correlations among different modalities, which necessitates the investigation of the cross-modal anomaly detection problem with deep learning models.

In this paper, we aim to capture the nonlinear correlations among different modalities for cross-modal anomaly detection. In particular, we propose a deep structured cross-modal anomaly detection framework to identify inconsistent patterns or behaviors of instances across different modalities. The proposed deep structured cross-anomaly detection method consists of three major modules. Firstly, we train different deep neural networks to extract features from different modalities, and then project the features into a consensus latent feature space. Secondly, we pull samples with similar patterns across different modalities close to each other, while pushing samples with dissimilar patterns across different modalities far away from each other. Finally, we distinguish the cross-modality anomalies by measuring the similarity across different modalities. The main contributions are as follow:

  • We systematically examine and define the cross-modal anomaly detection problem, and analyze the limitations of existing efforts.

  • We propose a novel cross-modal deep learning framework CMAD which could project data from different modalities into a comparable consensus feature space for anomaly detection when the original multi-modal data have different distributions.

  • We perform extensive experiments on various real-world multi-modal datasets to demonstrate the effectiveness of our proposed cross-modal anomaly detection framework.

Ii Related Work

In this section, we briefly review related work from two aspects: (1) conventional anomaly detection with data from a single modality; (2) cross-modal anomaly detection.

Anomaly detection refers to the problem of finding instances in a dataset that do not conform to the expected patterns or behaviors of the majority [4]. The anomaly detection problem has a wide range of applications in many high-impact domains, such as fraud detection [7], intrusion detection [3], and healthcare monitoring and alert [22]. Conventional anomaly detection methods for a single modality data can be broadly categorized into classification based methods [5, 9] and clustering based methods [24]

. The classification based methods are often applied to learn a classifier from a set of labeled training data, and then classify the test data into normal classes or anomalous classes 

[5, 9]. The clustering based methods, on the other hand, often assume that anomalies belong to small and sparse clusters, while normal instances belong to large and dense clusters [24, 16]. While traditional classification based and clustering based methods are powerful in detecting anomalies among a swarm of normal instances, these methods lack the capability in modeling the cross-modal correlations among different modalities in multi-modal data for anomaly detection.

Identifying the cross-modal abnormal behaviors is a relatively new topic in the anomaly detection field. Only a few number of approaches have been developed to identify the inconsistent patterns across different modalities. Some of the recent works detect cross-modal anomalies by analyzing the clustering results in different modalities. These methods are based on the assumption that the underlying clustering structure of normal instances is usually shared across multiple modalities, while the anomalous instances belong to different clusters. In particular, in [8], a multi-view anomaly detection method is proposed. It first obtains the spectral embeddings of instances with an ensemble similarity matrix, and then calculates the anomaly score of each instance based on the cosine distance between different embeddings. Later on, Alvarez et al. proposed an affinity propagation based anomaly detection algorithm, which could identify anomalies by analyzing the neighborhoods of each instance in different views [18]. Another widely used cross-modal anomaly detection method is based on the low-rank learning, these methods represent data across different modalities in a low-rank feature space, and typical methods along this line include [31] and [15]. Both the aforementioned clustering based methods and low-rank learning based methods aim to identify inconsistent patterns across different modalities by learning meaningful feature representations. As the linear CCA based methods, these methods are often limited by their representation ability due to the linear projections.

Iii The Proposed Framework

In this section, we introduce the proposed cross-modal anomaly detection framework CMAD in details. We begin with a formal definition of the cross-modal anomaly detection problem, and then propose a principled method to learn the consensus data representations across different modalities. After that, we discuss how to leverage deep structure to extract effective feature representations. Finally, we will introduce how to perform cross-modal anomaly detection with the built deep learning model.

Iii-a Problem Definition

Given different modalities, , ,…, are the corresponding feature matrices of these modalities in different classes, where , and is the feature dimensionality of the modality. is the number of instances in the dataset. The feature representation of the instance is denoted as , where {1,2,…,}.

denotes the feature vector of the

instance from modality . The class label of the instance in modality is denoted as , where {1,2,…,}. Our task is to find the anomalous instances whose patterns (w.r.t. class labels) are inconsistent across different modalities. For the notation convenience and without loss of generality, we only include two modalities and when introducing the methodology, where , {1,2,…,} and , but our method can be easily extended to consider any number of modalities.

Based on the aforementioned terminologies, we define the problem of cross-modal anomaly detection as follows:

Definition: (Cross-Modal Anomaly Detection) Given a dataset which contains multiple modalities as data sources, such as , , we define that the instance in is regarded as an anomaly when:

(1)

where is a function measuring the cross-modal similarity between a pair of observations, whereas in this definition, the observations belong to the same instance. The parameter is a pre-defined threshold value such that if the measured similarity is larger than or equals to , we regard that the instance is normal across the two modalities; otherwise, the instance is a cross-modal anomaly.

Iii-B Data Fusion across Different Modalities

To detect the cross-modal anomalies, the fundamental prerequisite is to develop a principled way to fuse the information from different modalities. To achieve this target, a prevalent way is to learn the mapping function which could project instances from different modalities into a consensus feature space. In this way, it presents us a unified feature space in which we can measure the cross-modal similarity of the data instances, based on which the cross-modal anomaly detection can be easily carried out.

To this end, we transform instances from two modalities and

into a consensus feature space with two linear transformation matrices, denoted as

and , where is the dimensionality of the resultant unified feature space and its value is much smaller than and . After transformation, we obtain the following data representations for modalities and :

(2)
(3)

Given a pair of instances and

, we employ the cosine similarity to measure their cross-modal similarity in the transformed feature space:

(4)

To find the cross-modal anomalies, we need to ensure that for the pair of instances with consistent patterns across different modalities (i.e.,

), their cross-modal similarity in the transformed feature space should be high. To this end, the loss function can be expressed as follows:

(5)

where the set contains the instance pairs whose cross-modal observations are consistent, i.e., .

Also, we need to penalize the instance pairs with inconsistent patterns across different modalities. In particular, in the training process, we not only make use of the instance pairs with consistent patterns across different modalities, but also incorporate an additional set of negative samples by negative sampling [19]. Let be the set of the negative samples where their cross-modal patterns are inconsistent (i.e., ), then to learn the mapping matrices, the loss function for encouraging dissimilarities of negative samples can be presented in the following format:

(6)

In the above formulation, the hyperparameter

, as a margin value, controls the penalty degree of the inconsistent patterns. Specifically, if , it implies that the instance pairs with inconsistent patterns across different modalities are regarded as similar and it is necessary to penalize it in the loss function.

By combining Eq. (5) and Eq. (6), it leads to the following objective function for the transformation matrices learning:

(7)

The above loss function consists of three terms. The first term is to “pull” the projections of different modalities together if their cross-modal patterns (w.r.t. instance class labels) are consistent, while the second term is to “push” them further apart otherwise (as shown in Figure 1(b)). The resultant unified feature space is shown in Figure 1(c). The parameter

controls the influence of negative sampling. The third term is a regularization term, which controls the bias-variance trade-off in order to avoid the over-fitting, and

is the regularization hyper-parameter. The aforementioned objective function can be effectively solved with coordinate descent methods, in which the variables and are updated in an alternating way until the objective function converges to a local optimum.

Deep neural networks have been successfully applied as powerful feature extraction models for different types of data with single modality, including text, image and audio data 

[21]

. The main reason for the success can be attributed to the fact that deep neural networks are capable of learning nonlinear mappings by extracting high-level abstractions from the input raw features. Over the past few decades, many successful deep neural networks such as the deep Boltzmann machines 

[25], auto-encoders [21]

, and recurrent neural networks 

[10, 11] have been widely applied and achieved the state-of-the-art performance in many learning tasks.

Motivated by the success of deep neural networks, we develop a deep structured framework to characterize the features of different modalities for cross-modal anomaly detection. In particular, we employ a series of nonlinear mapping functions in the deep neural networks to map information of each modality into a consensus feature space, in which the instance pairs with consistent patterns across different modalities are pushed away while the pairs with inconsistent cross-modal patterns are pulled together. In this way, the trained model can help us identify the cross-modal anomalies in the test data. Compared with previously introduced linear projections, the developed method employs deep neural networks to learn nonlinear mappings. Thus it empowers a stronger ability to learn more effective feature representations from the original multi-modal data [17]

. In addition, the nonlinear mapping functions help to fully capture the nonlinear correlations among different modalities, which is otherwise difficult to characterize with conventional linear projection functions. It should be noted that the previously mentioned linear mapping is mathematically equivalent to imposing a single fully connected layer (without nonlinear activation functions) in the deep neural networks. Specifically, if we replace the linear mapping process (through mapping matrices

) with neural networks, then we have the loss function as:

(8)

where and denote neural network-like differentiable functions,

denotes the set of all parameters of the deep neural network. We determine the architecture of the neural networks based on the data modalities to be handled. Specifically, we extract features from images through a convolutional neural network (CNN), and train a fully connected neural network to process text tags. The parameter

controls the importance of the second term. In the experiments, we empirically set as 1.

Our goal is to minimize Eq. (8) through updating all the model parameters in the deep neural networks. With an initialization of model parameters, the proposed deep model CMAD is optimized by Adam, which is an adaptive momentum based gradient descent method. The detailed are given in Algorithm 1. We make use of the dropout [26] in the training process to avoid over-fitting. After the optimization process, the instances with consistent patterns across different modalities will be pulled together within a small distance, while the instances with inconsistent patterns across different modalities will be pushed away from each other. Finally, for example, the distribution of data instances in the projected latent space is shown as Figure 1 (c). Thus, we can use the CMAD to identify cross-modality anomalies by measuring the similarity across different modalities as shown in Algorithm 2. We show how the performance of CMAD varies with different hyperparameter settings of in Eq. (8) and the threshold in Eq.(2) in the section Experiment D.

1:Input: Initialize a dataset , which contains multiple modalities as the data source, such as ,
2:for

 each epoch 

do
3:     for each mini batch do
4:         for the positive training samples in  do
5:              ;
6:         end for
7:         for the negative training samples in from negative sampling do
8:               ;
9:         end for
10:     end for
11:     Update the model parameters with Adam;
12:end for
13:Output: The learned model parameters of CMAD.
Algorithm 1 The training process of CMAD.
1:Input: instances from two modalities,
2:Output: the cross-modal anomalies
3:load the learned model parameters from Algorithm 1;
4:for  {1,2,…do
5:     if  then
6:         the instance is a cross-modal anomaly;
7:     end if
8:end for
Algorithm 2 Anomaly detection with CMAD.

Iv Experiments

In this section, we conduct experiments to evaluate the effectiveness of our proposed CMAD framework in detecting cross-modal anomalies. We also perform a case study to visualize the distribution in the latent spaces to illuminate effectiveness of the CMAD.

Iv-a Datasets

We use two real-world cross-modal datasets to assess the effectiveness of the proposed framework in cross-modal anomaly detection. The details of the datasets are as follow:

  • MNIST [14]: In the MNIST dataset, we have 60,000 original images to represent ten different digits of 12828 pixels. After adding the text tag to each image and embed tags through different word embedding methods(Word2Vec [20], GloVe [23]) to 100 dimension vectors, we can synthetically generate the training data with two different modalities. The image modality is fed into a CNN model, and the text modality is fed into a fully connected neural network for training.

  • RGB-D [12]: In the RGB-D set, we have two modalities, RGB images and depth images to represent the same objects in the real-world. In our experiments, we include 960 RGB images and 960 corresponding depth images from five different kinds of objects in the RGB-D dataset, including apples, staplers, coffee-mugs, soda-cans, and toothpastes. Both modalities are fed into a CNN model for feature learning.

As there is no ground truth of anomalies in these datasets, thus we need to inject anomalies for evaluation. To generate the cross-modal anomalies, we adopt a widely used injection method - negative sampling, to create a number of instance pairs such that their patterns are inconsistent across different modalities. Originally, we only have instances whose patterns are consistent across different modalities (with the same class labels). To introduce the inconsistent instance pairs, we randomly sample a number of instance pairs such that their cross-modal patterns are different such that . To be more specific, in the testing stage, we inject 1,019 inconsistent pairs in the MNIST, and 545 inconsistent pairs in the RGB-D datasets. In the meanwhile, we have the same portion of instances with consistent behavior which have no overlap with the training stage.

Dataset Dimension of the representations
in the hidden layers
MNIST Images 784-1440-1280-320-150-50
MNIST Tags 100-100-50
RGB-D RGB Images 58575-1440-1280-320-150-50
RGB-D Depth Images 58575-1440-1280-320-150-50

TABLE I: Hyaperparameters of the deep neural models.

Iv-B Baseline Algorithms

We introduce several widely used baseline methods to compare the performance of cross-modal anomaly detection on multi-modal data:

  • CCA & KCCA [27]. CCA models correlation across multiple modalities and implicitly maps instances into a lower-dimensional space, with the target to find the maximum correlations in the latent space by linear projections. Compared with CCA, KCCA  [13, 2] uses an alternative projection strategy by nonlinearly mapping the multi-modal data into a consensus feature space with kernel tricks. We maximize the correlations of instances with consistent patterns in the training phase, and identify cross-modal anomalies by their cross-modal correlation in the test phase. Both CCA and KCCA are supervised.

  • PLS [1]. PLS is also a supervised feature representation learning method based on linear transformation. The major difference between PLS and CCA is that PLS maximizes the covariance between different modalities through linear transformations. We identify the cross-modal anomalies in the same way as CCA and KCCA.

  • HOAD [8]. HOAD is a supervised anomaly detection algorithm to find anomalies from the multi-modal data. It first obtains the spectral embeddings from the multi-modal data with an ensemble similarity matrix, and then calculates the anomaly score of each instance based on the cosine distance between different embeddings.

  • Embedding Network  [28, 29]. An supervised representation learning framework based on deep neural networks to extract features from different modalities, such as image and text. We distinguish the cross-modal anomalies by measuring the euclidean distances across different modalities after projecting instances into the latent space.

In our experiment, we evaluate the performance of cross-modal anomaly detection using the metrics of precision, recall, and accuracy.

We use TP, FP, TN, FN as the number of true positives, false positives, true negatives and false negatives, respectively, in the predicted results. Their definitions are listed as follows:

(9)
(10)
(11)

Iv-C Anomaly Detection Performance

The parameter settings in our deep model CMAD vary among different modalities and datasets. Specifically, we use convolutional neural network (CNN) for the image modality [30, 6]

, and fully-connected neural network for the text tag modality. The hyperparameters of the feature dimensionality in each layer are listed in Table 2. Table 3 and Table 4 present the performance of cross-modal anomaly detection using different algorithms, which include the accuracy, precision and recall of the anomaly detection results.

As we can see from the tables, several observations are drawn as below.


Method
Accuracy Precision Recall

CCA
0.6654 0.8911 0.6191
Kernel CCA(RBF) 0.6923 0.7062 0.6948
PLS 0.7235 0.6636 0.6233
HOAD 0.5118 0.9220 0.5115
Embedding Network 0.9504 0.9873 0.9127
CMAD 0.9921 0.9971 0.9875


TABLE II: Performance on MNIST.

Method
Accuracy Precision Recall

CCA
0.7682 0.7171 0.7988
Kernel CCA(RBF) 0.8456 0.7134 0.9699
PLS 0.6580 0.8304 0.7539
HOAD 0.5137 0.9345 0.5124
Embedding Network 0.5138 0.5075 0.9346
CMAD 0.9512 0.9313 0.9742


TABLE III: Performance on RGB-D.

First, CMAD outperforms the baseline methods CCA, KCCA, PLS and HOAD in most cases, which validates the importance of applying deep models for cross-modal anomaly detection by extracting more effective feature representations from the original multi-modal data.

Second, we compare CMAD against the Embedding Network method. Similar to our proposed CMAD, the Embedding Network is also a deep structure based approach. Based on the experimental results, both of the deep structured methods outperform other linear transformation based methods. These observations validate the limitation of the linear projection based methods, i.e., the linear transformation, as they cannot fully capture nonlinear correlations among different modalities for cross-modal anomaly detection.

There are two major differences between the proposed CMAD and the Embedding Network method. First, Embedding Network is developed to handle the text and image data only, while CMAD is able to handle different types of data with appropriate deep neural network designs. Second, the Embedding Network narrows the Euclidean distance of the pair of instances with consistent information across different modalities, whereas in CMAD, we not only pull samples with similar patterns across different modalities close to each other, but also push samples with dissimilar patterns across different modalities far away from each other. Thus, CMAD can learn better representations from the multi-modal data.

Fig. 2: Precision (left) and recall (Right) in the MNIST dataset with different hyperparameter settings. The figures show how precision and recall scores vary as we change the values of margin and threshold .

Fig. 3: Left: the distribution of images in the projected space which only contains images of digits “3” and “7” (green denotes “3”, and brown denotes “7”). Right: the distribution of tags in the projected feature space which contains digits from zero to nine, each color denotes one digit.

Iv-D Hyperparameter Settings

There are two hyperparameters to be tuned in our method, i.e., the margin in Eq. (8) and the threshold in Eq.(2). In Figure 2, we show how the performance of CMAD varies with different hyperparameter settings. In particular, we tune the values of these hyperparameters within the range of {-0.4, -0.3, … ,0,5} and {-0.2, -0.1, … ,0,7}, respectively. In general, the algorithm performs better w.r.t. precision when the threshold is set to be small and the margin is set to be large. The accuracy achieves the best value when is 0.3 and equals to 0.3. In the meanwhile, we can find that the proposed CMAD could achieve a good performance (over 92%) in a wide range of hyperparameter settings, the loose condition give us more opportunity to find anomalies in the real-world applications.

Fig. 4: Reconstruction result in MNIST. According to finding the closest instances from other modalities, we could demonstrate that, the distances between the same objects from different modalities are closer than those denoting different objects in the latent space. The upper part are the selected images as the ground truth, and the lower part are reconstructed images through finding the closest neighbors which originally distributed in the text tag space.

Iv-E Case Studies

Finally, we conduct two case studies to visualize the data distribution in the projected consensus feature space. Both of the cases are based on the MNIST dataset. The case studies are separated into two parts. The first part is to represent the data distribution of each individual modality in the latent space, and to demonstrate the effectiveness of the “pull” and “push” actions in our developed method. In the second part, we demonstrate the representation ability of CMAD across different modalities.

In the first case, in order to show that CMAD is capable to separate different instances in the projected consensus feature space, we visualize the data distribution in individual modalities after the training stage. To represent the data distribution, we first map the original instances into the projected feature space. Then we further reduce the dimensionality of the projected feature space to a two-dimensional visible feature space by PCA. As shown in Figure 3, in each modality, the data instances which have similar patterns are separated from each other. In Figure 3(left), we can find that when the images with labels “three” and “seven” are taken as the test data, most of the images with label “three” (green in the figure) are separated from with the images with label “seven” (brown in the figure) in the projected two-dimensional space. In Figure 3 (right), we use the text tag as another data modality. We can find that the distributions of different tags could also be separated in the projected consensus feature space.

The second case is to demonstrate the representation ability of the developed method across different modalities. In other words, our target is to measure whether the distance between observations with similar modalities is closer than the distance between observations with dissimilar modalities after feature mapping. We extract one specific instance from one modality in the projected consensus feature space and find its nearest (based on cosine distance) instances from the other modality .

As shown in Eq. (12), given an observation where modality denotes text data, we extract a set of semantically similar image observations from , and aggregate these observations to generate a new image:

(12)

After normalizing the generated image from Eq.(12

) in the range of [0, 255] for visualization in the gray scale, we generate a new image which can represent the estimation of the distribution close to the given text

.

We extract instances from the text modality to find the instances from the image modality with the closest cosine distances in the latent space, and reconstruct images from the text features. After normalizing the nearest images, we represent them as new, reconstructed images in the gray scale. The reconstructed results are reported in Figure 4. The upper part contains the ground-truth images, and the lower part denotes to the reconstructed results. Comparing the reconstructed results with the original images, we can intuitively find that our algorithm is effective in pulling images with the consistent instances across different modalities together, and pushing those with inconsistent instances apart away.

V Conclusions and Future Work

In this paper, we propose an novel cross-modal anomaly detection approach CMAD based on deep neural networks. The proposed CMAD framework is able to identify inconsistent patterns or behaviors of instances across different modalities by capturing their nonlinear correlations in the learned consensus latent feature space. Firstly, we train deep structured model to represent features from different modalities, and then project the features into a consensus latent feature space. Secondly, we “pull” the projections of a pair of instances from different modalities together if their cross-modal patterns are consistent, while “push” them further apart otherwise. Finally, we distinguish the cross-modality anomalies by measuring the distances across different modalities. We demonstrate the effectiveness of the developed approach on different multi-modal datasets and the experimental results show that CMAD achieves better detection performance than other prevalent anomaly detection methods on multi-modal data.

There are a number of directions that can be explored for future research, including: (1) extend the current cross-modal anomaly detection framework to other complex data, such as sequential data and networked data; (2) develop a principled way to handle the dynamic data that often arrives at a fast pace, since in many real-world applications the detection of fraudulent or malicious behaviors in real-time is often desired.

Acknowledgement

The work is, in part, supported by National Science Foundation (#IIS-1657196, #IIS-1750074, #CNS-1816497).

References

  • [1] H. Abdi (2003) Partial least square regression (pls regression). Encyclopedia for research methods for the social sciences 6 (4), pp. 792–795. Cited by: 2nd item.
  • [2] S. Akaho (2006) A kernel method for canonical correlation analysis. arXiv preprint cs/0609071. Cited by: 1st item.
  • [3] A. L. Buczak and E. Guven (2016) A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications Surveys & Tutorials 18 (2), pp. 1153–1176. Cited by: §II.
  • [4] V. Chandola, A. Banerjee, and V. Kumar (2009) Anomaly detection: a survey. ACM computing surveys (CSUR) 41 (3), pp. 15. Cited by: §I, §II.
  • [5] C. Chang and S. Chiang (2002) Anomaly detection and classification for hyperspectral imagery. IEEE transactions on geoscience and remote sensing 40 (6), pp. 1314–1325. Cited by: §II.
  • [6] S. Chang, W. Han, J. Tang, G. Qi, C. C. Aggarwal, and T. S. Huang (2015) Heterogeneous network embedding via deep architectures. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 119–128. Cited by: §IV-C.
  • [7] J. R. Dorronsoro, F. Ginel, C. Sgnchez, and C. Cruz (1997) Neural fraud detection in credit card operations. IEEE transactions on neural networks 8 (4), pp. 827–834. Cited by: §II.
  • [8] J. Gao, W. Fan, D. Turaga, S. Parthasarathy, and J. Han (2011) A spectral framework for detecting inconsistency across multi-source object relationships. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pp. 1050–1055. Cited by: §I, §II, 3rd item.
  • [9] F. Gonzalez, D. Dasgupta, and R. Kozma (2002) Combining negative selection and classification techniques for anomaly detection. In Evolutionary Computation, 2002. CEC’02. Proceedings of the 2002 Congress on, Vol. 1, pp. 705–710. Cited by: §II.
  • [10] A. Graves, A. Mohamed, and G. Hinton (2013) Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp. 6645–6649. Cited by: §III-B.
  • [11] X. Huang, Q. Song, Y. Li, and X. Hu (2019) Graph recurrent networks with attributed random walks. Cited by: §III-B.
  • [12] K. Lai, L. Bo, X. Ren, and D. Fox (2011) A large-scale hierarchical multi-view rgb-d object dataset. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 1817–1824. Cited by: 2nd item.
  • [13] P. L. Lai and C. Fyfe (2000) Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems 10 (05), pp. 365–377. Cited by: §I, 1st item.
  • [14] Y. LeCun and C. Cortes (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Link Cited by: 1st item.
  • [15] S. Li, M. Shao, and Y. Fu (2015)

    Multi-view low-rank analysis for outlier detection

    .
    In Proceedings of the 2015 SIAM International Conference on Data Mining, pp. 748–756. Cited by: §I, §II.
  • [16] Cited by: §II.
  • [17] N. Liu, Q. Tan, Y. Li, H. Yang, J. Zhou, and X. Hu (2019) Is a single vector enough? exploring node polysemy for network embedding. arXiv preprint arXiv:1905.10668. Cited by: §III-B.
  • [18] A. Marcos Alvarez, M. Yamada, A. Kimura, and T. Iwata (2013) Clustering-based anomaly detection in multi-view data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pp. 1545–1548. Cited by: §II.
  • [19] T. Mikolov, Q. V. Le, and I. Sutskever (2013) Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168. Cited by: §III-B.
  • [20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: 1st item.
  • [21] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng (2011) Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 689–696. Cited by: §III-B.
  • [22] A. Pantelopoulos and N. G. Bourbakis (2010) A survey on wearable sensor-based systems for health monitoring and prognosis. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 40 (1), pp. 1–12. Cited by: §II.
  • [23] J. Pennington, R. Socher, and C. Manning (2014) Glove: global vectors for word representation. In

    Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)

    ,
    pp. 1532–1543. Cited by: 1st item.
  • [24] L. Portnoy, E. Eskin, and S. Stolfo (2001) Intrusion detection with unlabeled data using clustering. In In Proceedings of ACM CSS Workshop on Data Mining Applied to Security (DMSA-2001, Cited by: §II.
  • [25] R. Salakhutdinov and H. Larochelle (2010) Efficient learning of deep boltzmann machines. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    ,
    pp. 693–700. Cited by: §III-B.
  • [26] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §III-B.
  • [27] B. Thompson (2000) Canonical correlation analysis.. Cited by: §I, 1st item.
  • [28] L. Wang, Y. Li, J. Huang, and S. Lazebnik (2018) Learning two-branch neural networks for image-text matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: 4th item.
  • [29] L. Wang, Y. Li, and S. Lazebnik (2016) Learning deep structure-preserving image-text embeddings. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 5005–5013. Cited by: 4th item.
  • [30] S. Zhai, Y. Cheng, W. Lu, and Z. Zhang (2016)

    Deep structured energy based models for anomaly detection

    .
    arXiv preprint arXiv:1605.07717. Cited by: §IV-C.
  • [31] L. Zhang, S. Wang, X. Zhang, Y. Wang, B. Li, D. Shen, and S. Ji (2016) Collaborative multi-view denoising. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2045–2054. Cited by: §I, §II.
  • [32] D. Zhou, J. He, K. S. Candan, and H. Davulcu (2015) MUVIR: multi-view rare category detection.. In IJCAI, pp. 4098–4104. Cited by: §I.