I Introduction
Over the past few decades, various successful anomaly detection methods have been developed and widely applied in a number of highimpact domains, including fraud detection, cybersecurity, algorithmic trading and medical treatment [4]
. Identifying anomalies from a swarm of normal instances not only enables the detection and early alert of malicious behaviors but also helps enhance the performance of myriad downstream machine learning tasks.
Existing anomaly detection methods predominately focus on data from a single source. While in many realworld scenarios, the data we collected often comes from different sources or can be represented by different feature formats, forming the socalled multimodal data. For instance, in synthetic ID detection, the target is to differentiate generated fake identities from the true identities of users [32]. In this task, each identity is associated with information from different modalities, such as ID photos, bank transaction histories, and user online social behaviors. Compared with the data that is collected from a single source, when different modalities are presented together, they often reveal complementary insights into the underlying application domains.
In this paper, we focus on identifying anomalies whose patterns are disparate across different modalities and we refer the studied problem as crossmodal anomaly detection, which is different from the conventional anomaly detection problem. The major reason of proposing such a problem is that a large portion of data instances within a multimodal context are often not anomalous when they are viewed separately in each individual modality, but they present abnormal patterns or behaviors when multiple sources of information are jointly considered. For instance, in bank transaction records, each transaction is associated with different kinds of information such as user profiles, account summary, transaction records and the user signature, which are naturally of different modalities. In this case, if the user profile is not consistent with other sources of information of the same user, for instance, the ID is inconsistent with the signature, it might cause a bank fraud associated with a crucial crime. In other words, these anomalies present inconsistent behaviors across different modalities [8], and the accurate detection of these anomalies has significant implications in practice.
However, crossmodal anomaly detection remains a challenging task. The main reason is that, to distinguish the crossmodal anomalies from other normal instances, we need to develop a principled framework to model the correlations between different modalities as the distributions of different modalities could be very complicated and may vary remarkably. Existing efforts either aim at maximizing the correlations between different modalities through linear mapping [27, 13], or try to extract features from different modalities with lowrank approximation techniques [15, 31]
. In other words, the main focus of these methods is to embed various modalities into a consensus lowdimensional consensus feature space with a shallow learning model. In fact, the correlations might not be well captured in the lowdimensional consensus feature space by these shallow models due to their limited representation ability. In a nutshell, these shallow models cannot fully capture the nonlinear correlations among different modalities, which necessitates the investigation of the crossmodal anomaly detection problem with deep learning models.
In this paper, we aim to capture the nonlinear correlations among different modalities for crossmodal anomaly detection. In particular, we propose a deep structured crossmodal anomaly detection framework to identify inconsistent patterns or behaviors of instances across different modalities. The proposed deep structured crossanomaly detection method consists of three major modules. Firstly, we train different deep neural networks to extract features from different modalities, and then project the features into a consensus latent feature space. Secondly, we pull samples with similar patterns across different modalities close to each other, while pushing samples with dissimilar patterns across different modalities far away from each other. Finally, we distinguish the crossmodality anomalies by measuring the similarity across different modalities. The main contributions are as follow:

We systematically examine and define the crossmodal anomaly detection problem, and analyze the limitations of existing efforts.

We propose a novel crossmodal deep learning framework CMAD which could project data from different modalities into a comparable consensus feature space for anomaly detection when the original multimodal data have different distributions.

We perform extensive experiments on various realworld multimodal datasets to demonstrate the effectiveness of our proposed crossmodal anomaly detection framework.
Ii Related Work
In this section, we briefly review related work from two aspects: (1) conventional anomaly detection with data from a single modality; (2) crossmodal anomaly detection.
Anomaly detection refers to the problem of finding instances in a dataset that do not conform to the expected patterns or behaviors of the majority [4]. The anomaly detection problem has a wide range of applications in many highimpact domains, such as fraud detection [7], intrusion detection [3], and healthcare monitoring and alert [22]. Conventional anomaly detection methods for a single modality data can be broadly categorized into classification based methods [5, 9] and clustering based methods [24]
. The classification based methods are often applied to learn a classifier from a set of labeled training data, and then classify the test data into normal classes or anomalous classes
[5, 9]. The clustering based methods, on the other hand, often assume that anomalies belong to small and sparse clusters, while normal instances belong to large and dense clusters [24, 16]. While traditional classification based and clustering based methods are powerful in detecting anomalies among a swarm of normal instances, these methods lack the capability in modeling the crossmodal correlations among different modalities in multimodal data for anomaly detection.Identifying the crossmodal abnormal behaviors is a relatively new topic in the anomaly detection field. Only a few number of approaches have been developed to identify the inconsistent patterns across different modalities. Some of the recent works detect crossmodal anomalies by analyzing the clustering results in different modalities. These methods are based on the assumption that the underlying clustering structure of normal instances is usually shared across multiple modalities, while the anomalous instances belong to different clusters. In particular, in [8], a multiview anomaly detection method is proposed. It first obtains the spectral embeddings of instances with an ensemble similarity matrix, and then calculates the anomaly score of each instance based on the cosine distance between different embeddings. Later on, Alvarez et al. proposed an affinity propagation based anomaly detection algorithm, which could identify anomalies by analyzing the neighborhoods of each instance in different views [18]. Another widely used crossmodal anomaly detection method is based on the lowrank learning, these methods represent data across different modalities in a lowrank feature space, and typical methods along this line include [31] and [15]. Both the aforementioned clustering based methods and lowrank learning based methods aim to identify inconsistent patterns across different modalities by learning meaningful feature representations. As the linear CCA based methods, these methods are often limited by their representation ability due to the linear projections.
Iii The Proposed Framework
In this section, we introduce the proposed crossmodal anomaly detection framework CMAD in details. We begin with a formal definition of the crossmodal anomaly detection problem, and then propose a principled method to learn the consensus data representations across different modalities. After that, we discuss how to leverage deep structure to extract effective feature representations. Finally, we will introduce how to perform crossmodal anomaly detection with the built deep learning model.
Iiia Problem Definition
Given different modalities, , ,…, are the corresponding feature matrices of these modalities in different classes, where , and is the feature dimensionality of the modality. is the number of instances in the dataset. The feature representation of the instance is denoted as , where {1,2,…,}.
denotes the feature vector of the
instance from modality . The class label of the instance in modality is denoted as , where {1,2,…,}. Our task is to find the anomalous instances whose patterns (w.r.t. class labels) are inconsistent across different modalities. For the notation convenience and without loss of generality, we only include two modalities and when introducing the methodology, where , {1,2,…,} and , but our method can be easily extended to consider any number of modalities.Based on the aforementioned terminologies, we define the problem of crossmodal anomaly detection as follows:
Definition: (CrossModal Anomaly Detection) Given a dataset which contains multiple modalities as data sources, such as , , we define that the instance in is regarded as an anomaly when:
(1) 
where is a function measuring the crossmodal similarity between a pair of observations, whereas in this definition, the observations belong to the same instance. The parameter is a predefined threshold value such that if the measured similarity is larger than or equals to , we regard that the instance is normal across the two modalities; otherwise, the instance is a crossmodal anomaly.
IiiB Data Fusion across Different Modalities
To detect the crossmodal anomalies, the fundamental prerequisite is to develop a principled way to fuse the information from different modalities. To achieve this target, a prevalent way is to learn the mapping function which could project instances from different modalities into a consensus feature space. In this way, it presents us a unified feature space in which we can measure the crossmodal similarity of the data instances, based on which the crossmodal anomaly detection can be easily carried out.
To this end, we transform instances from two modalities and
into a consensus feature space with two linear transformation matrices, denoted as
and , where is the dimensionality of the resultant unified feature space and its value is much smaller than and . After transformation, we obtain the following data representations for modalities and :(2) 
(3) 
Given a pair of instances and
, we employ the cosine similarity to measure their crossmodal similarity in the transformed feature space:
(4)  
To find the crossmodal anomalies, we need to ensure that for the pair of instances with consistent patterns across different modalities (i.e.,
), their crossmodal similarity in the transformed feature space should be high. To this end, the loss function can be expressed as follows:
(5) 
where the set contains the instance pairs whose crossmodal observations are consistent, i.e., .
Also, we need to penalize the instance pairs with inconsistent patterns across different modalities. In particular, in the training process, we not only make use of the instance pairs with consistent patterns across different modalities, but also incorporate an additional set of negative samples by negative sampling [19]. Let be the set of the negative samples where their crossmodal patterns are inconsistent (i.e., ), then to learn the mapping matrices, the loss function for encouraging dissimilarities of negative samples can be presented in the following format:
(6) 
In the above formulation, the hyperparameter
, as a margin value, controls the penalty degree of the inconsistent patterns. Specifically, if , it implies that the instance pairs with inconsistent patterns across different modalities are regarded as similar and it is necessary to penalize it in the loss function.By combining Eq. (5) and Eq. (6), it leads to the following objective function for the transformation matrices learning:
(7)  
The above loss function consists of three terms. The first term is to “pull” the projections of different modalities together if their crossmodal patterns (w.r.t. instance class labels) are consistent, while the second term is to “push” them further apart otherwise (as shown in Figure 1(b)). The resultant unified feature space is shown in Figure 1(c). The parameter
controls the influence of negative sampling. The third term is a regularization term, which controls the biasvariance tradeoff in order to avoid the overfitting, and
is the regularization hyperparameter. The aforementioned objective function can be effectively solved with coordinate descent methods, in which the variables and are updated in an alternating way until the objective function converges to a local optimum.Deep neural networks have been successfully applied as powerful feature extraction models for different types of data with single modality, including text, image and audio data
[21]. The main reason for the success can be attributed to the fact that deep neural networks are capable of learning nonlinear mappings by extracting highlevel abstractions from the input raw features. Over the past few decades, many successful deep neural networks such as the deep Boltzmann machines
[25], autoencoders [21], and recurrent neural networks
[10, 11] have been widely applied and achieved the stateoftheart performance in many learning tasks.Motivated by the success of deep neural networks, we develop a deep structured framework to characterize the features of different modalities for crossmodal anomaly detection. In particular, we employ a series of nonlinear mapping functions in the deep neural networks to map information of each modality into a consensus feature space, in which the instance pairs with consistent patterns across different modalities are pushed away while the pairs with inconsistent crossmodal patterns are pulled together. In this way, the trained model can help us identify the crossmodal anomalies in the test data. Compared with previously introduced linear projections, the developed method employs deep neural networks to learn nonlinear mappings. Thus it empowers a stronger ability to learn more effective feature representations from the original multimodal data [17]
. In addition, the nonlinear mapping functions help to fully capture the nonlinear correlations among different modalities, which is otherwise difficult to characterize with conventional linear projection functions. It should be noted that the previously mentioned linear mapping is mathematically equivalent to imposing a single fully connected layer (without nonlinear activation functions) in the deep neural networks. Specifically, if we replace the linear mapping process (through mapping matrices
) with neural networks, then we have the loss function as:(8)  
where and denote neural networklike differentiable functions,
denotes the set of all parameters of the deep neural network. We determine the architecture of the neural networks based on the data modalities to be handled. Specifically, we extract features from images through a convolutional neural network (CNN), and train a fully connected neural network to process text tags. The parameter
controls the importance of the second term. In the experiments, we empirically set as 1.Our goal is to minimize Eq. (8) through updating all the model parameters in the deep neural networks. With an initialization of model parameters, the proposed deep model CMAD is optimized by Adam, which is an adaptive momentum based gradient descent method. The detailed are given in Algorithm 1. We make use of the dropout [26] in the training process to avoid overfitting. After the optimization process, the instances with consistent patterns across different modalities will be pulled together within a small distance, while the instances with inconsistent patterns across different modalities will be pushed away from each other. Finally, for example, the distribution of data instances in the projected latent space is shown as Figure 1 (c). Thus, we can use the CMAD to identify crossmodality anomalies by measuring the similarity across different modalities as shown in Algorithm 2. We show how the performance of CMAD varies with different hyperparameter settings of in Eq. (8) and the threshold in Eq.(2) in the section Experiment D.
Iv Experiments
In this section, we conduct experiments to evaluate the effectiveness of our proposed CMAD framework in detecting crossmodal anomalies. We also perform a case study to visualize the distribution in the latent spaces to illuminate effectiveness of the CMAD.
Iva Datasets
We use two realworld crossmodal datasets to assess the effectiveness of the proposed framework in crossmodal anomaly detection. The details of the datasets are as follow:

MNIST [14]: In the MNIST dataset, we have 60,000 original images to represent ten different digits of 12828 pixels. After adding the text tag to each image and embed tags through different word embedding methods(Word2Vec [20], GloVe [23]) to 100 dimension vectors, we can synthetically generate the training data with two different modalities. The image modality is fed into a CNN model, and the text modality is fed into a fully connected neural network for training.

RGBD [12]: In the RGBD set, we have two modalities, RGB images and depth images to represent the same objects in the realworld. In our experiments, we include 960 RGB images and 960 corresponding depth images from five different kinds of objects in the RGBD dataset, including apples, staplers, coffeemugs, sodacans, and toothpastes. Both modalities are fed into a CNN model for feature learning.
As there is no ground truth of anomalies in these datasets, thus we need to inject anomalies for evaluation. To generate the crossmodal anomalies, we adopt a widely used injection method  negative sampling, to create a number of instance pairs such that their patterns are inconsistent across different modalities. Originally, we only have instances whose patterns are consistent across different modalities (with the same class labels). To introduce the inconsistent instance pairs, we randomly sample a number of instance pairs such that their crossmodal patterns are different such that . To be more specific, in the testing stage, we inject 1,019 inconsistent pairs in the MNIST, and 545 inconsistent pairs in the RGBD datasets. In the meanwhile, we have the same portion of instances with consistent behavior which have no overlap with the training stage.
Dataset  Dimension of the representations 
in the hidden layers  
MNIST Images  7841440128032015050 
MNIST Tags  10010050 
RGBD RGB Images  585751440128032015050 
RGBD Depth Images  585751440128032015050 

IvB Baseline Algorithms
We introduce several widely used baseline methods to compare the performance of crossmodal anomaly detection on multimodal data:

CCA & KCCA [27]. CCA models correlation across multiple modalities and implicitly maps instances into a lowerdimensional space, with the target to find the maximum correlations in the latent space by linear projections. Compared with CCA, KCCA [13, 2] uses an alternative projection strategy by nonlinearly mapping the multimodal data into a consensus feature space with kernel tricks. We maximize the correlations of instances with consistent patterns in the training phase, and identify crossmodal anomalies by their crossmodal correlation in the test phase. Both CCA and KCCA are supervised.

PLS [1]. PLS is also a supervised feature representation learning method based on linear transformation. The major difference between PLS and CCA is that PLS maximizes the covariance between different modalities through linear transformations. We identify the crossmodal anomalies in the same way as CCA and KCCA.

HOAD [8]. HOAD is a supervised anomaly detection algorithm to find anomalies from the multimodal data. It first obtains the spectral embeddings from the multimodal data with an ensemble similarity matrix, and then calculates the anomaly score of each instance based on the cosine distance between different embeddings.

Embedding Network [28, 29]. An supervised representation learning framework based on deep neural networks to extract features from different modalities, such as image and text. We distinguish the crossmodal anomalies by measuring the euclidean distances across different modalities after projecting instances into the latent space.
In our experiment, we evaluate the performance of crossmodal anomaly detection using the metrics of precision, recall, and accuracy.
We use TP, FP, TN, FN as the number of true positives, false positives, true negatives and false negatives, respectively, in the predicted results. Their definitions are listed as follows:
(9) 
(10) 
(11) 
IvC Anomaly Detection Performance
The parameter settings in our deep model CMAD vary among different modalities and datasets. Specifically, we use convolutional neural network (CNN) for the image modality [30, 6]
, and fullyconnected neural network for the text tag modality. The hyperparameters of the feature dimensionality in each layer are listed in Table 2. Table 3 and Table 4 present the performance of crossmodal anomaly detection using different algorithms, which include the accuracy, precision and recall of the anomaly detection results.
As we can see from the tables, several observations are drawn as below.
Method 
Accuracy  Precision  Recall 

CCA 
0.6654  0.8911  0.6191 
Kernel CCA(RBF)  0.6923  0.7062  0.6948 
PLS  0.7235  0.6636  0.6233 
HOAD  0.5118  0.9220  0.5115 
Embedding Network  0.9504  0.9873  0.9127 
CMAD  0.9921  0.9971  0.9875 

Method 
Accuracy  Precision  Recall 

CCA 
0.7682  0.7171  0.7988 
Kernel CCA(RBF)  0.8456  0.7134  0.9699 
PLS  0.6580  0.8304  0.7539 
HOAD  0.5137  0.9345  0.5124 
Embedding Network  0.5138  0.5075  0.9346 
CMAD  0.9512  0.9313  0.9742 

First, CMAD outperforms the baseline methods CCA, KCCA, PLS and HOAD in most cases, which validates the importance of applying deep models for crossmodal anomaly detection by extracting more effective feature representations from the original multimodal data.
Second, we compare CMAD against the Embedding Network method. Similar to our proposed CMAD, the Embedding Network is also a deep structure based approach. Based on the experimental results, both of the deep structured methods outperform other linear transformation based methods. These observations validate the limitation of the linear projection based methods, i.e., the linear transformation, as they cannot fully capture nonlinear correlations among different modalities for crossmodal anomaly detection.
There are two major differences between the proposed CMAD and the Embedding Network method. First, Embedding Network is developed to handle the text and image data only, while CMAD is able to handle different types of data with appropriate deep neural network designs. Second, the Embedding Network narrows the Euclidean distance of the pair of instances with consistent information across different modalities, whereas in CMAD, we not only pull samples with similar patterns across different modalities close to each other, but also push samples with dissimilar patterns across different modalities far away from each other. Thus, CMAD can learn better representations from the multimodal data.
IvD Hyperparameter Settings
There are two hyperparameters to be tuned in our method, i.e., the margin in Eq. (8) and the threshold in Eq.(2). In Figure 2, we show how the performance of CMAD varies with different hyperparameter settings. In particular, we tune the values of these hyperparameters within the range of {0.4, 0.3, … ,0,5} and {0.2, 0.1, … ,0,7}, respectively. In general, the algorithm performs better w.r.t. precision when the threshold is set to be small and the margin is set to be large. The accuracy achieves the best value when is 0.3 and equals to 0.3. In the meanwhile, we can find that the proposed CMAD could achieve a good performance (over 92%) in a wide range of hyperparameter settings, the loose condition give us more opportunity to find anomalies in the realworld applications.
IvE Case Studies
Finally, we conduct two case studies to visualize the data distribution in the projected consensus feature space. Both of the cases are based on the MNIST dataset. The case studies are separated into two parts. The first part is to represent the data distribution of each individual modality in the latent space, and to demonstrate the effectiveness of the “pull” and “push” actions in our developed method. In the second part, we demonstrate the representation ability of CMAD across different modalities.
In the first case, in order to show that CMAD is capable to separate different instances in the projected consensus feature space, we visualize the data distribution in individual modalities after the training stage. To represent the data distribution, we first map the original instances into the projected feature space. Then we further reduce the dimensionality of the projected feature space to a twodimensional visible feature space by PCA. As shown in Figure 3, in each modality, the data instances which have similar patterns are separated from each other. In Figure 3(left), we can find that when the images with labels “three” and “seven” are taken as the test data, most of the images with label “three” (green in the figure) are separated from with the images with label “seven” (brown in the figure) in the projected twodimensional space. In Figure 3 (right), we use the text tag as another data modality. We can find that the distributions of different tags could also be separated in the projected consensus feature space.
The second case is to demonstrate the representation ability of the developed method across different modalities. In other words, our target is to measure whether the distance between observations with similar modalities is closer than the distance between observations with dissimilar modalities after feature mapping. We extract one specific instance from one modality in the projected consensus feature space and find its nearest (based on cosine distance) instances from the other modality .
As shown in Eq. (12), given an observation where modality denotes text data, we extract a set of semantically similar image observations from , and aggregate these observations to generate a new image:
(12) 
After normalizing the generated image from Eq.(12
) in the range of [0, 255] for visualization in the gray scale, we generate a new image which can represent the estimation of the distribution close to the given text
.We extract instances from the text modality to find the instances from the image modality with the closest cosine distances in the latent space, and reconstruct images from the text features. After normalizing the nearest images, we represent them as new, reconstructed images in the gray scale. The reconstructed results are reported in Figure 4. The upper part contains the groundtruth images, and the lower part denotes to the reconstructed results. Comparing the reconstructed results with the original images, we can intuitively find that our algorithm is effective in pulling images with the consistent instances across different modalities together, and pushing those with inconsistent instances apart away.
V Conclusions and Future Work
In this paper, we propose an novel crossmodal anomaly detection approach CMAD based on deep neural networks. The proposed CMAD framework is able to identify inconsistent patterns or behaviors of instances across different modalities by capturing their nonlinear correlations in the learned consensus latent feature space. Firstly, we train deep structured model to represent features from different modalities, and then project the features into a consensus latent feature space. Secondly, we “pull” the projections of a pair of instances from different modalities together if their crossmodal patterns are consistent, while “push” them further apart otherwise. Finally, we distinguish the crossmodality anomalies by measuring the distances across different modalities. We demonstrate the effectiveness of the developed approach on different multimodal datasets and the experimental results show that CMAD achieves better detection performance than other prevalent anomaly detection methods on multimodal data.
There are a number of directions that can be explored for future research, including: (1) extend the current crossmodal anomaly detection framework to other complex data, such as sequential data and networked data; (2) develop a principled way to handle the dynamic data that often arrives at a fast pace, since in many realworld applications the detection of fraudulent or malicious behaviors in realtime is often desired.
Acknowledgement
The work is, in part, supported by National Science Foundation (#IIS1657196, #IIS1750074, #CNS1816497).
References
 [1] (2003) Partial least square regression (pls regression). Encyclopedia for research methods for the social sciences 6 (4), pp. 792–795. Cited by: 2nd item.
 [2] (2006) A kernel method for canonical correlation analysis. arXiv preprint cs/0609071. Cited by: 1st item.
 [3] (2016) A survey of data mining and machine learning methods for cyber security intrusion detection. IEEE Communications Surveys & Tutorials 18 (2), pp. 1153–1176. Cited by: §II.
 [4] (2009) Anomaly detection: a survey. ACM computing surveys (CSUR) 41 (3), pp. 15. Cited by: §I, §II.
 [5] (2002) Anomaly detection and classification for hyperspectral imagery. IEEE transactions on geoscience and remote sensing 40 (6), pp. 1314–1325. Cited by: §II.
 [6] (2015) Heterogeneous network embedding via deep architectures. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 119–128. Cited by: §IVC.
 [7] (1997) Neural fraud detection in credit card operations. IEEE transactions on neural networks 8 (4), pp. 827–834. Cited by: §II.
 [8] (2011) A spectral framework for detecting inconsistency across multisource object relationships. In Data Mining (ICDM), 2011 IEEE 11th International Conference on, pp. 1050–1055. Cited by: §I, §II, 3rd item.
 [9] (2002) Combining negative selection and classification techniques for anomaly detection. In Evolutionary Computation, 2002. CEC’02. Proceedings of the 2002 Congress on, Vol. 1, pp. 705–710. Cited by: §II.
 [10] (2013) Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pp. 6645–6649. Cited by: §IIIB.
 [11] (2019) Graph recurrent networks with attributed random walks. Cited by: §IIIB.
 [12] (2011) A largescale hierarchical multiview rgbd object dataset. In Robotics and Automation (ICRA), 2011 IEEE International Conference on, pp. 1817–1824. Cited by: 2nd item.
 [13] (2000) Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems 10 (05), pp. 365–377. Cited by: §I, 1st item.
 [14] (2010) MNIST handwritten digit database. Note: http://yann.lecun.com/exdb/mnist/ External Links: Link Cited by: 1st item.

[15]
(2015)
Multiview lowrank analysis for outlier detection
. In Proceedings of the 2015 SIAM International Conference on Data Mining, pp. 748–756. Cited by: §I, §II.  [16] Cited by: §II.
 [17] (2019) Is a single vector enough? exploring node polysemy for network embedding. arXiv preprint arXiv:1905.10668. Cited by: §IIIB.
 [18] (2013) Clusteringbased anomaly detection in multiview data. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management, pp. 1545–1548. Cited by: §II.
 [19] (2013) Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168. Cited by: §IIIB.
 [20] (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: 1st item.
 [21] (2011) Multimodal deep learning. In Proceedings of the 28th international conference on machine learning (ICML11), pp. 689–696. Cited by: §IIIB.
 [22] (2010) A survey on wearable sensorbased systems for health monitoring and prognosis. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 40 (1), pp. 1–12. Cited by: §II.

[23]
(2014)
Glove: global vectors for word representation.
In
Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)
, pp. 1532–1543. Cited by: 1st item.  [24] (2001) Intrusion detection with unlabeled data using clustering. In In Proceedings of ACM CSS Workshop on Data Mining Applied to Security (DMSA2001, Cited by: §II.

[25]
(2010)
Efficient learning of deep boltzmann machines.
In
Proceedings of the thirteenth international conference on artificial intelligence and statistics
, pp. 693–700. Cited by: §IIIB.  [26] (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §IIIB.
 [27] (2000) Canonical correlation analysis.. Cited by: §I, 1st item.
 [28] (2018) Learning twobranch neural networks for imagetext matching tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: 4th item.

[29]
(2016)
Learning deep structurepreserving imagetext embeddings.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 5005–5013. Cited by: 4th item. 
[30]
(2016)
Deep structured energy based models for anomaly detection
. arXiv preprint arXiv:1605.07717. Cited by: §IVC.  [31] (2016) Collaborative multiview denoising. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2045–2054. Cited by: §I, §II.
 [32] (2015) MUVIR: multiview rare category detection.. In IJCAI, pp. 4098–4104. Cited by: §I.