Multi-view clustering (MVC) is a fundamental learning task in data mining, image segmentation and pattern recognition[29, 6, 36, 21]. The key of MVC is to find the consistency and complementary information among each view which is described by different aspects, which has been attracted enormous attention. Existing multi-view clustering approaches can be categorized into four categories according to the mechanisms and principles involved, namely, co-training, multi-kernel clustering, graph clustering and subspace clustering [17, 2, 11, 43, 30]. Co-training algorithms bootstrap the clustering results of different views by using the prior or learning knowledge from others [12, 13, 20]. For multi-kernel clustering, the data are first mapped to high-dimensional spaces through kernel functions, and then these kernels are combined linearly or non-linearly to improve clustering performance [14, 8, 28, 4]. Multi-view graph clustering algorithms aim to construct graph similarity matrix for individual views, and the main challenge is how to approximately obtain a fusion graph [15, 33, 32, 39, 37, 18, 35, 42]. Multi-view subspace clustering can be further divided into subspace-based methods [34, 3, 19, 23, 41, 45, 27, 22] and matrix factorization methods [35, 46]. Both of them are designed to learn a low-dimensional representation shared by all views. Our paper belongs to the non-negative matrix factorization method.
In recent years, non-negative matrix factorization (NMF) in Multi-view subspace clustering has been developed to a certain extent. A novel NMF-based multi-view clustering algorithm has been proposed by searching for a factorization that gives compatible clustering solutions across multiple views . The work in  proposes a multi-manifold regularized non-negative matrix factorization framework (MMNMF) which can preserve the locally geometrical structure of the manifolds for multi-view clustering. A method of 
aims at multi-view feature selection and fusion problems by using matrix factorization. In, a novel NMF model with co-orthogonal constraints is designed to deal with the MVC problem. However, most algorithms based on matrix factorization follow the single-layer strategy. Only a few algorithms such as the method DMVC in  uses the deep semi-NMF framework inspired by work 
. DMVC focuses on the intrinsic geometric structure of each view, so graph regularizations are introduced to couple the output representation of deep structures. However, DMVC needs to learn the values of hyperparameters. Then a method has been proposed to solve this problem and performance has been further improved. Although these methods have achieved success, they can also be considered to be improved from the following perspectives: Since different views represent various attributes of the data items, the view-specific features have been discarded in existing methods and are forced to be consistent among various views. According to , there is still a large gap of fully discovering the rich hidden information of original data with deep factorization matrix structures from existing mechanisms.
In this paper, we propose a multi-view clustering method via deep semi-NMF to solve the above problems. We jointly optimize the representation learning of each view and the late fusion stage in a unified framework, which terms as multi-view clustering with deep semi-NMF and global graph refinement (MVC-DMF-GGR). Firstly, we learn a low-dimensional and more compact representation for each view through the deep semi-NMF framework. As these representations originate from different views, the specific information across views can be well captured. Secondly, we use these learned representations to reconstruct the graph structure of each view and then merge them to approximate a common graph structure. Although the representation of each view may be different, the graph structure of each view tends to be similar. Because they all represent the same batch of samples. Therefore, following traditional graph-based methods, we combine the representation learning and common graph structure learning for joint optimization and hope to obtain an optimal graph structure for clustering. Besides, extensive experiments on six benchmark datasets are performed to evaluate the effectiveness of our proposed method. The proposed method enjoys superior clustering performance by comparing with some state-of-the-art methods.
The contributions of this paper are summarized as follows,
We propose a multi-view clustering method with deep semi-NMF and global graph refinement (MVC-DMF-GGR). In this work, we unify the representation learning and graph structure learning into one framework, which can promote and guide each other and reach a best consensus for clustering.
Through introducing the deep semi-NMF framework, we decompose the feature matrix by multiple layers and capture the underlying information of each view. In the fusion stage, the graph regularization item is introduced to learn a graph structure shared by each view. The common graph unifies the internal geometric structures of data among different views.
Extensive experiments are conducted on six multi-view datasets and our proposed method shows clear superiority over other SOTA methods.
The rest of the paper is organized as follows. Section II outlines the related work of multi-view clustering via NMF. Section III introduces the method we have proposed and the alternate algorithm that to solve the optimization problem with its convergence and the computational complexity analysis. Section IV introduces the datasets and compared methods and shows the experiment results with analysis. The ending of this paper is a conclusion in Section V.
Ii Related Work
We introduce some notations firstly. represents a matrix which with bold capital symbol. , and represent its -th row, -th column and the -th element. denotes the -th layer and denotes the -th view. The Forbenius norm of matrix is denoted as and the trace of matrix is denoted as . and denote the transpose and the Moore-Penrose generalized inverse of matrix respectively. We separate the negative parts and positive parts of matrix as and .
In this part, we briefly review several of the most related works, including Semi-NMF, deep Semi-NMF, Multi-view clustering via DMF, etc.
Non-negative matrix factorization is an important theme in matrix factorization, which can be used to solve clustering, spectral decomposition, and subspace identification. In reality, the source data we get may have mixed signs. The work in  extends traditional NMF to semi-NMF and gives an alternately updating algorithm of related variables. NMF can be written as:
Semi-NMF can be written as:
Where denotes the input data with samples and each sample is composed of dimensional feature. in Eq. (2) represents the elements of the original data are positive and in Eq. (LABEL:Semi-NMF) represents the elements of the original data are mixed. When NMF or semi-NMF is used in clustering, is the cluster centroid matrix and is the soft clustering assignment matrix or the representation of -dimensional. The differences between NMF and Semi-NMF can be concluded that the elements of and in NMF are forced to be positive, while in Semi-NMF they can be mix-sign.
The optimization problem of Semi-NMF in Eq. (LABEL:Semi-NMF) can be solved by alternately updating Z and H:
i) Optimizing by given . By fixing the soft clustering assignment matrix , the optimization Eq. (LABEL:Semi-NMF) can be considered as an unconstrained problem as:
. By setting , give the solutions as .
ii) Optimizing by given . With fixed, can be optimized via solving the problem as with constraint . By using Lagrange Method, we can obtain the update rule of which satisfies the KKT condition as follow,
Ii-B Deep Semi-NMF for representation learning
The low-dimensional one-layer representation obtained by Semi-NMF cannot preserve the original feature well due to the limitd representation ability. So a deep Semi-NMF framework for single-view has been proposed in 
, which is able to learn a lower and hidden representation. This method promotes the applications of semi-NMF and provides interpretability for the improvement of clustering performance. Deep Semi-NMF can be written as,
where denotes the mapping between feature matrix and the -th representation . denotes the mapping between the --th representation and the -th representation . In other words, . denotes the depth of Semi-NMF. Following the work in , we denote . The optimization problem can be solved by alternately updating and :
i) Optimizing while others be fixed. The optimization Eq. (4) can be written as an unconstrained problem as: . By setting , we can give the solution as .
ii) Optimizing while others be fixed. can be optimized via solving the problem as with constraintion . By using Lagrange method, we can obtain the update rule of which satisfies the KKT condition as follow,
The work of  combines deep semi-NMF with multi-view clustering which is called DMVC. The proposed method solves the clustering problem with constant geometric structure and representation learning by multi-layer simultaneously. Formally, multi-view clustering with deep semi-NMF can be mathematically written as,
We denote , where represents the feature matrix of the -th view. and denote with sample and each sample is of dimensional feature. Similar to subsection II-B, denotes the mapping between feature matrix and the -th representation of the -th view. denotes the mapping between the --th representation and the -th representation of the -th view. is the number of views and is the number of layers or called the depth of Semi-NMF. is the consensus latent representation for all views. is the weighting coefficient of the -th view and is a coefficient that controls the weights distribution. denotes the -th graph Laplacian, where is constructed by feature matrix using -nearest neighbor and . The optimization problem of Eq. (6) can be solved by alternately updating , , and . The update rule of , and are similar to the method deep semi-NMF. As for updating , we can use Lagrange method and take the derivative of Lagrange function with respect to .
Iii The proposed method
|Feature matrix of the -th view|
|-th layer cluster centroid matrix of the -th view|
|-th layer cluster centroid matrix of the -th view|
|-th layer feature representation of the -th view|
|the similarity matrix of the -th view|
|Consensus similarity matrix|
We introduce some basic notations of our method firstly as described in Table I. We also explain in the relevant places of the paper for reading easily.
As we mentioned before, the representations of all views in the last layer should be different in theory and the global graph structure which represents the relationship between samples should be consistent. Therefore, different from DMVC, we assume that the feature representations of the last layer in different views are different and a consensus local structure matrix should be fused with individual structures. The idea can be mathematically expressed as follows,
The meaning of , and are similar to these symbols in Eq. (6) described in Table I. denotes the -th layer of the -th view. constructs the similarity matrix in different layer. is the weight coefficient of the -th view for . denotes the consensus similarity matrix. denotes the similarity score between -th and -th sample so we need to add the constraints and for . The larger value is, the more likely two samples belong to the same cluster. We hope to obtain normalized solution, so we add the constraint .
Inspired by the tricks of the initialization in , we have pre-trained all of the layers to initialize the variables and by decomposing layer by layer. Firstly, we decompose the feature matrix of the -th view , where and . Following this, we decompose the new feature matrix , where and . We repeat the above steps until all layers have been pre-trained. We pre-train each of the layers to have an initial approximation of the matrices and which can greatly reduce the time for follow-up work. Then we use the value of and to initialize and by setting and . At the beginning, we argue that each view has the same contribution, so we initialize by the construction of with the same weight.
Because the objective function Eq. (7) is a non-convex problem, it seems unlikely to solve this problem in one step. So we propose a five-step alternate optimization method to address this problem. To reduce the total reconstruction error of the model, we also need to alternately minimize and in each layer.
Iii-B1 Update rule for matrix
By fixing , , and (), we can update by solving the following problem without constraint,
where , by setting , we can give the solutions as,
where and . and denotes the reconstruction of the -th layer’s representation for the -th view.
Iii-B2 Update rule for matrix
By fixing , , and , we can update by solving the following problem,
where . Following the update rule in , the update rule for can be written as,
We also update here for faster convergence and easier code writing.
Iii-B3 Update rule for matrix
By fixing , , and , we can update by solving the following problem,
where the variables are defined as follows,
We give the updating rule of firstly, followed by the proof of it.
The limited solution of the update rule in Eq. (14) satisfies the KKT condition.
We introduce the Lagrangian function as
In order to satisfy the constraint , we introduce the Lagrangian multiplier . By setting , we can obtation:
From the complementary slackness condition, we can obtain,
Iii-B4 Update rule for matrix
By fixing , and , we can update by solving the following problem,
where . This problem yields a close-formed solution that,
where is the -th row of , is the -th row of .
The problem of Eq. (19) can be easily rewritten into row-formed independent optimization problems as follow,
The Lagrangian function of Eq. (21) is,
where and are the Lagrangian multipliers for the constraints and respectively. Then the KKT condition is written as,
We can easily obtain the Eq. (20).
Iii-B5 Update rule for coefficient
By fixing , and , we can update by solving the following problem,
Supposing , we have that,
Note that =, we have =. Taking them into Eq. (25), the optimization can be written as follows,
where the variables are defined as follows,
For every , we have that . So the matrix is a positive semi-definite matrix and quadratic programming could be used in Eq. (26).
Iii-C Analysis and discussions
Computational Complexity: Pre-training and fine-tuning are the two main stages of our proposed method, and we will analyze them separately. To make the analysis clearer, we assume the dimensions in all the layers are the same. So we denote and the dimensions of the original feature for all the views are the same which denoted . denotes the number of iterations to achieve convergence in pre-training process and denotes the number of iterations to achieve convergence in fine-tuning process. So the complexity of pre-training and fine-tuning stages are and respectively, where normally. In conclusion, the time complexity of our algorithm is .
Convergence: It is easy to obtain that the lower bound of the whole optimization function is 0. When we optimize one variable with fixing the others, the four (optimizing and as one subproblem) subproblems are strictly convex and the objective of Algorithm 1 is monotonically decreased at each iteration. As a result, the proposed algorithm can be confirmed to be convergent.
In this part, we evaluate the clustering performance, the parameter sensitivity, and the convergence of Algorithm 1 on six benchmark datasets.
Iv-a Benchmark Datasets
We select six datasets of two types: image and text. The key information of the datasets is shown in Table II and the sample images from two image data sets are illustrated in Figure 2. The details of these datasets are given below:
contains 2000 images of 0-9 ten-digit classes. Each class has 200 images, which are described by six views. These classes including Profile correlations (216), Fourier coefficients (76), Karhunen coefficients (64), Morphological (6), Pixel averages (240), and Zernike moments (47). The number in brackets represents the dimension of each view. The data we use just includes two views with Profile correlations and Pixel averages.
BBCSprot222http://mlg.ucd.ie/datasets/segment.html. is derived from the BBC Sport section. It contains 544 documents and each document is split into two related segments as views. The dimension of two views are 3183 and 3203 respectively.
BBC333http://mlg.ucd.ie/datasets/segment.html. is derived from the BBC news corporan. It contains 685 documents and each document is split into four related segments as views, which dimensions are 4659, 4633, 4665 and, 4684 respectively.
3Sources444http://mlg.ucd.ie/datasets/3sources.html. is a document dataset collected from BBC, Reuters, and The Guardian. It contains 169 documents and these documents belong to six different themes including technology, health, business, politics, entertainment, and sport.
ORL666http://www.cl.cam.ac.uk/research/dtg/. is created by the Olivetti Research Laboratory in Cambridge, England. It is a face dataset containing 400 images of 40 different people. For each subject, images are taken at different times, lights, facial expression (open or closed eyes, smiling or not smiling), and facial details (with glasses or not). Each image uses three kinds of features which called intensity feature, LBP feature, and Gabor feature to obtain three views.
Iv-B Compared Method
We compare our proposed Algorithm 1 with the following methods, including 10 state-of-the-art multi-view clustering algorithms. Eight algorithms include four matrix decomposition clustering algorithms, Co-training algorithms and other SOTA multi-view clustering algorithms.
Perform k-means to every view and get the result of each view, then select the best one as the final result. We call the methodBKM.
AKM is regarded as a baseline method. It concatenates all of the views as one view and performs k-means to get the final result.
Kernel-based weighted multi-view clustering (MVKKM)  expresses all views by given kernel matrices. A weighted combination of the kernels is learned in parallel to the partitioning.
A co-training approach for multi-view spectral clustering (Co-train)  has been proposed with a flavor of co-training. They work on the assumption that the true underlying clustering would assign a point to the same cluster irrespective of the view with no hyperparameters.
Adaptive Structure Concept Factorization for Multi-view Clustering (MVCF)  is a method for data integration. This method correlates the affinity weights of all views with the inter-view correlation.
Self-weighted multi-view clustering with soft capped norm (SCaMVC)  learns an optimal weight for each view automatically without introducing an additive parameter. It mainly deals with different level noises and outliers by using soft capped norm.
Multi-view clustering via deep semi-NMF (DMVC)  proposes a deep matrix factorization framework for MVC. A graph regularization term is added to a deep NMF framework for preserving the inherent structure of the origin data. It is required that the representation in the last layer of each view is the same.
Auto-weighted multi-view clustering via deep matrix decomposition (AwDMVC)  learns lower-level hidden attributes for the subsequent clustering task. The weights of different views are automatically assigned without introducing extra hyperparameters.
Iv-C Experiment Setup
For the proposed mehtod, all original feature matrices should be normalized firstly. We set the number of clusters is the true number of classes for each dataset. The trade-off parameters is selected from . We assume that the layer size should be correlated with the number of clusters, so we design two schemes with one layer size and another layer size . Where in are chosen from and respectively and in are chosen from and respectively. The reason why the third layer of is fixed to will be explained in the subsection LABEL:last_layer. For these compared methods, we obtain their paper and code from the autors’ websites and obey the setting of the hyper-parameters in the paper.
The clustering performance is evaluated by three widely used criteria, including clustering accuracy (ACC), normalized mutual information (NMI), and purity (PUR). We repeat each experiment 50 times to avoid the effect of the random initialization and save the best result. All experiments are conducted on a desktop computer with Intel i9-9900K CPU @ 3.60GHz16 and 64GB RAM, MATLAB 2017a (64bit).
Iv-D Experiment Results
Table III, Table IV and Table V show the clustering performance which is measured by ACC, NMI, and PUR on six benchmark datasets. The best results of all datasets in all algorithms mark in bold. Based on these tables, we can obtain the following conclusions:
As for Table III which measured by ACC, we can find it seems better to connect all features then do k-means than perform k-means on all single views then get the best in most of the time. So using all information of data is always better than a certain aspect. It is prominent that the performance of our proposed method is always best. The ACC of our algorithm exceeds the second best method by , , , , , and on BBCSport, 3Sources, BBC, CiteSeer, ORL and HW respectively. What stands out is the performance of on the image datasets ORL and HW. Both of them just have little increase in ACC and the algorithm of Co-train exceeds our algorithm by on NMI reported in Table IV. It proves that our algorithm is more suitable for text datasets from another angle.
Comparing with the DMVC, our proposed algorithm has achieved good results on six benchmark datasets and improved clustering performance. And both of them use a deep semi-NMF framework. The results show that real-time reconstruction of the global graph instead of a fixed graph structure will learn a better representation for the original data and a global consensus graph.
AwDMVC also uses the framework of deep NMF. It assigns a weight to each view automatically when learning feature representation layer by layer without other hyperparameters being introduced. We outperform AwDMVC on all datasets by a large margin, which shows the merits of combining the representation learning and consensus graph reconstructing.
As a result, we have demonstrated our proposed method is effective compared with other state-of-the-art methods by analyzing the above experimental results. We attribute the superiority of our proposed algorithm with two factors: i) Our proposed method is based on a deep matrix decomposition framework, so it is can more likely find the meaningful representation layer by layer. ii) We abandon the original graph structure which results in a bad clustering effect and use the learned new good representation to reconstruct a consistent graph for clustering. iii) We propose a framework that unifies representation learning and consensus graph constructing, so learning representation and reconstructing the graph can mutually promote each other.
Iv-E Ablation Study
We have done a set of ablation experiments like the number of layers is one, two, and three. The purpose is to verify if the number of layers deeper, the hidden information can be easier to be extracted, and the more valuable representations can be learned.
We record the best parameters like when depth is third. As can see in Table VI, we compare the results of different depths like , , . It is obvious to find that the results of three layers are always greater than two, and the results of two layers are always greater than one. It is easy to calculate the performance improvement on BBCSport, 3Sources, BBC, CiteSeer and ORL by , , , and when the number of layers changes from to , and the performance improvement by , , , and when the number of layers changes from to . So it is very necessary to choose an appropriate number of layers for all datasets.
We visualize the clustering results of our algorithm and the comparison diagram of our algorithm, DMVC, and AwDMVC in Figure 3 and Figure 4 respectively. In these figures, we represent the samples of the same class as a color. The points of the same color become closer and the points of different colors become further, the better the clustering performance is. It can be seen from Figure 3 that the differences between intra-class structure and inter-class are becoming more and more obvious with the increase of the number of iteration on datasets BBCSports, BBC, and HW. It can be seen from Figure 4 that the clustering effect of DMVC is least obvious. For DMVC, some intra-class structures can be seen, while the boundaries between clusters are very vague or even not. AwDMVC clusters into a ring structure for one class, but a ring always contains other classes, which greatly reduces the clustering effect. In contrast, clear clustering structures can be seen in our algorithm on two benchmarks.
It is theoretically guaranteed that our algorithm converges to a local minimum. We also conduct experiments to verify that the algorithm is convergent or not. As shown in Figure 5, the objective value curves are plotted in red on datasets 3Source, BBC, and BBCSport. The experimental results prove that our proposed algorithm can decrease monotonically and the iterations less than 150 usually. Thus it experimentally proves the convergence of our algorithm.
Iv-H Parameter Sensitivity
There are two sets of parameters in our proposed method, i.e., the balance coefficient and the size of layer . Next we will analyze the sensitivity of , the selection rules of the last parameter in and , and the sensitivity of and in .
Iv-H1 Sensitivity of
Figure 6 shows the influence of ACC result concerning the parameter under the best layer size setting which is obtained in previous experiments. Each small picture in Figure 6 contains two curves, where the red one is ours, the blue one is the second best algorithm. We can find that our algorithm outperforms the second best algorithm in most range of the on most of the benchmarks even if it is a little sensitive to the parameter .
Iv-H2 Sensitivity of and in
The Figure 8 shows the sensitivity experimental results of and in on ORL, 3Source and CiteSeer. From these figures, we observe that it is relatively stable in most parameter combinations without the above-mentioned discipline. Despite slight variation, it outperforms most algorithms in most of the benchmarks.
In this paper, we propose a novel multiple clustering framework with deep semi-NMF, which simultaneously optimizes deep representation learning and consensus graph constructing. In other words, the deep representation can be refined by the global consensus graph and vice versa. Through the multi-layer projection and the guidance of a consensus geometric structure that is constrained by a graph, the representation learned can contain more hidden attributes of the original features. Extensive experiments are conducted on six benchmarks, demonstrating the effectiveness of our proposed algorithm by comparing with ten SOTA methods. In the future, we will consider learning a consensus representation with a rotation matrix directly and construct the consensus graph more discriminative.
This work was supported by the National Key R&D Program of China 2018YFB1003203 and the National Natural Science Foundation of China (Grant NO.61672528, NO. 61773392, NO. 61872377).
Multi-view k-means clustering on big data.
Twenty-Third International Joint conference on artificial intelligence, Cited by: item 4.
Diversity-induced multi-view subspace clustering.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 586–594. Cited by: §I.
-  (2021) Relaxed multi-view clustering in latent embedding space. Information Fusion 68, pp. 8 – 21. Cited by: §I.
-  (2019) . IEEE Transactions on Multimedia. Cited by: §I.
-  (2010) Convex and Semi-Nonnegative Matrix Factorizations. IEEE Transactions on Pattern Analysis and Machine Intelligence 32, pp. 45–55. Cited by: §II-A, §II-B, §III-B2.
-  (2010) Multi-view video summarization. IEEE Transactions on Multimedia 12 (7), pp. 717–729. Cited by: §I.
Reducing the dimensionality of data with neural networks. science 313, pp. 504–507. Cited by: §III-A.
-  (2020) Robust visual tracking via constrained multi-kernel correlation filters. IEEE Transactions on Multimedia 22 (11), pp. 2820–2832. Cited by: §I.
-  (2018) Self-weighted multi-view clustering with soft capped norm. Knowledge-Based Systems 158, pp. 1–8. Cited by: item 8.
-  (2020) Auto-weighted multi-view clustering via deep matrix decomposition. Pattern Recognition 97, pp. 107015. Cited by: §I, item 10.
-  (2020) Partition level multiview subspace clustering. Neural Networks 122, pp. 279–288. Cited by: §I.
-  (2011) A co-training approach for multi-view spectral clustering. In Proceedings of the 28th international conference on machine learning (ICML-11), pp. 393–400. Cited by: §I, item 5.
-  (2011) Co-regularized multi-view spectral clustering. Advances in neural information processing systems 24, pp. 1413–1421. Cited by: §I.
-  (2016) Multiple kernel clustering with local kernel alignment maximization. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pp. 1704–1710. Cited by: §I.
-  (2020) Multi-view spectral clustering with high-order optimal neighborhood laplacian matrix. IEEE Transactions on Knowledge and Data Engineering. Cited by: §I, §I.
-  (2013) Multi-view clustering via joint nonnegative matrix factorization. In Proceedings of the 2013 SIAM International Conference on Data Mining, pp. 252–260. Cited by: §I.
-  (2020) Optimal neighborhood multiple kernel clustering with adaptive local kernels. IEEE Transactions on Knowledge and Data Engineering. Cited by: §I.
-  (2016) Parameter-free auto-weighted multiple graph learning: a framework for multiview clustering and semi-supervised classification.. In IJCAI, pp. 1881–1887. Cited by: §I.
-  (2020) Anchor-based multiview subspace clustering with diversity regularization. IEEE MultiMedia 27 (4), pp. 91–101. Cited by: §I.
-  (2020) Unsupervised multi-view clustering by squeezing hybrid knowledge from cross view and each view. IEEE Transactions on Multimedia. Cited by: §I.
-  (2019) Adaptive hypergraph embedded semi-supervised multi-label image annotation. IEEE Transactions on Multimedia 21 (11), pp. 2837–2849. Cited by: §I.
-  (2018) Learning a joint affinity graph for multiview subspace clustering. IEEE Transactions on Multimedia 21 (7), pp. 1724–1736. Cited by: §I.
-  (2019) Learning a Joint Affinity Graph for Multiview Subspace Clustering. IEEE Transactions on Multimedia 21, pp. 1724–1736. Cited by: §I.
-  (2014) A deep semi-nmf model for learning hidden representations. In International Conference on Machine Learning, pp. 1692–1700. Cited by: