In representation learning (RL), two critical issues need to be considered. First, how to make the learned representations more interpretable? Interpretability is a must in many applications. For instance, in a clinical setting, when applying deep learning (DL) and machine learning (ML) models to learn representations for patients and use the representations to assist clinical decision-making, we need to explain the representations to physicians such that the decision-making process is transparent, rather than being black-box. Second, how to avoid overfitting? It is often the case that the learned representations yield good performance on the training data, but perform less well on the testing data. How to improve the generalization performance on previously unseen data is important.
In this paper, we make an attempt towards addressing these two issues, via a unified approach. DL/ML models designed for representation learning are typically parameterized with a collection of weight vectors, each aiming at capturing a certain latent feature. For example, neural networks are equipped with multiple layers of hidden units where each unit is parameterized by a weight vector. In another representation learning model – sparse coding , a dictionary of basis vectors are utilized to reconstruct the data. In the interpretation of RL models, a major part is to interpret the learned weight vectors. Typically, elements of a weight vector have one-to-one correspondence with observed features and a weight vector is oftentimes interpreted by examining the top observed-features that correspond to the largest weights in this vector. For instance, when applying SC to reconstruct documents that are represented with bag-of-words feature vectors, each dimension of a basis vector corresponds to one word in the vocabulary. To visualize/interpret a basis vector, one can inspect the words corresponding to the large values in this vector. To achieve better interpretability, various constraints have been imposed on the weight vectors. Some notable ones are: (1) Sparsity  – which encourages most weights to be zero. Observed features that have zeros weights are considered to be irrelevant and one can focus on interpreting a few non-zero weights. (2) Diversity  – which encourages different weight vectors to be mutually “different” (e.g., having larger angles ). By doing this, the redundancy among weight vectors is reduced and cognitively one can map each weight vector to a physical concept in a more unambiguous way. (3) Non-negativeness  – which encourages the weights to be nonnegative since in certain scenarios (e.g., bag of words representation of documents), it is difficult to make sense of negative weights. In this paper, we propose a new perspective of interpretability: less-overlapness, which encourages the weight vectors to have small overlap in supports111The support of a vector is the set of indices of nonzero entries in this vector.. By doing this, each weight vector is anchored on a unique subset of observed features without being redundant with other vectors, which greatly facilitates interpretation. For example, if topic models  are learned in such a way, each topic will be characterized by a few representative words and the representative words of different topics are different. Such topics are more amenable for interpretation. Besides improving interpretability, less-overlapness helps alleviate overfitting. It imposes a structural constraint over the weight vectors, thus can effectively shrink the complexity of the function class induced by the RL models and improve the generalization performance on unseen data.
To encourage less-overlapness, we propose a regularizer that simultaneously encourages different weight vectors to be close to being orthogonal and each vector to be sparse, which jointly encourage vectors’ supports to have small overlap. The major contributions of this work include:
We propose a new type of regularization approach which encourages less-overlapness, for the sake of improving interpretability and reducing overfitting.
We apply the proposed regularizer to two models: neural networks and sparse coding (SC), and derive an efficient ADMM-based algorithm for the regularized SC problem.
In experiments, we demonstrate the empirical effectiveness of this regularizer.
In this section, we propose a nonoverlapness-promoting regularizer and apply it to two models.
2.1 Nonoverlapness-Promoting Regularization
We assume the model is parameterized by vectors . For a vector , its support is defined as – the indices of nonzero entries in . We first define a score to measure the overlap between two vectors:
which is the Jaccard index of their supports. The smalleris, the less overlapped the two vectors are. For vector, the overlap score is defined as the sum of pairwise scores
This score function is not smooth, which will result in great difficulty for optimization if used as a regularizer. Instead, we propose a smooth function that is motivated from and can achieve a similar effect as . The basic idea is: to encourage small overlap, we can encourage (1) each vector has a small number of non-zero entries and (2) the intersection of supports among vectors is small. To realize (1), we use an L1 regularizer to encourage the vectors to be sparse. To realize (2), we encourage the vectors to be close to being orthogonal. For two sparse vectors, if they are close to orthogonal, then their supports are landed on different positions. As a result, the intersection of supports is small.
We follow the method proposed by  to promote orthogonality. To encourage two vectors and to be close to being orthogonal, one can make their norm , close to one and their inner product close to zero. Based on this, one can promote orthogonality among a set of vectors by encouraging the Gram matrix (
) of these vectors to be close to an identity matrix. Since and zero are off the diagonal of and respectively, and and one are on the diagonal of and respectively, encouraging close to essentially makes close to zero and close to one. As a result, and are encouraged to be close to being orthogonal. In , one way proposed to measure the “closeness” between two matrices is to use the log-determinant divergence (LDD) . The LDD between two positive definite matrices and is defined as where denotes matrix trace. The closeness between and can be achieved by encouraging their LDD to be small.
Combining the orthogonality-promoting LDD regularizer with the sparsity-promoting L1 regularizer together, we obtain the following LDD-L1 regularizer
where is a tradeoff parameter between these two regularizers. As verified in experiments, this regularizer can effectively promote non-overlapness. The formal analysis of the relationship between Eq.(3) and Eq.(2) will be left for future study. It is worth noting that either L1 or LDD alone is not sufficient to reduce overlap. As illustrated in Figure 1(a) where only L1 is applied, though the two vectors are sparse, their supports are completely overlapped. In Figure 1(b) where the LDD regularizer is applied, though the two vectors are very close to orthogonal, their supports are completely overlapped since they are dense. In Figure 1(c) where the LDD-L1 regularizer is used, the two vectors are sparse and are close to being orthogonal. As a result, their supports are not overlapped.
2.2 Case Studies
In this section, we apply the LDD-L1 regularizer to two models.
In a neural network (NN) with hidden layers, each hidden layer is equipped with units and each unit is connected with all units in layer . Hidden unit at layer is parameterized by a weight vector . These hidden units aim at capturing latent features underlying data. For weight vectors in each layer , we apply the LDD-L1 regularizer to encourage them to have small overlap. An LDD-L1 regularized NN problem (LDD-L1-NN) can be defined in the following way:
where is the objective function of this NN.
Given data samples where is the feature dimension, we aim to use a dictionary of basis vectors to reconstruct , where is the number of basis vectors. Each data sample is reconstructed by taking a sparse linear combination of the basis vectors , where are the linear coefficients and most of them are zero. The reconstruction error is measured using the squared L2 norm . To achieve sparsity among the codes, L1 regularization is utilized: . To avoid the degenerated case where most coefficients are zero and the basis vectors are of large magnitude, L2 regularization is applied to the basis vectors: . We apply the LDD-L1 regularizer to encourage the supports of basis vectors to have small overlap. Putting these pieces together, we obtain the LDD-L1 regularized SC (LDD-L1-SC) problem
where denotes all the linear coefficients.
For LDD-L1-NNs, a simple subgradient descent algorithm is applied to learn the weight parameters. For LDD-L1-SC, we solve it by alternating between and : (1) updating with fixed; (2) updating with fixed. These two steps alternate until convergence. With fixed, the sub-problem defined over is
which can be decomposed into Lasso problems: for
where is the coefficient vector of the -th sample. Lasso can be solved by many algorithms, such as proximal gradient descent (PGD) . Fixing , the sub-problem defined over is:
We solve this problem using an ADMM-based algorithm. First, we write the problem into an equivalent form
Then we write down the augmented Lagrangian function
We minimize this Lagrangian function by alternating among , and .
The subproblem defined on is
which is a Lasso problem and can be solved using PGD .
The update equation of is simple.
The subproblem defined on is
which can be solved using a coordinate descent algorithm. The derivation is given in the next subsection.
3.1 Coordinate Descent Algorithm for Learning
In each iteration of the CD algorithm, one basis vector is chosen for update while the others are fixed. Without loss of generality, we assume it is . The sub-problem defined over is
To obtain the optimal solution, we take the derivative of the objective function and set it to zero. First, we discuss how to compute the derivative of w.r.t
. According to the chain rule, we have
where denotes the first column of . Let , then
According to the inverse of block matrix
where , , , , we have equals where
To this end, we obtain the full gradient of the objective function in Eq.(13):
Setting the gradient to zero, we get
Let , , , then and . Let be the eigen decomposition of , we have
The matrix is idempotent, i.e., , and its rank is . According to the property of idempotent matrix, the first eigenvalues of equal to one and the rest equal to zero. Thereafter, the first eigenvalues of equal to zero and the rest equal to one. Based on this property, Eq.(24) can be simplified as
After simplification, it is a quadratic function where has a closed form solution. Then we plug the solution of into Eq.(23) to get the solution of .
In these section, we present experimental results. The studies were performed on three models: sparse coding (SC) for document modeling, long short-term memory (LSTM)
network for language modeling and convolutional neural network (CNN) for image classification.
We used four datasets. The SC experiments were conducted on two text datasets: 20-Newsgroups222http://qwone.com/~jason/20Newsgroups/ (20-News) and Reuters Corpus333http://www.daviddlewis.com/resources/testcollections/rcv1/ Volume 1 (RCV1). The 20-News dataset contains newsgroup documents belonging to 20 categories, where 11314, 3766 and 3766 documents were used for training, validation and testing respectively. The original RCV1 dataset contains documents belonging to 103 categories. Following , we chose the largest 4 categories which contain 9625 documents, to carry out the study. The number of training, validation and testing documents are 5775, 1925, 1925 respectively. For both datasets, stopwords were removed and all words were changed into lower-case. Top 1000 words with the highest document frequency were selected to form the vocabulary. We used tf-idf to represent documents and the feature vector of each document is normalized to have unit L2 norm.
The LSTM experiments were conducted on the Penn Treebank (PTB) dataset , which consists of 923K training, 73K validation, and 82K test words. Following , top 10K words with highest frequency were selected to form the vocabulary. All other words are replaced with a special token UNK.
The CNN experiments were performed on the CIFAR-10 dataset444https://www.cs.toronto.edu/~kriz/cifar.html
. It consists of 32x32 color images belonging to 10 categories, where 50,000 images were used for training and 10,000 for testing. 5000 training images were used as the validation set for hyperparameter tuning. We augmented the dataset by first zero-padding the images with 4 pixels on each side, then randomly cropping the padded images to reproduce 32x32 images.
4.2 LDD-L1 and Non-overlapness
First of all, we verify whether the LDD-L1 regularizer is able to promote non-overlapness. The study is performed on the SC model and the 20-News dataset. The number of basis vectors was set to 50. For 5 choices of the regularization parameter of LDD-L1: , we ran the LDD-L1-SC model until convergence and measured the overlap score (defined in Eq.2) of the basis vectors. The tradeoff parameter inside LDD-L1 is set to 1. Figure 2 shows that the overlap score consistently decreases as the regularization parameter of LDD-L1 increases, which implies that LDD-L1 can effectively encourage non-overlapness. As a comparison, we replaced LDD-L1 with LDD-only and L1-only, and measured the overlap scores. As can be seen, for LDD-only, the overlap score remains to be 1 when the regularization parameter increases, which indicates that LDD alone is not able to reduce overlap. This is because under LDD-only, the vectors remain dense, which renders their supports to be completely overlapped. Under the same regularization parameter, LDD-L1 achieves lower overlap score than L1, which suggests that LDD-L1 is more effective in promoting non-overlapness. Given that – the tradeoff parameter associated with the L1 norm in LDD-L1 – is set to 1, the same regularization parameter imposes the same level of sparsity for both LDD-L1 and L1-only. Since LDD-L1 encourages the vectors to be mutually orthogonal, the intersection between vectors’ supports is small, which consequently results in small overlap. This is not the case for L1-only, which hence is less effective in reducing overlap.
In this section, we examine whether the weight vectors learned under LDD-L1 regularization are more interpretable, using SC as a study case. For each basis vector learned by LDD-L1-SC on the 20-News dataset, we use the words (referred to as representative words) that correspond to the supports of to interpret . Table 1 shows the representative words of 9 exemplar vectors. By analyzing the representative words, we can see vector 1-9 represent the following semantics respectively: crime, faith, job, war, university, research, service, religion and Jews. The representative words of these vectors have no overlap. As a result, it is easy to associate each vector with a unique concept, in other words, easy to interpret. Figure 3 visualizes the learned vectors where the black dots denote vectors’ supports. As can be seen, the supports of different basis vectors are landed over different words and their overlap is very small.
|6||application, performance, ideas|
|9||christ, jews, land, faq|
4.4 Reducing Overfitting
In this section, we verify whether LDD-L1 is able to reduce overfitting. The studies were performed on SC, LSTM and CNN. In each experiment, the hyperparameters were tuned on the validation set.
For 20-News, the number of basis vectors in LDD-L1-SC is set to 50. , , and are set to 1, 1, 0.1 and 0.001 respectively. For RCV1, the number of basis vectors is set to 200. , , and are set to 0.01, 1, 1 and 1 respectively. We compared LDD-L1 with LDD-only and L1-only.
To evaluate the model performance quantitatively, we applied the dictionary learned on the training data to infer the linear coefficients ( in Eq.4) of test documents, then performed
-nearest neighbors (KNN) classification on. Table 2 shows the classification accuracy on test sets of 20-News and RCV1 and the gap555Training accuracy minus test accuracy. between the accuracy on training and test sets. Without regularization, SC achieves a test accuracy of 0.592 on 20-News, which is lower than the training accuracy by 0.119. This suggests that an overfitting to training data occurs. With LDD-L1 regularization, the test accuracy is improved to 0.612 and the gap between training and test accuracy is reduced to 0.099, demonstrating the ability of LDD-L1 in alleviating overfitting. Though LDD alone and L1 alone improve test accuracy and reduce train/test gap, they perform less well than LDD-L1, which indicates that for overfitting reduction, encouraging non-overlapness is more effective than solely promoting orthogonality or solely promoting sparsity. Similar observations are made on the RCV1 dataset. Interestingly, the test accuracy achieved by LDD-L1-SC on RCV1 is even better than the training accuracy.
|–||Test||Gap between train and test||Test||Gap between train and test|
LSTM for Language Modeling
The LSTM network architecture follows the word language model (PytorchTM) provided in Pytorch666https://github.com/pytorch/examples/tree/master/word_language_model
. The number of hidden layers is set to 2. The embedding size is 1500. The size of hidden state is 1500. The word embedding and softmax weights are tied. The number of training epochs is 40. Dropout with 0.65 is used. The initial learning rate is 20. Gradient clipping threshold is 0.25. The size of mini-batch is 20. In LSTM training, the network is unrolled for 35 iterations. Perplexity is used for evaluating language modeling performance (lower is better). The weight parameters are initialized uniformly between [-0.1, 0.1]. The bias parameters are initialized as 0. We compare with the following regularizers: (1) L1 regularizer; (2) orthogonality-promoting regularizers based on cosine similarity (CS), incoherence (IC) , mutual angle (MA) , decorrelation (DC) , angular constraint (AC)  and LDD .
|Deep RNN ||107.5|
|Sum-Product Network ||100.0|
|LSTM (medium) ||82.7|
|LSTM (large) ||78.4|
|Variational LSTM with MC Dropout ||73.4|
|Pointer Sentinel LSTM ||70.9|
|Ensemble of 38 Large LSTMs ||68.7|
|Ensemble of 10 Large Variational LSTMs ||68.7|
|Variational RHN ||68.5|
|Variational LSTM +REAL ||68.5|
|Neural Architecture Search ||67.9|
|Variational RHN +RE [15, 45]||66.0|
|Variational RHN + WT ||65.4|
|Variational RHN + WT with MC dropout ||64.4|
|Neural Architecture Search + WT V1 ||64.0|
|Neural Architecture Search + WT V2 ||62.4|
Table 3 shows the perplexity on the PTB test set. Without regularization, PytorchLM achieves a perplexity of 72.3. With LDD-L1 regularization, the perplexity is significantly reduced to 71.1. This shows that LDD-L1 can effectively improve generalization performance. Compared with the sparsity-promoting L1 regularizer and orthogonality-promoting regularizers, LDD-L1 – which promotes non-overlapness by simultaneously promoting sparsity and orthogonality – achieves lower perplexity. For the convenience of readers, we also list the perplexity achieved by other state of the art deep learning models. The LDD-L1 regularizer can be applied to these models as well to potentially boost their performance.
CNN for Image Classification
The CNN architecture follows that of the wide residual network (WideResNet) 
. The depth and width are set to 28 and 10 respectively. The networks are trained using SGD, where the epoch number is 200, the learning rate is set to 0.1 initially and is dropped by 0.2 at 60, 120 and 160 epochs, the minibatch size is 128 and the Nesterov momentum is 0.9. The dropout probability is 0.3 and the L2 weight decay is 0.0005. Model performance is measured using error rate, which is the median of 5 runs. We compared with (1) L1 regularizer; (2) orthogonality-promoting regularizers including CS, IC, MA, DC, AC, LDD and one based on locally constrained decorrelation (LCD).
Table 4 shows classification errors on CIFAR-10 test set. Compared with the unregularized WideResNet which achieves an error rate of 3.89%, the proposed LDD-L1 regularizer greatly reduces the error to 3.60%. LDD-L1 outperforms the L1 regularizer and orthogonality-promoting regularizers, demonstrating that encouraging non-overlapness is more effective than encouraging sparsity alone or orthogonality alone in reducing overfitting. The error rates achieved by other state of the art methods are also listed.
|Highway Network ||7.60|
5 Related Works
The interpretation of representation learning models has been widely studied. Choi et al. 
develop a two-level neural attention model that detects influential variables in a reverse time order and use these variables to interpret predictions. Lipton discuss a taxonomy of both the desiderata and methods in interpretability research. Koh and Liang  propose to use influence functions to trace a model’s prediction back to its training data and identify training examples that are most relevant to a prediction. Dong et al. 
integrate topics extracted from human descriptions into neural networks via an interpretive loss and then use a prediction-difference maximization algorithm to interpret the learned features of each neuron. Our method is orthogonal to these existing approaches and can be potentially used with them together to further improve interpretability.
In this paper, we propose a new type of regularization approach that encourages the weight vectors to have less-overlapped supports. The proposed LDD-L1 regularizer simultaneously encourages the weight vectors to be sparse and close to being orthogonal, which jointly produces the effects of less overlap. We apply this regularizer to two models: neural networks and sparse coding (SC), and derive an efficient ADMM-based algorithm for solving the regularized SC problem. Experiments on various datasets demonstrate the effectiveness of this regularizer in alleviating overfitting and improving interpretability.
-  Yebo Bao, Hui Jiang, Lirong Dai, and Cong Liu. Incoherent training of deep neural networks to decorrelate bottleneck features for speech recognition. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013.
-  David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 2003.
-  Deng Cai and Xiaofei He. Manifold adaptive experimental design for text categorization. IEEE Transactions on Knowledge and Data Engineering, 24(4):707–719, 2012.
-  Wei-Chen Cheng, Stanley Kok, Hoai Vu Pham, Hai Leong Chieu, and Kian Ming A Chai. Language modeling with sum-product networks. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
-  Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter Stewart. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. In Advances in Neural Information Processing Systems, pages 3504–3512, 2016.
-  Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
-  Michael Cogswell, Faruk Ahmed, Ross Girshick, Larry Zitnick, and Dhruv Batra. Reducing overfitting in deep networks by decorrelating representations. arXiv preprint arXiv:1511.06068, 2015.
-  Yinpeng Dong, Hang Su, Jun Zhu, and Bo Zhang. Improving interpretability of deep neural networks with semantic information. arXiv preprint arXiv:1703.04096, 2017.
Yarin Gal and Zoubin Ghahramani.
A theoretically grounded application of dropout in recurrent neural networks.In Advances in Neural Information Processing Systems, pages 1019–1027, 2016.
-  Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. ICML, 2013.
-  Benjamin Graham. Fractional max-pooling. arXiv preprint arXiv:1412.6071, 2014.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In , pages 770–778, 2016.
-  Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
-  Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
-  Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.
Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush.
Character-aware neural language models.
Thirtieth AAAI Conference on Artificial Intelligence, 2016.
-  Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. Proceedings of the 34th International Conference on Machine Learning, 2017.
-  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 2012.
-  Brian Kulis, Mátyás A Sustik, and Inderjit S Dhillon. Low-rank kernel learning with bregman matrix divergences. Journal of Machine Learning Research, 10(Feb):341–376, 2009.
-  Chen-Yu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeply-supervised nets. In Artificial Intelligence and Statistics, pages 562–570, 2015.
-  Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999.
-  Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
-  Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2016.
-  Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
-  Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
-  Tomas Mikolov, Martin Karafiat, and Lukas Burget. Recurrent neural network based language model.
-  Tomas Mikolov and Geoffrey Zweig. Context dependent recurrent neural network language model. 2012.
-  Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv preprint arXiv:1511.06422, 2015.
-  Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 1997.
-  Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127–239, 2014.
-  Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026, 2013.
-  Pau Rodríguez, Jordi Gonzàlez, Guillem Cucurull, Josep M Gonfaus, and Xavier Roca. Regularizing cnns with locally constrained decorrelations. arXiv preprint arXiv:1611.01967, 2016.
-  Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
-  Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
-  Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.
Yichen Wang, Robert Chen, Joydeep Ghosh, Joshua C Denny, Abel Kho, You Chen,
Bradley A Malin, and Jimeng Sun.
Rubik: Knowledge guided tensor factorization and completion for health data analytics.In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1265–1274. ACM, 2015.
Pengtao Xie, Yuntian Deng, and Eric P. Xing.
Diversifying restricted boltzmann machine for document modeling.In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2015.
-  Pengtao Xie, Yuntian Deng, Yi Zhou, Abhimanu Kumar, Yaoliang Yu, James Zou, and Eric P. Xing. Learning latent space models with angular constraints. In Proceedings of the 34th International Conference on Machine Learning, pages 3799–3810, 2017.
-  Pengtao Xie, Barnabas Poczos, and Eric P Xing. Near-orthogonality regularization in kernel methods. 2017.
-  Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431, 2016.
-  Yoshihiro Yamada, Masakazu Iwamura, and Koichi Kise. Deep pyramidal residual networks with separated stochastic depth. arXiv preprint arXiv:1612.01230, 2016.
-  Yang Yu, Yu-Feng Li, and Zhi-Hua Zhou. Diversity regularized machine. In International Joint Conference on Artificial Intelligence. Citeseer, 2011.
-  Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
-  Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
-  Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. Recurrent highway networks. arXiv preprint arXiv:1607.03474, 2016.
-  Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.