1 Introduction
In representation learning (RL), two critical issues need to be considered. First, how to make the learned representations more interpretable? Interpretability is a must in many applications. For instance, in a clinical setting, when applying deep learning (DL) and machine learning (ML) models to learn representations for patients and use the representations to assist clinical decisionmaking, we need to explain the representations to physicians such that the decisionmaking process is transparent, rather than being blackbox. Second, how to avoid overfitting? It is often the case that the learned representations yield good performance on the training data, but perform less well on the testing data. How to improve the generalization performance on previously unseen data is important.
In this paper, we make an attempt towards addressing these two issues, via a unified approach. DL/ML models designed for representation learning are typically parameterized with a collection of weight vectors, each aiming at capturing a certain latent feature. For example, neural networks are equipped with multiple layers of hidden units where each unit is parameterized by a weight vector. In another representation learning model – sparse coding [29], a dictionary of basis vectors are utilized to reconstruct the data. In the interpretation of RL models, a major part is to interpret the learned weight vectors. Typically, elements of a weight vector have onetoone correspondence with observed features and a weight vector is oftentimes interpreted by examining the top observedfeatures that correspond to the largest weights in this vector. For instance, when applying SC to reconstruct documents that are represented with bagofwords feature vectors, each dimension of a basis vector corresponds to one word in the vocabulary. To visualize/interpret a basis vector, one can inspect the words corresponding to the large values in this vector. To achieve better interpretability, various constraints have been imposed on the weight vectors. Some notable ones are: (1) Sparsity [35] – which encourages most weights to be zero. Observed features that have zeros weights are considered to be irrelevant and one can focus on interpreting a few nonzero weights. (2) Diversity [36] – which encourages different weight vectors to be mutually “different” (e.g., having larger angles [37]). By doing this, the redundancy among weight vectors is reduced and cognitively one can map each weight vector to a physical concept in a more unambiguous way. (3) Nonnegativeness [21] – which encourages the weights to be nonnegative since in certain scenarios (e.g., bag of words representation of documents), it is difficult to make sense of negative weights. In this paper, we propose a new perspective of interpretability: lessoverlapness, which encourages the weight vectors to have small overlap in supports^{1}^{1}1The support of a vector is the set of indices of nonzero entries in this vector.. By doing this, each weight vector is anchored on a unique subset of observed features without being redundant with other vectors, which greatly facilitates interpretation. For example, if topic models [2] are learned in such a way, each topic will be characterized by a few representative words and the representative words of different topics are different. Such topics are more amenable for interpretation. Besides improving interpretability, lessoverlapness helps alleviate overfitting. It imposes a structural constraint over the weight vectors, thus can effectively shrink the complexity of the function class induced by the RL models and improve the generalization performance on unseen data.
To encourage lessoverlapness, we propose a regularizer that simultaneously encourages different weight vectors to be close to being orthogonal and each vector to be sparse, which jointly encourage vectors’ supports to have small overlap. The major contributions of this work include:

We propose a new type of regularization approach which encourages lessoverlapness, for the sake of improving interpretability and reducing overfitting.

We apply the proposed regularizer to two models: neural networks and sparse coding (SC), and derive an efficient ADMMbased algorithm for the regularized SC problem.

In experiments, we demonstrate the empirical effectiveness of this regularizer.
2 Methods
In this section, we propose a nonoverlapnesspromoting regularizer and apply it to two models.
2.1 NonoverlapnessPromoting Regularization
We assume the model is parameterized by vectors . For a vector , its support is defined as – the indices of nonzero entries in . We first define a score to measure the overlap between two vectors:
(1) 
which is the Jaccard index of their supports. The smaller
is, the less overlapped the two vectors are. For vector, the overlap score is defined as the sum of pairwise scores(2) 
This score function is not smooth, which will result in great difficulty for optimization if used as a regularizer. Instead, we propose a smooth function that is motivated from and can achieve a similar effect as . The basic idea is: to encourage small overlap, we can encourage (1) each vector has a small number of nonzero entries and (2) the intersection of supports among vectors is small. To realize (1), we use an L1 regularizer to encourage the vectors to be sparse. To realize (2), we encourage the vectors to be close to being orthogonal. For two sparse vectors, if they are close to orthogonal, then their supports are landed on different positions. As a result, the intersection of supports is small.
We follow the method proposed by [39] to promote orthogonality. To encourage two vectors and to be close to being orthogonal, one can make their norm , close to one and their inner product close to zero. Based on this, one can promote orthogonality among a set of vectors by encouraging the Gram matrix (
) of these vectors to be close to an identity matrix
. Since and zero are off the diagonal of and respectively, and and one are on the diagonal of and respectively, encouraging close to essentially makes close to zero and close to one. As a result, and are encouraged to be close to being orthogonal. In [39], one way proposed to measure the “closeness” between two matrices is to use the logdeterminant divergence (LDD) [19]. The LDD between two positive definite matrices and is defined as where denotes matrix trace. The closeness between and can be achieved by encouraging their LDD to be small.Combining the orthogonalitypromoting LDD regularizer with the sparsitypromoting L1 regularizer together, we obtain the following LDDL1 regularizer
(3) 
where is a tradeoff parameter between these two regularizers. As verified in experiments, this regularizer can effectively promote nonoverlapness. The formal analysis of the relationship between Eq.(3) and Eq.(2) will be left for future study. It is worth noting that either L1 or LDD alone is not sufficient to reduce overlap. As illustrated in Figure 1(a) where only L1 is applied, though the two vectors are sparse, their supports are completely overlapped. In Figure 1(b) where the LDD regularizer is applied, though the two vectors are very close to orthogonal, their supports are completely overlapped since they are dense. In Figure 1(c) where the LDDL1 regularizer is used, the two vectors are sparse and are close to being orthogonal. As a result, their supports are not overlapped.
2.2 Case Studies
In this section, we apply the LDDL1 regularizer to two models.
Neural Networks
In a neural network (NN) with hidden layers, each hidden layer is equipped with units and each unit is connected with all units in layer . Hidden unit at layer is parameterized by a weight vector . These hidden units aim at capturing latent features underlying data. For weight vectors in each layer , we apply the LDDL1 regularizer to encourage them to have small overlap. An LDDL1 regularized NN problem (LDDL1NN) can be defined in the following way:
where is the objective function of this NN.
Sparse Coding
Given data samples where is the feature dimension, we aim to use a dictionary of basis vectors to reconstruct , where is the number of basis vectors. Each data sample is reconstructed by taking a sparse linear combination of the basis vectors , where are the linear coefficients and most of them are zero. The reconstruction error is measured using the squared L2 norm . To achieve sparsity among the codes, L1 regularization is utilized: . To avoid the degenerated case where most coefficients are zero and the basis vectors are of large magnitude, L2 regularization is applied to the basis vectors: . We apply the LDDL1 regularizer to encourage the supports of basis vectors to have small overlap. Putting these pieces together, we obtain the LDDL1 regularized SC (LDDL1SC) problem
(4) 
where denotes all the linear coefficients.
3 Algorithm
For LDDL1NNs, a simple subgradient descent algorithm is applied to learn the weight parameters. For LDDL1SC, we solve it by alternating between and : (1) updating with fixed; (2) updating with fixed. These two steps alternate until convergence. With fixed, the subproblem defined over is
(5) 
which can be decomposed into Lasso problems: for
(6) 
where is the coefficient vector of the th sample. Lasso can be solved by many algorithms, such as proximal gradient descent (PGD) [30]. Fixing , the subproblem defined over is:
(7) 
We solve this problem using an ADMMbased algorithm. First, we write the problem into an equivalent form
(8) 
Then we write down the augmented Lagrangian function
(9) 
We minimize this Lagrangian function by alternating among , and .
Update
Update
The update equation of is simple.
(11) 
The subproblem defined on is
(12) 
which can be solved using a coordinate descent algorithm. The derivation is given in the next subsection.
3.1 Coordinate Descent Algorithm for Learning
In each iteration of the CD algorithm, one basis vector is chosen for update while the others are fixed. Without loss of generality, we assume it is . The subproblem defined over is
(13) 
To obtain the optimal solution, we take the derivative of the objective function and set it to zero. First, we discuss how to compute the derivative of w.r.t
. According to the chain rule, we have
(14) 
where denotes the first column of . Let , then
(15) 
According to the inverse of block matrix
(16) 
where , , , , we have equals where
(17) 
(18) 
Then
(19) 
where
(20) 
To this end, we obtain the full gradient of the objective function in Eq.(13):
(21) 
Setting the gradient to zero, we get
(22) 
Let , , , then and . Let be the eigen decomposition of , we have
(23) 
Then
(24) 
The matrix is idempotent, i.e., , and its rank is . According to the property of idempotent matrix, the first eigenvalues of equal to one and the rest equal to zero. Thereafter, the first eigenvalues of equal to zero and the rest equal to one. Based on this property, Eq.(24) can be simplified as
(25) 
After simplification, it is a quadratic function where has a closed form solution. Then we plug the solution of into Eq.(23) to get the solution of .
4 Experiments
In these section, we present experimental results. The studies were performed on three models: sparse coding (SC) for document modeling, long shortterm memory (LSTM)
[13]network for language modeling and convolutional neural network (CNN)
[18] for image classification.4.1 Datasets
We used four datasets. The SC experiments were conducted on two text datasets: 20Newsgroups^{2}^{2}2http://qwone.com/~jason/20Newsgroups/ (20News) and Reuters Corpus^{3}^{3}3http://www.daviddlewis.com/resources/testcollections/rcv1/ Volume 1 (RCV1). The 20News dataset contains newsgroup documents belonging to 20 categories, where 11314, 3766 and 3766 documents were used for training, validation and testing respectively. The original RCV1 dataset contains documents belonging to 103 categories. Following [3], we chose the largest 4 categories which contain 9625 documents, to carry out the study. The number of training, validation and testing documents are 5775, 1925, 1925 respectively. For both datasets, stopwords were removed and all words were changed into lowercase. Top 1000 words with the highest document frequency were selected to form the vocabulary. We used tfidf to represent documents and the feature vector of each document is normalized to have unit L2 norm.
The LSTM experiments were conducted on the Penn Treebank (PTB) dataset [24], which consists of 923K training, 73K validation, and 82K test words. Following [26], top 10K words with highest frequency were selected to form the vocabulary. All other words are replaced with a special token UNK.
The CNN experiments were performed on the CIFAR10 dataset^{4}^{4}4https://www.cs.toronto.edu/~kriz/cifar.html
. It consists of 32x32 color images belonging to 10 categories, where 50,000 images were used for training and 10,000 for testing. 5000 training images were used as the validation set for hyperparameter tuning. We augmented the dataset by first zeropadding the images with 4 pixels on each side, then randomly cropping the padded images to reproduce 32x32 images.
4.2 LDDL1 and Nonoverlapness
First of all, we verify whether the LDDL1 regularizer is able to promote nonoverlapness. The study is performed on the SC model and the 20News dataset. The number of basis vectors was set to 50. For 5 choices of the regularization parameter of LDDL1: , we ran the LDDL1SC model until convergence and measured the overlap score (defined in Eq.2) of the basis vectors. The tradeoff parameter inside LDDL1 is set to 1. Figure 2 shows that the overlap score consistently decreases as the regularization parameter of LDDL1 increases, which implies that LDDL1 can effectively encourage nonoverlapness. As a comparison, we replaced LDDL1 with LDDonly and L1only, and measured the overlap scores. As can be seen, for LDDonly, the overlap score remains to be 1 when the regularization parameter increases, which indicates that LDD alone is not able to reduce overlap. This is because under LDDonly, the vectors remain dense, which renders their supports to be completely overlapped. Under the same regularization parameter, LDDL1 achieves lower overlap score than L1, which suggests that LDDL1 is more effective in promoting nonoverlapness. Given that – the tradeoff parameter associated with the L1 norm in LDDL1 – is set to 1, the same regularization parameter imposes the same level of sparsity for both LDDL1 and L1only. Since LDDL1 encourages the vectors to be mutually orthogonal, the intersection between vectors’ supports is small, which consequently results in small overlap. This is not the case for L1only, which hence is less effective in reducing overlap.
4.3 Interpretability
In this section, we examine whether the weight vectors learned under LDDL1 regularization are more interpretable, using SC as a study case. For each basis vector learned by LDDL1SC on the 20News dataset, we use the words (referred to as representative words) that correspond to the supports of to interpret . Table 1 shows the representative words of 9 exemplar vectors. By analyzing the representative words, we can see vector 19 represent the following semantics respectively: crime, faith, job, war, university, research, service, religion and Jews. The representative words of these vectors have no overlap. As a result, it is easy to associate each vector with a unique concept, in other words, easy to interpret. Figure 3 visualizes the learned vectors where the black dots denote vectors’ supports. As can be seen, the supports of different basis vectors are landed over different words and their overlap is very small.
Vector  Representative Words 

1  crime, guns 
2  faith, trust 
3  worked, manager 
4  weapons, citizens 
5  board, uiuc 
6  application, performance, ideas 
7  service, quality 
8  bible, moral 
9  christ, jews, land, faq 
4.4 Reducing Overfitting
In this section, we verify whether LDDL1 is able to reduce overfitting. The studies were performed on SC, LSTM and CNN. In each experiment, the hyperparameters were tuned on the validation set.
Sparse Coding
For 20News, the number of basis vectors in LDDL1SC is set to 50. , , and are set to 1, 1, 0.1 and 0.001 respectively. For RCV1, the number of basis vectors is set to 200. , , and are set to 0.01, 1, 1 and 1 respectively. We compared LDDL1 with LDDonly and L1only.
To evaluate the model performance quantitatively, we applied the dictionary learned on the training data to infer the linear coefficients ( in Eq.4) of test documents, then performed
nearest neighbors (KNN) classification on
. Table 2 shows the classification accuracy on test sets of 20News and RCV1 and the gap^{5}^{5}5Training accuracy minus test accuracy. between the accuracy on training and test sets. Without regularization, SC achieves a test accuracy of 0.592 on 20News, which is lower than the training accuracy by 0.119. This suggests that an overfitting to training data occurs. With LDDL1 regularization, the test accuracy is improved to 0.612 and the gap between training and test accuracy is reduced to 0.099, demonstrating the ability of LDDL1 in alleviating overfitting. Though LDD alone and L1 alone improve test accuracy and reduce train/test gap, they perform less well than LDDL1, which indicates that for overfitting reduction, encouraging nonoverlapness is more effective than solely promoting orthogonality or solely promoting sparsity. Similar observations are made on the RCV1 dataset. Interestingly, the test accuracy achieved by LDDL1SC on RCV1 is even better than the training accuracy.Method  20News  RCV1  
–  Test  Gap between train and test  Test  Gap between train and test 
SC  0.592  0.119  0.872  0.009 
LDDSC  0.605  0.108  0.886  0.005 
L1SC  0.606  0.105  0.897  0.005 
LDDL1SC  0.612  0.099  0.909  0.015 
LSTM for Language Modeling
The LSTM network architecture follows the word language model (PytorchTM) provided in Pytorch
^{6}^{6}6https://github.com/pytorch/examples/tree/master/word_language_model. The number of hidden layers is set to 2. The embedding size is 1500. The size of hidden state is 1500. The word embedding and softmax weights are tied. The number of training epochs is 40. Dropout with 0.65 is used. The initial learning rate is 20. Gradient clipping threshold is 0.25. The size of minibatch is 20. In LSTM training, the network is unrolled for 35 iterations. Perplexity is used for evaluating language modeling performance (lower is better). The weight parameters are initialized uniformly between [0.1, 0.1]. The bias parameters are initialized as 0. We compare with the following regularizers: (1) L1 regularizer; (2) orthogonalitypromoting regularizers based on cosine similarity (CS)
[42], incoherence (IC) [1], mutual angle (MA) [37], decorrelation (DC) [7], angular constraint (AC) [38] and LDD [39].Network  Test 

RNN [27]  124.7 
RNN+LDA [27]  113.7 
Deep RNN [31]  107.5 
SumProduct Network [4]  100.0 
RNN+LDA+KN5+Cache [27]  92.0 
LSTM (medium) [44]  82.7 
CharCNN [16]  78.9 
LSTM (large) [44]  78.4 
Variational LSTM with MC Dropout [9]  73.4 
PytorchLM  72.3 
CSPytorchLM [42]  71.8 
ICPytorchLM [1]  71.9 
MAPytorchLM [37]  72.0 
DCPytorchLM [7]  72.2 
ACPytorchLM [38]  71.5 
LDDPytorchLM [39]  71.6 
L1PytorchLM  71.8 
LDDL1PytorchLM  71.1 
Pointer Sentinel LSTM [25]  70.9 
Ensemble of 38 Large LSTMs [44]  68.7 
Ensemble of 10 Large Variational LSTMs [9]  68.7 
Variational RHN [45]  68.5 
Variational LSTM +REAL [15]  68.5 
Neural Architecture Search [46]  67.9 
Variational RHN +RE [15, 45]  66.0 
Variational RHN + WT [45]  65.4 
Variational RHN + WT with MC dropout [45]  64.4 
Neural Architecture Search + WT V1 [46]  64.0 
Neural Architecture Search + WT V2 [46]  62.4 
Table 3 shows the perplexity on the PTB test set. Without regularization, PytorchLM achieves a perplexity of 72.3. With LDDL1 regularization, the perplexity is significantly reduced to 71.1. This shows that LDDL1 can effectively improve generalization performance. Compared with the sparsitypromoting L1 regularizer and orthogonalitypromoting regularizers, LDDL1 – which promotes nonoverlapness by simultaneously promoting sparsity and orthogonality – achieves lower perplexity. For the convenience of readers, we also list the perplexity achieved by other state of the art deep learning models. The LDDL1 regularizer can be applied to these models as well to potentially boost their performance.
CNN for Image Classification
The CNN architecture follows that of the wide residual network (WideResNet) [43]
. The depth and width are set to 28 and 10 respectively. The networks are trained using SGD, where the epoch number is 200, the learning rate is set to 0.1 initially and is dropped by 0.2 at 60, 120 and 160 epochs, the minibatch size is 128 and the Nesterov momentum is 0.9. The dropout probability is 0.3 and the L2 weight decay is 0.0005. Model performance is measured using error rate, which is the median of 5 runs. We compared with (1) L1 regularizer; (2) orthogonalitypromoting regularizers including CS, IC, MA, DC, AC, LDD and one based on locally constrained decorrelation (LCD)
[32].Table 4 shows classification errors on CIFAR10 test set. Compared with the unregularized WideResNet which achieves an error rate of 3.89%, the proposed LDDL1 regularizer greatly reduces the error to 3.60%. LDDL1 outperforms the L1 regularizer and orthogonalitypromoting regularizers, demonstrating that encouraging nonoverlapness is more effective than encouraging sparsity alone or orthogonality alone in reducing overfitting. The error rates achieved by other state of the art methods are also listed.
Network  Error 

Maxout [10]  9.38 
NiN [22]  8.81 
DSN [20]  7.97 
Highway Network [34]  7.60 
AllCNN [33]  7.25 
ResNet [12]  6.61 
ELUNetwork [6]  6.55 
LSUV [28]  5.84 
Fract. MaxPooling [11] 
4.50 
WideResNet [14]  3.89 
CSWideResNet [42]  3.81 
ICWideResNet [1]  3.85 
MAWideResNet [37]  3.68 
DCWideResNet [7]  3.77 
LCDWideResNet [32]  3.69 
ACWideResNet [38]  3.63 
LDDWideResNet [39]  3.65 
L1WideResNet  3.81 
LDDL1WideResNet  3.60 
ResNeXt [40]  3.58 
PyramidNet [14]  3.48 
DenseNet [14]  3.46 
PyramidSepDrop [41]  3.31 
5 Related Works
The interpretation of representation learning models has been widely studied. Choi et al. [5]
develop a twolevel neural attention model that detects influential variables in a reverse time order and use these variables to interpret predictions. Lipton
[23] discuss a taxonomy of both the desiderata and methods in interpretability research. Koh and Liang [17] propose to use influence functions to trace a model’s prediction back to its training data and identify training examples that are most relevant to a prediction. Dong et al. [8]integrate topics extracted from human descriptions into neural networks via an interpretive loss and then use a predictiondifference maximization algorithm to interpret the learned features of each neuron. Our method is orthogonal to these existing approaches and can be potentially used with them together to further improve interpretability.
6 Conclusions
In this paper, we propose a new type of regularization approach that encourages the weight vectors to have lessoverlapped supports. The proposed LDDL1 regularizer simultaneously encourages the weight vectors to be sparse and close to being orthogonal, which jointly produces the effects of less overlap. We apply this regularizer to two models: neural networks and sparse coding (SC), and derive an efficient ADMMbased algorithm for solving the regularized SC problem. Experiments on various datasets demonstrate the effectiveness of this regularizer in alleviating overfitting and improving interpretability.
References
 [1] Yebo Bao, Hui Jiang, Lirong Dai, and Cong Liu. Incoherent training of deep neural networks to decorrelate bottleneck features for speech recognition. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, 2013.
 [2] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 2003.
 [3] Deng Cai and Xiaofei He. Manifold adaptive experimental design for text categorization. IEEE Transactions on Knowledge and Data Engineering, 24(4):707–719, 2012.
 [4] WeiChen Cheng, Stanley Kok, Hoai Vu Pham, Hai Leong Chieu, and Kian Ming A Chai. Language modeling with sumproduct networks. In Fifteenth Annual Conference of the International Speech Communication Association, 2014.
 [5] Edward Choi, Mohammad Taha Bahadori, Jimeng Sun, Joshua Kulas, Andy Schuetz, and Walter Stewart. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. In Advances in Neural Information Processing Systems, pages 3504–3512, 2016.
 [6] DjorkArné Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.
 [7] Michael Cogswell, Faruk Ahmed, Ross Girshick, Larry Zitnick, and Dhruv Batra. Reducing overfitting in deep networks by decorrelating representations. arXiv preprint arXiv:1511.06068, 2015.
 [8] Yinpeng Dong, Hang Su, Jun Zhu, and Bo Zhang. Improving interpretability of deep neural networks with semantic information. arXiv preprint arXiv:1703.04096, 2017.

[9]
Yarin Gal and Zoubin Ghahramani.
A theoretically grounded application of dropout in recurrent neural networks.
In Advances in Neural Information Processing Systems, pages 1019–1027, 2016.  [10] Ian J Goodfellow, David WardeFarley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. Maxout networks. ICML, 2013.
 [11] Benjamin Graham. Fractional maxpooling. arXiv preprint arXiv:1412.6071, 2014.

[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 770–778, 2016.  [13] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [14] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. arXiv preprint arXiv:1608.06993, 2016.
 [15] Hakan Inan, Khashayar Khosravi, and Richard Socher. Tying word vectors and word classifiers: A loss framework for language modeling. arXiv preprint arXiv:1611.01462, 2016.

[16]
Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush.
Characteraware neural language models.
In
Thirtieth AAAI Conference on Artificial Intelligence
, 2016.  [17] Pang Wei Koh and Percy Liang. Understanding blackbox predictions via influence functions. Proceedings of the 34th International Conference on Machine Learning, 2017.
 [18] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, 2012.
 [19] Brian Kulis, Mátyás A Sustik, and Inderjit S Dhillon. Lowrank kernel learning with bregman matrix divergences. Journal of Machine Learning Research, 10(Feb):341–376, 2009.
 [20] ChenYu Lee, Saining Xie, Patrick Gallagher, Zhengyou Zhang, and Zhuowen Tu. Deeplysupervised nets. In Artificial Intelligence and Statistics, pages 562–570, 2015.
 [21] Daniel D Lee and H Sebastian Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401(6755):788–791, 1999.
 [22] Min Lin, Qiang Chen, and Shuicheng Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
 [23] Zachary C Lipton. The mythos of model interpretability. arXiv preprint arXiv:1606.03490, 2016.
 [24] Mitchell P Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a large annotated corpus of english: The penn treebank. Computational linguistics, 19(2):313–330, 1993.
 [25] Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
 [26] Tomas Mikolov, Martin Karafiat, and Lukas Burget. Recurrent neural network based language model.
 [27] Tomas Mikolov and Geoffrey Zweig. Context dependent recurrent neural network language model. 2012.
 [28] Dmytro Mishkin and Jiri Matas. All you need is a good init. arXiv preprint arXiv:1511.06422, 2015.
 [29] Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 1997.
 [30] Neal Parikh and Stephen Boyd. Proximal algorithms. Foundations and Trends in Optimization, 1(3):127–239, 2014.
 [31] Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. How to construct deep recurrent neural networks. arXiv preprint arXiv:1312.6026, 2013.
 [32] Pau Rodríguez, Jordi Gonzàlez, Guillem Cucurull, Josep M Gonfaus, and Xavier Roca. Regularizing cnns with locally constrained decorrelations. arXiv preprint arXiv:1611.01967, 2016.
 [33] Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
 [34] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. Highway networks. arXiv preprint arXiv:1505.00387, 2015.
 [35] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological), pages 267–288, 1996.

[36]
Yichen Wang, Robert Chen, Joydeep Ghosh, Joshua C Denny, Abel Kho, You Chen,
Bradley A Malin, and Jimeng Sun.
Rubik: Knowledge guided tensor factorization and completion for health data analytics.
In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1265–1274. ACM, 2015. 
[37]
Pengtao Xie, Yuntian Deng, and Eric P. Xing.
Diversifying restricted boltzmann machine for document modeling.
In ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2015.  [38] Pengtao Xie, Yuntian Deng, Yi Zhou, Abhimanu Kumar, Yaoliang Yu, James Zou, and Eric P. Xing. Learning latent space models with angular constraints. In Proceedings of the 34th International Conference on Machine Learning, pages 3799–3810, 2017.
 [39] Pengtao Xie, Barnabas Poczos, and Eric P Xing. Nearorthogonality regularization in kernel methods. 2017.
 [40] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. arXiv preprint arXiv:1611.05431, 2016.
 [41] Yoshihiro Yamada, Masakazu Iwamura, and Koichi Kise. Deep pyramidal residual networks with separated stochastic depth. arXiv preprint arXiv:1612.01230, 2016.
 [42] Yang Yu, YuFeng Li, and ZhiHua Zhou. Diversity regularized machine. In International Joint Conference on Artificial Intelligence. Citeseer, 2011.
 [43] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
 [44] Wojciech Zaremba, Ilya Sutskever, and Oriol Vinyals. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329, 2014.
 [45] Julian Georg Zilly, Rupesh Kumar Srivastava, Jan Koutník, and Jürgen Schmidhuber. Recurrent highway networks. arXiv preprint arXiv:1607.03474, 2016.
 [46] Barret Zoph and Quoc V Le. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578, 2016.
Comments
There are no comments yet.