One central task in machine learning (ML) is to extract underlying patterns from observed data (bishop2006pattern; han2011data; fukunaga2013introduction), which is essential for making effective use of big data for many applications (council2013frontiers; jordan2015machine). For example, in the healthcare domain (sun2013big), with the the prevalence of new technologies, data from electronic healthcare records (EHR), sensors, mobile applications, genome, social media is growing in both volume and complexity at an unexpected rate (sun2013big). Distilling high-value information such as clinical phenotypes (ho2014marble; wang2015rubik), treatment plans (razali2009generating; martin2002method), patient similarity (wang2011integrating; sun2012supervised) etc., from such data is of vital importance, but highly challenging. Among the various ML models and algorithms designed for pattern discovery, latent variable models (LVMs) (rabiner1989tutorial; bishop1998latent; knott1999latent; blei2003latent; hinton2006fast; airoldi2009mixed; blei2014build) or latent space models (LSMs) (rumelhart1985learning; deerwester1990indexing; olshausen1997sparse; lee1999learning; xing2002distance) are a large family of models providing a principled and effective way to uncover knowledge hidden behind data and have been widely used in text mining (deerwester1990indexing; blei2003latent), computer vision (olshausen1997sparse; lee1999learning), speech recognition (rabiner1989tutorial; hinton2012deep), computational biology (xing2007bayesian; song2009keller) and recommender systems (gunawardana2008tied; koren2009matrix). For instance, semantic-oriented distillation models such as Latent Semantic Indexing (LSI) (deerwester1990indexing), Latent Dirichlet Allocation (LDA) (blei2003latent), Sparse Coding (olshausen1997sparse), have led to a number of breakthroughs in automatic extraction of topics (blei2003latent), entity types (shu2009latent), storylines (ahmed2012timeline) from textual information. Multi-layer neural networks (krizhevsky2012imagenet; bahdanau2014neural) and latent graphical models (salakhutdinov2009deep; mohamed2011deep) have demonstrated great success in automatic learning of low-, middle- and high-level features from raw data and greatly advanced image classification (krizhevsky2012imagenet), machine translation (bahdanau2014neural) and speech recognition (hinton2012deep). In healthcare domain, LVMs have also been applied to various analytic applications, including topic model for clinical notes analysis (arnold2012topic; cohen2014redundancy), restricted Boltzmann machine (hinton2010practical) for patient profile modeling (nguyen2013latent; tran2015learning), distance metric learning (xing2002distance) for patient similarity measure (sun2012supervised; wang2011integrating)
, tensor factorization(cichocki2008nonnegative) for computational phenotyping (ho2014marble; wang2015rubik), to name a few.
Although LVMs have now been widely used, several new challenges have emerged due to the dramatic growth of volume and complexity of data: 1) In the event that the popularity of patterns behind big data is distributed in a power-law fashion, where a few dominant patterns occur frequently whereas most patterns in the long-tail region are of low popularity (wang2014peacock; xie2015diversifying), standard LVMs are inadequate to capture the long-tail patterns, which can incur significant information loss (wang2014peacock; xie2015diversifying). For instance, Figure 1(a) shows the distribution of the number of documents belonging to each topic in the wikipedia (partalas2015lshtc)
document collection with 2.4M documents and 0.33M topics. Dominant topics such as politics, economics are of high frequency whereas long-tail topics such as symphony, painting are of low popularity. A possible reason for standard LVMs to be inadequate to capture the long-tail patterns may lie in the design of their objective function used for training. For example, a maximum likelihood estimator would reward itself by modeling the dominant patterns well as they are the major contributors of the likelihood function, as illustrated in Figure2. Since dominant patterns denoted by these two large circles are the major contributors of the likelihood function, LVMs would allocate a number of triangles to cover the large circles as best as possible. On the other hand, the long-tail patterns denoted by the small circles contribute less to the likelihood function, thereby it is not very rewarding to model them well and LVMs tend to ignore them. However, in practice, long-tail patterns are important and ignoring them would incur significant information loss, as we evidence below. First, the volume of long-tail patterns can be quite large (partalas2015lshtc; deng2009imagenet). For example, in the Wikipedia dataset (Figure 1(a)) 96.3% topics are used by less than 100 documents. The percentage of documents labeled with long-tail topics is 51.8%. Second, in some applications (wang2014peacock), the long-tail patterns can be more interesting and useful. For example, Tencent Inc applied topic models for advertisement, in one application (Jin2014), they showed that long-tail topics such as lose weight, children nursing improve the click-through rate by 40%. 2) To cope with the rapidly growing complexity of patterns present in big data, ML practitioners typically increase the size and capacity of LVMs, which incurs great challenges for model training, inference, storage and maintenance (xie2015learning) — how to reduce model complexity without compromising expressivity? In LVMs the number of components incurs a tradeoff between model expressivity and complexity (xie2015learning). Under a small , the model would have fewer parameters and hence of lower complexity and better computational and statistical efficiency. However, the downside is the expressivity of the model would be low. For a large , the model would have high expressivity, but also high complexity and computational overhead. Figure 3 shows in a task (xie2015learning) of document retrieval based on distance metric learning, how the retrieval precision and training time vary as the number of components K grows. It is interesting to explore whether it is possible to simultaneously achieve high expressivity and low complexity under a small . 3) There exist substantial redundancy and overlapping amongst patterns discovered by existing LVMs from massive data, making them hard to interpret (wang2015rubik). To better assist human to explore the data and make decisions, it is desirable to learn patterns that are interpretable (wang2015rubik). Oftentimes, the patterns extracted by standard LVMs have a lot of redundancy and overlapping (wang2015rubik), which are ambiguous and difficult to interpret. Such a problem is especially severe in big data where the amount and complexity of both patterns and data are large. For example, in computational phenotyping from EHR, it is observed that the learned phenotypes by standard matrix and tensor factorization have much overlap, causing confusion such as two similar treatment plans are learned for the same type of disease (wang2015rubik). It is necessary to control the latent space during learning to make the patterns distinct and interpretable.
In this paper, we develop and investigate a novel regularization technique for LVMs, which controls the geometry of the latent space during learning, and simultaneously addresses the three challenges discussed above. Our proposed methods enable the learned latent components of LVMs to be diverse in the sense that they are favored to be mutually ”different” (in the sense to be made mathematically formal and explicit later in the proposal) from each other, to accomplish long-tail coverage, low redundancy, and better interpretability. First, concerning the long-tail phenomenon in extracting latent patterns (e.g., clusters, topics) from data: if the model components are biased to be far apart from each other, then one would expect that such components will tend to be less overlapping and less aggregated over dominant patterns (as one often experiences in standard clustering algorithms (Zou_priorsfor)), and therefore more likely to capture the long-tail patterns, as illustrated in Figure 4. Second, reducing model complexity without sacrificing expressivity: if the model components are preferred to be different from each other, then the patterns captured by different components are likely to have less redundancy and hence complementary to each other. Consequently, it is possible to use a small set of components to sufficiently capture a large proportion of patterns, as illustrated in Figure 5. Third, improving the interpretability of the learned components: if model components are encouraged to be distinct from each other and non-overlapping, then it would be cognitively easy for human to associate each component to an object or concept in the physical world.
The major contributions of this paper are:
We propose a diversity-promoting regularization approach to solve several key problems in latent variable modeling: capture long-tail patterns, reduce model complexity without compromising expressivity and improve interpretability.
We propose a mutual angular regularizer to encourage the components in LVMs to have larger mutual angles.
We develop optimization techniques to learn the mutual angle regularized LVMs, which are non-smooth and non-convex, hence are challenging.
Using neural network (NN) as a model instance, we analyze how the MAR affects the generalization error bounds of NN.
On restricted Boltzmann machine and distance metric learning, we empirically demonstrate that MAR can effectively capture long-tail patterns, reduce model complexity without sacrificing expressivity and improve interpretability.
The rest of the paper is organized as follows. In Section 2, we review related works. In Section 3, we propose the mutual angular regularizer and present the optimization techniques. Section 4 analyzes how the MAR affects the generalization errors of neural networks and Section 5 gives empirical evaluations of MAR on restricted Boltzmann machine and distance metric learning. Section 6 concludes the paper.
2 Related Works
Latent Variable Models
Latent Variable Models (LVMs) (rabiner1989tutorial; bishop1998latent; knott1999latent; blei2003latent; hinton2006fast; airoldi2009mixed; blei2014build) or more generally Latent Space Models (LSMs) (rumelhart1985learning; deerwester1990indexing; olshausen1997sparse; lee1999learning; xing2002distance)
are a large family of models in machine learning that are widely utilized for various application domains such as natural language processing(blei2003latent; petrov2007discriminative), computer vision (fei2005bayesian; krizhevsky2012imagenet), speech recognition (rabiner1989tutorial; mohamed2011deep), computational biology (xing2007bayesian; song2009keller), recommender systems (mnih2007probabilistic; salakhutdinov2007restricted), social network analysis (airoldi2009mixed; ho2012triangular) and so on. The utilities of LVMs/LSMs include but not limited to: 1) representation learning (Deep Neural Networks (krizhevsky2012imagenet)mohamed2011deep), Restricted Boltzmann Machine (hinton2010practical)); 2) semantic distillation (Latent Semantic Indexing (deerwester1990indexing), topic models (blei2003latent), Sparse Coding (olshausen1997sparse)); 3) dimension reduction (Factor Analysis (knott1999latent)jolliffe2002principal), Canonical Component Analysis (bishop2006pattern)hyvarinen2004independent)
); 4) sequential data modeling (Hidden Markov Model(rabiner1989tutorial)grewal2014kalman)
); 5) data grouping (Gaussian Mixture Model(bishop2006pattern), Mixture Membership Stochastic Block model (airoldi2009mixed)); 6) latent factor discovery (Matrix Factorization (koren2009matrix), Tensor Factorization (shashua2005non)
). While existing latent variable models have demonstrated effectiveness on small to moderate scale data, they are inadequate to cope with new problems emerged in big data, such as the highly skewed distribution of pattern frequency(wang2014peacock; xie2015diversifying), the conflict between model complexity and effectiveness (xie2015learning), the interpretability of large amount of patterns discovered from massive data (wang2015rubik), as explained in Section 1.
Regularization (hoerl1970ridge; tibshirani1996regression; recht2010guaranteed; wainwright2014structured; srivastava2014dropout) is an important concept and technique in machine learning, which can help alleviate overfitting (hoerl1970ridge; srivastava2014dropout), reduce model complexity (tibshirani1996regression; recht2010guaranteed), achieve certain properties of parameters such as sparsity (tibshirani1996regression; friedman2008sparse; jacob2009group; kim2010tree; jenatton2011structured), low-rankness (fazel2003log; recht2010guaranteed; candes2011robust), stabilize an optimization problem (wainwright2014structured) and lead to algorithmic speed-ups (wainwright2014structured). Commonly used regularizers include squared norm (hoerl1970ridge), norm (tibshirani1996regression), group Lasso norm (yuan2006model)
for parameters represented by vectors and Frobenius norm(mnih2007probabilistic) and trace norm (recht2010guaranteed) for parameters represented by matrices, and structured Hilbert norms for functions in Hilbert spaces (scholkopf2002learning). Regularization approaches promoting diversity in the underlying solutions have been studied and applied in ensemble learning (krogh1995neural; kuncheva2003measures; brown2005diversity; banfield2005ensemble; tang2006analysis; partalas2008focused; yu2011diversity), latent variable modeling (Zou_priorsfor; xie2015diversifying; xie2015learning), classification (malkin2008ratio; jalali2015variational), multitask learning (jalali2015variational). Many works (krogh1995neural; kuncheva2003measures; brown2005diversity; banfield2005ensemble; tang2006analysis; partalas2008focused; yu2011diversity)
explored how to select a diverse subset of base classifiers or regressors in ensemble learning, with the aim to improve generalization error and reduce computational complexity. Recently,(Zou_priorsfor; xie2015diversifying; xie2015learning) studied the diversity-inducing regularization of latent variable models, which encourages the individual components in latent variable models to be different from each other, with the goal to capture long-tail knowledge and reduce model complexity. In a multi-class classification problem, malkin2008ratio proposed to use the determinant of the covariance matrix to encourage classifiers to be different from each other. jalali2015variational proposed a class of convex diversity regularizers and applied them for hierarchical classification and multi-task learning. While these works nicely demonstrate the effectiveness of diversity-promoting regularizers via empirical experiments, a rigorous theoretical analysis is missing. In this work, we aim to bridge this gap.
Generalization Performance of Neural Networks
The generalization performance of neural networks, in particular the approximation error and estimation error, has been widely studied in the past several decades. For the approximation error, cybenko1989approximation demonstrated that finite linear combinations of compositions of a fixed, univariate function and a set of affine functionals can uniformly approximate any continuous function. hornik1991approximation
showed that neural networks with a single hidden layer, sufficiently many hidden units and arbitrary bounded and nonconstant activation function are universal approximators.leshno1993multilayer proved that multilayer feedforward networks with a non-polynomial activation function can approximate any function. Various error rates have also been derived based on different assumptions of the target function. jones1992simple showed that if the target function is in the hypothesis set formed by neural networks with one hidden layer of units, then the approximation error rate is . barron1993universal showed that neural networks with one layer of hidden units and sigmoid activation function can achieve approximation error of order makovoz1998uniform proved that if the target function is of the form , where , then neural networks with one layer of hidden units can approximate it with an error rate of , where is the dimension of input . As for the estimation error, please refer to (anthony1999neural) for an extensive review, which introduces various estimation error bounds based on VC-dimension, flat-shattering dimension, pseudo dimension and so on.
3 Latent Variable Modeling with Mutual Angular Regularization
In this section, we begin with a review of latent variable models, then propose the mutual angular regularizer and utilize it to regularize LVMs. We present optimization techniques to learn mutual angle regularized LVMs.
3.1 Latent Variable Models
An LVM consists of two types of variables: the observed ones are utilized to model the observed data and the latent ones are used to characterize the hidden patterns. The interaction between observed and latent variables encodes the correlation between data and patterns. Under an LVM, extracting patterns from data corresponds to inferring the value of latent variables given the observed ones (blei2014build). For example, in topic models (blei2003latent; hinton2009replicated) which are widely employed to extract topics from documents, the observed variables are used to model words and the latent variables are used to capture topics. The knowledge and structures hidden behind data are usually composed of multiple patterns. For instance, the semantics underlying documents contains a set of themes (hofmann1999probabilistic; blei2003latent), such as politics, economics and education. Accordingly, latent variable models are parametrized by multiple components where each component aims to capture one pattern in the knowledge and is represented with a parameter vector. For instance, the components in Latent Dirichlet Allocation (blei2003latent) are called topics and each topic is parametrized by a multinomial vector.
3.2 Mutual Angular Regularizer
Motivated by the problems stated in Section 1
, we propose to regularize the components in LVMs, to encourage them to diversely spread out. To measure the diversity of components, we define a mutual angular regularizer, which is a score bearing a larger value if the components are more diverse. Before quantifying the diversity of a set of components, we first measure the dissimilarity between two components (vectors). A good dissimilarity measure between two vectors would be invariant to scaling, translation, rotation and orientation of the two vectors. Commonly used metrics such as Euclidean distance, L1 distance, cosine similarity are not ideal since they are either sensitive to scaling or orientation. In this work, we utilize the non-obtuse angleas the dissimilarity measure of two vectors and , which is defined . The non-obtuse angle differs from the ordinary definition of angle in that it is always acute or right, which is preferred due to its insensitiveness of vector orientation. Given this pairwise dissimilarity measure, we can measure the diversity of a vector set. Let denote a set of components where each row of is a -dimensional component vector and the vectors are assumed to be linearly independent. We first take each pair of vectors from , compute their non-obtuse angle . Given all these pairwise angles, the mutual angular regularizer is defined as the mean of these angles minus the variance of these angles
where is a tradeoff parameter between mean and variance. The mean term summarizes how these vectors are different from each other on the whole. A larger mean indicates these vectors share a larger angle in general, hence are more diverse. The variance term is utilized to encourage the vectors to evenly spread out to different directions. A smaller variance indicates that each vector is uniformly different from all other vectors. Encouraging the variance to be small can prevent the phenomenon that the vectors fall into several groups where vectors in the same group have small angles and vectors between groups have large angles. Such a phenomenon renders the vectors to be redundant and less diverse, and hence should be prohibited. Figure 6 shows two set of vectors, where the mean of the pairwise angles of the first set (Figure 6(a)) is roughly the same as that of the second set (Figure 6(b)). But the variance of these angles are quite different in these two sets. In the first set, two vectors are very close to each other, while the third one is different from them. Hence the variance of the angles is large. In the second set, the vectors evenly point to different directions, hence the variance of angles is small. The first set has redundant vectors which is contradictory to diversity, hence it is desirable to prohibit such cases by encouraging the variance of the vectors to be small.
3.3 Latent Variable Models with Mutual Angular Regularization
Given this mutual angular regularizer, we can apply it to regularize latent variable models and control the geometry of the latent space during learning. Specifically it is employed to encourage the components to be diverse. Let denote model-specific objective function, such as likelihood (e.g., in topic models (blei2003latent), restricted Boltzmann machine (hinton2010practical)), negative squared-loss (sparse coding (olshausen1997sparse)), etc. Without loss of generality, we assume it is to be maximized. To diversify the components in the LVM, we augment the objective function with and define a mutual angle regularized LVM (MAR-LVM) problem as
where is a tradeoff parameter. plays an important role in balancing the diversity of model components and their fitness to data. Under a small , the components are learned to best fit data and their diversity is ignored. As discussed earlier, such components have high redundancy and may not be able to cover long-tail patterns effectively and are not amenable for optimization. Under a large , components are regularized to have high diversity, but may not fit well to data. To sum up, a proper needs to be chosen to achieve the optimal balance.
The mutual angular regularizer is non-smooth and non-convex, entailing great challenges for solving problem defined in Eq.(2). In this section, we discuss how to address this issue. The basic strategy is to derive a smooth approximation of and optimize instead. For reasons that will be clear later, we first reformulate by decomposing a component into its direction and magnitude. Let , where is a vector and denotes the norm of the th row of , then the norm of each row vector in is 1. Based on the definition of the mutual angular regularizer, we have . Accordingly, can be reformulated as
can be solved by alternating between and : optimizing with fixed and optimizing with fixed. With fixed, the problem defined over is
which can be efficiently solved with many optimization methods such as projected gradient descent (boyd2004convex), barrier method (boyd2004convex), etc. Fixing , the problem defined over is
which is still non-smooth and non-convex, entailing great obstacles for optimization. To address this issue, we derive a smooth lower bound of the regularizer and use the lower bound as a surrogate of for optimization
Since is smooth, the optimization problem in Eq.(6) is much easier than that in Eq.(5) and many algorithms can be applied, such as projected gradient descent (boyd2004convex). The overall algorithm is summarized in Algorithm 1.
The lower bound is given in Theorem 1.
Let denote the determinant of the Gram matrix of , then . Let , then is a lower bound of . and have the same global optimal.
Let the parameter vector of component be decomposed into , where lies in the subspace spanned by , is in the orthogonal complement of , , , is a scalar. Then , where with excluded..
The mutual angular regularizer comprises of two terms: , in which and measure the mean and variance of the pairwise angles respectively. We bound the two terms separately. We first bound the mean . Since the component vectors are assumed to be linearly independent, we have . As and (according to Lemma 1), we have . As , we can eliminate the columns of and apply the inequality repeatedly to draw the conclusion that (and ). So . For any , the pairwise angle between and is:
Thus . Now we bound the variance . For any , we have proved that . From the definition of , we also have . As is the mean value of all pairwise , we have . So . So . Combining the lower bound of and upper bound of , we have . Both and obtain the optimal value of when vectors in are orthogonal to each other. The proof completes.
Here we present an intuitive understanding of the lower bound. is the volume of the parallelipiped formed by the vectors in . The volume of a parallelipiped depends on both the length of vectors and the angles between vectors. Since vectors in are of unit-length, the angles determine the volume. The larger is, the more likely222This is not for sure. these vectors share larger angles. Let , which is increasing function, then , which is increasing w.r.t . This implies that the larger is, the more likely the vectors in have larger angles, and accordingly the more likely that the mutual angular regularizer is larger.
While in general increasing a lower bound of a function cannot ensure the function itself is increased, in our case it can be proved that the monotonicity of is closely aligned with . Specifically, at each point, the gradient of is an ascent direction of , which is formally stated in Theorem 2. This property qualifies to be a desirable surrogate of .
Let be the gradient of w.r.t . , such that , where denotes the projection to the unit sphere.
To prove Theorem 2, we show that at each point the gradient of is an ascent direction of the mean of angles and is a descent direction of the variance of angles , which are formally stated in Theorem 3 and 4. Since is the difference between and , the gradient of is a descent direction of .
Let be the gradient of w.r.t . , such that , where denotes the projection to the unit sphere.
We first introduce some notations. Let , , where is the th row of . Let , , then . Let , .
The following lemmas are needed to prove Theorem 3.
Let the parameter vector of component be decomposed into , where lies in the subspace spanned by , is in the orthogonal complement of , , , is a scalar. Then the gradient of w.r.t is , where is a positive scalar.
, we have , where .
, , such that , where .
. So such that we have . The proof completes.
Let be the gradient of w.r.t . , such that , where denotes the projection to the unit sphere.
To prove Theorem 4, we need the following lemma.
Given a nondecreasing sequence and a strictly decreasing function which satisfies , we define a sequence where . If , then , where denotes the variance of a sequence. Furthermore, let , we define a sequence where and is the indicator function, then .
The intuition behind the proof of Theorem 4 is: when the stepsize is sufficiently small, we can make sure the changes of smaller angles (between consecutive iterations) are larger than the changes of larger angles, then Lemma 5 can be used to prove that the variance decreases. Let denote . We sort in nondecreasing order and denote the resultant sequence as , then . We use the same order to index and denote the resultant sequence as , then . Let if and if , then is a strictly decreasing function. Let . It is easy to see when is sufficiently small, . We continue the proof from two complementary cases: (1) ; (2) . If , then according to Lemma 5, we have , where . Furthermore, let , , then , where . can be written as:
Let , it can be further written as
As is nondecreasing and , we have . Let when and when , then and . Substituting and into , we can obtain:
Note that and , so , such that . As , we can draw the conclusion that . On the other hand,
So such that . Let , then