Rectified Factor Networks
We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.READ FULL TEXT VIEW PDF
A nonparametric Bayesian extension of Factor Analysis (FA) is proposed w...
Nonsingular estimation of high dimensional covariance matrices is an
Omics technologies are powerful tools for analyzing patterns in gene
Different aspects of a clinical sample can be revealed by multiple types...
Subspace sparse coding (SSC) algorithms have proven to be beneficial to
Learning the "blocking" structure is a central challenge for high dimens...
Rectified Factor Networks
The success of deep learning is to a large part based on advanced and efficient input representations[1, 2, 3, 4]5, 6] and dropout . The key advantage of sparse representations is that dependencies between coding units are easy to model and to interpret. Most importantly, distinct concepts are much less likely to interfere in sparse representations. Using sparse representations, similarities of samples often break down to co-occurrences of features in these samples. In bioinformatics sparse codes excelled in biclustering of gene expression data  and in finding DNA sharing patterns between humans and Neanderthals .
Representations learned by ReLUs are not only sparse but also non-negative. Non-negative representations do not code the degree of absence of events or objects in the input. As the vast majority of events is supposed to be absent, to code for their degree of absence would introduce a high level of random fluctuations. We also aim for non-linear input representations to stack models for constructing hierarchical representations. Finally, the representations are supposed to have a large number of coding units to allow coding of rare and small events in the input. Rare events are only observed in few samples like seldom side effects in drug design, rare genotypes in genetics, or small customer groups in e-commerce. Small events affect only few input components like pathways with few genes in biology, few relevant mutations in oncology, or a pattern of few products in e-commerce. In summary, our goal is to construct input representations that (1) are sparse, (2) are non-negative, (3) are non-linear, (4) use many code units, and (5) model structures in the input data (see next paragraph).
Current unsupervised deep learning approaches like autoencoders or restricted Boltzmann machines (RBMs) do not model specific structures in the data. On the other hand, generative models explain structures in the data but their codes cannot be enforced to be sparse and non-negative. The input representation of a generative model is its posterior’s mean, median, or mode, which depends on the data. Therefore sparseness and non-negativity cannot be guaranteed independent of the data. For example, generative models with rectified priors, like rectified factor analysis, have zero posterior probability for negative values, therefore their means are positive and not sparse[10, 11]. Sparse priors do not guarantee sparse posteriors as seen in the experiments with factor analysis with Laplacian and Jeffrey’s prior on the factors (see Tab. 1). To address the data dependence of the code, we employ the posterior regularization method . This method separates model characteristics from data dependent characteristics that are enforced by constraints on the model’s posterior.
We aim at representations that are feasible for many code units and massive datasets, therefore the computational complexity of generating a code is essential in our approach. For non-Gaussian priors, the computation of the posterior mean of a new input requires either to numerically solve an integral or to iteratively update variational parameters . In contrast, for Gaussian priors the posterior mean is the product between the input and a matrix that is independent of the input. Still the posterior regularization method leads to a quadratic (in the number of coding units) constrained optimization problem in each E-step (see Eq. (3) below). To speed up computation, we do not solve the quadratic problem but perform a gradient step. To allow for stochastic gradients and fast GPU implementations, also the M-step is a gradient step. These E-step and M-step modifications of the posterior regularization method result in a generalized alternating minimization (GAM) algorithm . We will show that the GAM algorithm used for RFN learning (i) converges and (ii) is correct. Correctness means that the RFN codes are non-negative, sparse, have a low reconstruction error, and explain the covariance structure of the data.
Our goal is to construct representations of the input that (1) are sparse, (2) are non-negative, (3) are non-linear, (4) use many code units, and (5) model structures in the input. Structures in the input are identified by a generative model, where the model assumptions determine which input structures to explain by the model. We want to model the covariance structure of the input, therefore we choose maximum likelihood factor analysis as model. The constraints on the input representation are enforced by the posterior regularization method . Non-negative constraints lead to sparse and non-linear codes, while normalization constraints scale the signal part of each hidden (code) unit. Normalizing constraints avoid that generative models explain away rare and small signals by noise. Explaining away becomes a serious problem for models with many coding units since their capacities are not utilized. Normalizing ensures that all hidden units are used but at the cost of coding also random and spurious signals. Spurious and true signals must be separated in a subsequent step either by supervised techniques, by evaluating coding units via additional data, or by domain experts.
A generative model with hidden units and data is defined by its prior and its likelihood . The full model distribution can be expressed by the model’s posterior and its evidence (marginal likelihood) : . The representation of input is the posterior’s mean, median, or mode. The posterior regularization method introduces a variational distribution from a family , which approximates the posterior . We choose to constrain the posterior means to be non-negative and normalized. The full model distribution contains all model assumptions and, thereby, defines which structures of the data are modeled. contains data dependent constraints on the posterior, therefore on the code.
For data , the posterior regularization method maximizes the objective :
where is the Kullback-Leibler distance. Maximizing achieves two goals simultaneously: (1) extracting desired structures and information from the data as imposed by the generative model and (2) ensuring desired code properties via .
The factor analysis model extracts the covariance structure of the data. The prior of the hidden units (factors) and the noise of visible units (observations) are independent. The model parameters are the weight (loading) matrix and the noise covariance matrix . We assume diagonal to explain correlations between input components by the hidden units and not by correlated noise. The factor analysis model is depicted in Fig. 1. Given the mean-centered data , the posterior
is Gaussian with mean vectorand covariance matrix :
A rectified factor network (RFN) consists of a single or stacked factor analysis model(s) with constraints on the posterior. To incorporate the posterior constraints into the factor analysis model, we use the posterior regularization method that maximizes the objective given in Eq. (1) 
. Like the expectation-maximization (EM) algorithm, the posterior regularization method alternates between an E-step and an M-step. Minimizing the firstof Eq. (1) with respect to
leads to a constrained optimization problem. For Gaussian distributions, the solution withand from Eq. (2) is with and the quadratic problem:
where “” is component-wise. This is a constraint non-convex quadratic optimization problem in the number of hidden units which is too complex to be solved in each EM iteration. Therefore, we perform a step of the gradient projection algorithm [14, 15], which performs first a gradient step and then projects the result to the feasible set. We start by a step of the projected Newton method, then we try the gradient projection algorithm, thereafter the scaled gradient projection algorithm with reduced matrix  (see also ). If these methods fail to decrease the objective in Eq. (3), we use the generalized reduced method . It solves each equality constraint for one variable and inserts it into the objective while ensuring convex constraints. Alternatively, we use Rosen’s gradient projection method  or its improvement . These methods guarantee a decrease of the E-step objective.
Since the projection by Eq. (6) is very fast, the projected Newton and projected gradient update is very fast, too. A projected Newton step requires steps (see Eq. (7) and defined in Theorem 1), a projected gradient step requires steps, and a scaled gradient projection step requires steps. The RFN complexity per iteration is (see Alg. 1). In contrast, a quadratic program solver typically requires for the variables (the means of the hidden units for all samples) steps to find the minimum . We exemplify these values on our benchmark datasets MNIST (k, ) and CIFAR (k, ). The speedup with projected Newton or projected gradient in contrast to a quadratic solver is , which gives speedup ratios of for MNIST and for CIFAR. These speedup ratios show that efficient E-step updates are essential for RFN learning. Furthermore, on our computers, RAM restrictions limited quadratic program solvers to problems with k.
The M-step decreases the expected reconstruction error
from Eq. (1) with respect to the model parameters and . Definitions of , and are given in Alg. 1. The M-step performs a gradient step in the Newton direction, since we want to allow stochastic gradients, fast GPU implementation, and dropout regularization. The Newton step is derived in the supplementary which gives further details, too. Also in the E-step, RFN learning performs a gradient step using projected Newton or gradient projection methods. These projection methods require the Euclidean projection of the posterior means onto the non-convex feasible set:
See supplementary material. ∎
Using the projection defined in Eq. (6), the E-step updates for the posterior means are:
where we set for the projected Newton method (thus ), and for the projected gradient method . For the scaled gradient projection algorithm with reduced matrix, the -active set for consists of all with . The reduced matrix is the Hessian with -active columns and rows fixed to unit vectors . The resulting algorithm is a posterior regularization method with a gradient based E- and M-step, leading to a generalized alternating minimization (GAM) algorithm . The RFN learning algorithm is given in Alg. 1. Dropout regularization can be included before E-step2 by randomly setting code units to zero with a predefined dropout rate (note that convergence results will no longer hold).
The rectified factor network (RFN) learning algorithm given in Alg. 1 is a “generalized alternating minimization” (GAM) algorithm and converges to a solution that maximizes the objective .
Alg. 1 ensures to decrease the M-step objective which is convex in and . The update with leads to the minimum of the objective. Convexity of the objective guarantees a decrease in the M-step for if not in a minimum. Alg. 1 ensures to decrease the E-step objective by using gradient projection methods. All other requirements for GAM convergence are also fulfilled. ∎
The goal of the RFN algorithm is to explain the data and its covariance structure. The expected approximation error is defined in line 14 of Alg. 1. Theorem 3 states that the RFN algorithm is correct, that is, it explains the data (low reconstruction error) and captures the covariance structure as good as possible.
The fixed point equation for the update is . Using the definition of and , we have . is the ridge regression solution of
where is the trace. After multiplying out all in , we obtain:
For the fixed point of , the update rule gives: . Thus, minimizes given and . Multiplying the Woodbury identity for from left and right by gives
Inserting this into the expression for and taking the trace gives
Using the trace norm (nuclear norm or Ky-Fan n-norm) on matrices, Eq. (13) states that the left hand side of Eq. (14) is quadratic in for . The trace norm of a positive semi-definite matrix is its trace and bounds the Frobenius norm . Thus, for , the covariance is approximated up to a quadratic error in according to Eq. (9). The diagonal is exactly modeled. ∎
Since the minimization of the expected reconstruction error is based on , the quality of reconstruction depends on the correlation between and . We ensure maximal information in on by the I-projection (the minimal Kullback-Leibler distance) of the posterior onto the family of rectified and normalized Gaussian distributions.
|undercomplete 50 code units||complete 100 code units||overcomplete 150 code units|
We assess the performance of rectified factor networks (RFNs) as unsupervised methods for data representation. We compare (1) RFN: rectified factor networks, (2) RFNn: RFNs without normalization, (3) DAE
: denoising autoencoders with ReLUs, (4)RBM: restricted Boltzmann machines with Gaussian visible units, (5) FAsp: factor analysis with Jeffrey’s prior () on the hidden units which is sparser than a Laplace prior, (6) FAlap: factor analysis with Laplace prior on the hidden units, (7) ICA
: independent component analysis by FastICA, (8) SFA: sparse factor analysis with a Laplace prior on the parameters, (9) FA: standard factor analysis, (10) PCA
: principal component analysis. The number of components are fixed to 50, 100 and 150 for each method. We generated nine different benchmark datasets (D1 to D9), where each dataset consists of 100 instances. Each instance has 100 samples and 100 features resulting in a 100100 matrix. Into these matrices, biclusters are implanted . A bicluster is a pattern of particular features which is found in particular samples like a pathway activated in some samples. An optimal representation will only code the biclusters that are present in a sample. The datasets have different noise levels and different bicluster sizes. Large biclusters have 20–30 samples and 20–30 features, while small biclusters 3–8 samples and 3–8 features. The pattern’s signal strength in a particular sample was randomly chosen according to the Gaussian
. Finally, to each matrix, zero-mean Gaussian background noise was added with standard deviation 1, 5, or 10. The datasets are characterized by Dx=with background noise , number of large biclusters , and the number of small biclusters : D1=(1,10,10), D2=(5,10,10), D3=(10,10,10), D4=(1,15,5), D5=(5,15,5), D6=(10,15,5), D7=(1,5,15), D8=(5,5,15), D9=(10,5,15).
for generative models. For RFNs sparseness is the percentage of the components that are exactly 0, while for others methods it is the percentage of components with an absolute value smaller than 0.01. The reconstruction error is the sum of the squared errors across samples. The covariance reconstruction error is the Frobenius norm of the difference between model and data covariance. See supplement for more details on the data and for information on hyperparameter selection for the different methods. Tab.1 gives averaged results for models with 50 (undercomplete), 100 (complete) and 150 (overcomplete) coding units. Results are the mean of 900 instances consisting of 100 instances for each dataset D1 to D9. In the supplement, we separately tabulate the results for D1 to D9 and confirm them with different noise levels. FAlap did not yield sparse codes since the variational parameter did not push the absolute representations below the threshold of 0.01. The variational approximation to the Laplacian is a Gaussian distribution . RFNs had the sparsest code, the lowest reconstruction error, and the lowest covariance approximation error of all methods that yielded sparse representations (SP>10%).
We assess the performance of rectified factor networks (RFNs) if used for pretraining of deep networks. Stacked RFNs are obtained by first training a single layer RFN and then passing on the resulting representation as input for training the next RFN. The deep network architectures use a RFN pretrained first layer (RFN-1) or stacks of 3 RFNs giving a 3-hidden layer network. The classification performance of deep networks with RFN pretrained layers was compared to (i) support vector machines, (ii) deep networks pretrained by stacking denoising autoencoders (SDAE), (iii) stacking regular autoencoders (SAE), (iv) restricted Boltzmann machines (RBM), and (v) stacking restricted Boltzmann machines (DBN).
). The test error rate is reported together with the 95% confidence interval. The best performing method is given in bold, as well as those for which confidence intervals overlap. The first column gives the dataset, the second the size of training, validation and test set, the last column indicates the number of hidden layers of the selected deep network. In only one case RFN pretraining was significantly worse than the best method but still the second best. In six out of the nine experiments RFN pretraining performed best, where in four cases it was significantly the best.
The benchmark datasets and results are taken from previous publications [25, 26, 27, 28] and contain: (i) MNIST (original MNIST), (ii) basic (a smaller subset of MNIST for training), (iii) bg-rand (MNIST with random noise background), (iv) bg-img (MNIST with random image background), (v) rect (discrimination between tall and wide rectangles), (vi) rect-img (discrimination between tall and wide rectangular images overlayed on random background images), (vii) convex (discrimination between convex and concave shapes), (viii) CIFAR-10 (60k color images in 10 classes), and (ix) NORB (29,160 stereo image pairs of 5 generic categories). For each dataset its size of training, validation and test set is given in the second column of Tab. 2. As preprocessing we only performed median centering. Model selection is based on the validation set performance . The RFNs hyperparameters are (i) the number of units per layer from and (ii) the dropout rate from . The learning rate was fixed to its default value of
. For supervised fine-tuning with stochastic gradient descent, we selected the learning rate from, the masking noise from , and the number of layers from . Fine-tuning was stopped based on the validation set performance, following . The test error rates together with the 95% confidence interval (computed according to ) for deep network pretraining by RFNs and other methods are given in Tab. 2. Fig. 2(b) shows learned filters. The result of the best performing method is given in bold, as well as the result of those methods for which confidence intervals overlap. RFNs were only once significantly worse than the best method but still the second best. In six out of the nine experiments RFNs performed best, where in four cases it was significantly the best.
Using RFNs we analyzed gene expression datasets of two projects in the lead optimization phase of a big pharmaceutical company . The first project aimed at finding novel antipsychotics that target PDE10A. The second project was an oncology study that focused on compounds inhibiting the FGF receptor. In both projects, the expression data was summarized by FARMS  and standardized. RFNs were trained with 500 hidden units, no masking noise, and a learning rate of . The identified transcriptional modules are shown in Fig. 3. Panels A and B illustrate that RFNs found rare and small events in the input. In panel A only a few drugs are genotoxic (rare event) by downregulating the expression of a small number of tubulin genes (small event). The genotoxic effect stems from the formation of micronuclei (panel C and D) since the mitotic spindle apparatus is impaired. Also in panel B, RFN identified a rare and small event which is a transcriptional module that has a negative feedback to the MAPK signaling pathway. Rare events are unexpectedly inactive drugs (black dots), which do not inhibit the FGF receptor. Both findings were not detected by other unsupervised methods, while they were highly relevant and supported decision-making in both projects .
We have introduced rectified factor networks (RFNs) for constructing very sparse and non-linear input representations with many coding units in a generative framework. Like factor analysis, RFN learning explains the data variance by its model parameters. The RFN learning algorithm is a posterior regularization method which enforces non-negative and normalized posterior means. We have shown that RFN learning is a generalized alternating minimization method which can be proved to converge and to be correct. RFNs had the sparsest code, the lowest reconstruction error, and the lowest covariance approximation error of all methods that yielded sparse representations (SP10%). RFNs have shown that they improve performance if used for pretraining of deep networks. In two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that were so far missed by other unsupervised methods. These gene modules were highly relevant and supported the decision-making in both studies. RFNs are geared to large datasets, sparse coding, and many representational units, therefore they have high potential as unsupervised deep learning techniques.
The Tesla K40 used for this research was donated by the NVIDIA Corporation.
Reducing the dimensionality of data with neural networks.Science, 313(5786):504–507, 2006.
Journal of Machine Learning Research, 15:1929–1958, 2014.
Interior Point Polynomial Time Methods for Linear Programming, Conic Quadratic Programming, and Semidefinite Programming, chapter 6, pages 377–442. Society for Industrial and Applied Mathematics, 2001.
appendix.A appendix.B appendix.C appendix.D subsection.D.1 subsection.D.2 appendix.E appendix.F appendix.G appendix.H subsection.H.1 subsection.H.2 subsubsection.H.2.1 subsubsection.H.2.2 firstname.lastname@example.org email@example.com appendix.I subsection.I.1 subsection.I.2 subsection.I.3 subsubsection.I.3.1 subsubsection.I.3.2 subsection.I.4 subsubsection.I.4.1 section*.19 section*.20 subsubsection.I.4.2 subsection.I.5 subsubsection.I.5.1 subsubsection.I.5.2 subsubsection.I.5.3 appendix.J appendix.K appendix.L appendix.M appendix.N
This supplement contains additional information complementing the main manuscript and is structured as follows: First, the rectified factor network (RFN) learning algorithm with E- and M-step updates, weight decay and dropout regularization is given in Section S2. In Section S3, we proof that the (RFN) learning algorithm is a “generalized alternating minimization” (GAM) algorithm and converges to a solution that maximizes the RFN objective. The correctness of the RFN algorithm is proofed in Section S4. Section S5 describes the maximum likelihood factor analysis model and the model selection by the EM-algorithm. The RFN objective, which has to be maximized, is described in Section S6. Next, RFN’s GAM algorithm via gradient descent both in the M-step and the E-step is reported in the Section S7. The following sections S8 and S9 describe the gradient-based M- and E-step, respectively. In Section S10, we describe how the RFNs sparseness can be controlled by a Gaussian prior. Additional information on the selected hyperparameters of the benchmark methods is given in Section S11. The sections S12 and S13 describe the data generation of the benchmark datasets and report the results for three different experimental settings, namely for extracting 50 (undercomplete), 100 (complete) or 150 (overcomplete) factors / hidden units. Finally, Section S14 describes experiments, that we have done to assess the performance of RFN first layer pretraining on CIFAR-10 and CIFAR-100 for three deep convolutional network architectures: (i) the AlexNet [31, 32], (ii) Deeply Supervised Networks (DSN) , and (iii) our 5-Convolution-Network-In-Network (5C-NIN).
Algorithm S2 is the rectified factor network (RFN) learning algorithm. The RFN algorithm calls Algorithm S3 to project the posterior probability onto the family of rectified and normalized variational distributions . Algorithm S3 guarantees an improvement of the E-step objective . Projection Algorithm S3 relies on different projections, where a more complicated projection is tried if a simpler one failed to improve the E-step objective. If all following Newton-based gradient projection methods fail to decrease the E-step objective, then projection Algorithm S3 falls back to gradient projection methods. First the equality constraints are solved and inserted into the objective. Thereafter, the constraints are convex and gradient projection methods are applied. This approach is called “generalized reduced gradient method” , which is our preferred alternative method. If this method fails, then Rosen’s gradient projection method  is used. Finally, the method of Haug and Arora  is used.
First we consider Newton-based projection methods, which are used by Algorithm S3. Algorithm S5 performs a simple projection, which is the projected Newton method with learning rate set to one. This projection is very fast and ideally suited to be performed on GPUs for RFNs with many coding units. Algorithm S4 is the fast and simple projection without normalization even simpler than Algorithm S5. Algorithm S6 generalizes Algorithm S5 by introducing step sizes and . The step size scales the gradient step, while scales the difference between to old projection and the new projection. For both and annealing steps, that is, learning rate decay is used to find an appropriate update.
If these Newton-based update rules do not work, then Algorithm S7 is used. Algorithm S7 performs a scaled projection with a reduced Hessian matrix instead of the full Hessian . For computing an -active set is determined, which consists of all with . The reduced matrix is the Hessian with -active columns and rows fixed to unit vector .
The RFN algorithm allows regularization of the parameters and (off-diagonal elements) by weight decay. Priors on the parameters can be introduced. If the priors are convex functions, then convergence of the RFN algorithm is still ensured. The weight decay Algorithm S8 can optionally be used after the M-step of Algorithm S2. Coding units can be regularized by dropout. However dropout is not covered by the convergence proof for the RFN algorithm. The dropout Algorithm S9 is applied during the projection between rectifying and normalization. Methods like mini-batches or other stochastic gradient methods are not covered by the convergence proof for the RFN algorithm. However, in  it is shown how to generalize the GAM convergence proof to mini-batches as it is shown for the incremental EM algorithm. Dropout and other stochastic gradient methods can be show to converge similar to mini-batches.
The rectified factor network (RFN) learning algorithm given in Algorithm S2 is a “generalized alternating minimization” (GAM) algorithm and converges to a solution that maximizes the objective .
The factor analysis EM algorithm is given by Eq. (81) and Eq. (82) in Section S5. Algorithm S2 is the factor analysis EM algorithm with modified the E-step and the M-step. The E-step is modified by constraining the variational distribution to non-negative means and by normalizing its means across the samples. The M-step is modified to a Newton direction gradient step.
Like EM factor analysis, Algorithm S2 aims at maximizing the negative free energy , which is