Rectified Factor Networks

02/23/2015 ∙ by Djork-Arné Clevert, et al. ∙ Johannes Kepler University Linz 0

We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events in the input, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm. On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique for deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 49

page 50

page 51

page 53

page 54

page 55

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The success of deep learning is to a large part based on advanced and efficient input representations

[1, 2, 3, 4]

. These representations are sparse and hierarchical. Sparse representations of the input are in general obtained by rectified linear units (ReLU)

[5, 6] and dropout [7]. The key advantage of sparse representations is that dependencies between coding units are easy to model and to interpret. Most importantly, distinct concepts are much less likely to interfere in sparse representations. Using sparse representations, similarities of samples often break down to co-occurrences of features in these samples. In bioinformatics sparse codes excelled in biclustering of gene expression data [8] and in finding DNA sharing patterns between humans and Neanderthals [9].

Representations learned by ReLUs are not only sparse but also non-negative. Non-negative representations do not code the degree of absence of events or objects in the input. As the vast majority of events is supposed to be absent, to code for their degree of absence would introduce a high level of random fluctuations. We also aim for non-linear input representations to stack models for constructing hierarchical representations. Finally, the representations are supposed to have a large number of coding units to allow coding of rare and small events in the input. Rare events are only observed in few samples like seldom side effects in drug design, rare genotypes in genetics, or small customer groups in e-commerce. Small events affect only few input components like pathways with few genes in biology, few relevant mutations in oncology, or a pattern of few products in e-commerce. In summary, our goal is to construct input representations that (1) are sparse, (2) are non-negative, (3) are non-linear, (4) use many code units, and (5) model structures in the input data (see next paragraph).

Current unsupervised deep learning approaches like autoencoders or restricted Boltzmann machines (RBMs) do not model specific structures in the data. On the other hand, generative models explain structures in the data but their codes cannot be enforced to be sparse and non-negative. The input representation of a generative model is its posterior’s mean, median, or mode, which depends on the data. Therefore sparseness and non-negativity cannot be guaranteed independent of the data. For example, generative models with rectified priors, like rectified factor analysis, have zero posterior probability for negative values, therefore their means are positive and not sparse

[10, 11]. Sparse priors do not guarantee sparse posteriors as seen in the experiments with factor analysis with Laplacian and Jeffrey’s prior on the factors (see Tab. 1). To address the data dependence of the code, we employ the posterior regularization method [12]. This method separates model characteristics from data dependent characteristics that are enforced by constraints on the model’s posterior.

We aim at representations that are feasible for many code units and massive datasets, therefore the computational complexity of generating a code is essential in our approach. For non-Gaussian priors, the computation of the posterior mean of a new input requires either to numerically solve an integral or to iteratively update variational parameters [13]. In contrast, for Gaussian priors the posterior mean is the product between the input and a matrix that is independent of the input. Still the posterior regularization method leads to a quadratic (in the number of coding units) constrained optimization problem in each E-step (see Eq. (3) below). To speed up computation, we do not solve the quadratic problem but perform a gradient step. To allow for stochastic gradients and fast GPU implementations, also the M-step is a gradient step. These E-step and M-step modifications of the posterior regularization method result in a generalized alternating minimization (GAM) algorithm [12]. We will show that the GAM algorithm used for RFN learning (i) converges and (ii) is correct. Correctness means that the RFN codes are non-negative, sparse, have a low reconstruction error, and explain the covariance structure of the data.

2 Rectified Factor Network

Our goal is to construct representations of the input that (1) are sparse, (2) are non-negative, (3) are non-linear, (4) use many code units, and (5) model structures in the input. Structures in the input are identified by a generative model, where the model assumptions determine which input structures to explain by the model. We want to model the covariance structure of the input, therefore we choose maximum likelihood factor analysis as model. The constraints on the input representation are enforced by the posterior regularization method [12]. Non-negative constraints lead to sparse and non-linear codes, while normalization constraints scale the signal part of each hidden (code) unit. Normalizing constraints avoid that generative models explain away rare and small signals by noise. Explaining away becomes a serious problem for models with many coding units since their capacities are not utilized. Normalizing ensures that all hidden units are used but at the cost of coding also random and spurious signals. Spurious and true signals must be separated in a subsequent step either by supervised techniques, by evaluating coding units via additional data, or by domain experts.

A generative model with hidden units and data is defined by its prior and its likelihood . The full model distribution can be expressed by the model’s posterior and its evidence (marginal likelihood) : . The representation of input is the posterior’s mean, median, or mode. The posterior regularization method introduces a variational distribution from a family , which approximates the posterior . We choose to constrain the posterior means to be non-negative and normalized. The full model distribution contains all model assumptions and, thereby, defines which structures of the data are modeled. contains data dependent constraints on the posterior, therefore on the code.

For data , the posterior regularization method maximizes the objective [12]:

(1)

where is the Kullback-Leibler distance. Maximizing achieves two goals simultaneously: (1) extracting desired structures and information from the data as imposed by the generative model and (2) ensuring desired code properties via .

Figure 1: Factor analysis model: hidden units (factors) , visible units , weight matrix , noise .

The factor analysis model extracts the covariance structure of the data. The prior of the hidden units (factors) and the noise of visible units (observations) are independent. The model parameters are the weight (loading) matrix and the noise covariance matrix . We assume diagonal to explain correlations between input components by the hidden units and not by correlated noise. The factor analysis model is depicted in Fig. 1. Given the mean-centered data , the posterior

is Gaussian with mean vector

and covariance matrix :

(2)

A rectified factor network (RFN) consists of a single or stacked factor analysis model(s) with constraints on the posterior. To incorporate the posterior constraints into the factor analysis model, we use the posterior regularization method that maximizes the objective given in Eq. (1) [12]

. Like the expectation-maximization (EM) algorithm, the posterior regularization method alternates between an E-step and an M-step. Minimizing the first

of Eq. (1) with respect to

leads to a constrained optimization problem. For Gaussian distributions, the solution with

and from Eq. (2) is with and the quadratic problem:

(3)

where “” is component-wise. This is a constraint non-convex quadratic optimization problem in the number of hidden units which is too complex to be solved in each EM iteration. Therefore, we perform a step of the gradient projection algorithm [14, 15], which performs first a gradient step and then projects the result to the feasible set. We start by a step of the projected Newton method, then we try the gradient projection algorithm, thereafter the scaled gradient projection algorithm with reduced matrix [16] (see also [15]). If these methods fail to decrease the objective in Eq. (3), we use the generalized reduced method [17]. It solves each equality constraint for one variable and inserts it into the objective while ensuring convex constraints. Alternatively, we use Rosen’s gradient projection method [18] or its improvement [19]. These methods guarantee a decrease of the E-step objective.

Since the projection by Eq. (6) is very fast, the projected Newton and projected gradient update is very fast, too. A projected Newton step requires steps (see Eq. (7) and defined in Theorem 1), a projected gradient step requires steps, and a scaled gradient projection step requires steps. The RFN complexity per iteration is (see Alg. 1). In contrast, a quadratic program solver typically requires for the variables (the means of the hidden units for all samples) steps to find the minimum [20]. We exemplify these values on our benchmark datasets MNIST (k, ) and CIFAR (k, ). The speedup with projected Newton or projected gradient in contrast to a quadratic solver is , which gives speedup ratios of for MNIST and for CIFAR. These speedup ratios show that efficient E-step updates are essential for RFN learning. Furthermore, on our computers, RAM restrictions limited quadratic program solvers to problems with k.

The M-step decreases the expected reconstruction error

(4)

from Eq. (1) with respect to the model parameters and . Definitions of , and are given in Alg. 1. The M-step performs a gradient step in the Newton direction, since we want to allow stochastic gradients, fast GPU implementation, and dropout regularization. The Newton step is derived in the supplementary which gives further details, too. Also in the E-step, RFN learning performs a gradient step using projected Newton or gradient projection methods. These projection methods require the Euclidean projection of the posterior means onto the non-convex feasible set:

(5)

The following Theorem 1 gives the Euclidean projection as solution to Eq. (5).

Theorem 1 (Euclidean Projection).

If at least one is positive for , then the solution to optimization problem Eq. (5) is

(6)

If all are non-positive for , then the optimization problem Eq. (5) has the solution for and otherwise.

Proof.

See supplementary material. ∎

Using the projection defined in Eq. (6), the E-step updates for the posterior means are:

(7)

where we set for the projected Newton method (thus ), and for the projected gradient method . For the scaled gradient projection algorithm with reduced matrix, the -active set for consists of all with . The reduced matrix is the Hessian with -active columns and rows fixed to unit vectors . The resulting algorithm is a posterior regularization method with a gradient based E- and M-step, leading to a generalized alternating minimization (GAM) algorithm [21]. The RFN learning algorithm is given in Alg. 1. Dropout regularization can be included before E-step2 by randomly setting code units to zero with a predefined dropout rate (note that convergence results will no longer hold).

1:  
2:  while STOP=false do
3:     ——E-step1——
4:     for all  do
5:        
6:     end for
7:     
8:     ——Constraint Posterior——
9:     (1) projected Newton, (2) projected gradient, (3) scaled gradient projection, (4) generalized reduced method, (5) Rosen’s gradient project.
10:     ——E-step2——
11:     
12:     
13:     ——M-step——
14:     
15:     
16:     for all  do
17:        
18:     end for
19:     if stopping criterion is met: STOP=true
20:  end while

Complexity: objective : ; E-step1: ; projected Newton: ; projected gradient: ; scaled gradient projection: ; E-step2: ; M-step: ; overall complexity with projected Newton / gradient for : .

Algorithm 1 Rectified Factor Network.

3 Convergence and Correctness of RFN Learning

Convergence of RFN Learning.

Theorem 2 states that Alg. 1 converges to a maximum of .

Theorem 2 (RFN Convergence).

The rectified factor network (RFN) learning algorithm given in Alg. 1 is a “generalized alternating minimization” (GAM) algorithm and converges to a solution that maximizes the objective .

Proof.

We present a sketch of the proof which is given in detail in the supplement. For convergence, we show that Alg. 1 is a GAM algorithm which convergences according to Proposition 5 in [21].

Alg. 1 ensures to decrease the M-step objective which is convex in and . The update with leads to the minimum of the objective. Convexity of the objective guarantees a decrease in the M-step for if not in a minimum. Alg. 1 ensures to decrease the E-step objective by using gradient projection methods. All other requirements for GAM convergence are also fulfilled. ∎

Proposition 5 in [21] is based on Zangwill’s generalized convergence theorem, thus updates of the RFN algorithm are viewed as point-to-set mappings [22]. Therefore the numerical precision, the choice of the methods in the E-step, and GPU implementations are covered by the proof.

Correctness of RFN Learning.

The goal of the RFN algorithm is to explain the data and its covariance structure. The expected approximation error is defined in line 14 of Alg. 1. Theorem 3 states that the RFN algorithm is correct, that is, it explains the data (low reconstruction error) and captures the covariance structure as good as possible.

Theorem 3 (RFN Correctness).

The fixed point of Alg. 1 minimizes given and

by ridge regression with

(8)

where . The model explains the data covariance matrix by

(9)

up to an error, which is quadratic in for . The reconstruction error is quadratic in for .

Proof.

The fixed point equation for the update is . Using the definition of and , we have . is the ridge regression solution of

(10)

where is the trace. After multiplying out all in , we obtain:

(11)

For the fixed point of , the update rule gives: . Thus, minimizes given and . Multiplying the Woodbury identity for from left and right by gives

(12)

Inserting this into the expression for and taking the trace gives

(13)

Therefore for the error is quadratic in . follows from fixed point equation . Using this and Eq. (12), Eq. (11) is

(14)

Using the trace norm (nuclear norm or Ky-Fan n-norm) on matrices, Eq. (13) states that the left hand side of Eq. (14) is quadratic in for . The trace norm of a positive semi-definite matrix is its trace and bounds the Frobenius norm [23]. Thus, for , the covariance is approximated up to a quadratic error in according to Eq. (9). The diagonal is exactly modeled. ∎

Since the minimization of the expected reconstruction error is based on , the quality of reconstruction depends on the correlation between and . We ensure maximal information in on by the I-projection (the minimal Kullback-Leibler distance) of the posterior onto the family of rectified and normalized Gaussian distributions.

4 Experiments

undercomplete 50 code units complete 100 code units overcomplete 150 code units
SP ER CO SP ER CO SP ER CO
RFN 750 2493 1083 811 689 266 851 176 76
RFNn 740 2954 1404 790 1855 593 800 1424 352
DAE 660 2513 690 1472 710 1302
RBM 151 3104 71 2874 50 2864
FAsp 401 99963 99999 630 99965 99999 800 99965 99999
FAlap 40 2396 34119 60 464 98545 40 464 97653
ICA 20 1742 31 00 31 00
SFA 10 2185 943 10 161 1145 10 161 2857
FA 10 2184 903 10 161 834 10 161 2636
PCA 00 1742 20 00 20 00
Table 1: Comparison of RFN with other unsupervised methods, where the upper part contains methods that yielded sparse codes. Criteria: sparseness of the code (SP), reconstruction error (ER), difference between data and model covariance (CO). The panels give the results for models with 50, 100 and 150 coding units. Results are the mean of 900 instances, 100 instances for each dataset D1 to D9 (maximal value: 999). RFNs had the sparsest code, the lowest reconstruction error, and the lowest covariance approximation error of all methods that yielded sparse representations (SP10%).
RFNs vs. Other Unsupervised Methods.

We assess the performance of rectified factor networks (RFNs) as unsupervised methods for data representation. We compare (1) RFN: rectified factor networks, (2) RFNn: RFNs without normalization, (3) DAE

: denoising autoencoders with ReLUs, (4)

RBM: restricted Boltzmann machines with Gaussian visible units, (5) FAsp: factor analysis with Jeffrey’s prior () on the hidden units which is sparser than a Laplace prior, (6) FAlap: factor analysis with Laplace prior on the hidden units, (7) ICA

: independent component analysis by FastICA

[24], (8) SFA: sparse factor analysis with a Laplace prior on the parameters, (9) FA: standard factor analysis, (10) PCA

: principal component analysis. The number of components are fixed to 50, 100 and 150 for each method. We generated nine different benchmark datasets (D1 to D9), where each dataset consists of 100 instances. Each instance has 100 samples and 100 features resulting in a 100

100 matrix. Into these matrices, biclusters are implanted [8]. A bicluster is a pattern of particular features which is found in particular samples like a pathway activated in some samples. An optimal representation will only code the biclusters that are present in a sample. The datasets have different noise levels and different bicluster sizes. Large biclusters have 20–30 samples and 20–30 features, while small biclusters 3–8 samples and 3–8 features. The pattern’s signal strength in a particular sample was randomly chosen according to the Gaussian

. Finally, to each matrix, zero-mean Gaussian background noise was added with standard deviation 1, 5, or 10. The datasets are characterized by Dx=

with background noise , number of large biclusters , and the number of small biclusters : D1=(1,10,10), D2=(5,10,10), D3=(10,10,10), D4=(1,15,5), D5=(5,15,5), D6=(10,15,5), D7=(1,5,15), D8=(5,5,15), D9=(10,5,15).
We evaluated the methods according to the (1) sparseness of the components, the (2) input reconstruction error from the code, and the (3) covariance reconstruction error

for generative models. For RFNs sparseness is the percentage of the components that are exactly 0, while for others methods it is the percentage of components with an absolute value smaller than 0.01. The reconstruction error is the sum of the squared errors across samples. The covariance reconstruction error is the Frobenius norm of the difference between model and data covariance. See supplement for more details on the data and for information on hyperparameter selection for the different methods. Tab. 

1 gives averaged results for models with 50 (undercomplete), 100 (complete) and 150 (overcomplete) coding units. Results are the mean of 900 instances consisting of 100 instances for each dataset D1 to D9. In the supplement, we separately tabulate the results for D1 to D9 and confirm them with different noise levels. FAlap did not yield sparse codes since the variational parameter did not push the absolute representations below the threshold of 0.01. The variational approximation to the Laplacian is a Gaussian distribution [13]. RFNs had the sparsest code, the lowest reconstruction error, and the lowest covariance approximation error of all methods that yielded sparse representations (SP>10%).

RFN Pretraining for Deep Nets.
(a) MNIST digits
(c) MNIST digits with random noise background
(d) convex and concave shapes
(e) tall and wide rectangular
(f) rectangular images on background images
(g) CIFAR-10 images (best viewed in color)
(h) NORB images
(b) MNIST digits with random image background
Figure 2: Randomly selected filters trained on image datasets using an RFN with 1024 hidden units. RFNs learned stroke, local and global blob detectors. RFNs are robust to background noise (b,c,f).
(b) MNIST digits with random image background

We assess the performance of rectified factor networks (RFNs) if used for pretraining of deep networks. Stacked RFNs are obtained by first training a single layer RFN and then passing on the resulting representation as input for training the next RFN. The deep network architectures use a RFN pretrained first layer (RFN-1) or stacks of 3 RFNs giving a 3-hidden layer network. The classification performance of deep networks with RFN pretrained layers was compared to (i) support vector machines, (ii) deep networks pretrained by stacking denoising autoencoders (SDAE), (iii) stacking regular autoencoders (SAE), (iv) restricted Boltzmann machines (RBM), and (v) stacking restricted Boltzmann machines (DBN).


Dataset SVM RBM DBN SAE SDAE RFN
MNIST 50k-10k-10k 1.400.23 1.210.21 1.240.22 1.400.23 1.280.22 1.270.22 (1)
basic 10k-2k-50k 3.030.15 3.940.17 3.110.15 3.460.16 2.840.15 2.660.14 (1)
bg-rand 10k-2k-50k 14.580.31 9.800.26 6.730.22 11.280.28 10.300.27 7.940.24 (3)
bg-img 10k-2k-50k 22.610.37 16.150.32 16.310.32 23.000.37 16.680.33 15.660.32 (1)
rect 1k-0.2k-50k 2.150.13 4.710.19 2.600.14 2.410.13 1.990.12 0.630.06 (1)
rect-img 10k-2k-50k 24.040.37 23.690.37 22.500.37 24.050.37 21.590.36 20.770.36 (1)
convex 10k-2k-50k 19.130.34 19.920.35 18.630.34 18.410.34 19.060.34 16.410.32 (1)
NORB 19k-5k-24k 11.60.40 8.310.35 - 10.100.38 9.500.37 7.000.32 (1)
CIFAR 40k-10k-10k 62.70.95 40.390.96 43.380.97 43.250.97 - 41.290.95 (1)
Table 2: Results of deep networks pretrained by RFNs and other models (taken from [25, 26, 27, 28]

). The test error rate is reported together with the 95% confidence interval. The best performing method is given in bold, as well as those for which confidence intervals overlap. The first column gives the dataset, the second the size of training, validation and test set, the last column indicates the number of hidden layers of the selected deep network. In only one case RFN pretraining was significantly worse than the best method but still the second best. In six out of the nine experiments RFN pretraining performed best, where in four cases it was significantly the best.

The benchmark datasets and results are taken from previous publications [25, 26, 27, 28] and contain: (i) MNIST (original MNIST), (ii) basic (a smaller subset of MNIST for training), (iii) bg-rand (MNIST with random noise background), (iv) bg-img (MNIST with random image background), (v) rect (discrimination between tall and wide rectangles), (vi) rect-img (discrimination between tall and wide rectangular images overlayed on random background images), (vii) convex (discrimination between convex and concave shapes), (viii) CIFAR-10 (60k color images in 10 classes), and (ix) NORB (29,160 stereo image pairs of 5 generic categories). For each dataset its size of training, validation and test set is given in the second column of Tab. 2. As preprocessing we only performed median centering. Model selection is based on the validation set performance [26]. The RFNs hyperparameters are (i) the number of units per layer from and (ii) the dropout rate from . The learning rate was fixed to its default value of

. For supervised fine-tuning with stochastic gradient descent, we selected the learning rate from

, the masking noise from , and the number of layers from . Fine-tuning was stopped based on the validation set performance, following [26]. The test error rates together with the 95% confidence interval (computed according to [26]) for deep network pretraining by RFNs and other methods are given in Tab. 2. Fig. 2(b) shows learned filters. The result of the best performing method is given in bold, as well as the result of those methods for which confidence intervals overlap. RFNs were only once significantly worse than the best method but still the second best. In six out of the nine experiments RFNs performed best, where in four cases it was significantly the best.

RFNs in Drug Discovery.

Using RFNs we analyzed gene expression datasets of two projects in the lead optimization phase of a big pharmaceutical company [29]. The first project aimed at finding novel antipsychotics that target PDE10A. The second project was an oncology study that focused on compounds inhibiting the FGF receptor. In both projects, the expression data was summarized by FARMS [30] and standardized. RFNs were trained with 500 hidden units, no masking noise, and a learning rate of . The identified transcriptional modules are shown in Fig. 3. Panels A and B illustrate that RFNs found rare and small events in the input. In panel A only a few drugs are genotoxic (rare event) by downregulating the expression of a small number of tubulin genes (small event). The genotoxic effect stems from the formation of micronuclei (panel C and D) since the mitotic spindle apparatus is impaired. Also in panel B, RFN identified a rare and small event which is a transcriptional module that has a negative feedback to the MAPK signaling pathway. Rare events are unexpectedly inactive drugs (black dots), which do not inhibit the FGF receptor. Both findings were not detected by other unsupervised methods, while they were highly relevant and supported decision-making in both projects [29].

Figure 3: Examples of small and rare events identified by RFN in two drug design studies, which were missed by previous methods. Panel A and B: first row gives the coding unit, while the other rows display expression values of genes for controls (red), active drugs (green), and inactive drugs (black). Drugs (green) in panel A strongly downregulate the expression of tubulin genes which hints at a genotoxic effect by the formation of micronuclei (C). The micronuclei were confirmed by microscopic analysis (D). Drugs (green) in panel B show a transcriptional effect on genes with a negative feedback to the MAPK signaling pathway (E) and therefore are potential cancer drugs.

5 Conclusion

We have introduced rectified factor networks (RFNs) for constructing very sparse and non-linear input representations with many coding units in a generative framework. Like factor analysis, RFN learning explains the data variance by its model parameters. The RFN learning algorithm is a posterior regularization method which enforces non-negative and normalized posterior means. We have shown that RFN learning is a generalized alternating minimization method which can be proved to converge and to be correct. RFNs had the sparsest code, the lowest reconstruction error, and the lowest covariance approximation error of all methods that yielded sparse representations (SP

10%). RFNs have shown that they improve performance if used for pretraining of deep networks. In two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that were so far missed by other unsupervised methods. These gene modules were highly relevant and supported the decision-making in both studies. RFNs are geared to large datasets, sparse coding, and many representational units, therefore they have high potential as unsupervised deep learning techniques.

Acknowledgment.

The Tesla K40 used for this research was donated by the NVIDIA Corporation.

References

  • [1] G. E. Hinton and R. Salakhutdinov.

    Reducing the dimensionality of data with neural networks.

    Science, 313(5786):504–507, 2006.
  • [2] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, NIPS, pages 153–160. MIT Press, 2007.
  • [3] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–117, 2015.
  • [4] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
  • [5] V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltzmann machines. In ICML, pages 807–814. Omnipress 2010, ISBN 978-1-60558-907-7, 2010.
  • [6] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse rectifier neural networks. In AISTATS, volume 15, pages 315–323, 2011.
  • [7] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting.

    Journal of Machine Learning Research

    , 15:1929–1958, 2014.
  • [8] S. Hochreiter, U. Bodenhofer, et al. FABIA: factor analysis for bicluster acquisition. Bioinformatics, 26(12):1520–1527, 2010.
  • [9] S. Hochreiter. HapFABIA: Identification of very short segments of identity by descent characterized by rare variants in large sequencing data. Nucleic Acids Res., 41(22):e202, 2013.
  • [10] B. J. Frey and G. E. Hinton. Variational learning in nonlinear Gaussian belief networks. Neural Computation, 11(1):193–214, 1999.
  • [11] M. Harva and A. Kaban. Variational learning for rectified factor analysis. Signal Processing, 87(3):509–527, 2007.
  • [12] K. Ganchev, J. Graca, J. Gillenwater, and B. Taskar. Posterior regularization for structured latent variable models. Journal of Machine Learning Research, 11:2001–2049, 2010.
  • [13] J. Palmer, D. Wipf, K. Kreutz-Delgado, and B. Rao. Variational EM algorithms for non-Gaussian latent variable models. In NIPS, volume 18, pages 1059–1066, 2006.
  • [14] D. P. Bertsekas. On the Goldstein-Levitin-Polyak gradient projection method. IEEE Trans. Automat. Control, 21:174–184, 1976.
  • [15] C. T. Kelley. Iterative Methods for Optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, 1999.
  • [16] D. P. Bertsekas. Projected Newton methods for optimization problems with simple constraints. SIAM J. Control Optim., 20:221–246, 1982.
  • [17] J. Abadie and J. Carpentier. Optimization, chapter Generalization of the Wolfe Reduced Gradient Method to the Case of Nonlinear Constraints. Academic Press, 1969.
  • [18] J. B. Rosen. The gradient projection method for nonlinear programming. part ii. nonlinear constraints. Journal of the Society for Industrial and Applied Mathematics, 9(4):514–532, 1961.
  • [19] E. J. Haug and J. S. Arora. Applied optimal design. J. Wiley & Sons, New York, 1979.
  • [20] A. Ben-Tal and A. Nemirovski.

    Interior Point Polynomial Time Methods for Linear Programming, Conic Quadratic Programming, and Semidefinite Programming

    , chapter 6, pages 377–442.
    Society for Industrial and Applied Mathematics, 2001.
  • [21] A. Gunawardana and W. Byrne. Convergence theorems for generalized alternating minimization procedures. Journal of Machine Learning Research, 6:2049–2073, 2005.
  • [22] W. I. Zangwill. Nonlinear Programming: A Unified Approach. Prentice Hall, Englewood Cliffs, N.J., 1969.
  • [23] N. Srebro. Learning with Matrix Factorizations. PhD thesis, Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, 2004.
  • [24] A. Hyvärinen and E. Oja. A fast fixed-point algorithm for independent component analysis. Neural Comput., 9(7):1483–1492, 1999.
  • [25] Y. LeCun, F.-J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to pose and lighting. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    . IEEE Press, 2004.
  • [26] P. Vincent, H. Larochelle, et al. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. JMLR, 11:3371–3408, 2010.
  • [27] H. Larochelle, D. Erhan, et al. An empirical evaluation of deep architectures on problems with many factors of variation. In ICML, pages 473–480, 2007.
  • [28] A. Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, Deptartment of Computer Science, University of Toronto, 2009.
  • [29] B. Verbist, G. Klambauer, et al. Using transcriptomics to guide lead optimization in drug discovery projects: Lessons learned from the {QSTAR} project. Drug Discovery Today, 20(5):505 – 513, 2015.
  • [30] S. Hochreiter, D.-A. Clevert, and K. Obermayer. A new summarization method for Affymetrix probe level data. Bioinformatics, 22(8):943–949, 2006.

Supplementary Material

appendix.A appendix.B appendix.C appendix.D subsection.D.1 subsection.D.2 appendix.E appendix.F appendix.G appendix.H subsection.H.1 subsection.H.2 subsubsection.H.2.1 subsubsection.H.2.2 thmt@dummyctr.dummy.9 thmt@dummyctr.dummy.10 appendix.I subsection.I.1 subsection.I.2 subsection.I.3 subsubsection.I.3.1 subsubsection.I.3.2 subsection.I.4 subsubsection.I.4.1 section*.19 section*.20 subsubsection.I.4.2 subsection.I.5 subsubsection.I.5.1 subsubsection.I.5.2 subsubsection.I.5.3 appendix.J appendix.K appendix.L appendix.M appendix.N

List of Theorems

Appendix S1 Introduction

This supplement contains additional information complementing the main manuscript and is structured as follows: First, the rectified factor network (RFN) learning algorithm with E- and M-step updates, weight decay and dropout regularization is given in Section S2. In Section S3, we proof that the (RFN) learning algorithm is a “generalized alternating minimization” (GAM) algorithm and converges to a solution that maximizes the RFN objective. The correctness of the RFN algorithm is proofed in Section S4. Section S5 describes the maximum likelihood factor analysis model and the model selection by the EM-algorithm. The RFN objective, which has to be maximized, is described in Section S6. Next, RFN’s GAM algorithm via gradient descent both in the M-step and the E-step is reported in the Section S7. The following sections S8 and S9 describe the gradient-based M- and E-step, respectively. In Section S10, we describe how the RFNs sparseness can be controlled by a Gaussian prior. Additional information on the selected hyperparameters of the benchmark methods is given in Section S11. The sections S12 and S13 describe the data generation of the benchmark datasets and report the results for three different experimental settings, namely for extracting 50 (undercomplete), 100 (complete) or 150 (overcomplete) factors / hidden units. Finally, Section S14 describes experiments, that we have done to assess the performance of RFN first layer pretraining on CIFAR-10 and CIFAR-100 for three deep convolutional network architectures: (i) the AlexNet [31, 32], (ii) Deeply Supervised Networks (DSN) [33], and (iii) our 5-Convolution-Network-In-Network (5C-NIN).

Appendix S2 Rectified Factor Network (RFN) Algorithms

Algorithm S2 is the rectified factor network (RFN) learning algorithm. The RFN algorithm calls Algorithm S3 to project the posterior probability onto the family of rectified and normalized variational distributions . Algorithm S3 guarantees an improvement of the E-step objective . Projection Algorithm S3 relies on different projections, where a more complicated projection is tried if a simpler one failed to improve the E-step objective. If all following Newton-based gradient projection methods fail to decrease the E-step objective, then projection Algorithm S3 falls back to gradient projection methods. First the equality constraints are solved and inserted into the objective. Thereafter, the constraints are convex and gradient projection methods are applied. This approach is called “generalized reduced gradient method” [17], which is our preferred alternative method. If this method fails, then Rosen’s gradient projection method [18] is used. Finally, the method of Haug and Arora [19] is used.

First we consider Newton-based projection methods, which are used by Algorithm S3. Algorithm S5 performs a simple projection, which is the projected Newton method with learning rate set to one. This projection is very fast and ideally suited to be performed on GPUs for RFNs with many coding units. Algorithm S4 is the fast and simple projection without normalization even simpler than Algorithm S5. Algorithm S6 generalizes Algorithm S5 by introducing step sizes and . The step size scales the gradient step, while scales the difference between to old projection and the new projection. For both and annealing steps, that is, learning rate decay is used to find an appropriate update.

If these Newton-based update rules do not work, then Algorithm S7 is used. Algorithm S7 performs a scaled projection with a reduced Hessian matrix instead of the full Hessian . For computing an -active set is determined, which consists of all with . The reduced matrix is the Hessian with -active columns and rows fixed to unit vector .

The RFN algorithm allows regularization of the parameters and (off-diagonal elements) by weight decay. Priors on the parameters can be introduced. If the priors are convex functions, then convergence of the RFN algorithm is still ensured. The weight decay Algorithm S8 can optionally be used after the M-step of Algorithm S2. Coding units can be regularized by dropout. However dropout is not covered by the convergence proof for the RFN algorithm. The dropout Algorithm S9 is applied during the projection between rectifying and normalization. Methods like mini-batches or other stochastic gradient methods are not covered by the convergence proof for the RFN algorithm. However, in [21] it is shown how to generalize the GAM convergence proof to mini-batches as it is shown for the incremental EM algorithm. Dropout and other stochastic gradient methods can be show to converge similar to mini-batches.

    
  
  for : ,
  number of coding units
  
  , , , , , ,
  
  , element-wise random in ,
  , STOP=false
  
  while STOP=false do
     ——E-step1——
     for all  do
        
     end for
     
     ——Projection——
     perform projection of onto the feasible set by Algorithm S3 giving
     ——E-step2——
     
       
     
     ——M-step——
     
     
     —– update——
     
     —–diagonal update——
     for all  do
        
     end for
     —–full update——
     
     —–bound parameters——
     
     
     if stopping criterion is met: STOP=true
  end while
Algorithm S2 Rectified Factor Network
    
  
  obtain that decrease the E-step objective
  
  ,
  for : , ,
  simple projection (rectified or rectified & normalized),
  E-step objective:
  , , , , (for -active set)
  
  —–Simple Projection——
  perform Newton Projection by Algorithm S5 or Algorithm S4
  —–Scaled Projection——
  if  then
     following loop for: (1) , (2) , or (3) and annealing
     
     while  and and  do
         (skipped for annealing)
         (skipped for annealing)
        perform Scaled Newton Projection by Algorithm S6
     end while
  end if
  —–Scaled Projection With Reduced Matrix——
  if  then
     determine -active set as all with
     set to with -active columns and rows fixed to
     following loop for: (1) , (2) , or (3) and annealing
     
     while  and and  do
         (skipped for annealing)
         (skipped for annealing)
        perform Scaled Projection With Reduced Matrix by Algorithm S7
     end while
  end if
  —–General Gradient Projection——
  while  do
     use generalized reduced gradient [17] OR
     use Rosen’s gradient projection [18] OR
     use method of Haug and Arora [19]
  end while
Algorithm S3 Projection with E-Step Improvement
    
  
  for : project onto feasible set giving
  
  
  
  for all  do
     
  end for
Algorithm S4 Simple Projection: Rectifying
    
  
  for : project onto feasible set giving
  
  for :
  
  for all  do
     for all  do
        
     end for
  end for
  
  for all  do
     if at least one  then
        for all  do
           
        end for
     else
        for all  do
           
        end for
     end if
  end for
Algorithm S5 Simple Projection: Rectifying and Normalization
    
  
  perform a scaled Newton step with subsequent projection
  
  for :
  for :
  simple projection (rectified or rectified & normalized),
   (gradient step size), (projection difference)
  
  
  
Algorithm S6 Scaled Newton Projection
    
  
  perform a scaled projection step with reduced matrix
  
  for :
  for :
  simple projection (rectified or rectified & normalized),
  , , ,
  
  
  
Algorithm S7 Scaled Projection With Reduced Matrix
    
  
  Parameters
  Weight decay factors (Gaussian) and (Laplacian)
  
  
  
  
  
Algorithm S8 Weight Decay
    
  
  for :
  dropout probability
  
  for all  do
     for all  do
        
        
     end for
  end for
Algorithm S9 Dropout

Appendix S3 Convergence Proof for the RFN Learning Algorithm

Theorem 4 (RFN Convergence).

The rectified factor network (RFN) learning algorithm given in Algorithm S2 is a “generalized alternating minimization” (GAM) algorithm and converges to a solution that maximizes the objective .

Proof.

The factor analysis EM algorithm is given by Eq. (81) and Eq. (82) in Section S5. Algorithm S2 is the factor analysis EM algorithm with modified the E-step and the M-step. The E-step is modified by constraining the variational distribution to non-negative means and by normalizing its means across the samples. The M-step is modified to a Newton direction gradient step.

Like EM factor analysis, Algorithm S2 aims at maximizing the negative free energy , which is

(15)