Differentiable Pooling for Hierarchical Feature Learning

06/30/2012 ∙ by Matthew D. Zeiler, et al. ∙ NYU college 0

We introduce a parametric form of pooling, based on a Gaussian, which can be optimized alongside the features in a single global objective function. By contrast, existing pooling schemes are based on heuristics (e.g. local maximum) and have no clear link to the cost function of the model. Furthermore, the variables of the Gaussian explicitly store location information, distinct from the appearance captured by the features, thus providing a what/where decomposition of the input signal. Although the differentiable pooling scheme can be incorporated in a wide range of hierarchical models, we demonstrate it in the context of a Deconvolutional Network model (Zeiler et al. ICCV 2011). We also explore a number of secondary issues within this model and present detailed experiments on MNIST digits.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 7

page 8

page 9

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A number of recent approaches in vision and machine learning have explored hierarchical representations for images and video, with the goal of learning features for object recognition. One class of methods, for example Convolutional Neural Networks

[13] or the recent RICA model of Le et al. [12]

, use a purely feed-forward hierarchy that maps the input image to a set of features which are presented to a simple classifier. Another class of models attempts to build hierarchical generative models of the data. These include Deep Belief Networks

[9]

, Deep Boltzmann Machines

[19] and the Compositional Models of Zhu et al. [23, 4].

Spatial pooling is a key mechanism in all these hierarchical image representations, giving invariance to local perturbations of the input and allowing higher-level features to model large portions of the image. Sum and max pooling are the most common forms, with max being typically preferred (see Boureau

et al. [3] for an analysis).

In this paper we introduce a parametric form of pooling that can be directly integrated into the overall objective function of many hierarchical models. Using a Gaussian parametric model, we can directly optimize the mean and variance of each Gaussian pooling region during inference to minimize a global objective function. This contrasts with existing pooling methods that just optimize a local criterion (e.g. max over a region). Adjusting the variance of each Gaussian allows a smooth transition between selecting a single element (akin to max pooling) over the pooling region, or averaging over it (like a sum operation).

Integrating pooling into the objective facilitates joint training and inference across all layers of the hierarchy, something that is often a major issue in many deep models. During training, most approaches build up layer-by-layer, holding the output of the layer beneath fixed. However, this is sub-optimal, since the features in the low-layers cannot use top-down information from a higher layer to improve them. A few approaches do perform full joint training of the layers, notably the Deep Boltzmann Machine [19], and Eslami et al. [5], as applied to images, and the Deep Energy Models of Ngiam et al. [15]. We demonstrate our differentiable pooling in a third model with this capability, the Deconvolutional Networks of Zeiler et al. [22]. This is a simple sparse-coding model that can be easily stacked and we show how joint inference and training of all layers is possible, using the differentiable pooling. However, differentiable pooling is not confined to the Deconvolutional Network model – it is capable of being incorporated into many existing hierarchical models.

The latent variables that control the Gaussians in our pooling scheme store location information (“where”), distinct from the features that capture appearance (“what”). This separation of what/where is also present in Ranzato et al. [17], the transforming auto-encoders of Hinton et al. [7], and Zeiler et al. [22].

In this paper, we also explore a number of secondary issues that help with training deep models: non-negativity constraints; different forms of sparsity; overcoming local minima during inference and different sparsity levels during training and testing.

Figure 1: (a): A 2-layer model architecture. (b): Schematic of inference in a two layer model. (c): Illustration of the Gaussian parameterization used in our differentiable pooling.

2 Model Overview

We explain our contributions in the context of a Deconvolutional Network, introduced by Zeiler et al. [22]. This model is a hierarchical form of convolutional sparse coding that can learn invariant image features in an unsupervised manner. Its simplicity allows the easy integration of differentiable pooling and is amenable to joint inference over all layers.

Let us start by reviewing a single Deconvolutional Network layer, presented with an input image (having color channels). The goal is to produce a reconstruction from sparse features , that is close to . We achieve this by minimizing:

(1)

where is a hyper-parameter that controls the influence of the reconstruction term. consists of a set of 2-D feature maps, thus forming an over complete-basis. To give a unique solution, a sparsity constraint on is needed and we use an element-wise pseudo-norm where . The reconstruction is produced from by two sub-layers: Unpooling and Convolution.

2.1 Unpooling

In the unpooling sub-stage, each 2D feature map undergoes an unpooling operation to produce a larger 2D unpooled feature map 1113D (un)pooling is also possible, as explored in [22].. Each element in influences a small neighborhood (typically or ) in the unpooled map , via a set of weights within the neighborhood:

(2)

We constrain the weights to have unit -norm, as this makes the unpooling operation invertible222Combining Eqs. 2 and 3, we have , hence =1.. The inverse pooling operation computes each element in as the sum of weights in neighborhood of the unpooled map :

(3)

In Zeiler et al. [22], max (un)pooling was used, equivalent to being all zero, except for a single element set to 1. In this work, we consider more general ’s, as detailed in Section 2.5, treating them as latent variables which will be inferred for each input image. Note that each element in has its own set of ’s.

For the rest of the paper, we consider the neighborhoods to be non-overlapping, but the above formulation generalizes to overlapping regions as well. For brevity, we write the unpooling operation as a single linear matrix, parameterized by weights : .

2.2 Convolution

In the convolution sub-stage, the reconstruction is formed by convolving 2D unpooled feature maps with filters and summing them:

(4)

where is the 2D convolution operator. The filters are the parameters of the model common to all images. The feature maps are latent variables, specific to each image. For notational brevity, we combine the convolution and summing operations into a single convolution matrix and convert the multiple 2D maps

into a single vector

: .

2.3 Discussion of Single Layer

The combination of unpooling and convolution operations gives the reconstruction from :

(5)

A single layer of the model is shown in the lower part of Fig. 1(a). This integrated formulation allows the straightforward optimization of the filters , features and the (un)pooling weights to minimize a single objective function. While most other models also learn filters and features, the pooling operation is typically fixed. Direct optimization of Eqn. 5 with respect to is one the main contributions of this work and is described in Section 2.5.

Note that, given fixed weights , the reconstruction is linear in , thus Eqn. 5 describes a tri-linear model, with coding position (where) information about the (what) features .

Eqn. 5 differs from the original Deconvolutional Network formulation [22] in several important ways. First, sparsity is imposed directly on , as opposed to . This integrates pooling into the objective function, allowing it to become part of the inference. Second, [22] considers only , rather than the hyper-Laplacian () sparsity we employ. Third, is non-negative, as opposed to [22] where there was no such constraint. Fourth, and most importantly, by inferring the optimal (un)pooling weights we directly minimize the objective function of the model. Fixed sum or max pooling, employed by other approaches, is a local heuristic that has no clear relationship to the overall cost.

2.4 Multiple Layers

Multi-layer models are constructed by stacking the single layer model described above in the same manner as Zeiler et al. [22]. The feature maps from one layer become the input maps to the layer above (which now has “color channels”).

An important property of the model is that feature maps exist solely at the top of the model (there are no explicit features in intermediate layers), thus the only variables at the intermediate layers are filters and unpooling weights . For an layer model, the reconstruction is:

(6)

where and are the convolutional and unpooling operations from each layer . We condense the sequence of unpooling and convolution operations into a single reconstruction operator , which lets us write the overall object for a multi-layer model (shown here for a single image, , but optimized over a set of images during training):

(7)

A multi-layer model is shown in Fig. 1(a). Note that since is linear, given the (un)pooling weights , the reconstruction term is easily differentiable. The derivative of is simply , which is a forward propagation operator. This takes a signal at the input and repeatedly convolves (using with flipped versions of the filters at each layer) and pools (using weights ) all the way up to the features. This is a key operation for both inference and learning, as described in Section 3 and Section 4 respectively. Fig. 1(b) illustrates the reconstruction and forward propagation operations.

2.5 Differentiable Pooling

We impose a parametric form on the (un)pooling weights to ensure that the features are invariant to small changes in the input. The pooling would otherwise be able to memorize perfectly the unpooled features, giving “lossless” pooling which would not generalize at all.

The parametric model we use is a 2D axis-aligned Gaussian, with mean and precision over the pooling neighborhood , introduced in Section 2.1. The Gaussian is normalized within the extent of the pooling region to give weights whose square sums to 1 (thus giving unit norm):

(8)

where is value of the Gaussian for element , at location within the neighborhood :

(9)

Fig. 1(c) shows an illustration of this parameterization. For brevity, we let be the parameters for neighborhood . We thus rewrite the unpooling operation in as . The Gaussian representation has several advantages over existing sum or max pooling:

  • Varying the mean of the Gaussian selects a particular region in the unpooled feature map, just like max pooling. This makes the feature invariant to small translations within the unpooled maps.

  • Varying the precision of the Gaussian allows a smooth variation between max and sum operations (high and low precision respectively).

  • Changes in precision allow invariance to small scale changes in the unpooled features. For example, the width of an edge can easily be altered by adjusting the variance (see Fig. 2(c)).

  • The continuous nature of the Gaussian allows sub-pixel reconstruction that avoids aliasing artifacts, which can occur with max pooling. See Fig. 5 for an illustration of this.

  • The Gaussian representation is differentiable, i.e. the gradient of Eqn. 5 with respect to has analytic form, as detailed in Section 3.2.

2.6 Non-Negativity

In standard sparse coding and other learning methods both the feature activations and the learned parameters can be positive or negative. This contrasts with our model, in which we enforce non-negativity.

This is motivated by several factors. First, there is no notion of a negative intensities or objects in the visual world. Second, the Gaussian parameterization used in the differentiable pooling scheme, described in Section 2.5 has positive weights, so cannot represent individual negative values in the unpooled feature maps. Third, there is some biological evidence for non-negative representations within the brain [10]. Finally, we find experimentally that non-negativity reduces the flexibility of the model, encouraging it to learn good representations. The features computed at test-time have improved classification performance, compared with models without this constraint (see Section 6.4).

2.7 Hyper-Laplacian Sparsity

Most sparse coding models utilize the -norm to enforce a sparsity constraint on the features [16], as a proxy for optimizing sparsity [21]. However, a drawback of this form of regularization is that it gives the same cost to two elements being 0.5 versus a single elements at 1 and the other at 0, even though the latter has a lower cost.

To encourage features with lower cost, we use a pseudo-norm (i.e.  in Eqn. 5) inspired by Krishnan and Fergus [11], which aggressively pushes small elements toward zero. To optimize this, we experimented with techniques in [11], but settled on gradient descent for simplicity.

3 Inference

During inference, the filters at all layers are fixed and the objective is to find the features and (un)pooling variables for all neighborhoods and all layers that minimize Eqn. 7. We do this by alternating between updating the features and the Gaussian variables , while holding the other fixed.

3.1 Feature Updates

For a given layer , we seek the features that minimize (Eqn. 7), given an input image , filters and unpooling variables . This is a large convolutional sparse coding problem and we adapt the ISTA scheme of Beck and Teboulle [1]. This uses an iterative framework of gradient and shrinkage steps.

Gradient step: The gradient of with respect to is:

(10)

This involves first reconstructing the input from the current features: , computing the error signal , and then forward propagating this up to compute the top layer gradient . Given the gradient, we then can update :

(11)

where the parameter sets the size of the gradient step.

Shrinkage step: Following the gradient step, we perform a per-element shrinkage operation that clamps small elements in to zero, increasing its sparsity. For , we use the standard shrinkage:

(12)

For , we step in the direction of the gradient:

(13)

Projection step: After shrinking small elements away, the solution is then projected onto the non-negative set:

(14)

Step size calculation:

In order to set a learning rate for the feature map optimization, we employ an estimation technique for steepest descent problems

[20] which uses the gradients :

(15)

Automating the step-size computation has two advantages. First, each layer requires a significantly different learning rate on account of the differences in architecture, making it hard to set manually. Second, by computing the step-size before each gradient step, each ISTA iteration makes good progress at reducing the overall cost. In practice, we find fixed step-sizes to be significantly inferior.

is computed once per mini-batch. For efficiency, instead of computing the denominator in Eqn. 15 for each image, we estimate it by selecting a small portion (10%) of each mini-batch.

Reset step:

Repeated optimization of the objective function tends to get stuck in local minima as it proceeds over the dataset for several epochs. We found a simple and effective way to overcome this problem. By setting all feature maps

to 0 every few epochs (essentially re-initializing inference), cleaner filters and better performing features can be learned, as demonstrated in Section 6.5.

This reset may be explained as follows. During alternating inference and learning stages, the model can overfit a mini-batch of data by optimizing either the filters or feature maps too much. This causes the model to lock up in a state where no new feature map element can turn on because the reconstruction performance is sufficient to have only a small error propagating forward to the feature level. Since no new features turn on after shrinkage, the filters remain fixed as they continue to get the same gradients. This can happen early in the learning procedure when the filters are still not optimal and therefore the learned representation suffers. By resetting the feature maps, at the next epoch the model has to reevaluate how to reconstruct the image from scratch, and can therefore turn on the optimal feature elements and continue to optimize the filters.

3.2 (Un)pooling Variable Updates

Given a model with layers, we wish to update the (un)pooling variables at each intermediate layer to optimize the objective . We assume that the filters and features are fixed.

The gradients for the pooling variables involve combining, at layer , the forward propagated error signal with the top down reconstruction signal. This combined signal then drives the update of the pooling variables. More formally:

(16)

where is the top down reconstruction from layer feature maps to layer feature maps and is the error propagation up to .

With the chosen Gaussian parameterization of the pooling regions, the chain rule can be used to compute the gradient for each parameter

:

(17)

where is the neighborhood index,

(18)
(19)
(20)
(21)
(22)
(23)
(24)

where and are the coordinates within the pooling neighborhood .

Once the complete gradient is computed as in Eqn. 17, we do a gradient step on each pooling variable:

(25)

using a fixed step size . We experimented with a similar step size to Eqn. 15 for the pooling parameters, however found the estimates to be unstable, likely due to the nonlinear derivatives involved in the Gaussian pooling.

0:  Training set , # layers , # epochs , # ISTA steps
0:  Regularization coefficients , # feature maps
0:  Pooling step sizes
1:  for  do %% Loop over layers 
2:     Init. features/filters: ,
3:     Init. switches:
4:     for epoch  do %% Epoch iteration 
5:         for  do %% Loop over images 
6:            for  do %% ISTA iteration 
7:               Reconstruct input:
8:               Compute reconstruction error:
9:               Propagate error up to layer :
10:               Estimate step size as in Eqn. 15
11:               Take gradient step on p:
12:               Perform shrink:
13:               Project to positive:
14:               for  do %% Loop over lower layers 
15:                   Take gradient step on :
16:               end for
17:            end for
18:         end for
19:         Update by solving Eqn. 26 using CG
20:         Project to positive and unit length
21:     end for
22:  end for
23:  Output: filters , feature maps and pooling variables .
Algorithm 1 Learning with Differentiable Pooling in Deconvolutional Networks

4 Learning

After inference of the feature maps for the top layer and (un)pooling variables for all layers is complete, the filters in each layer are updated. This is done using the gradient with respect to each layer’s filters:

(26)

where the left term is the bottom up error signal propagated up to the feature maps below the given filters, and the right term is the top down reconstruction to the unpooled feature maps . The gradient is therefore the convolution between all combinations of input error maps to the layer (indexed by ) and the unpooled feature maps reconstructed from above (indexed by ), resulting in updates of each filter plane , for each layer .

In practice we use batch conjugate gradient updates for learning the filters as the model is linear in once the feature maps and pooling parameters are inferred. After 2 steps of conjugate gradients, the filters are projected to be nonnegative and renormalized to unit length.

4.1 Joint Inference

The objective function explicitly constrains the reconstruction from the top layer features to be close to the input image. From this we can calculate gradients for each layer’s filters and pooling variables while optimizing the top level features maps. Therefore for each image we can infer the local shifts and scalings of low level features as the high level concepts develop.

We have found that pre-training the first layer in one phase of training and then using the pooling variables and learned layer 1 filters to initialize a second phase of training works best. The second phase of training optimizes the second layer objective from which we can update , , , , and jointly. If care is not taken in this joint update, the first layer features can trade off representation power with the second layer filters. This can result in the second layer filters capturing the details while the first layer filters become dots. To avoid this problem, after the first phase of training we hold fixed and optimize the remaining variables jointly. Thus, while the filters are learned layer-by-layer, inference is always performed jointly across all layers. This has the nice property that these low level parts can move and scale as the variables are optimized while the high level concepts are learned.

5 Initialization of Parameters

Before training, the filter parameters are initialized to Gaussian distributed random values. After this random initialization, the filters are projected to be non-negative and normalized to unit length before training begins.

Before inference, either at the start of training or at test time, we initialize the features maps to 0. This creates a reconstruction of 0 in the pixel space, therefore the initial gradient being propagated up the network is

. This is similar to a feedforward network for the first iteration of inference. While forward propagating this signal up the network we can leverage the Gaussian parameterization of the pooling regions to fit these pooling parameters using moment matching. That is, at each layer, we extract the optimal pooling parameter that fit this bottom up signal. This provides a natural initialization to both the pooling variables at each layer and the top level feature activations given the input image and the filter initialization.

6 Experiments

Evaluation on MNIST We choose to evaluate our model on the MNIST handwritten digit classification task. This dataset provides a relatively large number of training instances per class, has many other results to compare to, and allows easy interpretation of how a trained model is decomposing each image.

Pre-processing: The inputs were the unprocessed MNIST digits at 28x28 resolution. Since no preprocessing was done, the elements remained nonnegative.

Model architecture: We trained a 2 layer model with 5x5 filters in each layer and 2x2 non-overlapping pooling regions. The first layer contained 16 feature maps and the second layer contained 48 features maps. Each of these 48 feature maps connect randomly to 8 different layer 1 feature maps through the second layer filters. These sizes were chosen comparable to [22] while being more amenable to GPU processing. The receptive fields of the second layer features are 14x14 pixels with this configuration, or one quarter the input image size.

Classification: One motivation of this paper was to analyze how the classification pipeline of Zeiler et al. [22] could be simplified by making the top level features of the network more informative. Therefore, in this paper we simply treat the top level activations inferred for each image as input to a linear SVM [6].

The only post processing done to these high level activations is that overlapping patches are extracted and pooled, analogous to the dense SIFT processing which is shown by many computer vision researchers to improve results

[2]. This step provides an expansion in the number of inputs, allowing the linear SVM to operate in a higher dimensional space. For layer 1 classification these patches were 9x9 elements of the layer 1 features maps. For layer 2 they were 6x6 patches, roughly the same ratio to the feature map size as for layer 1. These patches were concatenated as input to the classifier. Throughout the experiments we did not combine features from multiple layers, concatenating only layer 1 patches together for layer 1 classification and only layer 2 features together for layer 2 classification. These final inputs to the classifier were each normalized to unit length.

Hyperparameters: By cross validating on a 50,000 train and 10,000 validation set of MNIST images, we found that and gave optimal classification performance. Each layer was trained with 100 ISTA steps/epoch for 50 epochs (passes through the dataset). After epoch 25, the feature maps were reset to 0 during training. At test time, we found higher and improved classification, as did optimizing for only 50 ISTA steps of inference.

Figure 2: Visualization of the trained model: (a) reconstructions from layer 2, (b) the 16 layer 1 filter weights, (c) invariance visualization for layer 1 incorporating unpooling and convolutions (see Section 6.1 for details) (d) layer 2 filter weights (shown as 16 groups of filter planes connecting to all 48 layer 2 maps), (e) layer 2 pixel space invariance visualization of features projected down from samples of the layer 2 feature distribution (see Section 6.1).

6.1 Model visualization

By visualizing the filters and features maps of the model, we can easily understand what it has learned. In Fig. 2 (a) we demonstrate sharp reconstructions of the input images from the second layer features maps. In Fig. 2 (b) we display the raw filter coefficients for layer 1 which have learned small pieces of strokes. By incorporating the pooling parameters into the layer, these filters are robust to small changes in the input.

Visualizing these invariances of a model can be helpful in understanding the inputs the model is sensitive to. Searching through the dataset of inferred feature map activations and selecting the maximum element per feature map to project downward into the pixel space as in [22] is one way of visualizing these invariances. However, these selected elements are only exemplars of inputs that most strongly activated that feature. In Fig. 2(c) we show a more representative selection of invariances by instead selecting a feature activation to be projected down based on sampling from the distribution of activations for that feature inf the dataset. This gives a less biased view of what activates that feature than selecting the largest few activations from the dataset. Once a sample is selected for a given feature map, the pooling variables corresponding to the image from which the activation was selected are used in the unpooling stages to do the top down visualization.

Examining the 16 sample visualizations for each feature in Fig. 2(c) shows the scale and shifts that the Gaussian pooling provides to these relatively simple first layer filters. We can continue to analyze the model by viewing the layer 2 filters planes in Fig. 2(d). Each of the 48 second layer features has 16 filter planes (shown in separate groups), one connecting to each of the layer 1 feature maps. While the second layer filters are difficult to understand directly, we can visualize the learned representation of the second layer by projecting down all the way to the pixel space through layer 1. Fig. 2(e) shows for each of the 48 feature maps a 4x4 grid of pixel space projections obtained by sampling 16 activations from the distribution of activations of each layer 2 feature and projecting down via alternating convolution and unpooling with the corresponding pooling variables separately for each activation.

While analyzing the features in pixel space is informative, we have also found it is useful to view the features as decompositions of an input image to know how the model is representing the data. One possible method of displaying the decomposition is by coloring each pixel of the reconstruction according to which feature it came from. Each feature is assigned a hue (in no particular order) and the associated reconstruction produced then defines the saturation of that color. The resulting image therefore depicts the high level feature assignments. Pixels with brownish colors indicate a summation of several colors (features) together. Note that the input images themselves are grayscale – the colors are just for visualization purposes.

Figure 3: One layer decomposition of a digit into parts. From the top down the layer 1 feature maps (a) are unpooled into (b) and convolved with (e) to produce the reconstruction (f). The colors in the reconstruction simply represent which feature the reconstructed pixel came from.
Figure 4: Two layer decomposition of a digit into parts. From the top down the layer 2 feature maps (a) are unpooled into (b) and convolved with the layer 2 filters to produce the reconstruction of layer 1 feature maps (c). These are unpooled into (c) and convolved with (e) to produce (f), colored according to the layer 2 feature that the reconstructed pixel it was reconstructed from.

In Fig. 3(d) we show such a reconstruction from layer 1 for the original image in (e). To understand the model we also show the layer 1 feature map activations in (a) with their corresponding color assignment around them. Notice the sparse distribution of activations can reconstruct the entire image by utilizing the Gaussian pooling and layer 1 filters in (c). Fig. 3(b) shows the result of this unpooling operation on the feature maps. Notice in the orange and purple boxes the elongated lines in the unpooled maps, made possible by a low precision in one dimension.

Fig. 4 takes this analysis one step further by using the second layer of the model. Starting from 3 features in the layer 2 feature maps as shown in (a), they are unpooled (as shown in (b)) and then convolved with the second layer filters to reconstruct many elements down on to the first layer features maps (c). These are further unpooled to (d) where again you can see the benefits of the Gaussian pooling smoothly transitioning between non-overlapping pooling regions. These are finally convolved with first layer filters (e) to give the decomposition shown in (f). Notice how long range structures are grouped into common features in the higher layer compared to the layer 1 decomposition of Fig. 3.

6.2 Max Pooling vs Gaussian Pooling

The discrete locations that max pooling allows within a region are a limiting factor in the reconstruction quality of the model. Fig. 5 (bottom) shows a significant aliasing effect is present in the visualizations of the model when Max pooling is used. With the complex interactions between positive and negative elements removed, the model is not able to form smooth transitions between non overlapping pooling regions even though the filters used in the succeeding convolution sublayer have overlap between regions. Using the Gaussian pooling, the model can infer the desired precisions and means in order to optimize the reconstruction quality from high layers of the model.

This fine tuning of reconstruction allows for improvements without significantly varying the features activations (ie. maintains or decreases the sparsity while adjusting the pooling parameters). This is confirmed in Fig. 6 where we break down the cost function into the reconstruction and regularization terms. In this figure we also display the sparsity of each model as this can directly be used for comparison.

The Gaussian pooling significantly outperforms Max pooling in terms of optimizing the objective. By not being able to adjust the pooling variables to optimize the overall cost, Max pooling plateaus despite running for many epochs. Additionally it has a much higher cost throughout training. In contrast, the cost with Gaussian pooling decreases smoothly throughout training because the model can fine tune the pooling parameters to explain much more with each feature activation. This property is shown in Table 1 to significantly improve classification performance compared to Max pooling when stacking.

Layer 1 Layer 2
Max Pooling
Gaussian Pooling
Table 1: MNIST error rate of Max pooling versus Gaussian pooling for 1 and 2 layer models. Note the performance improvement when stacking layer with Gaussian pooling.
Figure 5: Feature decomposition comparison between Gaussian pooling (top) and max pooling (bottom). Each reconstructed pixel’s color corresponds to the layer 2 feature map it was reconstructed from. Note the reuse of similar strokes in digits of a different class. Aliasing artifacts are present in the reconstructions using max pooling – see Section 6.2.
Figure 6: Breakdown of cost function into reconstruction and regularization terms for Max and Gaussian pooling for 2 layer models. Gaussian pooling gives consistently lower cost than max pooling. Furthermore, the sparsity (shown in blue), is significantly lower for the Gaussian pooling, although not explicitly part of the cost.

6.3 Joint Inference

One of the main criticisms of sparse coding methods is that inference must be conducted even at test time due to the lack of a feedforward connection to encode the features. In our approach we discovered two fundamental techniques that mitigate this drawback.

The first is that running a joint inference procedure over both layers of our network improves the classification performance compared to running each layer separately. Instead of inferring the feature maps and pooling variables for the first layer and then using these pooling variables to initialize the second layer inference (2 phases), we can directly run inference with a two layer model. The differentiable pooling allows us to infer the pooling variables of both layers in addition to the layer 2 feature values simultaneously in 1 phase. At the first iteration of inference we leverage the ability to fit the Gaussian pooling parameters in a feed forward way as mentioned in Section 5. This halves the number of inference iterations needed by not requiring any first layer inference prior to inferring the second layer.

To examine this first discovery in depth we considered several combinations of how to joint train and then run inference at test time with this model. During training we have found both qualitatively in terms of feature diversity and quantitatively in terms of classification performance that training in separate phases, one for each layer of the model, works better than jointly training both layers from scratch. In the second phase of training, when optimizing for reconstruction from the second layer feature maps, the first layer pooling variables and filters can either be updated or held fixed. Each row of Table 2 examines each combination of these updates during training. We can see that the optimal training scheme was with fixed first layer filters but pooling updates on both layers. This made the system more stable while still allowing these first layer filters to move and scale as needed by updating the first layer pooling variables.

In all cases we see a significant reduction in error rates when doing inference in 1 phase. The middle column of the table shows this 1 phase inference, but without optimizing the first layer pooling parameters whereas the last column does optimize . We see an improvement in updating for all but the last row which was trained without and so is used to that type of inference. This improvement with joint inference of and is a key finding which is only possible with differentiable pooling.

Training Infer 2 Infer 1 Infer 1
phases phase (no ) phase
Updating
Updating
Updating
No Layer 1 Updates
Table 2: Comparison of joint training techniques. Each row is a trained two layer model that updates select variables in layer 1 during training (in addition to and ). The three columns use these models but run inference at test time in 2 phases, 1 phase without updating , and 1 phase with all updates respectively.

The second discovery that reduces evaluation time is that running the same number of ISTA iterations as was done during training does not give optimal classification performance, possibly due to over-sparsification of the features. Similarly running with too few iterations also reduces performance. Fig. 7 shows a plot comparing the number of ISTA iterations to the classification performance with an optimum at 50 ISTA steps, half the number used during training.

Figure 7: Comparison of classification errors versus number of ISTA steps used during inference.

6.4 Effects of Non-Negativity

With negative elements present in the system, many possible solutions can be found during optimization. This happens because subtractions allow the removal of portions of high level features. This has the effect of making them less discriminative because the model can change parameters in-between the high level feature activations and the input image in order to reconstruct better while assigning less meaning to the feature activations themselves.

To show this is not an artifact of the Gaussian pooling being more suited to nonnegative systems (due to the summation over the pooling region possibly leading to cancellations if negatives are present), we include comparison in Table 3 to Max pooling. In both cases, enforcing positivity via projected gradient descent improves the discriminative information preserved in the features.

Positive/Negative Non-negative
Max Pooling
Gaussian Pooling
Table 3: MNIST error rate for Max and Gaussian models trained with and without the non-negativity constraint.

6.5 Effects of Feature Reset

When training the model on MNIST, some less than optimal filters are learned when not resetting the feature maps. For example, in Fig. 8 (c) many of these layer 1 filters are block-like such as the 3rd row, 2nd column. However this same feature in (a) improves if the feature maps are reset to 0 once half way through training. This single reset is enough to encourage the filters to specialize and improve. Similarly, the layer 2 pixel visualizations in (b) have much more variation due to the reset compared to (d) which did not have the reset. In particular, notice many blob-like features learned in (d) without reset such as the 2nd and 5th rows of the 1st column that improve in (b). These larger, more varied features learned with the reset help improve classification performance as shown in Table 4.

Figure 8: Qualitative difference in first layer filters with (left) and without (right) resetting of the feature maps.
Trained with No Reset Trained with Reset
Table 4: MNIST error rates for 2 layer models trained with and without resetting the feature maps.

6.6 Effects of Hyper-Laplacian Sparsity

It has previously been shown that sparsity encourages learning of distinctive features, however it is not necessarily useful for classification [18] [22]. We analyze this in the context of hyper-laplacian sparsity applied to both training and inference. In this comparison we trained two models, one with a prior on the feature maps and the other with a prior. Once trained, we took each model and ran inference with both and priors. For reference the sparsity for the training runs was 4.2 for the regularized training and 20.2 for the regularized training with the same setting. Since the amount of sparsity can also be controlled during inference by the parameter, we plot in Fig. 9 the classification performance for various settings in these four model combinations.

Figure 9: Error rates for and priors used in training and inference.

Interestingly, utilizing the added sparsity during training enforced by the while using the more relaxed prior for inference is the optimal combination for all settings. This suggests sparsity is useful during training to learn meaningful features, but is not as useful for inference at test time.

6.7 Comparison to Other Methods

We chose the MNIST dataset for it’s large number of results to compare to. Of these, deep learning methods typically fall into one of two categories, 1) those that are completely unsupervised and have a simple classifier on top, or 2) those that are fine-tune discriminatively with labels. Our method falls into the first category as it is completely unsupervised during training, and only the linear SVM applied on top has access to the label information of the training set. We do not back propagate this information through the network, but this would be an interesting future direction to pursue. Table 

5 shows our method is competitive with other deep generative models, even surpassing several which use discriminative fine tuning.

Pre-training Fine-tuning
Our Method
CDBN (1+2 layers) [14]
DBN (3 layers) [8] [9]
DBM (2 layers) [19]
Table 5: MNIST errors rates for related generative models.

7 Discussion

In this work we introduced the concept of differentiable pooling for deep learning methods. Also, we demonstrated that joint training the model improves performance, positivity encourages the model to learn better representations, and that there is an optimal amount of sparsity to be used during training and inference. Finally, we introduced a simple resetting scheme to avoid local minimum and learn better features. We believe many of the approaches and findings in this work are applicable not only to Deconvolutional Networks but also to sparse coding and other deep learning methods in general.

References

  • [1] A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
  • [2] Y. Boureau, F. Bach, Y. LeCun, and J. Ponce. Learning mid-level features for recognition. In CVPR. IEEE, 2010.
  • [3] Y. Boureau, J. Ponce, and Y. LeCun. A theoretical analysis of feature pooling in vision algorithms. In ICML, 2010.
  • [4] Y. Chen, L. Zhu, C. Lin, A. Yuille, and Z. H. Rapid inference on a novel and/or graph for object detection, segmentation and parsing. In NIPS, 2007.
  • [5] S. Eslami, N. Heess, and J. Winn. The shape boltzmann machine: a strong model of object shape. In CVPR, 2012.
  • [6] R. E. Fan, K. W. Chang, C. J. Hsieh, X. R. Wang, and C. J. Lin. Liblinear: A library for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
  • [7] G. E. Hinton, A. Krizhevsky, and S. Wang. Transforming auto-encoders. In ICANN-11, 2011.
  • [8] G. E. Hinton, S. Osindero, and Y. The. A fast learning algorithm for deep belief nets. Neuro Computation, 18:1527–1554, 2006.
  • [9] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
  • [10] P. O. Hoyer. Modeling receptive fields with non-negative sparse coding. Neurocomputing, 52-54:547–552, 2008.
  • [11] D. Krishnan and R. Fergus. Analytic Hyper-Laplacian Priors for Fast Image Deconvolution. In NIPS, 2009.
  • [12] Q. Le, M. Ranzato, R. Monga, M. Devin, K. Chen, G. Corrado, J. Dean, and A. Ng.

    Building high-level features using large scale unsupervised learning.

    In ICML, 2012.
  • [13] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Comput., 1(4):541–551, 1989.
  • [14] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, pages 609–616, 2009.
  • [15] J. Ngiam, Z. Chen, P. Koh, and A. Ng. Learning deep energy models. In ICML, 2011.
  • [16] B. A. Olshausen and D. J. Field. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23):3311–3325, 1997.
  • [17] M. Ranzato, F. Huang, Y. Boureau, and Y. LeCun. Unsupervised learning of invariant feature hierarchies with applications to object reocgnition. In CVPR, 2007.
  • [18] R. Rigamonti, M. Brown, and V. Lepetit. Are sparse representations really relevant for image classification? In CVPR, pages 1545–1552, 2011.
  • [19] R. Salakhutdinov and G. Hinton. Deep Boltzmann machines. In AISTATS, volume 5, pages 448–455, 2009.
  • [20] J. R. Shewchuk. An introduction to the conjugate gradient method without the agonizing pain. Neural Comput., 49(CS-94-125):64, 1994.
  • [21] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58, 1996.
  • [22] M. Zeiler, G. Taylor, and R. Fergus. Adaptive deconvolutional networks for mid and high level feature learning. In ICCV, 2011.
  • [23] L. Zhu, Y. Chen, A. Yuille, and W. Freeman. Latent hierarchical structural learning for object detection. In CVPR, 2010.