Learning to relate images: Mapping units, complex cells and simultaneous eigenspaces

10/01/2011 ∙ by Roland Memisevic, et al. ∙ IG Farben Haus 0

A fundamental operation in many vision tasks, including motion understanding, stereopsis, visual odometry, or invariant recognition, is establishing correspondences between images or between images and data from other modalities. We present an analysis of the role that multiplicative interactions play in learning such correspondences, and we show how learning and inferring relationships between images can be viewed as detecting rotations in the eigenspaces shared among a set of orthogonal matrices. We review a variety of recent multiplicative sparse coding methods in light of this observation. We also review how the squaring operation performed by energy models and by models of complex cells can be thought of as a way to implement multiplicative interactions. This suggests that the main utility of including complex cells in computational models of vision may be that they can encode relations not invariances.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Correspondence is arguably the most ubiquitous computational primitive in vision: Tracking amounts to establishing correspondences between frames; stereo vision between different views of a scene; optical flow between any two images; invariant recognition between images and invariant descriptions in memory; odometry between images and motion information; action recognition between frames; etc. In these and many other tasks, the relationship between images not the content of a single image carries the relevant information. Representing structures within a single image, such as contours, can be also considered as an instance of a correspondence problem, namely between areas, or pixels, within an image111The importance of image correspondence in action understanding is nicely illustrated in Heider and Simmel’s 1944 video of geometric objects engaged in various “social activities” [15] (althouth the original intent of that video goes beyond making a case for correspondences). Each single frame depicts a rather meaningless set of geometric objects and conveys almost no information about the content of the movie. The only way to understand the movie is by understanding the motions and actions, and thus by decoding the relationships between frames.. The fact that correspondence is such a common operation across vision suggests that the task of representing relations may have to be kept in mind when trying to build autonomous vision systems and when trying to understand biological vision.

A lot of progress has been made recently in building models that learn to solve tasks like object recognition from independent, static images. One of the reasons for the recent progress is the use of local features

, which help virtually eliminate the notoriously difficult problems of occlusions and small invariances. A central finding is that the right choice of features not the choice of high-level classifier or computational pipeline are what typically makes a system work well. Interestingly, some of the best performing recognition models are highly biologically consistent, in that they are based on features that are learned unsupervised from data. Besides being biological plausible, feature learning comes with various benefits, such as helping overcome tedious engineering, helping adapt to new domains and allowing for some degree of

end-to-end learning in place of constructing, and then combining, a large number of modules to solve a task. The fact that tasks like object recognition can be solved using biologically consistent, learning based methods raises the question whether understanding relations can be amenable to learning in the same way. If so, this may open up the road to learning based and/or biologically consistent approaches to a much larger variety of problems than static object recognition, and perhaps also beyond vision.

In this paper, we review a variety of recent methods that address correspondence tasks by learning local features. We discuss how the common computational principle behind all these methods are multiplicative interactions, which were introduced to the vision community years ago under the terms “mapping units” [18] and “dynamic mappings” [48]. An illustration of mapping units is shown in Figure 1.1: The three variables shown in the figure interact multiplicatively, and as a result, each variable (say, ) can be thought of as dynamically modulating the connections between other variables in the model ( and ). Likewise, the value of any variable (eg., ) can be thought of as depending on the product of the other variables (, ) [18]

. This is in contrast to common feature learning models like ICA, Restricted Boltzmann Machines, auto-encoder networks and many others, all of which are based on bi-partite networks, that do not involve any three-way multiplicative interactions. In these models, independent hidden variables interact with independent observable variables, such that the value of any variable depends on a weighted

sum not product of the other variables. Closely related to models of mapping units are energy models (for example, [1]), which may be thought of as a way to “emulate” multiplicative interactions by computing squares.

We shall show how both mapping units and energy models can be viewed as ways to learn and detect rotations in a set of shared invariant subspaces of a set of commuting matrices. Our analysis may help understand why action recognition methods seem to profit from squaring non-linearities (for example, [27]), and it predicts that squaring and cross-products will be helpful, in general, in applications that involve representing relations.

1.1 A brief history of multiplicative interactions

Shortly after mapping units were introduced in , energy models [1] received a lot of attention. Energy models are closely related to cross-correlation models [2], which, in turn, are a type of multiplicative interaction model. Energy models have been used as a way to model motion (relating time frames in a video) [1] and stereo vision (relating images across different eyes or cameras) [33]. An energy model is a computational unit that relates images by summing over squared responses of, typically two, linear projections of input data. This operation can be shown to encode translations independently of content [7], [37] (cf. Section 3).

Early approaches to building and applying energy and cross-correlation models were based entirely on hand-wiring (see, for example, [37], [41], [7]). Practically all of these models use Gabor filters as the linear receptive fields whose responses are squared and summed. The focus on Gabor features has somewhat biased the analysis of energy models to focus on the Fourier-spectrum as the main object of interest (see, for example, [7, 37]). As we shall discuss in Section 3, Fourier-components arise just as the special case of one transformation class, namely translation, and many of the analyses apply more generally and to other types of transformation.

Gabor-based energy models have also been applied monocularly. In this case they encode features independently of the Fourier-phase of the input. As a result, their responses are invariant to small translations as well as to contrast variations of the input. In part for this reason, energy models have been popular in models of complex cells, which are known to show similar invariance properties (see, for example, [22]).

Shortly after energy and cross-correlation models emerged, there has been some attention on learning invariances with higher-order neural networks, which are neural networks trained on polynomial basis expansions of their inputs,

[11]. Higher-order neural networks can be composed of units that compute sums of products. These units are sometimes referred to as “Sigma-Pi-units” [40] (where “Pi” stands for product and “Sigma” for sum). [42]

, at about the same time, discussed how multiplicative interactions make it possible to build distributed representations of symbolic data.

In 1995, Kohonen introduced the “Adaptive Subspace Self-Organizing Map” (ASSOM)

[26], which computes sums over squared filter responses to represent data. Like the energy model, the ASSOM is based on the idea that the sum of squared responses is invariant to various properties of its inputs. In contrast to the early energy models, the ASSOM is trained from data. Inspired by the ASSOM, [23] introduced “Independent Subspace Analysis” (ISA), which puts the same idea into the context of more conventional sparse coding models. Extensions of this work are topographic ICA [23] and [50], where sums are computed not over separate but over shared groups of squared filter responses.

In a parallel line of work, bi-linear models were used as an approach to learning in the presence of multiplicative interactions [45]

. This early work on bi-linear models used these as global models trained on whole images rather than using local receptive fields. In contrast to more recent approaches to learning with multiplicative interactions, training typically involved filling a two-dimensional grid with data that shows two types of variability (sometimes called “style” and “content”). The purpose of bi-linear models is then to untangle the two degrees of freedom in the data. More recent work does not make this distinction, and the purpose of multiplicative hidden variables is merely to capture the multiple ways in which two images can be related.

[13], [36], [30], for example, show how multiplicative interactions make it possible to model the multitude of relationships between frames in natural videos. [30] also show how they allow us to model more general classes of relations between images. An earlier multiplicative interaction model, that is also related to bi-linear models, is the “routing-circuit” [35].

Multiplicative interactions have also been used to model structure within static images, which can be thought of as modeling higher-order relations, and, in particular, pair-wise products, between pixel intensities (for example, [25, 23, 49, 21, 38, 6, 29]).

Recently, [32] showed how multiplicative interactions between a class-label

and a feature vector can be viewed as an invariant classifier, where each class is represented by a manifold of allowable transformations. This work may be viewed as a modern version of the model that introduced the term mapping units in 1981

[18]. The main difference between 2011 and 1981 is that models are now trained from large datasets.

2 Learning to relate images

2.1 Feature learning

We briefly review standard feature learning models in this section and we discuss relational feature learning in Section 2.2. We discuss extensions of relational models and how they relate to complex cells and to energy models in Section 3.

Practically all standard feature learning models can be represented by a graphical model like the one shown in Figure 2.1 (a). The model is a bi-partite network that connects a set of unobserved, latent variables with a set of observable variables (for example, pixels) . The weights , which connect pixel with hidden unit , are learned from a set of training images . The vector of latent variables in Figure 2.1 (a) is considered to be unobserved, so one has to infer it, separately for each training case, along with the model parameters for training. The graphical model shown in the figure represents how the dependencies between components and

are parameterized, but it does not define a model or learning algorithm. A large variety of models and learning algorithms can be parameterized as in the figure, including principal components, mixture models, k-means clustering, or restricted Boltzmann machines

[16]. Each of these can in principle be used as a feature learning method (see, for example, [5] for a recent quantitative comparison).

For the hidden variables to extract useful structure from the images, their capacity needs to be constrained. The simplest form of constraining it is to let the dimensionality be smaller than the dimensionality of the images. Learning in this case amounts to performing dimensionality reduction. It has become obvious recently that it is more useful in most applications to use an over-complete representation, that is, , and to constrain the capacity of the latent variables instead by forcing the hidden unit activities to be sparse. In Figure 2.1, and in what follows, we use to symbolize the fact that is capacity-constrained, but it should be kept in mind that capacity can be (and often is) constrained in other ways. The most common operations in the model, after training, are: “Inference” (or “Analysis”): Given image , compute ; and “Generation” (or “Synthesis”): Invent a latent vector , then compute .

Figure 2: (a) Sparse coding graphical model. (b) Auto-encoder network.

A simple way to train a model, given training images, is by minimizing reconstruction error combined with a sparsity encouraging term for the hidden variables (for example, [34]):

(1)

Optimization is with respect to both and all . For this end, it is common to alternate between optimizing and optimizing all . After training, inference then amounts to minimizing the same expression for test images (with fixed).

To avoid iterative optimization during inference, one can eliminate by defining it implicitly as a function of . A common choice of function is where is a matrix and is a squashing non-linearity, such as , which confines the values of to reside in a fixed interval. This model is the well-known auto-encoder (for example, [47]) and it is depicted in Figure 2.1. Learning amounts to minimizing reconstruction error with respect to both and . In practice, it is common to enforce in order to reduce the number of parameters and for consistency with other sparse coding models.

One can add a penalty term that encourages sparsity of the latent variables. Alternatively, one can train auto-encoders, such that they de-noise corrupted version of their inputs, which can be achieved by simply feeding in corrupted inputs during training (but measuring reconstruction error with respect to the original data). This turns auto-encoders into “de-noising auto-encoders” [47], which show properties similar to other sparse coding methods, but inference, like in a standard auto-encoder, is a simple feed-forward mapping.

A technique similar to the auto-encoder is the Restricted Boltzmann machine (RBM): RBMs define the joint probability distribution

(2)

from which one can derive

(3)

showing that inference, again, amounts to a linear mapping plus non-linearity. Learning amounts to maximizing the average log-probability of the training data. Since the derivatives with respect to the parameters are not tractable (due to the normalizing constant in Eq. 2

), it is common to use approximate Gibbs sampling in order to approximate them. This leads to a Hebbian-like learning rule known as contrastive divergence training

[16].

Another common sparse coding method is independent components analysis (ICA) (for example, [22]). One way to train an ICA-model that is complete (that is, where has the same size as ) is by encouraging latent responses to be sparse, while preventing weights from becoming degenerate [22]:

(4)
(5)

Enforcing the constraint can be inefficient in practice, since it requires an eigen decomposition.

For most feature learning models, inference and generation are variations of the two linear mappings:

(6)
(7)

The set of model parameters for any

are typically referred to as “features” or “filters” (although a more appropriate term would be “basis functions”; we shall use these interchangeably). Practically all methods yield Gabor-like features when trained on natural images. An advantage of non-linear models, such as RBM’s and auto-encoders, is that stacking them makes it possible to learn feature hierarchies (“deep learning”)

[17].

In practice, it is common to add bias terms, such that inference and generation (Eqs. 6 and 7) are affine not linear functions, for example, for some parameter . We shall refrain from adding bias terms to avoid clutter, noting that, alternatively, one may think of and as being in “homogeneous” coordinates, containing an extra, constant -dimension.

Feature learning is typically performed on small images patches of size between around and pixels. One reason for this is that training and inference can be computationally demanding. More important, local features make it possible to deal with images of different size, and to deal with occlusions and local object variations. Given a trained model, two common ways to perform invariant recognition on test images are:

“Bag-Of-Features”: Crop patches around interest points (such as SIFT or Harris corners), compute latent representation for each patch, collapse (add up) all representations to obtain a single vector , classify using a standard classifier. There are several variations of this scheme, including using an extra clustering-step before collapsing features, or using a histogram-similarity in place of Euclidean distance for the collapsed representation.

“Convolutional”: Crop patches from the image along a regular grid; compute for each patch; concatenate all descriptors into a very large vector ; classify using a standard classifier. One can also use combinations of the two schemes (see, for example [5]).

Local features yield highly competitive performance in object recognition tasks (for example, [5]). In the next section we discuss recent approaches to extending feature learning to encode relations between, as opposed to content within, images.

2.2 Encoding relations

We now consider the task of learning relations between two images and as illustrated222Face images taken from the data-base described in [46] in Figure 2.2, and we discuss the role of multiplicative interactions when learning relations.







2.2.1 The need for multiplicative interactions

A naive approach to modeling relations between two images would be to perform sparse coding on the concatenation. A hidden unit in such a model would receive as input the sum of two projections, one from each image. To detect a particular transformation, the two receptive fields would need to be defined, such that one receptive field is the other modified by the transformation that the hidden unit is supposed to detect. The net input that the hidden unit receives will then tend to be high for image pairs showing the transformation. However, the net input will equally dependent on the images themselves. The reason is that hidden variables are akin to logical “OR”-gates, which accumulate evidence (see, for example [51] for a discussion).

It is straightforward to build a content-independent detector if we allow for multiplicative interactions between the variables. In particular, consider the outer product between two one-dimensional, binary images, as shown in Figure LABEL:figure:outer. Every component of this matrix constitutes evidence for exactly one type of transformation (translation, in the example). The components act like AND-gates, that can detect coincidences. Since a component is equal to only when both corresponding pixels are equal to , a hidden unit that pools over multiple components (Figure LABEL:figure:outer (c)) is much less likely to receive spurious activity that depends on the image content rather than on the transformation. Note that pooling over the components of amounts to computing the correlation of the output image with a transformed version of the input image. The same is true for real-valued data.

(a) (b) (c)

Based on these observations, a variety of sparse coding models were suggested which encode transformations (for example, [36, 13, 30]). The number of parameters is typically equal to the number of hidden variables the number of input-pixels the number of output pixels. It is instructional to think of the parameters as populating a

-way-“tensor

with components .

Figure 2.2.1 (left) shows two alternative illustrations of this type of model (adapted from [30]). Sub-figure (a) shows that each hidden variable can blend in a slice

of the parameter tensor. Each slice is a matrix connecting each input pixel to each output-pixel. We can think of this matrix as performing linear regression in the space of stacked gray-value intensities, known commonly as a “warp”. Thus, the model as a whole can be thought of as defining a

factorial mixture of warps.

Alternatively, each input pixel can be thought of as blending in a slice of the parameter tensor. Thus, we can think of the model as a standard sparse coding model on the output image (Figure 2.2.1 (left)), whose parameters are modulated by the input image. This turns the model into a predictive or conditional sparse coding model [36, 30]. In both cases, hidden variables take on the roles of dynamic mapping units [18, 48] which encode the relationship not the content of the images. Each unit in the model can gate connections between other variables in the model. We shall refer to this type of model as “gated sparse coding”, or synonymously as “cross-correlation model”.

Figure 5: Relating images using multiplicative interactions. Two equivalent views of the same type of model.

Like in a standard sparse coding model one needs to include biases in practice. The set of model parameters thus consists of the three-way parameters , as well as of single-node parameters , and . One could also include “higher-order-biases” [30] like , which connect two groups of variables, but it is not common to do so. Like before, we shall drop all bias terms in what follows in order to avoid clutter. Both simple biases and higher-order biases can be implemented by adding constant-1 dimensions to data and to hidden variables.

2.3 Inference

The graphical model of gated sparse coding models is tri-partite. That of a standard sparse coding model is bi-partite. Inference can be performed in almost the same as in a standard sparse coding model, whenever two out of three groups of variables have been observed.

Consider, for example, the task of inferring , given and (see Figure 2.3 (a)). Recall that for a standard sparse coding model, we have: (up to component-wise non-linearities). It is instructional to think of the gated sparse coding model as turning the weights into a function of . If that function is linear: , we get:

(8)

which is exactly of the form discussed in the previous section.

Eq. 8 shows that inference amounts to computing for each output-component a quadratic form in and defined by the weight tensor . Considering either or as fixed, one can also think of inference as a simple linear function like in a standard sparse coding model. This property is typical of models with bi-linear dependencies [45]. Despite the similarity to a standard sparse coding model, the meaning of inference differs from standard sparse coding: The meaning of , here, is the transformation that takes to (or vice versa).

Inferring , given two images and (Figure 2.3 (b)) yields the analogous expression:

(9)

so inference is again a quadratic form. The meaning of is now “ transformed according to known transformation ”.

For the analysis in Section 4 it is useful to note that, when is given, then is a linear function of (cf. Eq. 9), so it can be written

(10)

for some matrix , which itself is a function of . Commonly, and represent vectorized images, so that the linear function is a warp. Note, that the representation of the linear function is factorial. That is, the hidden variables make it possible to compose a warp additively from constituting components much like a factorial sparse coding model (in contrast to a genuine mixture model) makes it possible to compose an image from independent components.

(a) (b)
Figure 6: Inferring any one group of variables, given the other two, is like inference in a standard sparse coding model. Blue shading represents conditioning.

Like in a standard sparse coding model, it can be useful in some applications to assign a number to an input, quantifying how well it is represented by the model. For this number to be useful, it has to be “calibrated”, which is typically achieved by using a probabilistic model. In contrast to a simple sparse coding model, training a probabilistic gated sparse coding model can be slightly more complicated, because of the dependencies between and conditioned on . We discuss this issue in detail in the next section.

2.4 Learning

Training data for a gated sparse coding model consists of pairs of points . Training is similar to standard sparse coding, but there are some important differences. In particular, note that the gated model is like a sparse coding model whose input is the vectorized outer-product (cf. Section 2.2), so that standard learning criteria, such as squared error, are obviously not appropriate.






2.4.1 Predictive training

One way to train the model is utilizing the view as predictive sparse coding (Figure 2.3 (b)), and to train the model conditionally by predicting given [13], [36], [30].

Recall that we can think of the inputs as modulating the parameters. This modulation is case-dependent. Learning can therefore be viewed as “sparse coding with case-dependent weights”. The cost that data-case contributes is:

(11)

Differentiating with respect to is the same as in a standard sparse coding model. In particular, the model is still linear wrt. the parameters. Predictive learning is therefore possible with gradient-based optimization similar to standard feature learning (cf. Section 2.1).

To avoid iterative inference, it is possible to adapt various sparse coding variants, like auto-encoders and RBMs (Section 2.1) to the conditional case. As an example, we obtain a “gated Boltzmann machine” (GBM) by changing the energy function into the three-way energy [30]:

(12)

and exponentiating and normalizing:

(13)

Note that the normalization is over and only, which is consistent with our goal of defining a predictive model. It is possible to define a joint model, but this makes training more difficult (cf. Section 2.4.2). Like in a standard RBM, training involves sampling and . In the relational RBM samples are drawn from the conditional distributions and .

As another example, we can turn an auto-encoder into a relational auto-encoder, by defining the encoder and decoder parameters and as linear functions of ([28], [29]). Learning is then essentially the same as in a standard auto-encoder modeling . In particular, the model is still a directed acyclic graph, so one can use simple back-propagation to train the model. See Figure 2.4.1 for an illustration.

(a) (b)
Figure 7: (a) Relational auto-encoder. (b) Toy data commonly used to test relational models. There is no structure in the images, only in their relationship.

2.4.2 Symmetric training

In probabilistic terms, predictive training amounts to modeling the conditional distribution . [43] show how modeling instead the joint distribution can make it possible to perform image matching, by allowing us to quantify how compatible any two images are under to the trained model.

Formally, modeling the joint amounts simply to changing the normalization constant of the three-way RBM to (cf. previous section). Learning is more complicated, however, because the simplifying view of case-based modulation no longer holds. [43] suggest using three-way Gibbs sampling to train the model.


As an alternative to modeling a joint probability distribution, [29] show how one can instead use a relational auto-encoder trained symmetrically on the sum of the two predictive objectives

(14)

This forces parameters to be able to transform in both directions, and it can give performance similar to symmetrically trained, fully probabilistic models [29]. Like an auto-encoder, this model can be trained with gradient based optimization.


2.4.3 Learning higher-order within-image structure

Another reason for learning the joint distribution is that it allows us to model higher-order within-image structure (for example, [25, 39, 23]).

[39] apply a GBM to the task of modeling second-order within-image features, that is, features that encode pair-wise products of pixel intensities. They show that this can be achieved by optimizing the joint GBM distribution and using the same image as input and as output . In contrast to [43], [39] suggest hybrid Monte Carlo to train the joint.


One can also combine higher-order models with standard sparse coding models, by using some hidden units to model higher-order structure and some to learn linear codes [38, 29].




2.4.4 Toy example: Motion extraction and analogy making

Figure 8 (a) shows a toy example of a gated Boltzmann machine applied to translations. The model was trained on images showing iid random dots where the output image is a copy of the input image shifted in a random direction. The center column in both plots in Figure 8 visualizes the inferred transformation as a vector field. The vector-field was produced by (i) inferring the transformation given the image pair (Eq. 8), (ii) computing the transformation from the inferred hiddens, and (iii) finding for each input-pixel the output-position it is most strongly connected to [30]. The two right-most columns in both plots show how the inferred transformation can be applied to new images by analogy, that is, by computing the output-image given a new input image and the inferred transformation (Eq. 9). Figure 8 (b) shows an example, where the transformations are split-screen translations, that is, translations which are independent in the top half vs. the bottom half of the image. This illustrates how the model has to decompose transformations into factorial constituting transformations.


(a) (b)
Figure 8: Inferring motion direction from test data. (a) Coherent motion across the whole image. (b) “Factorial motion” that is independent in different image regions. In both plots, the meaning of the five columns is as follows (left-to-right): Random test images , random test images , inferred flow-field, new test-image , inferred output .

3 Factorization and energy models

In the following, we discuss the close relationship between gated sparse coding models and energy models. For this end, we first describe how parameter factorization makes it possible to pre-process input images and thereby reduce the number of parameters.





3.1 Factorizing the gating parameters

The number of gating parameters is roughly cubic in the number of pixels, if we assume that the number of constituting transformations is about the same as the number of pixels. It can easily be more for highly over-complete hiddens. [31] suggest reducing that number by factorizing the parameter tensor into three matrices, such that each component is given by the “three-way inner product”

(15)

Here, is a number of hidden “factors”, which, like the number of hidden units, has to be chosen by hand or by cross-validation. The matrices , and are , and , respectively.

An illustration of this factorization is given in Figure LABEL:figure:factorization (a). It is interesting to note that, under this factorization, the activity of output-variable , by using the distributive law, can by written:

(16)

Similarly, for we have

(17)

One can obtain a similar expression for the energy in a gated Boltzmann machine. Eq. 17 shows that factorization can be viewed as filter matching: For inference, each group of variables , and are projected onto linear basis functions which are subsequently multiplied, as illustrated in Figure LABEL:figure:factorization (b).

Figure 10: Input filters learned from various types of transformation. Top-left: Translation, Top-right: Rotation, Bottom-left: split-screen translation, Bottom-right: Natural videos. See figure 11 on the next page for corresponding output filters.
Figure 11: Output filters learned from various types of transformation. Top-left: Translation, Top-right: Rotation, Bottom-left: split-screen translation, Bottom-right: Natural videos. See figure 10 on the previous page for corresponding input filters.
Figure 3: Learning to encode relations: We consider the task of learning latent variables that encode the relationship between images and , independently of their content.
Figure 1: Symbolic representation of a mapping unit [18]. The triangle symbolizes multiplicative interactions between the three variables , and . The value of any one of the three variables is a function of the product of all the others. ([18]).

2 Learning to relate images

2.1 Feature learning

We briefly review standard feature learning models in this section and we discuss relational feature learning in Section 2.2. We discuss extensions of relational models and how they relate to complex cells and to energy models in Section 3.

Practically all standard feature learning models can be represented by a graphical model like the one shown in Figure 2.1 (a). The model is a bi-partite network that connects a set of unobserved, latent variables with a set of observable variables (for example, pixels) . The weights , which connect pixel with hidden unit , are learned from a set of training images . The vector of latent variables in Figure 2.1 (a) is considered to be unobserved, so one has to infer it, separately for each training case, along with the model parameters for training. The graphical model shown in the figure represents how the dependencies between components and

are parameterized, but it does not define a model or learning algorithm. A large variety of models and learning algorithms can be parameterized as in the figure, including principal components, mixture models, k-means clustering, or restricted Boltzmann machines

[16]. Each of these can in principle be used as a feature learning method (see, for example, [5] for a recent quantitative comparison).

For the hidden variables to extract useful structure from the images, their capacity needs to be constrained. The simplest form of constraining it is to let the dimensionality be smaller than the dimensionality of the images. Learning in this case amounts to performing dimensionality reduction. It has become obvious recently that it is more useful in most applications to use an over-complete representation, that is, , and to constrain the capacity of the latent variables instead by forcing the hidden unit activities to be sparse. In Figure 2.1, and in what follows, we use to symbolize the fact that is capacity-constrained, but it should be kept in mind that capacity can be (and often is) constrained in other ways. The most common operations in the model, after training, are: “Inference” (or “Analysis”): Given image , compute ; and “Generation” (or “Synthesis”): Invent a latent vector , then compute .

Figure 2: (a) Sparse coding graphical model. (b) Auto-encoder network.

A simple way to train a model, given training images, is by minimizing reconstruction error combined with a sparsity encouraging term for the hidden variables (for example, [34]):

(1)

Optimization is with respect to both and all . For this end, it is common to alternate between optimizing and optimizing all . After training, inference then amounts to minimizing the same expression for test images (with fixed).

To avoid iterative optimization during inference, one can eliminate by defining it implicitly as a function of . A common choice of function is where is a matrix and is a squashing non-linearity, such as , which confines the values of to reside in a fixed interval. This model is the well-known auto-encoder (for example, [47]) and it is depicted in Figure 2.1. Learning amounts to minimizing reconstruction error with respect to both and . In practice, it is common to enforce in order to reduce the number of parameters and for consistency with other sparse coding models.

One can add a penalty term that encourages sparsity of the latent variables. Alternatively, one can train auto-encoders, such that they de-noise corrupted version of their inputs, which can be achieved by simply feeding in corrupted inputs during training (but measuring reconstruction error with respect to the original data). This turns auto-encoders into “de-noising auto-encoders” [47], which show properties similar to other sparse coding methods, but inference, like in a standard auto-encoder, is a simple feed-forward mapping.

A technique similar to the auto-encoder is the Restricted Boltzmann machine (RBM): RBMs define the joint probability distribution

(2)

from which one can derive

(3)

showing that inference, again, amounts to a linear mapping plus non-linearity. Learning amounts to maximizing the average log-probability of the training data. Since the derivatives with respect to the parameters are not tractable (due to the normalizing constant in Eq. 2

), it is common to use approximate Gibbs sampling in order to approximate them. This leads to a Hebbian-like learning rule known as contrastive divergence training

[16].

Another common sparse coding method is independent components analysis (ICA) (for example, [22]). One way to train an ICA-model that is complete (that is, where has the same size as ) is by encouraging latent responses to be sparse, while preventing weights from becoming degenerate [22]:

(4)
(5)

Enforcing the constraint can be inefficient in practice, since it requires an eigen decomposition.

For most feature learning models, inference and generation are variations of the two linear mappings:

(6)
(7)

The set of model parameters for any

are typically referred to as “features” or “filters” (although a more appropriate term would be “basis functions”; we shall use these interchangeably). Practically all methods yield Gabor-like features when trained on natural images. An advantage of non-linear models, such as RBM’s and auto-encoders, is that stacking them makes it possible to learn feature hierarchies (“deep learning”)

[17].

In practice, it is common to add bias terms, such that inference and generation (Eqs. 6 and 7) are affine not linear functions, for example, for some parameter . We shall refrain from adding bias terms to avoid clutter, noting that, alternatively, one may think of and as being in “homogeneous” coordinates, containing an extra, constant -dimension.

Feature learning is typically performed on small images patches of size between around and pixels. One reason for this is that training and inference can be computationally demanding. More important, local features make it possible to deal with images of different size, and to deal with occlusions and local object variations. Given a trained model, two common ways to perform invariant recognition on test images are:

“Bag-Of-Features”: Crop patches around interest points (such as SIFT or Harris corners), compute latent representation for each patch, collapse (add up) all representations to obtain a single vector , classify using a standard classifier. There are several variations of this scheme, including using an extra clustering-step before collapsing features, or using a histogram-similarity in place of Euclidean distance for the collapsed representation.

“Convolutional”: Crop patches from the image along a regular grid; compute for each patch; concatenate all descriptors into a very large vector ; classify using a standard classifier. One can also use combinations of the two schemes (see, for example [5]).

Local features yield highly competitive performance in object recognition tasks (for example, [5]). In the next section we discuss recent approaches to extending feature learning to encode relations between, as opposed to content within, images.

2.2 Encoding relations

We now consider the task of learning relations between two images and as illustrated222Face images taken from the data-base described in [46] in Figure 2.2, and we discuss the role of multiplicative interactions when learning relations.







2.2.1 The need for multiplicative interactions

A naive approach to modeling relations between two images would be to perform sparse coding on the concatenation. A hidden unit in such a model would receive as input the sum of two projections, one from each image. To detect a particular transformation, the two receptive fields would need to be defined, such that one receptive field is the other modified by the transformation that the hidden unit is supposed to detect. The net input that the hidden unit receives will then tend to be high for image pairs showing the transformation. However, the net input will equally dependent on the images themselves. The reason is that hidden variables are akin to logical “OR”-gates, which accumulate evidence (see, for example [51] for a discussion).

It is straightforward to build a content-independent detector if we allow for multiplicative interactions between the variables. In particular, consider the outer product between two one-dimensional, binary images, as shown in Figure LABEL:figure:outer. Every component of this matrix constitutes evidence for exactly one type of transformation (translation, in the example). The components act like AND-gates, that can detect coincidences. Since a component is equal to only when both corresponding pixels are equal to , a hidden unit that pools over multiple components (Figure LABEL:figure:outer (c)) is much less likely to receive spurious activity that depends on the image content rather than on the transformation. Note that pooling over the components of amounts to computing the correlation of the output image with a transformed version of the input image. The same is true for real-valued data.

(a) (b) (c)

Based on these observations, a variety of sparse coding models were suggested which encode transformations (for example, [36, 13, 30]). The number of parameters is typically equal to the number of hidden variables the number of input-pixels the number of output pixels. It is instructional to think of the parameters as populating a

-way-“tensor

with components .

Figure 2.2.1 (left) shows two alternative illustrations of this type of model (adapted from [30]). Sub-figure (a) shows that each hidden variable can blend in a slice

of the parameter tensor. Each slice is a matrix connecting each input pixel to each output-pixel. We can think of this matrix as performing linear regression in the space of stacked gray-value intensities, known commonly as a “warp”. Thus, the model as a whole can be thought of as defining a

factorial mixture of warps.

Alternatively, each input pixel can be thought of as blending in a slice of the parameter tensor. Thus, we can think of the model as a standard sparse coding model on the output image (Figure 2.2.1 (left)), whose parameters are modulated by the input image. This turns the model into a predictive or conditional sparse coding model [36, 30]. In both cases, hidden variables take on the roles of dynamic mapping units [18, 48] which encode the relationship not the content of the images. Each unit in the model can gate connections between other variables in the model. We shall refer to this type of model as “gated sparse coding”, or synonymously as “cross-correlation model”.

Figure 5: Relating images using multiplicative interactions. Two equivalent views of the same type of model.

Like in a standard sparse coding model one needs to include biases in practice. The set of model parameters thus consists of the three-way parameters , as well as of single-node parameters , and . One could also include “higher-order-biases” [30] like , which connect two groups of variables, but it is not common to do so. Like before, we shall drop all bias terms in what follows in order to avoid clutter. Both simple biases and higher-order biases can be implemented by adding constant-1 dimensions to data and to hidden variables.

2.3 Inference

The graphical model of gated sparse coding models is tri-partite. That of a standard sparse coding model is bi-partite. Inference can be performed in almost the same as in a standard sparse coding model, whenever two out of three groups of variables have been observed.

Consider, for example, the task of inferring , given and (see Figure 2.3 (a)). Recall that for a standard sparse coding model, we have: (up to component-wise non-linearities). It is instructional to think of the gated sparse coding model as turning the weights into a function of . If that function is linear: , we get:

(8)

which is exactly of the form discussed in the previous section.

Eq. 8 shows that inference amounts to computing for each output-component a quadratic form in and defined by the weight tensor . Considering either or as fixed, one can also think of inference as a simple linear function like in a standard sparse coding model. This property is typical of models with bi-linear dependencies [45]. Despite the similarity to a standard sparse coding model, the meaning of inference differs from standard sparse coding: The meaning of , here, is the transformation that takes to (or vice versa).

Inferring , given two images and (Figure 2.3 (b)) yields the analogous expression:

(9)

so inference is again a quadratic form. The meaning of is now “ transformed according to known transformation ”.

For the analysis in Section 4 it is useful to note that, when is given, then is a linear function of (cf. Eq. 9), so it can be written

(10)

for some matrix , which itself is a function of . Commonly, and represent vectorized images, so that the linear function is a warp. Note, that the representation of the linear function is factorial. That is, the hidden variables make it possible to compose a warp additively from constituting components much like a factorial sparse coding model (in contrast to a genuine mixture model) makes it possible to compose an image from independent components.

(a) (b)
Figure 6: Inferring any one group of variables, given the other two, is like inference in a standard sparse coding model. Blue shading represents conditioning.

Like in a standard sparse coding model, it can be useful in some applications to assign a number to an input, quantifying how well it is represented by the model. For this number to be useful, it has to be “calibrated”, which is typically achieved by using a probabilistic model. In contrast to a simple sparse coding model, training a probabilistic gated sparse coding model can be slightly more complicated, because of the dependencies between and conditioned on . We discuss this issue in detail in the next section.

2.4 Learning

Training data for a gated sparse coding model consists of pairs of points . Training is similar to standard sparse coding, but there are some important differences. In particular, note that the gated model is like a sparse coding model whose input is the vectorized outer-product (cf. Section 2.2), so that standard learning criteria, such as squared error, are obviously not appropriate.






2.4.1 Predictive training

One way to train the model is utilizing the view as predictive sparse coding (Figure 2.3 (b)), and to train the model conditionally by predicting given [13], [36], [30].

Recall that we can think of the inputs as modulating the parameters. This modulation is case-dependent. Learning can therefore be viewed as “sparse coding with case-dependent weights”. The cost that data-case contributes is:

(11)

Differentiating with respect to is the same as in a standard sparse coding model. In particular, the model is still linear wrt. the parameters. Predictive learning is therefore possible with gradient-based optimization similar to standard feature learning (cf. Section 2.1).

To avoid iterative inference, it is possible to adapt various sparse coding variants, like auto-encoders and RBMs (Section 2.1) to the conditional case. As an example, we obtain a “gated Boltzmann machine” (GBM) by changing the energy function into the three-way energy [30]:

(12)

and exponentiating and normalizing:

(13)

Note that the normalization is over and only, which is consistent with our goal of defining a predictive model. It is possible to define a joint model, but this makes training more difficult (cf. Section 2.4.2). Like in a standard RBM, training involves sampling and . In the relational RBM samples are drawn from the conditional distributions and .

As another example, we can turn an auto-encoder into a relational auto-encoder, by defining the encoder and decoder parameters and as linear functions of ([28], [29]). Learning is then essentially the same as in a standard auto-encoder modeling . In particular, the model is still a directed acyclic graph, so one can use simple back-propagation to train the model. See Figure 2.4.1 for an illustration.

(a) (b)
Figure 7: (a) Relational auto-encoder. (b) Toy data commonly used to test relational models. There is no structure in the images, only in their relationship.

2.4.2 Symmetric training

In probabilistic terms, predictive training amounts to modeling the conditional distribution . [43] show how modeling instead the joint distribution can make it possible to perform image matching, by allowing us to quantify how compatible any two images are under to the trained model.

Formally, modeling the joint amounts simply to changing the normalization constant of the three-way RBM to (cf. previous section). Learning is more complicated, however, because the simplifying view of case-based modulation no longer holds. [43] suggest using three-way Gibbs sampling to train the model.


As an alternative to modeling a joint probability distribution, [29] show how one can instead use a relational auto-encoder trained symmetrically on the sum of the two predictive objectives

(14)

This forces parameters to be able to transform in both directions, and it can give performance similar to symmetrically trained, fully probabilistic models [29]. Like an auto-encoder, this model can be trained with gradient based optimization.


2.4.3 Learning higher-order within-image structure

Another reason for learning the joint distribution is that it allows us to model higher-order within-image structure (for example, [25, 39, 23]).

[39] apply a GBM to the task of modeling second-order within-image features, that is, features that encode pair-wise products of pixel intensities. They show that this can be achieved by optimizing the joint GBM distribution and using the same image as input and as output . In contrast to [43], [39] suggest hybrid Monte Carlo to train the joint.


One can also combine higher-order models with standard sparse coding models, by using some hidden units to model higher-order structure and some to learn linear codes [38, 29].




2.4.4 Toy example: Motion extraction and analogy making

Figure 8 (a) shows a toy example of a gated Boltzmann machine applied to translations. The model was trained on images showing iid random dots where the output image is a copy of the input image shifted in a random direction. The center column in both plots in Figure 8 visualizes the inferred transformation as a vector field. The vector-field was produced by (i) inferring the transformation given the image pair (Eq. 8), (ii) computing the transformation from the inferred hiddens, and (iii) finding for each input-pixel the output-position it is most strongly connected to [30]. The two right-most columns in both plots show how the inferred transformation can be applied to new images by analogy, that is, by computing the output-image given a new input image and the inferred transformation (Eq. 9). Figure 8 (b) shows an example, where the transformations are split-screen translations, that is, translations which are independent in the top half vs. the bottom half of the image. This illustrates how the model has to decompose transformations into factorial constituting transformations.


(a) (b)
Figure 8: Inferring motion direction from test data. (a) Coherent motion across the whole image. (b) “Factorial motion” that is independent in different image regions. In both plots, the meaning of the five columns is as follows (left-to-right): Random test images , random test images , inferred flow-field, new test-image , inferred output .

3 Factorization and energy models

In the following, we discuss the close relationship between gated sparse coding models and energy models. For this end, we first describe how parameter factorization makes it possible to pre-process input images and thereby reduce the number of parameters.





3.1 Factorizing the gating parameters

The number of gating parameters is roughly cubic in the number of pixels, if we assume that the number of constituting transformations is about the same as the number of pixels. It can easily be more for highly over-complete hiddens. [31] suggest reducing that number by factorizing the parameter tensor into three matrices, such that each component is given by the “three-way inner product”

(15)

Here, is a number of hidden “factors”, which, like the number of hidden units, has to be chosen by hand or by cross-validation. The matrices , and are , and , respectively.

An illustration of this factorization is given in Figure LABEL:figure:factorization (a). It is interesting to note that, under this factorization, the activity of output-variable , by using the distributive law, can by written:

(16)

Similarly, for we have

(17)

One can obtain a similar expression for the energy in a gated Boltzmann machine. Eq. 17 shows that factorization can be viewed as filter matching: For inference, each group of variables , and are projected onto linear basis functions which are subsequently multiplied, as illustrated in Figure LABEL:figure:factorization (b).

Figure 10: Input filters learned from various types of transformation. Top-left: Translation, Top-right: Rotation, Bottom-left: split-screen translation, Bottom-right: Natural videos. See figure 11 on the next page for corresponding output filters.
Figure 11: Output filters learned from various types of transformation. Top-left: Translation, Top-right: Rotation, Bottom-left: split-screen translation, Bottom-right: Natural videos. See figure 10 on the previous page for corresponding input filters.
Figure 3: Learning to encode relations: We consider the task of learning latent variables that encode the relationship between images and , independently of their content.

3 Factorization and energy models

In the following, we discuss the close relationship between gated sparse coding models and energy models. For this end, we first describe how parameter factorization makes it possible to pre-process input images and thereby reduce the number of parameters.





3.1 Factorizing the gating parameters

The number of gating parameters is roughly cubic in the number of pixels, if we assume that the number of constituting transformations is about the same as the number of pixels. It can easily be more for highly over-complete hiddens. [31] suggest reducing that number by factorizing the parameter tensor into three matrices, such that each component is given by the “three-way inner product”

(15)

Here, is a number of hidden “factors”, which, like the number of hidden units, has to be chosen by hand or by cross-validation. The matrices , and are , and , respectively.

An illustration of this factorization is given in Figure LABEL:figure:factorization (a). It is interesting to note that, under this factorization, the activity of output-variable , by using the distributive law, can by written:

(16)

Similarly, for we have

(17)

One can obtain a similar expression for the energy in a gated Boltzmann machine. Eq. 17 shows that factorization can be viewed as filter matching: For inference, each group of variables , and are projected onto linear basis functions which are subsequently multiplied, as illustrated in Figure LABEL:figure:factorization (b).

Figure 10: Input filters learned from various types of transformation. Top-left: Translation, Top-right: Rotation, Bottom-left: split-screen translation, Bottom-right: Natural videos. See figure 11 on the next page for corresponding output filters.
Figure 11: Output filters learned from various types of transformation. Top-left: Translation, Top-right: Rotation, Bottom-left: split-screen translation, Bottom-right: Natural videos. See figure 10 on the previous page for corresponding input filters.