Convolutional neural networks (CNNs) have achieved remarkable performance on vision problems such as image classification [34, 57, 60] or object detection and localization [20, 55, 73]. Beyond impressive results, they have an unmatched resilience to dataset bias 
. It is now well known that a network trained to solve a task on a certain dataset (e.g. object recognition on ImageNet) can be easily fine-tuned to a related problem on another dataset (e.g. object detection on MS-COCO). Less studied is robustness to task bias, i.e. generalization across tasks. In this work, we consider an important class of such problems, where a classifier trained on a set of semantics is transferred to a second set of semantics, which are loose combinations of the original ones. We consider the particular case where original semantics are object classes and target semantics are scene classes that somehow depend on those objects.
Task transfer has been a topic of significant interest in computer vision. Prominent examples of cross-task transfer include object detectors learned from object recognition models[20, 23], object recognizers based on attribute detectors [36, 2] and complex activity recognition methods based on attribute detection [44, 40] or object recognition [27, 28]. Our particular interest in object to scene transfer stems from the complex relation between the two domains. A scene can be described as a collection of multiple objects occurring in an unpredictable layout. Localizing the scene semantics is already a difficult task. This is compounded by the difficulty of mapping localized semantics into a holistic scene representation. The problem of knowledge transfer from object to scene recognizers is therefore very challenging.
One might argue that instead of using transfer, a scene classifier CNN can be trained directly from a large dataset of scene images. This approach has two major limitations. First, it does not leverage all the work already devoted to object recognition in the literature. Both datasets and models have to be designed from scratch, which is time consuming. Second, the “directly learned” CNN does not necessarily model relations between holistic scene descriptions and scene objects. This can degrade classification performance. We consider instead the prediction of holistic scene tags from the scores produced by an object CNN classifier. Since it leverages available object recognition CNNs this type of transfer is more efficient in terms of data collection and training. We show that it can also produce better scene classification results. This is because a scene classifier can leverage the recognition of certain types of rocks, tree stumps, or lizard species to distinguish between “Arizona Desert” and “Joshua Tree National Park”. A holistically trained CNN can have difficulty honing in on these objects as discriminators between the two classes.
The proposed object-to-scene transfer is based on the bag of semantics (BoS) representation. It derives a scene representation by scoring a set of image patches with a pre-trained object classifier. The probabilities of different objects are the scene semantics and the set of probability vectors the BoS. A holistic scene classifier is then applied to the BoS, to transfer knowledge from objects to scenes. Several authors have argued for semantic image representations in vision [66, 52, 59, 35, 37, 4, 39]. They have been used to describe objects by their attributes , represent scenes as collections of objects [39, 32] and capture contextual relations between classes . For tasks such as hashing or large scale retrieval, a global semantic descriptor is usually preferred [63, 5]. Works on zero-shot object based scene representation  also use a global semantic image descriptor, mainly because the object-to-scene transfer functions used in these problems require dimensions of the descriptor to be interpretable as object scores. Proposals for scene classification, on the other hand, tend to rely on the BoS [59, 35, 39]. However, while the BoS outperforms low-level features in low dimensions , it has been less effective for high dimensional descriptors such as the Fisher vector (FV) . This is because region semantics can be noisy, and it is hard to map a bag of probability vectors into a high dimensional scene representation, such as a FV .
to overcome the first problem. We obtain a BoS by using these networks to extract semantic descriptors (object class posterior probability vectors) from local image patches. We then extend the FV to this BoS. Thissemantic Fisher vector
amounts to a large set of non-linear pooling operators that act on high-dimensional probability vectors. We show that, unlike low-level features, this extension cannot be implemented by the classical Gaussian mixture model FV (GMM-FV). We simplify the derivation of FVs for other models, by linking the FV of any probabilistic model to the-function of the expectation-maximization (EM) algorithm used to estimate its parameters. It is shown that the FV can be trivially computed as a combination of the E and M steps of EM. This link also enables the leveraging of efficient EM implementations to compute FVs. It is, however, shown that even a more natural distribution for probability vectors, the Dirichlet mixture model (DMM), fails to generate an effective FV for the BoS.
We hypothesize that this is due to the non-Euclidean nature of the probability simplex, which makes the modeling of probability distributions quite complex. Since the FV is always defined with respect to a reference probability distribution, this hurts classification performance. For the GMM-FV, the problem is that the assumption of a locally Euclidean geometry is not well suited for image semantics, which are defined on the simplex. For the DMM-FV, the problem is a lack of explicit modeling of second order statistics of these semantics. Nevertheless, an analysis of the DMM-FV reveals a non-linear log embedding that maps a multinomial distribution to its natural parameter space, where the Euclidean assumption is effective. This suggests using a GMM model on the natural parameter space of the semantic multinomial (SMN), leading to the logSMN-FV. In fact, because the multinomial has various natural space parametrization, we seek the one best suited for CNN semantics. This turns out to be the inverse of the softmax implemented at the network output. Since the CNN is optimized for these semantics, this parameterization has the benefits of end-to-end training. It is shown that a GMM-FV of the pre-softmax CNN outputs significantly outperforms the GMM-FV and the DMM-FV.
While these results show an advantage for modeling second order statistics, the use of a GMM of diagonal covariances limits the ability of the GMM-FV to approximate the non-linear manifold of CNN natural parameter features. For this, we resort to a richer generative model, the mixture of factor analyzers (MFA) [18, 65]
, which locally approximates the natural-space BoS manifold by a set of low-dimensional linear subspaces, derived from covariance information. We derive the MFA Fisher score (MFA-FS) and corresponding MFA-FV and show that the covariance statistics captured by these descriptors are highly discriminant for CNN semantics, significantly outperforming the GMM-FV. To allow end-to-end training, the MFA-FS is finally implemented as a neural network layer. The resulting MFAFSNet is an object to scene transfer network that can be fine-tuned for scene classification by backpropagation. This further improves scene classification performance.
Experiments on the SUN  and MIT Indoor  datasets show that the MFA representations (MFA-FS and MFAFSNet) outperform scene classifiers based on lower level CNN features [21, 8, 45, 46], alternative approaches for second order pooling of CNN semantics [43, 16], and even CNNs learned directly from scene datasets [73, 30, 22]. This is surprising, since the MFA representations perform task transfer, applying object recognition CNNs to scene classification, and require little scene training data. This is unlike direct CNN training, which requires a much larger scene dataset, such as Places . Furthermore, the two representations are complementary: combination of the MFA-FS and the scene CNN significantly outperforms the methods in isolation. The combined classifier has state-of-the-art scene classification performance, achieving sizable improvements over all previous approaches.
2 Bag of Semantics Classification
We start by reviewing the foundations of BoS classification.
2.1 Prior work
Figure 1 presents the architecture of the BoS classifier. Given an image , where denotes spatial location, it defines an initial mapping into a set of retinotopic feature maps . These preserve spatial topology of the image and encode local visual information. They have been implemented with handcrafted descriptors such as SIFT, HoG or the convolutional layers of a CNN. The next stage is a second retinotopic mapping into the space of classifier outputs . Classifiers that define this mapping are pre-trained on an auxiliary set of semantic concepts, e.g. objects [42, 39] or themes [54, 53, 35], that occur locally within images. At each location , they map the descriptors extracted at into a semantic vector in , whose entries are probabilities of occurrence of the individual semantic concepts. The image is thus, transformed into a collection or a “bag” of semantics. However, due to their retinotopic nature, a BoS is sensitive to variations in scene layout. If an object changes position in the field of view, the semantic feature map will change completely. To guarantee invariance, the BoS is embedded into a fixed length non-retinotopic representation, using a non-linear mapping into a high dimensional feature space . The space must have a Euclidean structure that supports classification with linear decision boundaries.
Prior BoS scene classifiers [54, 53, 35, 42, 39] had limited success, for two reasons. First, scene semantics are non-trivial to localize. Scenes are collections of objects and stuff  in diverse layouts. Detecting these entities can be challenging. Object detectors based on handcrafted features, such as SIFT or HoG, lacked discriminative power, producing mappings riddled with semantic noise . Second, it can be difficult to design an invariant scene descriptor (embedding ). The classical pooling of the bag of descriptors into a vector of statistics works well for low and mid-level features [10, 38, 70]
but is far less effective for the BoS. Semantic features are class probabilities that inhabit a very non-Euclidean simplex. Commonly used statistics, such as average or max pooling[35, 39], do not perform well in this space. Our experiments show that even sophisticated non-linear embeddings, such as FVs , can perform poorly.
The introduction of deep CNNs [34, 57, 60] has all but solved the problem of noisy semantics. These models learn highly discriminative and non-linear image mappings that are far superior to handcrafted features. Their top layers have been shown selective of semantics such as faces and object parts . As discussed in the following section, scoring the local regions of a scene with an object recognition CNN produces a robust BoS. It remains to design the embedding . This is discussed in the remainder of the paper.
2.2 CNN semantics
Given a vocabulary of semantic concepts, an image can be described as a bag of instances from these concepts, localized within image patches/regions. Defining an -dimensional binary indicator vector , such that and , , when the image patch depicts the semantic class , the image can be represented as , where is the total number of patches. Assuming that is sampled from a multinomial distribution of parameter the log-likelihood of is
Since the semantic labels for image regions are unknown, it is common to rely instead on the expected log-likelihood
where are the scene semantics for patch , and (2) depends only on the multinomial parameters . This is denoted the semantic multinomial (SMN) in . SMNs are computed by applying a classifier, trained on the semantics of , to the image patches , and using the resulting posterior class probabilities as . This is illustrated in Figure 2, for a CNN classifier. Each patch is mapped into the probability simplex, denoted the semantic space in Figure 1. The image is finally represented by the SMN collection . This is the BoS.
|a) bedroom scene||b) “day bed”|
|c) “quilt, comforter”||d) “window screen”|
Throughout this work, we use ImageNet classes as and object recognition CNNs to estimate the . For efficient BoS extraction, the CNN is implemented as a fully convolutional network, generating the BoS with a single forward pass per image. This requires changing fully connected into 1x1 convolutional layers. The receptive field of a fully convolutional CNN can be altered by reshaping the size of the input image. E.g. for 512x512 images, the fully convolutional implementation of  extracts SMNs from 128x128 pixel patches 32 pixels apart. Figure 3 illustrates the high quality of the resulting semantics. Recognizers of the “bed”, “window” and “quilt” objects exhibit are highly active in the regions where they appear in a bedroom scene.
3 Semantic Embedding
3.1 Fisher Vectors
Images are frequently represented by a bag of descriptors sampled independently from some generative model . An embedding is used to map this representation into a fixed-length vector suitable for classification. A popular mapping is the gradient (with respect to ) of the log-likelihood evaluated at a background model . This is known as the Fisher score of . This gradient vector is often normalized by , where is the square root of the Fisher information matrix of . This is the FV of [26, 49].
Since, for independent sampling, is a sum of the log-likelihoods , the FV is a vector of pooling operators, whose strength depends on the expressiveness of the generative model . The FV based on a large Gaussian mixture model (GMM) is known to be a strong descriptor of image context [48, 56]
. However, for models like GMMs or hidden Markov models, the FV can have various implementations of very different complexity and deriving an efficient implementation is not always easy. We next show that Fisher scores can be trivially obtained using a single step of the expectation maximization (EM) algorithm commonly used to learn such models. This unifies the EM and FV computations, enabling the use of many efficient implementations previously uncovered in the EM literature to implement FVs.
3.2 Fisher Scores from EM
Consider the log-likelihood of under a latent-variable model of hidden variable . Since the left-hand side is independent of the hidden variable, this can be written in an alternate form 
where is the “Q” function of the EM algorithm, a general probability distribution, its differential entropy and the Kullback Liebler divergence between the posterior and . It follows that
Each iteration of the EM algorithm chooses the distribution , where is a reference parameter vector (parameter estimates from previous iteration). In this case,
It follows from (4) that
In summary, the Fisher score of background model is the gradient of the Q-function of EM evaluated at reference model . The computation of the Fisher score thus simplifies into the two steps of EM. First, the E step computes the Q function at the reference . Second, the M-step evaluates the gradient Q with respect to at . Since latent variable models are learned with EM, efficient implementations of these steps are usually already available in the literature, e.g. the Baum-Welch algorithm used to learn hidden Markov models . Hence, the connection to EM makes the derivation of the Fisher score trivial for most models of interest.
4 Semantic Fisher vectors
In this section, we discuss the encoding of the Image BoS into semantic FVs.
4.1 Gaussian Mixture FVs
, here denoted the variance-GMM. Under this generative model, a mixture componentis first sampled from a hidden variable of categorical distribution . A descriptor is then sampled from the Gaussian component of mean and variance , which is a diagonal matrix. Both hidden and observed variables are sampled independently. The Q function is
where is the indicator function. The probabilities are the only quantities computed in the E-step. The M-step then computes the gradient with respect to parameters
where indicates the log-likelihood of the image and is the entry of vector .
These are also the components of the Fisher score, when evaluated using a reference model learned (with EM) from all training data. The FV is obtained by scaling the gradient vectors by an approximate Fisher information matrix, as detailed in . This leads to the following mean and variance components of the GMM-FV
For a single Gaussian component of zero mean, (12) reduces to the average pooling operator. For mixtures of many components, (12) implements a pooling operator per component, restricting each operator to descriptors of large probability under the component. The FV can also implement other pooling operations, e.g. capturing higher order statistics as in (13). Many variations of the GMM-FV have been proposed to enable discriminative learning , spatial feature encoding  or non-iid mixture modeling . However, for low-level features and large enough mixtures, the classical FV of (12) and (13) is still considered state-of-the-art.
4.2 Dirichlet Mixture FVs
The variance-GMM is a default model for low-level visual descriptors [64, 10, 56, 29]. However, SMNs, which inhabit a probability simplex, are more naturally modeled by the Dirichlet mixture (DMM). This follows from the fact that the Dirichlet distribution is the most popular model for probability vectors . For example, it is widely used for text modeling , as a prior of the latent Dirichlet allocation model, and for SIFT based image categorization [15, 9]. The DMM was previously used to model “theme” based SMNs in . It is defined as
where is the Dirichlet parameter of the mixture component and denotes the mixture weight. is the normalizing constant , where is the Gamma function. The generative process is as follows. A mixture component is sampled from a categorical distribution . An observation is then sampled from the selected Dirichlet component . This makes the observation a multinomial distribution that resides on the probability simplex.
The EM algorithm for DMM learning has function
where is the posterior probability of the sample being under the components and we ignore terms that do not depend on the parameters111Gradients w.r.t mixture weights are less informative than w.r.t other parameters and ignored in the FV literature [48, 49, 56].. The expression for the Fisher scores of a DMM is
where . As usual in the FV literature , we approximate the Fisher information by the component-wise block diagonal matrix
4.3 The logSMN-FV
To understand the benefits and limitations of the GMM-FV and DMM-FV it helps to investigate their relationships. Consider the application of the two FVs to the set of SMNs extracted from image . In both cases, the FV can be written as
where and are defined in Table I. This is a pooling mechanism that combines four operations: assigns the SMNs to the components , embeds each SMN into the space where pooling takes place, defines a centroid with respect to which the residuals are computed, and scales or normalizes that residual.
There are three main differences between the FVs. First, while the GMM-FV lacks an embedding, the DMM-FV uses . Second, while the GMM-FV has independent parameters to define centroids () and scaling (), the parameters of the DMM-FV are coupled, since the centroids and the scaling parameter are both determined by the DMM parameters . Finally, the two FVs differ in the assignments and centroids . However, the centroids are closely related. Assuming a background mixture model learned from a training set they are the parameters that set (12) and (16) to zero upon convergence of EM. This leads to the expressions
The differences in the assignments are also mostly of detail, since
are both softmax type non-linearities. For both assignments and centroids, the most significant difference is the use of the embedding in the DMM-FV.
In summary, the two FVs differ mostly in the use of the embedding by the DMM-FV and the greater modeling flexibility of the GMM-FV, due to the availability of independent localization (centroid) and scale parameters. This suggests the possibility of combining the strengths of the two FVs by applying the GMM-FV after this embedding. We refer to this as the logSMN-FV
4.4 FVs in Natural Parameter Space
The gains of the log embedding can be explained by the non-Euclidean nature of the probability simplex. For some insight on this, consider the two binary classification problems of Figures 4 a) and b). In a) the two classes are Gaussian, in b) Laplacian. Both problems have class-conditional distributions where is the class label and , with for Laplacian and for Gaussian. Figures 4 a) and b) show the iso-contours of the probability distributions under the two scenarios. Note that the two classifiers use very different metrics.
The posterior distribution of class is, in both cases,
where is the sigmoid. Since this is very non-linear, the projection of the samples into the semantic space destroys the Euclidean structure of the original spaces . This is illustrated in c), which shows the posterior surface and the projections for Gaussian . In this space, the shortest path between two samples is not a line. The sigmoid also makes the posterior surfaces of the two problems very similar. The surface of the Laplacian problem in b) is visually indistinguishable from c). In summary, Euclidean classifiers with two very different metrics transform the data into highly non-Euclidean semantic spaces that are almost indistinguishable. This reduces the effectiveness of modeling probabilities directly with GMMs or DMMs, producing weak FV embeddings.
The problem can be avoided by noting that SMNs are the parameters of the multinomial, which is a member of the exponential family of distributions
where is denoted a sufficient statistic. In this family, the re-parametrization makes the (log) probability distribution linear in the sufficient statistic
This is called the natural parameterization of the distribution. Under this parametrization, the multinomial log-likelihood of the BoS in (2) yields a natural parameter vector for each patch , instead of a probability vector. For the binary semantics of Figure 4,
is the logit transform. This maps the high-nonlinear semantic space of Figure 4 c) into the linear space of d), which preserves the Euclidean structure of a) and b). Hence, while the variance-GMM is not well matched to the geometry of the probability simplex where is defined, it is a good model for distributions on the (Euclidean) natural parameter space defined by .
Similarly, for multiclass semantics, the mapping from multinomial to natural parameter space is a one-to-one transformation into a space with Euclidean structure. In fact, the multinomial of parameter vector has three possible natural parametrization
where and are the entries of and , respectively. The fact that logSMNs implement explains the good performance of the logSMN-FV. However, the existence of two alternative embeddings raises the question of whether this is the best natural parameter space embedding for the BoS produced by a CNN. Note that, under , defines a probability vector if and only if . Hence, the mapping from to is the softmax function commonly implemented at the CNN output. This implies that CNNs learn to optimally discriminate data in the natural parameter space defined by and, for CNN semantics, should enable better scene classification.
|a) variance GMM||b) MFA|
4.5 The MFA-FV
The models introduced so far mostly disregard semantic feature covariance. The Dirichlet mixture is, by design, incapable of modeling second order statistics. As usual in the FV literature [48, 49, 56], the GMM-based FVs assume a diagonal covariance per mixture component. While standard for SIFT descriptors , this is not suitable for the much higher dimensional CNN features, more likely to populate a low-dimensional manifold of the ambient semantic space. As illustrated in Figure 5
, the variance-GMM requires many components to cover such a distribution. While a full covariance GMM could be substantially more efficient, covariance modeling is difficult in high dimensions. The data available for transfer learning is rarely sufficient to learn full covariances.
In this work, we explore approximate covariance modeling using mixtures of factor analyzers (MFAs) . As illustrated in Figure 5, the MFA approximates a non-linear data manifold by a set of locally linear subspaces. Each mixture component generates Gaussian data in a low dimensional latent space, which is then projected linearly into the high dimensional observation space. This is a low rank approximation of the full covariance Gaussian, which can be learned with the small amounts of data available for transfer learning. It generates high-dimensional covariance statistics that can be exploited by a FV for better classification.
4.5.1 MFA Fisher scores
A factor analyzer (FA) models high dimensional observations in terms of latent “factors” defined on a low-dimensional subspace . Specifically, , where is the factor loading matrix and additive noise. Factors are distributed as and noise as , where is a diagonal matrix. It can be shown that
follows a Gaussian distributionof covariance
. Since this is a full covariance matrix, the FA is better suited for high dimensional data than a Gaussian of diagonal covariance.
The MFA extends the FA so as to allow a piece-wise linear approximation of a non-linear data manifold. It has two hidden variables: a discrete variable , , which determines the mixture assignments and a continuous latent variable , , which is a low dimensional projection of the observation variable , . Hence, the MFA component is a FA of mean and subspace defined by . As illustrated in Figure 5, the MFA components approximate the distribution of by a set of sub-spaces. The MFA can be learned with an EM algorithm of Q function
where . After some simplifications, defining
the E step reduces to computing
The M-step computes the Fisher scores of . After some algebraic manipulations, these can be written as
For a detailed discussion of the Q function, the reader is referred to the EM derivation in . Note that the scores with respect to the means are functionally similar to the first order residuals of (10). However, the scores with respect to the factor loading matrices account for covariance statistics of the observations , not just variances. We refer to (36) and (37) as the MFA Fisher scores (MFA-FS).
4.5.2 MFA Fisher Information
The MFA-FV is obtained by scaling the MFA-FS by the Fisher information matrix. As before, this is approximated by a block-diagonal matrix that scales the Fisher scores of the mixture component by the inverse square-root of
Here is the weight of the mixture, the data term of its Fisher score, and the covariance with respect to the mixture component. For the mean scores of (36) this is simply the component covariance . For the factor loading scores it is the covariance of the data term of (37). This is a matrix, whose entry
is the product of two Gaussian random variables
where is the element of vector . The covariance matrix of the vectorized Fisher score is then
This can be simplified by using Isserlis’ theorem, which states that, for zero-mean Gaussian random variables , . It follows that
5 Neural Network Embedding
The FVs above are implemented independently of the CNN used to extract the SMNs. The mixture model is learned from the extracted BoS and the FV derived from its parameters. In this section, we redesign the MFA-FS embedding as a CNN layer, to enable end-to-end training.
5.1 The MFA-FS Layer
Since the mixture component has distribution , it follows that
finally leads to
Figure 6 shows how (49) can be implemented as a network layer. The bottom branch computes the posterior probability of (5.1). The top branch computes the remainder of the summation argument. The computation of (48) is similar. The bottom branch is identical, the top branch omits the operations beyond . However, because the benefits of this component are small, we only use the layer of Figure 6.
5.2 Network Architecture
The overall architecture of the MFAFSNet is shown in Figure 7. A model pretrained on ImageNet is used to extract a vector of image features. This network is applied to image patches, producing multiple feature maps per image to classify. When the patches are of a single scale, the model is converted to a fully convolutional network. For patches of multiple scales, the final pooling layer is replaced with a region-of-interest (ROI) pooling layer, which accepts feature maps of multiple sizes and produces a fixed size output. This is a standard practice in object detection [20, 19]. The feature vector is dimensionality reduced by a fc layer of appropriate dimensions, and fed to the MFA-FS layer of Figure 6. Note that this layer pools multiple local features, corresponding to objects of different sizes and in different image locations, generating a single feature vector for the whole image. This is fed to a power and a L2 normalization layers, and finally to a linear classifier layer.
5.3 Loss Function
While the parameters , , , and are learned by back-propagation, they must maintain their interpretation as statistical quantities. This requires that (30) and (45)-(47) hold. Some of these constraints do not need to be enforced. For example, since (47) is the only to involve , there is a one to one relationship between and , independently of the value of . In result, it is equivalent to learn under the constraint of (47) or simply learn , which leads to a simpler optimization. A similar observation holds for (30), which is the only constraint on . On the other hand, some of the relationships must be enforced to maintain the MFA-FS interpretation. These are (45), (46), and the symmetry of matrix
. They are enforced by adding regularization terms to the loss function. For training setand classification loss , this leads to a loss function