Semantic Fisher Scores for Task Transfer: Using Objects to Classify Scenes

by   Mandar Dixit, et al.

The transfer of a neural network (CNN) trained to recognize objects to the task of scene classification is considered. A Bag-of-Semantics (BoS) representation is first induced, by feeding scene image patches to the object CNN, and representing the scene image by the ensuing bag of posterior class probability vectors (semantic posteriors). The encoding of the BoS with a Fisher vector(FV) is then studied. A link is established between the FV of any probabilistic model and the Q-function of the expectation-maximization(EM) algorithm used to estimate its parameters by maximum likelihood. A network implementation of the MFA Fisher Score (MFA-FS), denoted as the MFAFSNet, is finally proposed to enable end-to-end training. Experiments with various object CNNs and datasets show that the approach has state-of-the-art transfer performance. Somewhat surprisingly, the scene classification results are superior to those of a CNN explicitly trained for scene classification, using a large scene dataset (Places). This suggests that holistic analysis is insufficient for scene classification. The modeling of local object semantics appears to be at least equally important. The two approaches are also shown to be strongly complementary, leading to very large scene classification gains when combined, and outperforming all previous scene classification approaches by a sizeable margin


Object Detectors Emerge in Deep Scene CNNs

With the success of new computational architectures for visual processin...

Deep FisherNet for Object Classification

Despite the great success of convolutional neural networks (CNN) for the...

FOSNet: An End-to-End Trainable Deep Neural Network for Scene Recognition

Scene recognition is an image recognition problem aimed at predicting th...

Multiple instance dense connected convolution neural network for aerial image scene classification

With the development of deep learning, many state-of-the-art natural ima...

Deriving Visual Semantics from Spatial Context: An Adaptation of LSA and Word2Vec to generate Object and Scene Embeddings from Images

Embeddings are an important tool for the representation of word meaning....

Scene Parsing by Weakly Supervised Learning with Image Descriptions

This paper investigates a fundamental problem of scene understanding: ho...

Understanding the Fisher Vector: a multimodal part model

Fisher Vectors and related orderless visual statistics have demonstrated...

Please sign up or login with your details

Forgot password? Click here to reset