SwiDeN : Convolutional Neural Networks For Depiction Invariant Object Recognition

07/29/2016 ∙ by Ravi Kiran Sarvadevabhatla, et al. ∙ indian institute of science 0

Current state of the art object recognition architectures achieve impressive performance but are typically specialized for a single depictive style (e.g. photos only, sketches only). In this paper, we present SwiDeN : our Convolutional Neural Network (CNN) architecture which recognizes objects regardless of how they are visually depicted (line drawing, realistic shaded drawing, photograph etc.). In SwiDeN, we utilize a novel `deep' depictive style-based switching mechanism which appropriately addresses the depiction-specific and depiction-invariant aspects of the problem. We compare SwiDeN with alternative architectures and prior work on a 50-category Photo-Art dataset containing objects depicted in multiple styles. Experimental results show that SwiDeN outperforms other approaches for the depiction-invariant object recognition problem.



There are no comments yet.


page 1

page 2

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Depiction-invariant object recognition is the ability to determine an object’s category regardless of how the object is visually depicted (line drawing, realistic shaded drawing, photograph etc.). Given the varying level of abstraction and complexity in depiction (See Figure 1), this is a challenging task. Human beings easily accomplish depiction-invariant recognition but machine-based systems are nowhere close to a similar level of performance. Current state-of-the-art object recognition architectures do achieve good performance but they are specialized for a single depiction style (e.g. photos [8], sketches [19]). Therefore, designing architectures which recognize objects regardless of depiction style can facilitate progress towards matching human-level abilities. Moreover, the associated performance scores can also aid in quantitatively determining the semantic gap between human and machine capabilities [15].

Surprisingly, not much work exists for depiction-invariant object recognition. To address this gap, we propose a Convolutional Neural Network (CNN) architecture for depiction-invariant object category recognition which we call SwiDeN (Section 3). A novel aspect of our architecture is a ‘deep’ dynamic switching mechanism between two parallel CNN sub-architectures (Section 3.1.1). Our switch-based design not only reduces the overall burden of the generalized object recognition task but also enables the system to address depiction-specific and depiction-invariant aspects of the problem. We compare our approach with baselines, alternative architectures (Section 4.2) and previous work on a -category Photo-Art dataset containing multiple depictions of objects (Section 4). Experimental results show that our architecture outperforms other architectures, especially for non-photo object depictions (Section 5).

Figure 1: Sample images from the Photo-Art-50 dataset grouped by category. For each category, one image each from ‘Art’(left in the pair) and ‘Photo’ depictive style are shown. Given the extreme changes in appearance, recognizing such images regardless of depiction is extremely challenging.
(a) Baseline
(b) Gradient Reversal Network(GRN)
(c) Switching Deep Networks(SwiDeN)
Figure 2: Our proposed architecture SwiDeN is shown in 2(c). The depictive style of the cartoon-ish horse image is determined as ‘Art’ by Switch

(purple block). An associated switch layer relays it to the ‘Art’ sub-network (green block). The latter’s output is passed via a series of shared layers and finally, a softmax classifier generates the label

Horse. Figure 2(a) is the baseline architecture. Figure 2(b) (GRN) is a modification of architecture proposed by Ganin et al [7]. VGG-19 [14] is used as the base network for all architectures.

2 Related Work

Object class (category) recognition, albeit restricted to photographic depictions, has been studied extensively by researchers [5, 6, 12]. However, little previous work exists for truly general multi- depiction object recognition. Wu et al. [16] construct multi-attribute part-graphs for object categories and use graph matching for classification on the same dataset we use. However, their evaluation procedure, also used by Cai et al. [3, 4], induces an unreasonable amount of category bias which makes comparison difficult. We present an alternative evaluation procedure which is more principled (See Section 4.1). Xiao et al. [17] present a graph-based object modelling approach and evaluate it on augmented classes of Caltech-256. Shrivastava et al [13] utilize a depiction-invariant method for image matching. Domain adaption approaches have been also been tried [4]. However, when the domain-specific identifiers (e.g. target domain labels) are available as in our case, a domain-adaptation procedure unnecessarily makes the overall problem harder since the objective in domain-adaptation is typically to “forget" the source domain.

All the approaches mentioned above utilize multiple hand-crafted modules in the recognition pipeline. To the best of our knowledge, ours is the first end-to-end deep learning approach for depiction-invariant recognition of object categories.

3 Our framework

3.1 Motivation

Instead of learning from scratch, a common paradigm is to utilize pre-trained CNNs as a starting point while constructing deep networks of interest. We follow a similar paradigm in our approach.

In an effort to represent the sheer variety seen in image content, the convolutional layers in a CNN typically contain a large number of learnable filters. However, the filters are only sufficient to the extent that the depiction style remains unchanged (e.g. photographs). To accommodate the increase in variety when images from additional depiction styles need to be recognized, a naïve strategy would be to add additional learnable filters for each convolution layer of a pre-trained network111In this case, the network could be one pre-trained for a particular depiction style (e.g. photographs). and perform fine-tuning. However, this strategy results in an unbalanced learning regime since convolutional layers now contain a mixture of learnt and non-learnt filters. In addition, the added filters necessitate an ad-hoc grouping of filter layers to ensure operational consistency which further complicates the overall framework.

An alternative design would be to learn the filters for each depictive style separately. In this design, a set of shallow layer sub-networks exist for each depictive style (see Figure 2(c)). Since our final objective is to achieve depiction-invariant recognition, we require our network to learn a depiction-invariant feature representation. This is achieved by having a final set of layers. To serve as a relay mechanism between the initial depiction-specific sub-network branches and the shared, deeper depiction-invariant fully-connected layers, we employ a custom-designed “switch" (Section 3.1.1). The switch is trained such that given an image, it determines its depictive style and selects the corresponding depiction-specific sub-network for processing the image. The output of this sub-network is then processed by the depiction-invariant layers of the network. The network culminates in a typical softmax-based classification layer which determines the image category, regardless of its depictive style (Figure 2(c)).

Next, we describe the depiction style-based switching mechanism. Subsequently, we delve into the architectural details of the main network pipeline which we dub SwiDeN (Switching Deep Network).

3.1.1 Switch

To realize the switching mechanism mentioned in Section 3.1, we design and train a switch network (see Figure 2 (c)), henceforth referred to as Switch, that determines the depiction style of the input image and passes the image to corresponding depiction sub-network (Photo or Art). The Switch has two convolution layers which capture depiction-discriminative features such as edges, textures, corners, colors and their conjunctions [20]. The first convolution layer is initialized from AlexNet [10]

. The features from the first layer are max pooled while the features from the second convolution layer are average pooled globally 

[11]. The pooled features are processed by two fully connected layers and passed to a classifier layer which determines the depiction style of the input image as ‘Art’ or ‘Photo’. For better generalization, we use dropout for fully connected layers with as the dropout value. We trained Switch

using stochastic gradient descent (SGD) 

[10] with a base learning rate and momentum . Overall, Switch achieves an average accuracy of ( for ‘Art’ and for ‘Photo’).

Switch’s inability to achieve accuracy can be attributed to the fact that some photo images have a predominantly artistic quality and vice-versa (see supplementary material). While this may seem like a liability, in practice, all we require is that Switch achieve a reasonably high accuracy which ensures an overall burden reduction for the filter learning process.

3.1.2 Switching Deep Network (SwiDeN)

The initial portion of SwiDeN consists of two separate sub-networks, one each for photo and art depiction style. During training, Switch (Section 3.1.1

) selects the sub-network branch through which the input image is passed in the forward pass and ensures that the corresponding network loss is backpropagated through the branch selected during the forward pass. The layers after

Switch are shared layers, designed to learn depiction-style invariant representations.

For our problem, we build SwiDeN using VGG-19 deep network [14] layers. We select a subset of initial convolutional layers of VGG-19 and utilize them as the sub-networks for each depictive style. The rest of the VGG-19 layers (except the final classification layer) are used as the shared layers of SwiDeN. Figure 2(c) illustrates a SwiDeN architecture where the first four convolutional layers of VGG-19 are used for the depictive style sub-networks (C1a - C4a for ‘Art’ and C1p - C4p for ‘Photo’) and the rest of VGG-19 layers (C-5,FC-6,FC-7) form the shared portion.

In our experiments, we systematically examined the effect on recognition performance when the first convolutional layers of VGG-19 are used as depiction-style sub-networks (Section 5). For the rest of the paper, we refer to the corresponding architectures as C1-S,C2-S,C3-S,C4-S and C5-S. Thus, the architecture in Figure 2(c) is C4-S.

4 Experiments

4.1 Dataset

We evaluate the classification performance on the Photo-Art-50 dataset [3]. This dataset contains classes and to images in each class with approximately half photo and half art images. The authors also provide train-test splits for comparative evaluation. However, the splits are unbalanced and do not include a validation split, thus inducing significant class bias during evaluation. To avoid this issue, we create our own train, validation and test splits. We create five random splits, each containing images from each category for training ( art and photo) and images for testing ( art and photo). The remaining images from each category are used for validation. We augment the dataset by taking 5 crops of size (four corner crops and the center crop) after rescaling the smallest side of the image to . For training images, the center crop alone is centered around the bounding box. Multiple objects of same class in a single image are ignored. We plan to release our balanced splits to the public.

4.2 Comparison architectures

4.2.1 Baseline

As a natural baseline, we fine-tune VGG-19 using the training data for the classes described in Section 4.1. For training, we used a stochastic gradient descent (SGD) method with a base learning rate and momentum to learn the weights. The learning rate was stepped down by a factor of when the validation accuracy plateaued.

4.2.2 Gradient Reversal Network

Ganin et al. [7]

propose a deep network-based domain-adaptation framework. The authors aim to maximize the target domain accuracy by simultaneously minimizing the target-domain label loss function and maximizing the loss for domain type (target or source) classification. To achieve this, they introduce a gradient reversal layer which not only assists domain-adaptation but also helps learn a domain-invariant representation (FC-8 in Figure

2(b)). Intrigued by this domain-invariance feature, we wished to examine the architecture’s suitability for our cross-depiction problem by viewing depictive styles as domains. However, in their original formulation, Ganin et al. maximize the accuracy for a single domain (depictive style). Therefore, we modify their formulation such that the overall network loss for both the domains (‘Art’ and ‘Photo’) is minimized. In addition, we replace Alexnet used by Ganin et al. with VGG-19. For the rest of the paper, we shall refer to this modified formulation as Gradient Reversal Network (GRN).

We initialize GRN with VGG-19 model weights and performed training using SGD with base learning rate of and momentum . A uniform learning rate was maintained throughout training. For the gradient reversal layer’s scaling factor (see [7] for details), we tried values of and found that gave the best result.

4.2.3 SwiDeN: training

The same training procedure and hyperparameters as in the baseline were used for training SwiDeN architectures

C1-S,C2-SC5-S (Section 3.1.2) with the exception of the the learning rates for the depictive style sub-networks. For the ‘Art’ sub-network, we used a learning rate scaled by a factor of since the base network (VGG) is primarily trained for non-Art images. The learning rate was stepped down by a factor of when the validation accuracy plateaued.

4.3 Evaluation

For evaluation, we determined the final label by pooling the results for five crops of the test image (four corner crops and one center crop) for all the architectures.

4.4 Implementation

We used Caffe 

[9] for all experiments on the baseline. For SwiDeN, we integrated the switch layer from a branch of Caffe [2] into the master branch [1] and customized it for our experiments involving SwiDeN. For experiments on GRN, we used the Caffe version provided by Ganin et al. [7].

5 Results

Arch. Overall Acc. Art Acc. Photo Acc.
Baseline 93.80% 89.80% 97.80%
GRN 92.64% 88.52% 96.76%
SwiDeN(Ours) 94.42% 91.12% 97.72%
Table 1: Classification accuracy for different architectures.

For each architecture (baseline, GRN and SwiDeN (C4-S), we computed the average test set accuracy across all the classes and all the splits. We collate the results into three groups – accuracy regardless of depictive style (‘Overall’) and style-wise accuracies for ‘Photo’ (i.e. accuracy on photographic test images only) and ‘Art’). The results can be seen in seen in Table 1. Our SwiDeN architecture outperforms the other two architectures overall and for ‘Art’ while remaining competitive for ‘Photo’. In SwiDeN, the depiction-style Switch guided sub-network learning reduces the overall burden for the deeper shared layers in learning a robust depiction-invariant representation, which in turn contributes to SwiDeN’s performance.

GRN performs worse than the baseline and SwiDeN. Similar to SwiDeN, GRN also utilizes feedback from a depiction-style classifier. However, the feedback is provided coarsely and indirectly (in terms of loss). Moreover, the feedback is provided at a layer situated deep in the network. This hinders the fine-tuning of shallower (convolutional) filters to learn ‘Art’-specific filters, thus affecting the performance on ‘Art’ in particular and overall performance in general.

We also observe that the baseline performs slightly better than other architectures for ‘Photo’ style. This is to be expected since the original filters are highly-tuned for photos. However, its performance for ‘Art’ is relatively lower compared to SwiDeN. This shows that the complexity involved in cross-depiction recognition cannot be addressed merely by employing typical transfer learning approaches such as fine-tuning.

In spite of the class-bias induced by the splits provided by Cai et al. [3], we compared the performance of our C4-S SwiDeN architecture against that of the multi-attribute part-graph model proposed by Wu et al. [16]. To aid training, we augment the training set by performing RGB jittering, horizontal flip on all images and morphological operations for ‘Art’ images. As Table 2 shows, SwiDeN achieves state-of-the-art results , outperforming the result of Wu et al. [16] overall and for ‘Photo’ images while remaining competitive for ‘Art’.

Table 3 summarizes the performance of different SwiDeN architectures. As can be seen, C4-S outperforms other SwiDeN architectures. As an interesting observation, the trends in overall accuracy and ‘Art’ accuracy as the depth of depictive-style sub-networks increases resemble the patterns observed by Yosinski et al.[18] for deep networks but in the context of transfer learning.

Arch. Overall Acc. Art Acc. Photo Acc.
Wu et al.[16] 89.67% 89.06% 90.29%
SwiDeN (Ours) 93.02% 88.47% 97.56%
Table 2: Classification accuracy on train-test splits by Cai et al. [3].
SwiDeN Arch. Overall Acc. Art Acc. Photo Acc.
C1-S 94.22% 90.44% 98.00%
C2-S 94.36% 90.8% 97.92%
C3-S 93.96% 90.4% 97.52%
C4-S 94.42% 91.12% 97.72%
C5-S 92.64% 88.52% 96.76%
Table 3: Classification accuracy for different SwiDeN architectures C1-S–C5-S(see Section 3.1.2).

6 Conclusion

In this paper, we have described SwiDeN, our end-to-end deep learning framework for recognizing objects regardless of depiction. A key aspect of SwiDeN is the ‘deep’ depictive style-based switching mechanism which judiciously addresses depiction-specific and depiction-invariant aspects of the problem. Addressing these aspects enables us to achieve state-of-the-art results on a challenging dataset containing ‘Photo’ and ‘Art’ style object depictions. In future, we plan to explore unsupervised network learning approaches. Our code and pre-trained models can be accessed at https://github.com/val-iisc/swiden.