Fine-Grained Visual Categorization (FGVC) is an important and an active research topic in computer vision[19, 3]. It is the task of identifying objects from subordinate categories, for example, recognizing species of birds 
. Unlike fine-grained categorization, in attributes recognition at fine-grained level we are interested in classifying the attributes of an instance. An example of this can be seen in Figure1 where an instance of a specific bird is classified for different attributes such as the bill shape or the under-part color. Retrieving such attributes of objects have been shown to be effective in improving object recognition and categorization . Further, given the fact that such higher-order description of the images can provide semantically meaningful information [1, 8], along with low-level features (whether hand-crafted or extracted from different layers of a CNN) they can facilitate the task of fine-grained retrieval for instance for product search [27, 30]. Specifically, it is been noted 
that it is worthwhile to learn a fine-grained model that is capable of characterizing the fine-grained visual similarity for the images within the same category.
Previous work have addressed the problem of recognizing attributes of an object at fine-grained level using other techniques such as relying on text information to associate attributes with the object, for instance describing jewelries or other fashion items with fine-grained attributes such as color or texture . Some work have taken advantage of localizing the object or the fine parts of the object [6, 35, 2, 31] for example in  the discriminative visual attributes are detected for birds images. In this work, we demonstrate that retrieving attributes of an object at a fine-grained level can be looked at as a multi-attribute attribute classification task. As opposed to the multi-attribute object classification (the example on the left in Figure 1 taken from COCO dataset  where the scene is classified for multiple objects at high level of recognition), in multi-attribute classification a single instance of an object will be tagged for its different attributes at a fine-grained level of recognition (see the bird example on the right in Figure 1 which is tagged for the shape of its bill, its under-part colors, etc.). However, we can take advantage of the methods in multi-attribute image classification domain. Pairwise ranking loss has been proven to be successful in multi-label image classification with annotation of tags as the ground truth [32, 12]
. The ranking objective in the pairwise ranking loss function is that the score of the positive labels is larger than the score of negative labels. The derivative of the ranking loss does not converge to zero when the logits have high values. This makes it robust against the vanishing gradient problem. Moreover, similar to multi-label object classification, in multi-attribute classification we are interested in the per attribute accuracy results.
As we mentioned before, fine-grained categorization is an area of research which is very close to attributes extraction at the fine-grained level. The Fine-grained categorization problem has been addressed using different techniques including part-based or localization models . However, these techniques often require additional expensive annotations such as bounding boxes  or keypoints . Among the recent techniques bilinear convolutional neural networks (bcnn)  have shown to improve the results for fine-grained categorization significantly. In bcnn 
an image is passed through two (similar or different) convolutional networks and the outputs of their last convolutional layer (more accurately conv+pool) are multiplied using the outer product at each location of the image and sum-pooled to obtain the bilinear vector. This additional step is defined as the bilinear-pool layer. In our proposed FineTag architecture we have adopted the idea of bilinear-pool layer to allow capturing a feature map with an emphasis on fine details and use it not as a second step of training but as convolutional layer in the network to enable end-to-end training.
Our contributions in this paper are as follows: 1. We introduce FineTag architecture which is a simple network (in terms of the number of parameters) to retrieve the attributes of an instance at fine-grained level of details. 2. We adapt CUB200 birds dataset by Caltech , which was initially collected and organized for fine-grained categorization, to be used for multi-attribute classification of attributes at fine-grained level.
The paper is organized as follows: In Section 2
, we describe the architecture of our network (FineTag) including the loss function used by the model as well as the evaluation metrics used. Section3 discusses preparing the data for the experiment, as well as the challenge of data in Fine Tagging. In Section 4, the experiments and results are discussed and Section 5 summarizes our work and outlines the future challenges and goals.
2.1 FineTag Network Architecture
Figure 2 shows the architecture of our proposed model (the graph in the first row). Our proposed architecture is fully convolutional  and it is based on VGG16 . VGG16 is chosen due its small filter kernel size () compared to other existing convolutional neural networks which makes it suitable for capturing fine texture details in an object. We applied the bilinear-pool layer in  as an outer product layer and used a loss function which is suitable for multi-label classification proposing an end-to-end model for multi-attribute extraction of an instance at a fine-grained level.
Similar to 
, we use VGG16 pre-trained on the ImageNet dataset truncated at a convolutional layer after the non-linearities. More specifically, we extract the features from the last convolutional layer of VGG16 with non-linearities, i.e. layer 30 ( + ).
The size of the feature map () at this level is . We project a copy of into a 20 dimensional ICA  projection space which is generated from the feature maps extracted from the same layer when a set of images from the same dataset are passed through the VGG16 network. This results into a reduced feature map () of size . We also tried PCA for dimensionality reduction, however we found ICA to be performing at least better than PCA (at least with CUB200 bird dataset ).
Then, the sum of the outer product of and at each location is calculated which results in a feature map. Summing over the spatial dimensions this is reduced to features. As it can be seen in Figure 2, the ICA projection can be designed as a convolutional operation with weights and biases of size and respectively. This architecture relies on the local correlation between pairs of features for creating the final list of logits necessary for multi-labelling. The layer following the sum of the outer product is a fully connected layer without any non-linearity (similar to the last fully connected layer in the original VGG16 design). To get the predicted tags for the image the results are passed through a multi-label classifier with the multi-label ranking loss during training (shown in green in Figure 2).
The number of classes (denoted as in Figure 2) is the size of the vocabulary of attributes. For instance, in case of CUB200 bird dataset  where there are 312 attributes in total, the network is trained for an output of 312 logits. The network is trained with a multi-labelling ranking loss (Eq. 2). The rank of the logits gives the tagging results. An optimally trained network would rank the labels present in the image higher than the ones missing.
An important advantage of our proposed architecture is that it requires much less parameters to train compared to the baseline network: very deep VGG16. Figure 2 provides a visual comparison of the weights and biases in the layers of our proposed network and of VGG16+rank. Apart from the number of parameters shared between the two models in the convolutional ReLU layers (roughly ), with (number of attributes in CUB200 bird dataset) our model required about parameters to train, whereas, VGG16 requires around parameters to train for the same number of attributes (i.e. FineTag has less parameters). As it is predictable by the number of parameters and the extra fully connected layers in VGG16, FineTag is more than 30% faster to train.
Another advantage of our proposed architecture is that, unlike what is usual with fine-grained recognition techniques [6, 2], with FineTag we do not have to include explicitly the location of the parts in the object.
We emphasis FineTag is and end-to-end architecture and the ICA projection space coefficients which are only calculated once beforehand and used for initialization of bilinear layer (shown in red in Figure 2).
2.2 Training and Loss Functions
We build the baseline training with the architecture proposed by . We explore multiple loss functions. Ranking is shown to be a good loss for highly unbalanced multi-labelling problems [28, 33]. One choice would be the hinge ranking loss [32, 12]:
where is a label (attribute) prediction model that maps an image to a K-dimensional label space which represents the confidence scores. The loss function is designed such that produces a vector whose values for true labels are greater than those for negative labels (i.e. ). This creates the framework of learning to rank  via pairwise comparisons.
which is basically a smooth approximation of Eq. 1 using the log_sum_exp pairwise function.
2.3 Evaluation metrics
We use ranking-based average precision (AVGPREC) as our main evaluation metric . This metric is designed to evaluate ranked retrieval results of labels. For each image the algorithm returns a ranked list of labels. The average precision metric computes for each relevant label in the retrieval list the percentage of relevant labels that are ranked higher than itself, and it averages these percentages over all relevant labels. AVGPREC is defined as:
where is the list of relevant labels for the given instance . and are the positions of the predicted ranking labels and , respectively, where is ranked higher than (therefore, its position is before ).
We finally average over all the images.
Additionally, we consider the average precision (AP) per individual attribute (tag). To calculate this, we consider for all the images the value of the logit corresponding to such label. We then compute the precision and recall at different threshold levels, create a precision-recall curve, and then average the precision over different recall values. However, we take into account the observation by[4, 9]
which linear interpolation of the points on the precision-recall curve provides an overly-optimistic measure of classifier performance. Therefore, we adopt a variant implementation of AP[25, 7] which does not interpolate the precision-recall curve. We finally calculate the weighted mean average precision (W_MAP) , weighting by the frequency of instances per label.
The multi-label instance classification task requires a fine-grained dataset where each item is annotated with an ideally balanced number of individual labels, i.e. for each label a balanced and sufficient number of images should be provided. There are a few fine-grained datasets available [29, 23, 15]. However, most of them do not include detailed part annotations. Since they are initially collected and prepared for fine-grained categorization, the only annotations provided are the classes that the images belong to; for instance, the images in FGVC aircraft dataset  are annotated with the model variant, family, and manufacturer names.
On the other hand, datasets which are provided for multi-label classification at a high level of recognition such as COCO  are not suitable since the items are not labeled at a fine-grained level.
CUB200 bird dataset  is a widely used dataset in fine-grained visual categorization domain. It incorporates 11788 images of 200 species of birds where each image is provided with several attributes. There are overall 312 attributes which make a vocabulary of the same size for our experiment. An example of this dataset along with its tags is shown in Figure 1.
In the original dataset, the id of each image is repeated per attribute (i.e. 312 separate lines are written for each image) and a binary value of or indicates the presence or absence of each attribute for the corresponding image. As usual with a multi-label image classification task we need to provide our network with labels for each training image. We generate a binary vector of attributes of size 312 (total number of attributes) per image. Every index of vector holding the value of one represents the presence of that attribute and zero means the corresponding attribute is absent. These attributes/labels are extracted for each image in the training, test and validation set to be used as the input to the network. 111The new format of CUB200bird train, test and validation sets are available (the link will be provided). Here, we are not taking into account with what confidence the attribute is decided for each image. For the total 11788 images of the training set this provides us with a matrix of size . We consider the same set of images as  for the training set with around 6000 images. From the total 5794 test images we separate around 700 images for validation set which leaves 5094 images for the test set.
The 312 attributes in the CUB dataset  are formed into 28 groups including bill_shape, eye_colour, wing_shape, etc. Each group of attributes varies from three to 14 diversities. For instance, the bill_shape could be in the form of curved, dagger, hooked, needle, hooked_seabird, spatulate, all-purpose, cone or specialized. As it is the case with the underpart_colour in Figure 1 which is both blue and grey, some groups of attributes might have more than one diversity which makes the task of retrieving the attributes even harder and defines the problem as inherently multi-labelling.
We adopt the VGG16 architecture as a backbone for our multi-labelling experiments. We compare our FineTag architecture with VGG16 based architecture both using ranking loss to train (denoted as VGG16+rank in Figure 2).
by VGG16. We repeated the experiments with two different optimizers: stochastic gradient descent with momentum and Adam’s optimizer.
Similarly, we trained our Finetag architecture initializing the convolutional layers with the weights of VGG16 trained on ImageNet classification. We initialized the 1x1 convolutional filter (ICA in Figure 2) with coefficients obtained with a dimensionality reduction based ICA or alternatively on PCA. We explored different number of components, from three to 100, and found 20 components to be optimal. Like with VGG16, we repeated the experiment with two optimizers: stochastic gradient descent with momentum and adam, the results for which are shown in rows two and five of Table 1 respectively.
Ultimately, to make a fairer comparison of the transfer learning exploited by our architecture, we repeated the experiment for VGG16+rank by using the pre-trained weights on Imagenet only for the convolutional layers of VGG16. We randomly initialized the remaining FC layers with Xavier initialization. This way we use transfer learning for the two architectures with the same amount of information.
The code is built on Tensorflow framework and the experiments are run on an NVIDIA Tesla V100 GPU. We train both networks for a few epochs. We found the batch size of 16 and learning rate of 0.00001 when using Adam optimizer and 0.0001 when using stochastic gradient descent with momentum to be the optimum settings for both networks.
Table 1 summarizes the results of weighted mean average precision (W_MAP) over all labels as well as the ranking-based average precision (AVGPREC) over all images (see Section 2.3). Using both method of optimization FineTag architecture performs better. Moreover, FineTag is quite robust towards the choice of optimizer. We can see that even with stochastic gradient descent with momentum it is possible to get almost equal results as with Adam’s, while the deep net’s acceptable performance is very dependant on the type of optimizer. Further, a fairer comparison between FineTag architecture (last row of the table) and VGG16+rank architecture where the pre-trained weights on Imagenet are used only for initialization in the convolutional layers (second row from bottom) shows that FineTag architecture has a significantly better performance on the CUB200 birds dataset.
Table 2 shows the weighted mean average precision (W_MAP) per group of labels. As mentioned before, there are 28 groups of attributes which are listed in the header of Table 2 (represented in two sections to fit the space). Each group of attributes could vary between three to 14 different diversities. The results in Table 2 are the average over all diversities per group of attributes. To have a fair evaluation we have taken the number of images per label into account. A label could have as few as 15 images in the whole dataset and as many as almost 10000 images. Obviously, this is going to affect the training process and as it is often the case with imbalance data when the number of images is very few for a label, the mean average precision for that label is significantly lower (this is shown in Figure 3). This is true for both networks trained with different optimizes. To compensate for the unbalanced distribution of instances across the labels we are multiplying the mean average precision for each label by a weight  which is calculated as the frequency of instances for the corresponding label over the whole dataset.
In Table 2, first row in each section holds the result for VGG16+rank architecture using Imagenet weights for initialization in all layers with momentum optimizer (denoted as vgg16R_aw_mom). In the second row in each section the results of FineTag architecture with momentum optimizer are shown (denoted as FineTag_mom). The next two rows represent the results for VGG16+rank architecture with Adam optimizer once with initialising all layers with pre-trained Imagenet weights and once initializing only the convolutional layers (shown as vgg16R_aw_adam and vgg16R_cw_adam, respectively). Finally, the last row of the table holds the results for FineTag architecture with Adam optimizer (denoted as FineTag_adam). We can see that FineTag returns a higher accuracy for most groups of attributes. Again, when comparing both networks trained with similar conditions, i.e. using the pre-trained weights on Imagenet only for convolutional layers, FineTag outperforms the VGG16+R architecture. (To make the analysis of this table visually simpler, we have only highlighted the best results regardless of the chosen optimizer).
As mentioned before, the FineTag architecture is a much smaller network in terms of number of parameters (almost smaller than VGG16+rank).
In this paper, we introduced the FineTag architecture: a network for recognizing multiple tags of a single item (multi-label item classification task) at fine-grained level of details, a task which has been less addressed. We compared our proposed architecture with a deep net (VGG16) using a rank-based loss for classification. Our experiments on CUB200 bird dataset showed that FineTag outperforms the baseline architecture in almost every group of attributes. The architecture is fully convolutional, with a relatively high resolution feature maps. This allows to run it on arbitrary size images. Further, we showed that FineTag architecture has very few parameters compared to the baseline architecture ( less). This is a highly important characteristic for the architecture to be deployed in a retrieval system. Moreover, being a shallower network, FineTag is not very dependant on the type of optimizer and can perform well enough with a simple optimizer like stochastic gradient descent with momentum.
One challenge which is remained with multi-label item classification task at fine-grained level is the lack of suitable datasets. Training a network for fine tagging requires a well-balanced densely labelled dataset. For our experiments, we adopted CUB200 birds dataset  to a format which is suitable for the network. However, even with CUB200 bird dataset the number of images per label is not well proportioned and it varies from 15 to 10000 images. In a more accurate experiment, we would like to address this issue by creating better datasets. Labelling the CUB200 bird dataset was a big challenge since it requires expert opinions for labelling, but other datasets such as the aircraft dataset  could be labelled accurately for future experiments without expert knowledge.
-  Berg, T.L., Berg, A.C., Shih, J.: Automatic attribute discovery and characterization from noisy web data. In: European Conference on Computer Vision. pp. 663–676. Springer (2010)
Berg, T., Belhumeur, P.N.: Poof: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 955–962. IEEE (2013)
-  Cui, Y., Song, Y., Sun, C., Howard, A., Belongie, S.: Large scale fine-grained categorization and domain-specific transfer learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4109–4118 (2018)
Davis, J., Goadrich, M.: The relationship between precision-recall and roc curves. In: Proceedings of the 23rd international conference on Machine learning. pp. 233–240. ACM (2006)
-  Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 248–255. IEEE (2009)
-  Duan, K., Parikh, D., Crandall, D., Grauman, K.: Discovering localized attributes for fine-grained recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3474–3481. IEEE (2012)
-  Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision 88(2), 303–338 (2010)
-  Farhadi, A., Endres, I., Hoiem, D., Forsyth, D.: Describing objects by their attributes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1778–1785. IEEE (2009)
-  Flach, P., Kull, M.: Precision-recall-gain curves: Pr analysis done right. In: Advances in Neural Information Processing Systems. pp. 838–846 (2015)
-  Fürnkranz, J., Hüllermeier, E., Mencía, E.L., Brinker, K.: Multilabel classification via calibrated label ranking. Machine learning 73(2), 133–153 (2008)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics. pp. 249–256 (2010)
-  Gong, Y., Jia, Y., Leung, T., Toshev, A., Ioffe, S.: Deep convolutional ranking for multilabel image annotation. arXiv preprint arXiv:1312.4894 (2013)
Hyvärinen, A.: Survey on independent component analysis (1999)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
-  Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine-grained categorization. In: IEEE International Conference on Computer Vision Workshops (ICCVW). pp. 554–561. IEEE (2013)
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. pp. 1097–1105 (2012)
-  Li, Y., Song, Y., Luo, J.: Improving pairwise ranking for multi-label image classification (2017)
-  Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)
-  Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear cnn models for fine-grained visual recognition. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1449–1457 (2015)
-  Liu, T.Y., et al.: Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 3(3), 225–331 (2009)
-  Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X.: Deepfashion: Powering robust clothes recognition and retrieval with rich annotations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1096–1104 (2016)
-  Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp. 3431–3440 (2015)
-  Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013)
-  Ruder, S.: An overview of gradient descent optimization algorithms. arXiv preprint arXiv:1609.04747 (2016)
-  Schütze, H., Manning, C.D., Raghavan, P.: Introduction to information retrieval, vol. 39. Cambridge University Press (2008)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
-  Song, J., Song, Y.Z., Xiang, T., Hospedales, T.: Fine-grained image retrieval: the text/sketch input dilemma. In: Proceedings of the British Machine Vision Conference (BMVC). BMVA Press (2017)
-  Usunier, N., Buffoni, D., Gallinari, P.: Ranking with ordered weighted pairwise classification. In: Proceedings of the 26th Annual International Conference on Machine Learning. pp. 1057–1064. ICML ’09, ACM, New York, NY, USA (2009). https://doi.org/10.1145/1553374.1553509, http://doi.acm.org/10.1145/1553374.1553509
-  Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: The caltech-ucsd birds-200-2011 dataset (2011)
-  Wang, J., Leung, T., Rosenberg, C., Wang, J., Philbin, J., Chen, B., Wu, Y., et al.: Learning fine-grained image similarity with deep ranking pp. 1386–1393 (2014)
-  Wei, X.S., Luo, J.H., Wu, J., Zhou, Z.H.: Selective convolutional descriptor aggregation for fine-grained image retrieval. IEEE Transactions on Image Processing 26(6), 2868–2881 (2017)
-  Weston, J., Bengio, S., Usunier, N.: Wsabie: Scaling up to large vocabulary image annotation. In: IJCAI. vol. 11, pp. 2764–2770 (2011)
-  Weston, J., Bengio, S., Usunier, N.: Wsabie: Scaling up to large vocabulary image annotation. In: Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI (2011)
-  Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Part-based r-cnns for fine-grained category detection. In: European conference on computer vision. pp. 834–849. Springer (2014)
-  Zhang, N., Farrell, R., Iandola, F., Darrell, T.: Deformable part descriptors for fine-grained recognition and attribute prediction. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 729–736 (2013)
-  Zhao, F., Huang, Y., Wang, L., Tan, T.: Deep semantic ranking based hashing for multi-label image retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1556–1564. IEEE (2015)