Fine-grained visual categorization (FGVC) is an important but challenging task due to high intra-class variances and low inter-class variances caused by deformation, occlusion, illumination, etc. An attention convolutional binary neural tree architecture is presented to address those problems for weakly supervised FGVC. Specifically, we incorporate convolutional operations along edges of the tree structure, and use the routing functions in each node to determine the root-to-leaf computational paths within the tree. The final decision is computed as the summation of the predictions from leaf nodes. The deep convolutional operations learn to capture the representations of objects, and the tree structure characterizes the coarse-to-fine hierarchical feature learning process. In addition, we use the attention transformer module to enforce the network to capture discriminative features. The negative log-likelihood loss is used to train the entire network in an end-to-end fashion by SGD with back-propagation. Several experiments on the CUB-200-2011, Stanford Cars and Aircraft datasets demonstrate that the proposed method performs favorably against the state-of-the-arts.READ FULL TEXT VIEW PDF
. The high intra-class and low inter-class visual variances caused by deformation, occlusion, and illumination, make FGVC to be a highly challenging task. Recently, the FGVC task is quickly dominated by the convolutional neural network (CNN) due to its amazing classification performance. Several algorithms[27, 24] focus on extracting discriminative subtle parts for accurate results. However, the single deep CNN model is hard to describe the differences between subordinate classes, see Figure 1. 
present the object-part attention model for FGVC, which uses both object and part attentions to exploit the subtle and local differences to distinguish subcategories, which demonstrates the effectiveness of using multiple deep models concentrating on different object regions in FGVC.
Inspired by 
, we design an attention convolutional binary neural tree architecture (ACNet) for weakly supervised FGVC, which incorporates convolutional operations along the edges of the tree structure, and use the routing functions in each node to determine the root-to-leaf computational paths within the tree as deep neural networks. This designed architecture makes our method inherits the representation learning ability of the deep convolutional model, and the coarse-to-fine hierarchical feature learning process. In this way, different branches in the tree structure focus on different local object regions for classification. The final decision is computed as the summation of the predictions from all leaf nodes. Meanwhile, we use the non-local attention module to enforce the network to capture discriminative features for accurate results. The negative log-likelihood loss is adopted to train the entire network in an end-to-end fashion by stochastic gradient descent (SGD) with back-propagation.
Notably, in contrast to  adaptively growing the tree structure in learning process, our method uses a complete binary tree structure with the pre-specified depth, which is data-independent. In addition, the attention transformer module is used to further help our network to achieve better performance. Several experiments are conducted on the CUB-200-2011 , Stanford Cars , and Aircraft  datasets, demonstrating the favorable performance of the proposed method compared to the state-of-the-art methods. We also carried out the ablation study to comprehensively understand the influences of different components in the proposed method. The main contributions of this paper are summarized as follows.
We present an attention convolutional binary neural tree architecture for FGVC, which incorporates convolutional operations along the edges of the tree structure and use the routing functions in each node to determine the root-to-leaf computational paths within the tree. The final decision is summed over all predictions from leaf nodes.
The attention transformer module is introduced to enforce the network to capture discriminative features for accurate results.
Extensive experiments conducted on three challenging dataset, i.e., CUB-200-2011, Stanford Cars, and Aircraft, demonstrate that our method performs favorably against the state-of-the-arts.
Deep supervised methods.
Some algorithms use object annotations or even dense part/keypoint annotations to guide the training of deep CNN model for FGVC. Zhang et al. propose to learn two detectors, i.e., the whole object detector and the part detector, to predict the fine-grained categories based on the pose-normalized representation. Liu et al. propose a fully convolutional attention networks that glimpses local discriminative regions to adapte to different fine-grained domains. The work in  construct the part-stacked CNN architecture, which explicitly explains the fine-grained recognition process by modeling subtle differences from object parts. However, these methods rely on labor-intensive part annotations, which limits their applications in real scenarios.
Deep weakly supervised method.
To that end, more recent methods only require image-level annotations. Zheng et al. introduce a multi-attention CNN model, where part generation and feature learning process reinforce each other for accurate results. Fu et al. develop a recurrent attention module to recursively learn discriminative region attention and region-based feature representation at multiple scales in a mutually reinforced way. Recently, Sun et al. regulate multiple object parts among different input images by using multiple attention region features of each input image. However, the aforementioned methods merely integrate the attention mechanism in a single network, affecting their performance.
Decision tree is an effective algorithm for classification task. It selects the appropriate directions based on the characteristic of feature. The inherent ability of interpretability makes it as promising direction to pose insight into internal mechanism in deep learning. Xiao propose the principle of fully functioned neural graph and design neural decision tree model for categorization task. Frosst and Hinton  develop a deep neural decision tree model to understand decision mechanism for particular test case in a learned network. In our work, we integrate the decision tree with neural network to implement sub-branch selection and representation learning simultaneously.
Attention mechanism has played an important role in deep learning to mimic human visual mechanism. In , the attention is used to make sure the student model focuses on the discriminative regions as teacher model does. 
propose the cascade attention mechanism on the different layers of CNN and concatenate them to gain discriminative representation as the input of final linear classifier. Huet al. apply the attention mechanism from aspect of channels and allocate the different weights according to the contribution of each channel. The CBAM module in  combines space region attentions with feature map attentions. In contrast to the aforementioned methods, we apply the attention mechanism on each branch of the tree architecture to sake the discriminative regions for classification.
Our ACNet model aims to classify each object sample in to sub-categories, i.e., assign each sample in with the category label , which consists of four modules, i.e., the backbone network, the branch routing, the attention transformer, and the label prediction modules, shown in Figure 2. We define the ACNet as a pair , where defines the topology of the tree, and denotes the set of operations along the edges of . Notably, we use the full binary tree , where is the set of nodes, is the total number of nodes, and is the set of edges between nodes, is the total number of edges. Since we use the full binary tree , we have and , where is the height of . Each node in is formed by a routing module determining the sending path of samples, and the attention transformers are used as the operations along the edges.
Meanwhile, we use the asymmetrical architecture in the fully binary tree , i.e., two attention transformers are used in the left edge, and one attention transformer is used in the right edge. In this way, the network is able to capture the different scales of features for accurate results. The detail architecture of our ACNet model is described as follows.
Backbone network module
. Since the discriminative regions in fine-grained categories are highly localized 
, we need to use a relatively small receptive field of the extracted features by constraining the size and stride of the convolutional filters and pooling kernels. The truncated VGG-16 model (i.e., retaining the layers from conv1_1 to conv4_3) is used as the backbone network module to extract features, which is pre-trained on the ILSVRC CLS-LOC dataset . Similar to , we use the input image size instead of the default . Notably, ACNet can also work on other pre-trained networks, such as ResNet  and Inception V2 .
Branch routing module
. As described above, we use the branch routing module to determine which child (i.e., left or right child) the samples would be sent to. Specifically, as shown in Figure 2(b), the -th routing module at the -th layer uses one convolutional layer with the kernel size , followed by a global context block . The global context block is an improvement of the simplified non-local (NL) block  and Squeeze-Excitation (SE) block , which shares the same implementation with the simplified NL block on the context modeling and fusion steps, and shares the transform step with the SE block. In this way, the context information is integrated to better describe the objects. After that, we use the global average pooling , element-wise square-root and L2 normalization 
, and a fully connected layer with the sigmoid activation function to produce a scalar value in
indicating the probability of samples being sent to the left or right sub-branches. Letdenote the output probability of the -th sample being sent to the right sub-branch produced by the branch routing module , where , . Thus, we have the probability of the sample being sent to the left sub-branch to be . If the probability is larger than , we prefer the left path instead of the right one; otherwise, the left branch dominates the final decision.
Inspired by [14, 37], we introduce an attention module in the transformers to enforce the network to capture discriminative features, see Figure 3. Specifically, following a convolutional layer with kernel size , we insert an attention module, which generates a channel attention map with the size
using a batch normalization (BN) layer
, a global average pooling (GAP) layer, a fully connected (FC) layer and ReLU activation function, and a fully connected layer and sigmoid function. In this way, the network is guided to focus on meaningful features for accurate results.
For each leaf node in our ACNet model, we use the label prediction module (i.e., ) to predict the subordinate category of the object , see Figure 2. Let to be the accumulated probability of the object passing from the root node to the -th node at the -th layer. For example, if the root to the node path on the tree is , i.e., the object is always sent to the left child, we have . As shown in Figure 2, the label prediction module is formed by a batch normalization layer, a convolutional layer with kernel size
, a max-pooling layer, a sqrt and L2 normalization layer, and a fully connected layer. Then, the final predictionof the -th object is computed as the summation of all leaf predictions multiplied with the accumulated probability generated by the passing branch routing modules, i.e., . We would like to emphasize that , i.e., the summation of confidences of belonging to all subordinate classes equal to ,
where is the accumulated probability of the -th node at the leaf layer. We present a short description to prove that as follows.
Let be the accumulated probability of the -th branch routing module at the -th layer. Thus, the accumulated probabilities of the left and right children corresponding to are and , respectively. At first, we demonstrate that the summation of the accumulated probabilities and is equal to the accumulated probability of their parent . That is,
Meanwhile, since we use the fully binary tree in our ACNet model, we have
Based on the above two equations, we can further get
This process is carried out iteratively, and we have
In the training phase, we use the cropping and flipping operations to augment data to construct a robust model to adapt to variations of objects. That is, we first rescale the original images such that their shorter side is pixels. After that, we randomly crop the patches with the size , and randomly flip them to generate the training samples.
The loss function for our ACNet is formed by two parts, i.e., the loss for the predictions of leaf nodes, and the loss for the final prediction, computed by the summation over all predictions from the leaf nodes. That is,
where is the height of the tree , is the negative logarithmic likelihood loss of the final prediction and the ground truth label , and is the negative logarithmic likelihood loss of the -th leaf prediction and the ground truth label .
The backbone network in our ACNet method is pre-trained on the ILSVRC CLS-LOC dataset . The “xavier” method  is used to randomly initialize the parameters of the convolutional layers. The entire training process is formed by two stages. For the first stage, the parameters in the truncated VGG-16 network are fixed, and other parameters are trained with epochs. The batch size is set to in training with the initial learning rate . The learning rate is gradually divided by at the -th, -th, -th, and -th epochs. In the second stage, we fine-tune the entire network for epochs. We use the batch size in training with the initial learning rate . The learning rate is gradually divided by at the -th, -th, and -th epochs. We use the SGD algorithm to train the network with momentum, and weight decay in the first stage and weight decay in the second stage.
|LRB P ||84.2|
|Improved B-CNN ||85.8|
|KERL w/ HR ||87.0|
, to demonstrate the effectiveness of the proposed method. Our method is implemented in the Caffe library. All the source codes of the proposed method will be made publicly available after the paper is accepted. All models are trained on a workstation with a 3.26 GHz Intel processor, 512 GB memory, and eight Nvidia V100 GPUs.
|Improved B-CNN ||92.0|
|Improved B-CNN ||88.5|
The Caltech-UCSD birds dataset (CUB-200-2011)  consists of annotated images in subordinate categories, including images for training and images for testing. The fine-grained classification results are shown in Table 1.
As shown in Table 1, the best supervised method111Notably, the supervised method requires object or part level annotations, demanding significant human effort. Thus, most of recent methods focus on the weakly supervised methods, pushing the state-of-the-art weakly supervised methods surpassing the performance of previous supervised methods., i.e.PN-CNN  using both the object and part level annotations produces top-1 accuracy on the CUB-200-2011 dataset. Without part-level annotation, MAMC  produces top-1 accuracy using two attention branches to learn discriminative features in different regions. KERL w/ HR  designs a single deep gated graph neural network to learn discriminative features, achieving better performance, i.e., top-1 accuracy. Compared to the state-of-the-art weakly supervised methods [5, 8, 35], our method achieves the best results with top-1 accuracy. This is attributed to the designed attention transformer module and the coarse-to-fine hierarchical feature learning process.
The Stanford Cars dataset  contains images from classes, which is formed by images for training and images for testing. The subordinate categories are determined by the Make, Model, and Year of cars.
As shown in Table 2, previous methods using part-level annotations (i.e., FCAN  and PA-CNN ) only produces less than top- accuracy. The recent weakly supervised method WS-DAN  designs the attention-guided data augmentation strategy to exploit discriminative object parts, achieving top-1 accuracy. Without using any fancy data augmentation strategy, our method achieves the best top- accuracy, i.e., .
|Height of the Tree||Top-1 Acc. (%)|
|Mode||Level||Leaf Node||Top-1 Acc. (%)|
The Aircraft dataset  is a fine-grained dataset of different aircraft variants formed by annotated images, which is divided into two subsets, i.e., the training set with images and the testing set with images. Specifically, the category labels are determined by the Model, Variant, Family and Manufacturer of airplanes. The evaluation results are presented in Table 3.
As shown in Table 3, our model performs on par with the state-of-the-art method MA-CNN , i.e., vs. top-1 accuracy. The operations along different root-to-leaf paths in our tree architecture focus on exploiting discriminative features on different object regions, which help each other to achieve the best performance in FGVC.
We conduct several ablation experiments to study the influence of some important parameters and different components of our ACNet method on the CUB-200-2011 dataset.
Effectiveness of the tree architecture .
To validate the effectiveness of the tree architecture design, we construct two variants, i.e., VGG and w/ Tree, of our ACNet method. Specifically, we construct the VGG method by only using the VGG-16 backbone network for classification, and further integrate the tree architecture to form the w/ Tree method. The evaluation results are reported in Figure 5. As shown in Figure 5, we find that using the tree architecture significantly improves the accuracy, i.e., improvements in top-1 accuracy, which demonstrates the effectiveness of the designed tree architecture in our ACNet method.
In addition, we also use the Grad-CAM method  to generate the heatmaps to visualize the responses of the leaf nodes in our ACNet model on the CUB-200-2011 dataset in Figure 4. As shown in Figure 4, we observe that different leaf nodes concentrate on different regions of images. For example, the leaf node corresponding to the first column focuses more on the background region, the leaf node corresponding to the second column focuses more on the head region, and the other two leaf nodes are more interested in the patches of wings and tail. The different leaf nodes help each other to construct more effective model for accurate results.
Height of the tree .
To explore the effect of the height of the tree , we construct four variants with different heights of tree in Table 4. Notably, the tree is degenerated to a single node when the height of the tree is set to , i.e., only the backbone VGG-16 network is used in classification. As shown in Table 4, we find that our ACNet achieves the best performance (i.e., top-1 accuracy) with the height of tree equals to , i.e., . If we set , there are limited number of parameters in our ACNet model, which are not enough to represent the significant variations of the subordinate categories. However, if we set , too many parameters with limited number of training data cause overfitting of our ACNet model, which greatly affects the performance.
Asymmetrical architecture of the tree .
To validate the effectiveness of the asymmetrical architecture design in , we construct two variants, i.e., one uses the symmetry architecture, and another one uses the asymmetrical architecture, and set the height of the tree to be . The evaluation results are reported in Table 5. As shown in Table 5, we find that the proposed method produces top-1 accuracy using the symmetrical architecture. If we use the asymmetrical architecture, the top-1 accuracy is improved to . We speculate that the asymmetrical architecture is able to fuse various features with different receptive fields for better performance.
|Pooling||Top-1 Acc. (%)|
Effectiveness of the attention transformer module.
We construct a variant “w/ Tree-Attn”, of the proposed ACNet model, to validate the effectiveness of the attention transformer module in Figure 5. Specifically, we add the attention block in the transformer module in the w/ Tree method to construct the “w/ Tree-Attn” method. As shown in Figure 5, the “w/ Tree-Attn” method performs consistently better than the “w/ Tree” method, producing higher top-1 accuracy with different number of channels, i.e., improving top-1 accuracy in average, which demonstrate that the attention mechanism is effective for fine-grained classification.
Components in the branch routing module.
We analyze the effectiveness of the global context block  in the branch routing module in Figure 5. As shown in Figure 5, we find that our ACNet method produces the best results with different number of channels in the branch routing module. After removing the global context block, the top-1 accuracy drops in average, which demonstrate that the global context block  is useful to improve the accuracy of the fine-grained classification.
Meanwhile, we also study the effectiveness of the pooling strategy in the branch routing module in Table 6. As shown in Table 6, we observe that using the global max-pooling (GMP) instead of the global average pooling (GAP) leads to top-1 accuracy drop on the CUB-200-2011 dataset. We speculate that the GAP operation encourages the filter to focus on high average response regions instead of the only maximal ones, which is able to integrate more context information for better performance.
In addition, we use the Grad-CAM method  to visualize the focuses of different branch routing modules (i.e., the , and modules in Figure 2) in Figure 6. As shown in Figure 6, we find that different branch routing modules focus on different discriminative regions. For example, the feature maps of the module pay more attentions to the whole bird region, while the feature maps of the and module focus more on the wings and head regions of the bird, see the example Bobolink in the first row of Figure 6
. This phenomenon demonstrates that our hierarchical feature extraction process in the treearchitecture gradually enforces our model to focus on more discriminative detail regions of object.
In this paper, we present an attention convolutional binary neural tree for weakly supervised FGVC, which incorporates convolutional operations along edges of the tree structure, and uses the routing functions in each node to determine the root-to-leaf computational paths within the tree. The final decision is produced by max-voting the predictions from leaf nodes. To enforce the network to capture discriminative features for accurate results, we insert the attention transformer module into the convolutional operations along edges. The entire network is trained in an end-to-end fashion by the SGD optimization method with negative log-likelihood loss. Extensive experiments are conducted on three challenging datasets, i.e., CUB-200-2011, Stanford Cars, and Aircraft, demonstrating the favorable performance of the proposed method against the state-of-the-arts.