Deep convolution neural networks (CNNs) have achieved a grate success on in the field of image processing[30, 31, 17, 11] and been applied on object detection and recognition[13, 24, 28] and get a better performance. As a kind of poor side effect, simple and noninvasive treatment, Chinese herbs are widely used in China and a number of Asian countries for healthcare[39, 37]. Therefore, there are wide application values and significance for recognizing Chinese herbs automatically. However, as far as we know, there is no research on this task and meanwhile it is difficult to train models for herbal recognition due to a lack of sufficient herbs data.
In this paper, we first propose a CNN model to deal with Chinese herbal recognition task, based on which we present a standard dataset for Chinese herbal recognition. Distinguishing from regular task of object recognition[13, 17] and fine-grained image recognition[10, 43]
, the former focus on distinguishing the outline and shape of object and the latter need more detailed features to identify so that they can classify with similar shape but different details. For Chinese Herbal recognition, we would be confronted with the above two cases: (a) some herbs are so distinguishing that they are easy to be classified with the shape features instead of detailed features. (b) some herbs with similar shapes usually need to be classified by more fine-grained features. The features extracted from convolution layers of different depth are rich in diversity that the features from earlier layers are more representational and from deeper layers are more abstract and contain more semantics in contrast[24, 22]. According to the aforementioned challenges of herbal recognition, we choose Feature Pyramid Networks (FPN) to merge features from different levels so that we can diversify image features overall to improve the performance of herbal recognition with CNNs.
Compared with the traditional FPN, in this study, we first introduce channel-wise attention in the process of fusing features from different levels. In this way, our models can dynamically adjust the weights of features from different levels, which makes it possible to adjust the extent of selecting features encoding from various levels adaptively. Furthermore, we also combine spatial attention to spatially recalibrate the misaligned features caused by serveral upsampling or downsampling operators during feed-forward propagation.
More importantly, the channel-wise and spatial attention are both improved in this paper as follows: (a) The original SE mechanism is limited on re-scaling the weights of features from the same layers, while the competitive attention proposed in this paper extends the modeling ranges of channel-wise attention, as same as spatial attention, and explicitly model the competitive channel dependencies between spatial and semantic informations in the process of fusion on various levels. (b) The feature maps from bottom-up pathway, which are abundant in more spatial informations to provide references for the misaligned and spatially coarser features from top-down pathway, are introduced into the process of spatial attentional modeling to recalibrate the misaligned features. Based on the above improvement of attention according to our specific structures and tasks, we can jointly model channel relationship of various levels and channel dependencies between spatial and semantic information flows, as well as recalibration on misaligned features spatially.
With aforementioned methods we proposed, we try our best to improve the performance of Chinese herbal recognition. Consequently, the contributions in this study can be concluded as follows:
1. We build and present the standard Chinese-Herbs recognition dataset (CNH-98), further, we build the corresponding tiny-Chinese-Herbs dataset (TCNH-98), which is used to train models for locally recognition of herbs.
2. We introduce both channel-wise and spatial attention mechanism into pyramid networks and further improve their structures to propose channel-wise competitive attention and spatial reference attention. The former focus on modeling channel dependencies between spatial and semantic information flows and the latter tends to recalibrate the misaligned features with spatial information flows for reference.
3. We first apply pyramid ConvNet to Chinese herbal recognition according to the characteristic of recognition task.
4. We conduct experiments on the datasets we proposed to validate the superior performance of presented models on the task of Chinese herbal recognition.
2 Related Work
Feature Pyramid. Feature pyramid network is proposed to get image features at different scales, based on this motivation, numerous methods with multi-level features in CNN have been proposed, such as RoI pooling or using skip-connection to construct pyramid. With RoI pooling on proposal region, HyperNet, ParseNet and ION concatenate features of multiple layers before computing predictions and [4, 38] also aggregate context in different scales with spatial pooling. Feature pyramid like Stacked Hourglass network
is the typical structure with skip-connection, which combines different levels features for key point estimation. Inspired by Hourglass Module, FPN designs a network with strong semantic at all scales for object detection and FANet improves it further by augmenting lower-level feature maps. Several other approaches including PRM for pose estimation, U-Net for segmentation and RON for object detection handling features at multi-level by skip connections. In our work, we introduce an attentional fusion method based on FPN to competively model the relationship between spatial information and semantics for Chinese Herbal Recognition.
Attention in CNN. With the trend of attention widely applied on the modeling process of CNNs, it is commonly used for two primary aspects: channel-wise attention  to explicitly model interdependencies between channels and the other one to re-weight the image spatial signals[33, 21, 35, 43]. Furthermore, some models combine both spatial and channel-wise attention, such as SCA-CNN[6, 23]. However, the mentioned models are limited on local region. To solve this problem, self-attention [34, 9] is proposed to capture long-range dependencies between local and global. Additionally, there are some attention models based on domain knowledge [5, 7]. Interaction-aware pyramid also introduce attention to the network for modeling long-range relationship. Different from , our proposed attention mechanism based on the specific structure of FPN explicitly models a trade off between spatial and semantic informations for Chinese Herbal recognition.
CNN Applied on Tasks like Herbal Recognition. There are some similar tasks using CNN with Chinese Herbal recognition such as plants recognition , which mainly focus on leaf recognition[15, 2]. Moreover, another similar tasks like flower recognition[12, 36] can also use CNN to achieve. As far as we konw, there has been no one using CNN to recognize Chinese Herbs so far and we propose this approach firstly.
3 Chinese-Herbs Dataset Collection
The Chinese-Herbs Dataset (CNH-98) is a collection of 9184 images of 98 categories covering the common Chinese herbs. Furthermore, we make a crop of each image into serveral tiny images without overlapping to construct a Tiny-Chinese-Herbs Dataset (TCNH-98) including 51198 images, because each image always contains multiple repeated herbs. These two datasets are divided randomly into training and validation sets with the proportion of 4:1. Fig. 1 shows some examples of CNH-98 (left) and their crop TCNH-98 (right). The sample datasets are available111https://github.com/scut-aitcm/Chinese-Herbs-Dataset.
3.1 Chinese-Herbs Dataset
In this dataset, most of the images were acquired by taking photos ourselves in the medicinal herbs stores, hospitals and so on. And the others were collected from the Google images. The smallest dimension of images is about 250 pixels. Each class contains 94 images on average and more than 41 classes include over 100 images. In order to ensure the availability and matching of labels and data, the labels were reviewed by the human annotators.
3.2 Tiny-Chinese-Herbs Dataset
Tiny dataset was sampled from above CNH-98 dataset with the size of and we ensured that there was no overlapping. Considering that there are some factors interfering with the quality of the image, such as blank place in the origin and so on, we dropped some images in the following conditions, as judged by the annotators: (i) the images were blank or the proportion of herbs in images is too small, (ii) not contain herbs (like some containers or background), (iii) the annotators cannot recognize such as the parts of original herbs. Overall, we gained an average of 522 images per class and minimum 100 per class.
We need to make a statement that the Tiny-Chinese-Herbs dataset may bring more severe challenges in herbal recognition, due to the limited image size and incomplete features of herbs, although the scale of this dataset is bigger.
4 Competitive Attentional Fusion Pyramid Networks
In this section, considering the characteristics of Chinese herbal recognition tasks, we first extend applications on Feature Pyramid Network (FPN) to Chinese herbal recognition tasks. Next, we propose a competitive attentional fusion mechanism based on the original FPN to adapt to the aforementioned tasks. Finally, in terms of existing problem of misaligned features, a spatial recalibtration method is proposed, which will be combined with the above attentional fusion mechanism.
4.1 Apply FPN to Chinese Herbal Recognition
For herbal recogniton tasks, there is a characteristic that the shapes of some herbs are so distinguishing that they are easy to be classified using the high-resolution features from the lower level of networks, while some herbs with similar shapes usually need to be classified by features from the higher level, which contain more fine-grained semantic informations. The feature maps from various layers of networks are shown in Fig. 2. Therefore, we choose FPN applied to Chinese herbal recognition, because FPN can fuse
multi-hierarchies features with its pyramid structure.
Consisting of two pathway, a bottom-up pathway and a top-down pathway, and lateral connections, FPN can build a feature pyramid with high-level semantics throughout by naturally exploiting a ConvNet’s pyramid feature hierarchy. The bottom-up pathway is the feed-forward computation of the backbone ConvNet, which results in a feature hierarchy containing feature maps at several scales with a scaling step of 2. And the top-down pathway generates higher resolution features by upsampling the last groups of feature maps by a factor of 2 on the bottom-up pathway. Here we record the output of upsampling as . As opposed to the features on the same level from the bottom-up pathway, these feature maps are spaitally coarser, but semantically stronger, hence we natrually refer the bottom-up pathway to the spatial flow and the top-down pathway to the semantic flow. As described in the design of FPN, the output of last layer of level via lateral connections merges with the corresponding feature maps with the same size from top-down pathway, as follows:
where will be fed into the next upsampling process. The result is a fusion feature pyramid that has strong semantic and spatial informations at all scales.
4.2 Competitive Attention between Spatial and Semantic Flows
The aforementioned fusion mode of features in FPN is to indiscriminately treat spatial and semantic flows at all scales, which is likely to cause redundencies in fusion features. From an intuitional point of view, we propose a competive attention mechanism that allows the network to explicitly modeling the competition between spatial and semantic flows in the process of fusion, such that the network can selectively emphasis richer semantic or spatial features and suppress redundant ones.
To achieve this, we gain the global information and , embedding from feature maps via lateral connections of spatial flow and upsampling feature maps of semantic flow respectively:
where denotes the operation of global pooling. The combination of and will be used as joint input for the excitation operation to capture channel-wise dependencies between spatial and semantic flows:
where refers to the concatenation of the feature-maps produced in the above squeeze operation from two flows, and parameters and . The result of Excitation operator is that will divide into two parts to rescaling the weights of features and respectively as follows:
where refers to and means . The competition between spatial and semantic flows is modeled by the Competitive Attention module proposed above and react to each channel of both spatial and semantic flow. On the one hand, the aforementioned mergence mode of features can be regarded as a adaptive competition between two flows and its recalibration depends on two flows adaptively to dynamically adjust the complement weights for each other. On the other hand, a trade off between spatial and semantic flows is indicated. Finally, the Competitive Attention module is reformulated as:
Fig. 3 shows the overview of our Competitive Attentional Fusion Pyramid Network and its more details in Competitive Attention module. It is concluded that the difference between the typical SE and the Competitive Attention is that based on the particular structure and meanings of FPN we simultaneously introduced two flows into SE to model their channel relationship competitively and trade off, and meanwhile we adjust two flows at the same time.
4.3 Spatial Reference Recalibration
As we discuss above, the upsampling features before merging are spatially coarser because they are products of several downsampling or upsampling operators. In other words, their spatial informations such as location are less accurate and even misaligned. That is also why we need fusion features.
However, it should be noted that the above amalgamation means of element-wise addition extremely rely on the spatial informations, thus it is likely that the fusion features merged in this way are sub-optimal. Consequently, we introduced a method to spatially recalibrate the misaligned features through modeling spatial attention on pixel-level with spatial flows for reference. As discussed in Harmonious-Spatial-Attention (HA), similarly we compresses the feature maps in the following ways (global cross-channel averaging pooling) to reduce parameters for the subsequent conv layer, but unlike HA we Simultaneously model two flows:
where and will be concatenated to fed into the next conv layer of
filters with stride 2 and then resized to original size by bilinear interpolation with the factor of 2. Finally, we add the scaling conv layer offilters for reducing aliasing effect of bilinear upsampling. As a result, we gain 2 feature maps to rescale values of features from two flows on pixel-level respectively. In addition, this mechanism also contributes to the robustness of network which allows it to use different upsampling methods on the top-down pathway.
Aming to combine the competitive attention with spatial recalibration, we further attach two convolution layer after tensor multiplication on two flows respectively, since the two procosses are not mutually independent. Finally, we deploy the sigmoid operations to normalise. More details of SRR-module and its combination with Competitive attention are shown in Fig. 4.
5.1 Implementation Details
For fair comparison, each plain FPN and its corresponding CA, SRR and SRR-CA counterparts are trained with identical optimisation schemes. For CNH-98 and TCNH-98 datasets, we train our all models with three degrees of data augmentation: no data augmentation, standard data augmentation (+) and 
, an advanced data augmentation technology. On CNH-98, the standard data augmentation (translation/mirroring) is adopted for training set and the 224x224 crop is randomly sampled. All images normalized with mean values and standard deviations. When testing, our implementation follows the practice in. On TCNH-98, we follows the standard practice and data augmentation in 
for CIFAR. All models were trained by optimizer SGD with 0.9 Nesterov momentum from scratch.
During training on CNH-98, we train our models with batch size 64 and 300 epochs for standard augmentation and mixup, 120 epochs for no augmentation. The learning rate is initialized to 0.1 and divided by 5 at epochs 120, 200, 260 for standard augmentation andand at epochs 30, 60, 90 for no augmentation , and weight decay are adopted with 0.0005 and 0.0001 respectively. In particular, we train models for on the last 20 epochs with traditional strategy.
During training on TCNH-98, our models are trained for 300 epochs with batch size 128 and the initial learning rate is 0.1 and is divided by 10 at 100th, 150th, 200th epochs. We also set the weight decay as 0.0001 following  for CIFAR. Especially, learning rate during training without data augmentation was divided by 5 at epochs 30, 60, 90.
5.2 Results of Chinese Herbal Recognition
We evaluate our methods on the CNH-98 and TCNH-98 datasets with pre-act ResNet for backbone networks and the results of contrastive experiments for FPN with/without CA and SRR-CA modules are shown in Table. 1, and we can make a summary as follows:
First of all, as shown in and in the Table. 1, we can see FPN indeed gets a better results than pre-act ResNet whether on CNH-98 or TCNH-98, which verifies the guess in Section 4.1 that FPN is more suitable to accomplish the task of Chinese herbal recognition, and here we record the experiment on FPN as baseline. Furthermore, for both CNH-98 and TCNH-98, FPN-CA can achieve superior performance than baseline and FPN-SRR-CA can further improve performance across different depth or keep the effect at least without too much extra parameters.
Secondly, FPN-SRR almost can exceed FPN except on CNH-98 without data augmentation, proving the effectiveness of SRR modules in most case and suggesting that CA and SRR modules are not two separate processes but need to model jointly, hence it is reasonable to attach convolution layer after combination of SRR and CA modules. For the reason of performance of SRR on CNH-98 with no augmentation, we infer that there is an overfitting phenomenon owing to the small size of CNH-98 dataset. Moreover, on CNH-98 dataset, compared with FPN-34, FPN-SRR-CA-18 even increases validation accuracy rates by 1.7% for no augmentation, 1.2% for standard augmentation and achieve or slightly go beyond of FPN-34 for mixup. In particular, FPN-SRR-CA-18 has higher accuracy rates than FPN-SRR-CA-34 for no augmentation, for which we infer that the depth 34 of networks for small dataset like CNH-98 is too deep to fit and our CA and SRR-CA modules can reduce overfitting as well as improving the generalization ability of models thus perform better with deeper networks. On the contrary, during training on TCNH-98 that consists of 40958 images with standard augmentation and , we notice that there is an underfitting for the depth 20 of networks, which indicates the representation of the models with depth 20 is too limited, and we increased the depth of networks, which can reduce this phenomenon, proving the performance of models with deeper networks can get better.
The  can be seen as an advanced method of data augmentation. However, for TCNH-98 dataset, models with achieve the worse results, for which we argue that as augmentation approaches would further aggravate underfitting, leading to a worse result natrually. Due to the limited representation of networks with depth 20, actually TCNH-98 dataset is suitable for deeper networks, proved by results of experiments on models with depth 56, which reduces underfitting.
5.3 Further Analysis and Discussion
The analysis of last section 5.2 has proven the effectiveness of CA and SRR-CA modules. In this section, from an intuitive angle of view, we discuss the effects of our approaches. The internal feature maps from different levels of three models, FPN-18, FPN-CA-18, FPN-SRR-CA-18, are shown in the top part of Fig. 5, from which we can conclude that our methods can strengthen the representation of networks. By observing the representation of feature maps, the previous layers of FPN almost extract contour features, while the features are increased with more detailed informations using our FPN-CA models, compared with feature maps of FPN with/without CA on level 1 and 2 in Fig. 5. It is worth mentioning that the features extracted by the models with CA modules are more sparse and accurate, compared to the original FPN, especially for feature maps of level 3. Moreover, SRR-CA modules can further spatially recalibrate the misaligned feature maps, mainly for deeper features, typically shown in level 3 of Fig. 5, which makes the features with stronger spatial informations and richer in semantic. Additionally, we statistics the distributions of the activation of CA modules on FPN-CA and FPN-SRR-CA models, and we can see that the attentional activation values of CA and SRR modules are very vigorous and distinguish, and the heatmap of SRR modules can reconstruct the distribution of the origin, which suggests that our methods indeed contribute to re-weighting and recalibrating features.
As shown in the distribution of channel-wise attentional outputs, we can see the activation values of features from deeper layers are always uniform and tend to 0.5, for the reason that features from deeper layers have been adjusted during training, thus CA modules perform less adjustment. It is noticed that the activation values on the deepest level of spatial flow are almost higher than the ones from semantic flow, while from deep to previous, the activation from semantic flow would stand out from the competition gradually. This confirmes our conjecture that high-level features is spatially coarser and strongly semantic, in contrast to low-level features, and simultaneously indicates the mechanism we proposed can complement spatial or semantic informations for requirements of different levels. Correspondingly, there are same conclusion on the analysi of heatmap activation of SRR modules. Compared with channel-wise attentional outputs between FPN-CA and FPN-SRR-CA, there is a trend that channel-wise activation of FPN-SRR-CA would be more stable than FPN-CA owing to the effectiveness of SRR, which enables features more accurate and the effects of SRR can be passed through the network.
In this paper, we firstly propose the standard Chinese Herbs dataset for recognition. Based on the characteristic of Chinese herbal recognition task, we introduce attention mechanism into pyramid networks to model channel relationship of features from various levels. Furthermore, we also improve channel-wise and spatial attention and propose competitive attention and spatial reference recalibration module, which respectively model channel dependencies between spatial and semantic flows in the process of feature fusion and spatially recalibrate the misaligned feature maps with spatial flow for reference. With improved pyramid network, we apply it to the Chinese herbal recognition and evaluate our methods on CNH-98 and TCNH-98 dataset we proposed as well as getting superior performance to the traditional pyramid networks.
-  Google images. Website. http://images.google.com/.
F. Ayaz, A. Ari, and D. Hanbay.
Leaf recognition based on artificial neural network.
International Artificial Intelligence and Data Processing Symposium, pages 1–5, 2017.
S. Bell, C. L. Zitnick, K. Bala, and R. Girshick.
Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks.In , pages 2874–2883, 2016.
-  J.-R. Chang and Y.-S. Chen. Pyramid stereo matching network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5410–5418, 2018.
-  J. Chen, H. Zhang, X. He, L. Nie, W. Liu, and T.-S. Chua. Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention. In Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval, pages 335–344. ACM, 2017.
-  L. Chen, H. Zhang, J. Xiao, L. Nie, J. Shao, W. Liu, and T.-S. Chua. Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6298–6306. IEEE, 2017.
-  E. Choi, M. T. Bahadori, L. Song, W. F. Stewart, and J. Sun. Gram: graph-based attention model for healthcare representation learning. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 787–795. ACM, 2017.
-  Y. Du, C. Yuan, B. Li, L. Zhao, Y. Li, and W. Hu. Interaction-aware spatio-temporal pyramid attention networks for action classification. arXiv preprint arXiv:1808.01106, 2018.
-  J. Fu, J. Liu, H. Tian, Z. Fang, and H. Lu. Dual attention network for scene segmentation. arXiv preprint arXiv:1809.02983, 2018.
-  J. Fu, H. Zheng, and T. Mei. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In CVPR, volume 2, page 3, 2017.
-  R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
I. Gogul and V. S. Kumar.
Flower species recognition system using convolution neural networks and transfer learning.In International Conference on Signal Processing, pages 1–6, 2017.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer, 2016.
-  J. Hu, Z. Chen, M. Yang, R. Zhang, and Y. Cui. A multiscale fusion convolutional neural network for plant leaf recognition. IEEE Signal Processing Letters, 25(6):853–857, 2018.
-  J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation networks. arXiv preprint arXiv:1709.01507, 7, 2017.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen. Ron: Reverse connection with objectness prior networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 2, 2017.
-  T. Kong, A. Yao, Y. Chen, and F. Sun. Hypernet: Towards accurate region proposal generation and joint object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 845–853, 2016.
-  W. Li, X. Zhu, and S. Gong. Harmonious attention network for person re-identification. In CVPR, volume 1, page 2, 2018.
-  T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detection. In CVPR, volume 1, page 4, 2017.
-  D. Linsley, D. Scheibler, S. Eberhardt, and T. Serre. Global-and-local attention networks for visual recognition. arXiv preprint arXiv:1805.08819, 2018.
-  W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
-  W. Liu, A. Rabinovich, and A. C. Berg. Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579, 2015.
-  A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision, pages 483–499. Springer, 2016.
-  T. V. Nguyen, Q. Zhao, and S. Yan. Attentive systems: A survey. International Journal of Computer Vision, 126(1):86–110, 2018.
-  J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
-  O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
-  B. P. Tóth, M. J. Tóth, D. Papp, and G. Szücs. Deep learning and svm classification for plant recognition in content-based large scale image retrieval. In CLEF (Working Notes), pages 569–578, 2016.
-  F. Wang, M. Jiang, C. Qian, S. Yang, C. Li, H. Zhang, X. Wang, and X. Tang. Residual attention network for image classification. arXiv preprint arXiv:1704.06904, 2017.
-  X. Wang, R. Girshick, A. Gupta, and K. He. Non-local neural networks. arXiv preprint arXiv:1711.07971, 10, 2017.
-  S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon. Cbam: Convolutional block attention module. In Proc. of European Conf. on Computer Vision (ECCV), 2018.
-  X. Xia, C. Xu, and B. Nan. Inception-v3 for flower classification. In International Conference on Image, Vision and Computing, pages 783–787, 2017.
-  C. C. Yang and P. Veltri. Intelligent healthcare informatics in big data era. Artificial intelligence in medicine, 65(2):75–77, 2015.
-  H. Yang, D. Huang, Y. Wang, and A. K. Jain. Learning face age progression: A pyramid architecture of gans. arXiv preprint arXiv:1711.10352, 2017.
-  J. Yang, Y. Hong, and S. Ma. Impact of the new health care reform on hospital expenditure in china: A case study from a pilot city. China Economic Review, 39:1–14, 2016.
-  W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang. Learning feature pyramids for human pose estimation. In The IEEE International Conference on Computer Vision (ICCV), volume 2, 2017.
-  H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
-  J. Zhang, X. Wu, J. Zhu, and S. C. Hoi. Feature agglomeration networks for single stage face detection. arXiv preprint arXiv:1712.00721, 2017.
-  H. Zheng, J. Fu, T. Mei, and J. Luo. Learning multi-attention convolutional neural network for fine-grained image recognition. In Int. Conf. on Computer Vision, volume 6, 2017.
Appendix A Details of Chinese Herbs Datasets
a.1 Distributions of Examples in CNH-98 Dataset
|Main Categories||Herbs Examples|
|Fruits & Seeds||Star Anise, Siraitia Grosvenorii,|
|Ginkgo, Chinese Wolfberry,|
|SElfheal, Fructus Arctii, etc.|
|Rhizome||Liquorice, Thorowax Root,|
|Unibract Fritillary Bulb, etc.|
|Flowers||Saffron, Flos Daturae,|
|Cloves, Magnolia, Coltsfoot,|
|Flos Jasmine, Lily, etc.|
|Bark||Cinnamon, Cortex Moutan,|
|Eucommia Ulmoides, etc.|
|Thallophyte||Glossy Ganoderma, Tremella ,|
|Cordyceps Sinensis, etc.|
|Whole Herbs||Abrus cantoniensis,|
|Anoectochilus roxburghii, etc.|
|Leaves||Lophatherum Gracile, etc.|
|Resin||Frankincense, Myrrh, etc.|
Chinese Herbs are usually acquired from natural plants and the parts of fungus and algae, and our Chinese-Herbs Dataset (CNH-98) is a collection of 9184 images of 98 classes, which can be divided into 8 categories including Fruits & Seeds, Rhizome, Flowers, Bark, Thallphyte, Whole Herbs, Leaves, Resin, whose examples are shown in Table. 2 correspondingly.
Fig. 6 (left) has shown the distibution of number of Chinese herbs classes in the 8 categories, where a majority of classes are Fruits & Seeds and Rhizome, including 42 and 32 classes respectively. It can be seen that the CNH-98 dataset is relatively unbalanced. Moreover, as shown in Fig. 6 (right) , there is an unbalance of images quantities between 98 classes, the largest number of which is 247 images of Amomum Tsaoko and the least is 14 images of Chestnut Shell in Fruits & Seeds.
a.2 Exhibition of Main Categories
In this section, we exhibit the examples of primary categories in CNH-98 and their corresponding cropping examples in TCNH-98, as shown in Fig. 7. From the exhibition in Fig. 7, we can see that although examples in TCNH-98 are just local parts, each example in TCNH-98 almost contains one herb with integrated shape at least, thanks to repeatability of examples in CNH-98. Furthermore, the shapes of Chinese herbs in various categories are extremely distinguishing, while the appearances of various classes in the same categories are similar, which is just the motivation of our proposed methods that the herbs with distinguishing shape can be classified by features from earlier layers of network, while the herbs with similar shape but different details need to be recognized by more semantic features from deeper levels.
Appendix B Evaluate with Other Upsampling Strategies
|Model||Backbone Depth||Nearest||Deconvolution(# params)||Bilinear*|
In order to validate the robustness of our models for different upsampling strategies, which mentioned in Section 4.3, we evaluate our methods with other upsampling methods including nearest neighbor and deconvolution and the results are shown in Table. 3. By analyzing the results, we can conclude that the accuracy rates of FPN fluctuate more greatly than our models and the maximum discrepancy is by 1% while only 0.1 0.5% for our models. Even only using spatial attention SRR modules, our models can perform more stable, which reflects our SRR modules contribute to adapt for various upsamping strategies and perform more robustly.
Appendix C Further Analysis of Intermediate Results in FPN-CA/SRR-CA
In this section, we extract the intermediate features from our models FPN-CA/SRR-CA-18 with ResNet-18 as backbone networks. We define layers producing output maps of same size as one pyramid level and the features extracted from the last layer of various levels are shown in Fig. 8-10 for three examples. Additionally, we statistics activation values of competitve attention from two pathway in the process of merging features and their spatial attention heatmaps.
By observing intuitively, we can obviously see that the informathons of some features on many channels are suppressed, either re-scaling with a small weight or retaining more local features, and with this adjustment models can get better performance, which does confirm our inference in the section 4.2 that the fusion method of original FPN will lead to redundancies in feature maps. That is also one of the motivations for us to propose attention mechanism. Moreover, the attentional regions can be apparently seen such as serrated petals shape of Flos Chrysanthemum in Fig. 8 and we can also see some fuzzy features are recalibrated spatially and presented more clearly.
The aforementioned changes always occur in the low-level of networks for both FPN-CA and FPN-SRR-CA and features from high-level have not been adjusted too much, shown in level 3 of Fig. 8 - 10, which can be verified by the activation values statistics in Fig. 11. The activation values of level 4 to 3 are always kept at about 0.5 and fluctuate sightly, for the reason that the features of spatial and semantics flows before merging are extracted from the deep layers, which are adjusted enough. However, on the low-level, features from sematics flows represent more vigorously and the others from spatial flow represent more sparsely, which reflects that there is high information density on the semantic flows, which is more benifical to classifying. Furthermore, a majority of features from spatial flows with weak ability of classification are redundant and suppressed, and only a small part of features are selected to make the supplement for semantic flows.
Compared with activation values of competitive attention of FPN-CA, the features on various channels of FPN-SRR-CA are less suppressed. We infer that SRR modules contribute to restoring the spatial informations for misaligned features, which results in higher information density of semantic flow, hence its representation are more vigorous (activation values of CA are almost non-zero), and this situation reflects the SRR-CA module will be more cautious when reducing redundancies of feature maps.
As shown in heatmaps of SRR attention modules, we can see that the attention outputs of different regions are obviously distinguishing and the absolute values of target activation are usually bigger. Howerver, for the examples of Flos Chrysanthemum in Fig. 8 of appendix and Unibract Fritillary Bulb in Fig. 5 of main text, we can see the SRR attention focus more on the background on level 1 and we infer that the activation values of SRR attention are closely related to the original images, especially for low-level of networks. The low-level features can highly restore the original images and are more sensitive to colors. Therefore, due to the dark colors of background, the absolute values of backgound activation are bigger than target. Despite all this, SRR attention has played a role in distinguishing from different regions and recalibrated the misaligned features.