Chinese Herbal Recognition based on Competitive Attentional Fusion of Multi-hierarchies Pyramid Features

12/23/2018 ∙ by Yingxue Xu, et al. ∙ 14

Convolution neural netwotks (CNNs) are successfully applied in image recognition task. In this study, we explore the approach of automatic herbal recognition with CNNs and build the standard Chinese herbs datasets firstly. According to the characteristics of herbal images, we proposed the competitive attentional fusion pyramid networks to model the features of herbal image, which mdoels the relationship of feature maps from different levels, and re-weights multi-level channels with channel-wise attention mechanism. In this way, we can dynamically adjust the weight of feature maps from various layers, according to the visual characteristics of each herbal image. Moreover, we also introduce the spatial attention to recalibrate the misaligned features caused by sampling in features amalgamation. Extensive experiments are conducted on our proposed datasets and validate the superior performance of our proposed models. The Chinese herbs datasets will be released upon acceptance to facilitate the research of Chinese herbal recognition.



There are no comments yet.


page 3

page 5

page 8

page 12

page 13

page 14

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep convolution neural networks (CNNs) have achieved a grate success on in the field of image processing

[30, 31, 17, 11] and been applied on object detection and recognition[13, 24, 28] and get a better performance. As a kind of poor side effect, simple and noninvasive treatment, Chinese herbs are widely used in China and a number of Asian countries for healthcare[39, 37]. Therefore, there are wide application values and significance for recognizing Chinese herbs automatically. However, as far as we know, there is no research on this task and meanwhile it is difficult to train models for herbal recognition due to a lack of sufficient herbs data.
In this paper, we first propose a CNN model to deal with Chinese herbal recognition task, based on which we present a standard dataset for Chinese herbal recognition. Distinguishing from regular task of object recognition[13, 17] and fine-grained image recognition[10, 43]

, the former focus on distinguishing the outline and shape of object and the latter need more detailed features to identify so that they can classify with similar shape but different details. For Chinese Herbal recognition, we would be confronted with the above two cases: (a) some herbs are so distinguishing that they are easy to be classified with the shape features instead of detailed features. (b) some herbs with similar shapes usually need to be classified by more fine-grained features. The features extracted from convolution layers of different depth are rich in diversity that the features from earlier layers are more representational and from deeper layers are more abstract and contain more semantics in contrast

[24, 22]. According to the aforementioned challenges of herbal recognition, we choose Feature Pyramid Networks[22] (FPN) to merge features from different levels so that we can diversify image features overall to improve the performance of herbal recognition with CNNs.
Compared with the traditional FPN[22], in this study, we first introduce channel-wise attention[16] in the process of fusing features from different levels. In this way, our models can dynamically adjust the weights of features from different levels, which makes it possible to adjust the extent of selecting features encoding from various levels adaptively. Furthermore, we also combine spatial attention[21] to spatially recalibrate the misaligned features caused by serveral upsampling or downsampling operators during feed-forward propagation.

More importantly, the channel-wise and spatial attention are both improved in this paper as follows: (a) The original SE mechanism is limited on re-scaling the weights of features from the same layers, while the competitive attention proposed in this paper extends the modeling ranges of channel-wise attention, as same as spatial attention, and explicitly model the competitive channel dependencies between spatial and semantic informations in the process of fusion on various levels. (b) The feature maps from bottom-up pathway, which are abundant in more spatial informations to provide references for the misaligned and spatially coarser features from top-down pathway, are introduced into the process of spatial attentional modeling to recalibrate the misaligned features. Based on the above improvement of attention according to our specific structures and tasks, we can jointly model channel relationship of various levels and channel dependencies between spatial and semantic information flows, as well as recalibration on misaligned features spatially.

With aforementioned methods we proposed, we try our best to improve the performance of Chinese herbal recognition. Consequently, the contributions in this study can be concluded as follows:
1. We build and present the standard Chinese-Herbs recognition dataset (CNH-98), further, we build the corresponding tiny-Chinese-Herbs dataset (TCNH-98), which is used to train models for locally recognition of herbs.
2. We introduce both channel-wise and spatial attention mechanism into pyramid networks and further improve their structures to propose channel-wise competitive attention and spatial reference attention. The former focus on modeling channel dependencies between spatial and semantic information flows and the latter tends to recalibrate the misaligned features with spatial information flows for reference.
3. We first apply pyramid ConvNet to Chinese herbal recognition according to the characteristic of recognition task.
4. We conduct experiments on the datasets we proposed to validate the superior performance of presented models on the task of Chinese herbal recognition.

2 Related Work

Feature Pyramid. Feature pyramid network is proposed to get image features at different scales[22], based on this motivation, numerous methods with multi-level features in CNN have been proposed, such as RoI pooling[11] or using skip-connection to construct pyramid[26]. With RoI pooling on proposal region, HyperNet[20], ParseNet[25] and ION[3] concatenate features of multiple layers before computing predictions and [4, 38] also aggregate context in different scales with spatial pooling. Feature pyramid like Stacked Hourglass network[26]

is the typical structure with skip-connection, which combines different levels features for key point estimation. Inspired by Hourglass Module, FPN

[22] designs a network with strong semantic at all scales for object detection and FANet[42] improves it further by augmenting lower-level feature maps. Several other approaches including PRM[40] for pose estimation, U-Net[29] for segmentation and RON[19] for object detection handling features at multi-level by skip connections. In our work, we introduce an attentional fusion method based on FPN[22] to competively model the relationship between spatial information and semantics for Chinese Herbal Recognition.
Attention in CNN. With the trend of attention widely applied on the modeling process of CNNs[27], it is commonly used for two primary aspects: channel-wise attention [16] to explicitly model interdependencies between channels and the other one to re-weight the image spatial signals[33, 21, 35, 43]. Furthermore, some models combine both spatial and channel-wise attention, such as SCA-CNN[6, 23]. However, the mentioned models are limited on local region. To solve this problem, self-attention [34, 9] is proposed to capture long-range dependencies between local and global. Additionally, there are some attention models based on domain knowledge [5, 7]. Interaction-aware pyramid[8] also introduce attention to the network for modeling long-range relationship. Different from [8], our proposed attention mechanism based on the specific structure of FPN[22] explicitly models a trade off between spatial and semantic informations for Chinese Herbal recognition.
CNN Applied on Tasks like Herbal Recognition. There are some similar tasks using CNN with Chinese Herbal recognition such as plants recognition [32], which mainly focus on leaf recognition[15, 2]. Moreover, another similar tasks like flower recognition[12, 36] can also use CNN to achieve. As far as we konw, there has been no one using CNN to recognize Chinese Herbs so far and we propose this approach firstly.

3 Chinese-Herbs Dataset Collection

Figure 1: Examples from the proposed CNH-98 dataset (left) and TCNH-98 dataset (right).

The Chinese-Herbs Dataset (CNH-98) is a collection of 9184 images of 98 categories covering the common Chinese herbs. Furthermore, we make a crop of each image into serveral tiny images without overlapping to construct a Tiny-Chinese-Herbs Dataset (TCNH-98) including 51198 images, because each image always contains multiple repeated herbs. These two datasets are divided randomly into training and validation sets with the proportion of 4:1. Fig. 1 shows some examples of CNH-98 (left) and their crop TCNH-98 (right). The sample datasets are available111

3.1 Chinese-Herbs Dataset

In this dataset, most of the images were acquired by taking photos ourselves in the medicinal herbs stores, hospitals and so on. And the others were collected from the Google images[1]. The smallest dimension of images is about 250 pixels. Each class contains 94 images on average and more than 41 classes include over 100 images. In order to ensure the availability and matching of labels and data, the labels were reviewed by the human annotators.

3.2 Tiny-Chinese-Herbs Dataset

Tiny dataset was sampled from above CNH-98 dataset with the size of and we ensured that there was no overlapping. Considering that there are some factors interfering with the quality of the image, such as blank place in the origin and so on, we dropped some images in the following conditions, as judged by the annotators: (i) the images were blank or the proportion of herbs in images is too small, (ii) not contain herbs (like some containers or background), (iii) the annotators cannot recognize such as the parts of original herbs. Overall, we gained an average of 522 images per class and minimum 100 per class.

We need to make a statement that the Tiny-Chinese-Herbs dataset may bring more severe challenges in herbal recognition, due to the limited image size and incomplete features of herbs, although the scale of this dataset is bigger.

4 Competitive Attentional Fusion Pyramid Networks

In this section, considering the characteristics of Chinese herbal recognition tasks, we first extend applications on Feature Pyramid Network (FPN[22]) to Chinese herbal recognition tasks. Next, we propose a competitive attentional fusion mechanism based on the original FPN to adapt to the aforementioned tasks. Finally, in terms of existing problem of misaligned features, a spatial recalibtration method is proposed, which will be combined with the above attentional fusion mechanism.

4.1 Apply FPN to Chinese Herbal Recognition

Figure 2: FeaturePyramid. Here we extract feature maps from four levels of ResNet-18 to form a feature pyramid.

For herbal recogniton tasks, there is a characteristic that the shapes of some herbs are so distinguishing that they are easy to be classified using the high-resolution features from the lower level of networks, while some herbs with similar shapes usually need to be classified by features from the higher level, which contain more fine-grained semantic informations. The feature maps from various layers of networks are shown in Fig. 2. Therefore, we choose FPN[22] applied to Chinese herbal recognition, because FPN can fuse multi-hierarchies features with its pyramid structure.
Consisting of two pathway, a bottom-up pathway and a top-down pathway, and lateral connections, FPN can build a feature pyramid with high-level semantics throughout by naturally exploiting a ConvNet’s pyramid feature hierarchy. The bottom-up pathway is the feed-forward computation of the backbone ConvNet, which results in a feature hierarchy containing feature maps at several scales with a scaling step of 2. And the top-down pathway generates higher resolution features by upsampling the last groups of feature maps by a factor of 2 on the bottom-up pathway. Here we record the output of upsampling as . As opposed to the features on the same level from the bottom-up pathway, these feature maps are spaitally coarser, but semantically stronger, hence we natrually refer the bottom-up pathway to the spatial flow and the top-down pathway to the semantic flow. As described in the design of FPN, the output of last layer of level via lateral connections merges with the corresponding feature maps with the same size from top-down pathway, as follows:


where will be fed into the next upsampling process. The result is a fusion feature pyramid that has strong semantic and spatial informations at all scales.

4.2 Competitive Attention between Spatial and Semantic Flows

Figure 3: Overview of Competitive Attentional Fusion Pyramid Network and Competitive Attention modules.

The aforementioned fusion mode of features in FPN is to indiscriminately treat spatial and semantic flows at all scales, which is likely to cause redundencies in fusion features. From an intuitional point of view, we propose a competive attention mechanism that allows the network to explicitly modeling the competition between spatial and semantic flows in the process of fusion, such that the network can selectively emphasis richer semantic or spatial features and suppress redundant ones.
To achieve this, we gain the global information and , embedding from feature maps via lateral connections of spatial flow and upsampling feature maps of semantic flow respectively:


where denotes the operation of global pooling. The combination of and will be used as joint input for the excitation operation to capture channel-wise dependencies between spatial and semantic flows:


where refers to the concatenation of the feature-maps produced in the above squeeze operation from two flows, and parameters and . The result of Excitation operator is that will divide into two parts to rescaling the weights of features and respectively as follows:


where refers to and means . The competition between spatial and semantic flows is modeled by the Competitive Attention module proposed above and react to each channel of both spatial and semantic flow. On the one hand, the aforementioned mergence mode of features can be regarded as a adaptive competition between two flows and its recalibration depends on two flows adaptively to dynamically adjust the complement weights for each other. On the other hand, a trade off between spatial and semantic flows is indicated. Finally, the Competitive Attention module is reformulated as:


Fig. 3 shows the overview of our Competitive Attentional Fusion Pyramid Network and its more details in Competitive Attention module. It is concluded that the difference between the typical SE and the Competitive Attention is that based on the particular structure and meanings of FPN we simultaneously introduced two flows into SE to model their channel relationship competitively and trade off, and meanwhile we adjust two flows at the same time.

4.3 Spatial Reference Recalibration

Figure 4: Spatial Reference Recalibration module and its combination with Competitive Attention module. The Batch Normalisation[18]

(BN) (attached to two conv layers after tensor multiplication) is not shown for brevity. We resize feature maps in SRR module with the factor of 2 by bilinear interpolation.

As we discuss above, the upsampling features before merging are spatially coarser because they are products of several downsampling or upsampling operators. In other words, their spatial informations such as location are less accurate and even misaligned. That is also why we need fusion features.
However, it should be noted that the above amalgamation means of element-wise addition extremely rely on the spatial informations, thus it is likely that the fusion features merged in this way are sub-optimal. Consequently, we introduced a method to spatially recalibrate the misaligned features through modeling spatial attention on pixel-level with spatial flows for reference. As discussed in Harmonious-Spatial-Attention[21] (HA), similarly we compresses the feature maps in the following ways (global cross-channel averaging pooling) to reduce parameters for the subsequent conv layer, but unlike HA we Simultaneously model two flows:


where and will be concatenated to fed into the next conv layer of

filters with stride 2 and then resized to original size by bilinear interpolation with the factor of 2. Finally, we add the scaling conv layer of

filters for reducing aliasing effect of bilinear upsampling. As a result, we gain 2 feature maps to rescale values of features from two flows on pixel-level respectively. In addition, this mechanism also contributes to the robustness of network which allows it to use different upsampling methods on the top-down pathway.
Aming to combine the competitive attention with spatial recalibration, we further attach two convolution layer after tensor multiplication on two flows respectively, since the two procosses are not mutually independent. Finally, we deploy the sigmoid operations to normalise. More details of SRR-module and its combination with Competitive attention are shown in Fig. 4.

5 Experiments

5.1 Implementation Details

For fair comparison, each plain FPN and its corresponding CA, SRR and SRR-CA counterparts are trained with identical optimisation schemes. For CNH-98 and TCNH-98 datasets, we train our all models with three degrees of data augmentation: no data augmentation, standard data augmentation (+) and [41]

, an advanced data augmentation technology. On CNH-98, the standard data augmentation (translation/mirroring) is adopted for training set and the 224x224 crop is randomly sampled. All images normalized with mean values and standard deviations. When testing, our implementation follows the practice in

[16]. On TCNH-98, we follows the standard practice and data augmentation in [13]

for CIFAR. All models were trained by optimizer SGD with 0.9 Nesterov momentum from scratch.

During training on CNH-98, we train our models with batch size 64 and 300 epochs for standard augmentation and mixup, 120 epochs for no augmentation. The learning rate is initialized to 0.1 and divided by 5 at epochs 120, 200, 260 for standard augmentation and

and at epochs 30, 60, 90 for no augmentation , and weight decay are adopted with 0.0005 and 0.0001 respectively. In particular, we train models for on the last 20 epochs with traditional strategy.
During training on TCNH-98, our models are trained for 300 epochs with batch size 128 and the initial learning rate is 0.1 and is divided by 10 at 100th, 150th, 200th epochs. We also set the weight decay as 0.0001 following [13] for CIFAR. Especially, learning rate during training without data augmentation was divided by 5 at epochs 30, 60, 90.

5.2 Results of Chinese Herbal Recognition

: Chinese-Herbs
Model backbone depth parames CNH-98 CNH-98+ CNH-98
pre-act ResNet-18[14] 18 11.7M 74.5 91.7 93.3
FPN-pre-act ResNet-18[22] 18 13.3M 74.7 91.9 93.5
FPN-CA-18(Ours) 18 13.4M 75.3 92.9 94.2
FPN-SRR-18(Ours) 18 13.3M 72.5 92.5 93.8
FPN-SRR-CA-18(Ours) 18 13.8M 76.8 93.5 94.1
FPN-pre-act ResNet-34[22] 34 23.4M 75.1 92.3 94.1
FPN-CA-34(Ours) 34 23.5M 76.1 93.5 94.6
FPN-SRR-34(Ours) 34 23.4M - 92.7 -
FPN-SRR-CA-34(Ours) 34 23.9M 76.3 93.8 94.8
: Tiny-Chinese-Herbs
Model backbone depth parames TCNH-98 TCNH-98+ TCNH-98
pre-act ResNet-20[14] 20 0.28M 63.0 74.8 72.8
FPN-pre-act ResNet-20[22] 20 0.31M 63.1 75.2 72.9
FPN-CA-20(Ours) 20 0.31M 62.8 75.8 73.6
FPN-SRR-20(Ours) 20 0.31M - 75.5 73.3
FPN-SRR-CA-20(Ours) 20 0.31M 63.8 75.8 73.7
FPN-pre-act ResNet-56[22] 56 0.89M 64.3 77.4 77.4
FPN-CA-56(Ours) 56 0.89M 63.1 77.7 76.7
FPN-SRR-CA-56(Ours) 56 0.90M 62.8 77.6 77.6
Table 1: Accuracy rates(%) of different methods on datasets CNH-98 and TCNH-98, the best records of our models are bold. We compare our models with the original FPN and its backbone networks, trained with either no data augmentation, standard augmentation (+) and .

We evaluate our methods on the CNH-98 and TCNH-98 datasets with pre-act ResNet[14] for backbone networks and the results of contrastive experiments for FPN with/without CA and SRR-CA modules are shown in Table. 1, and we can make a summary as follows:
First of all, as shown in and in the Table. 1, we can see FPN indeed gets a better results than pre-act ResNet whether on CNH-98 or TCNH-98, which verifies the guess in Section 4.1 that FPN is more suitable to accomplish the task of Chinese herbal recognition, and here we record the experiment on FPN as baseline. Furthermore, for both CNH-98 and TCNH-98, FPN-CA can achieve superior performance than baseline and FPN-SRR-CA can further improve performance across different depth or keep the effect at least without too much extra parameters.
Secondly, FPN-SRR almost can exceed FPN except on CNH-98 without data augmentation, proving the effectiveness of SRR modules in most case and suggesting that CA and SRR modules are not two separate processes but need to model jointly, hence it is reasonable to attach convolution layer after combination of SRR and CA modules. For the reason of performance of SRR on CNH-98 with no augmentation, we infer that there is an overfitting phenomenon owing to the small size of CNH-98 dataset. Moreover, on CNH-98 dataset, compared with FPN-34, FPN-SRR-CA-18 even increases validation accuracy rates by 1.7% for no augmentation, 1.2% for standard augmentation and achieve or slightly go beyond of FPN-34 for mixup. In particular, FPN-SRR-CA-18 has higher accuracy rates than FPN-SRR-CA-34 for no augmentation, for which we infer that the depth 34 of networks for small dataset like CNH-98 is too deep to fit and our CA and SRR-CA modules can reduce overfitting as well as improving the generalization ability of models thus perform better with deeper networks. On the contrary, during training on TCNH-98 that consists of 40958 images with standard augmentation and , we notice that there is an underfitting for the depth 20 of networks, which indicates the representation of the models with depth 20 is too limited, and we increased the depth of networks, which can reduce this phenomenon, proving the performance of models with deeper networks can get better.
The [41] can be seen as an advanced method of data augmentation. However, for TCNH-98 dataset, models with achieve the worse results, for which we argue that as augmentation approaches would further aggravate underfitting, leading to a worse result natrually. Due to the limited representation of networks with depth 20, actually TCNH-98 dataset is suitable for deeper networks, proved by results of experiments on models with depth 56, which reduces underfitting.

5.3 Further Analysis and Discussion

Figure 5: Top: internal feature maps of an example from four levels on three models: (a) FPN, (b) FPN-CA, (c) FPN-SRR-CA. Middle: the activation values (solid lines for bottom-up pathway via lateral connections and dotted lines for top-down pathway) of competitive attention on (b) and (c). Bottom: the heatmaps of SRR attention.

The analysis of last section 5.2 has proven the effectiveness of CA and SRR-CA modules. In this section, from an intuitive angle of view, we discuss the effects of our approaches. The internal feature maps from different levels of three models, FPN-18, FPN-CA-18, FPN-SRR-CA-18, are shown in the top part of Fig. 5, from which we can conclude that our methods can strengthen the representation of networks. By observing the representation of feature maps, the previous layers of FPN almost extract contour features, while the features are increased with more detailed informations using our FPN-CA models, compared with feature maps of FPN with/without CA on level 1 and 2 in Fig. 5. It is worth mentioning that the features extracted by the models with CA modules are more sparse and accurate, compared to the original FPN, especially for feature maps of level 3. Moreover, SRR-CA modules can further spatially recalibrate the misaligned feature maps, mainly for deeper features, typically shown in level 3 of Fig. 5, which makes the features with stronger spatial informations and richer in semantic. Additionally, we statistics the distributions of the activation of CA modules on FPN-CA and FPN-SRR-CA models, and we can see that the attentional activation values of CA and SRR modules are very vigorous and distinguish, and the heatmap of SRR modules can reconstruct the distribution of the origin, which suggests that our methods indeed contribute to re-weighting and recalibrating features.
As shown in the distribution of channel-wise attentional outputs, we can see the activation values of features from deeper layers are always uniform and tend to 0.5, for the reason that features from deeper layers have been adjusted during training, thus CA modules perform less adjustment. It is noticed that the activation values on the deepest level of spatial flow are almost higher than the ones from semantic flow, while from deep to previous, the activation from semantic flow would stand out from the competition gradually. This confirmes our conjecture that high-level features is spatially coarser and strongly semantic, in contrast to low-level features, and simultaneously indicates the mechanism we proposed can complement spatial or semantic informations for requirements of different levels. Correspondingly, there are same conclusion on the analysi of heatmap activation of SRR modules. Compared with channel-wise attentional outputs between FPN-CA and FPN-SRR-CA, there is a trend that channel-wise activation of FPN-SRR-CA would be more stable than FPN-CA owing to the effectiveness of SRR, which enables features more accurate and the effects of SRR can be passed through the network.

6 Conclusion

In this paper, we firstly propose the standard Chinese Herbs dataset for recognition. Based on the characteristic of Chinese herbal recognition task, we introduce attention mechanism into pyramid networks to model channel relationship of features from various levels. Furthermore, we also improve channel-wise and spatial attention and propose competitive attention and spatial reference recalibration module, which respectively model channel dependencies between spatial and semantic flows in the process of feature fusion and spatially recalibrate the misaligned feature maps with spatial flow for reference. With improved pyramid network, we apply it to the Chinese herbal recognition and evaluate our methods on CNH-98 and TCNH-98 dataset we proposed as well as getting superior performance to the traditional pyramid networks.


Appendix A Details of Chinese Herbs Datasets

a.1 Distributions of Examples in CNH-98 Dataset

Main Categories Herbs Examples
Fruits & Seeds Star Anise, Siraitia Grosvenorii,
Ginkgo, Chinese Wolfberry,
SElfheal, Fructus Arctii, etc.
Rhizome Liquorice, Thorowax Root,
Rhizoma Alismatis,
Unibract Fritillary Bulb, etc.
Flowers Saffron, Flos Daturae,
Cloves, Magnolia, Coltsfoot,
Flos Jasmine, Lily, etc.
Bark Cinnamon, Cortex Moutan,
Eucommia Ulmoides, etc.
Thallophyte Glossy Ganoderma, Tremella ,
Cordyceps Sinensis, etc.
Whole Herbs Abrus cantoniensis,
Anoectochilus roxburghii, etc.
Leaves Lophatherum Gracile, etc.
Resin Frankincense, Myrrh, etc.
Table 2: Main categories of CNH dataset and their corresponding examples.

Chinese Herbs are usually acquired from natural plants and the parts of fungus and algae, and our Chinese-Herbs Dataset (CNH-98) is a collection of 9184 images of 98 classes, which can be divided into 8 categories including Fruits & Seeds, Rhizome, Flowers, Bark, Thallphyte, Whole Herbs, Leaves, Resin, whose examples are shown in Table. 2 correspondingly.

Figure 6: Distribution of Chinese Herbs Categories (left) and amount of images for each classes in CNH-98 (right).

Fig. 6 (left) has shown the distibution of number of Chinese herbs classes in the 8 categories, where a majority of classes are Fruits & Seeds and Rhizome, including 42 and 32 classes respectively. It can be seen that the CNH-98 dataset is relatively unbalanced. Moreover, as shown in Fig. 6 (right) , there is an unbalance of images quantities between 98 classes, the largest number of which is 247 images of Amomum Tsaoko and the least is 14 images of Chestnut Shell in Fruits & Seeds.

a.2 Exhibition of Main Categories

Figure 7: Examples of Main Categories in CNH-98 and their Corresponding Examples in TCNH-98. From left to right, the left examples in CNH-98 corresponds to the right in TCNH-98 from top to bottom.

In this section, we exhibit the examples of primary categories in CNH-98 and their corresponding cropping examples in TCNH-98, as shown in Fig. 7. From the exhibition in Fig. 7, we can see that although examples in TCNH-98 are just local parts, each example in TCNH-98 almost contains one herb with integrated shape at least, thanks to repeatability of examples in CNH-98. Furthermore, the shapes of Chinese herbs in various categories are extremely distinguishing, while the appearances of various classes in the same categories are similar, which is just the motivation of our proposed methods that the herbs with distinguishing shape can be classified by features from earlier layers of network, while the herbs with similar shape but different details need to be recognized by more semantic features from deeper levels.

Appendix B Evaluate with Other Upsampling Strategies

Model Backbone Depth Nearest Deconvolution(# params) Bilinear*
FPN-18 18 91.0 90.9(16.5M) 91.9
FPN-34 34 91.5 - 92.3
FPN-SRR-CA-18 (ours) 18 93.4 93.0(16.9M) 93.5
FPN-SRR-CA-34 (ours) 34 93.6 - 93.8
FPN-SRR-18 (ours) 18 92.3 92.3(16.5M) 92.5
FPN-SRR-34 (ours) 34 92.7 - 92.7
Table 3: Accuracy rates(%). Compare our models(*) with FPN using different upsampling strategies.

In order to validate the robustness of our models for different upsampling strategies, which mentioned in Section 4.3, we evaluate our methods with other upsampling methods including nearest neighbor and deconvolution and the results are shown in Table. 3. By analyzing the results, we can conclude that the accuracy rates of FPN fluctuate more greatly than our models and the maximum discrepancy is by 1% while only 0.1 0.5% for our models. Even only using spatial attention SRR modules, our models can perform more stable, which reflects our SRR modules contribute to adapt for various upsamping strategies and perform more robustly.

Appendix C Further Analysis of Intermediate Results in FPN-CA/SRR-CA

Figure 8: Flos Chrysanthemum: intermediate features on various channels from 4 levels of models FPN-CA/SRR-CA-18, and their heatmaps of SRR Attention. Features on level 4 are initial on the top-down pathway, thus they are not fusion features. For each block (excepted for level 4), the top (-) are not re-scaled by attention and the bottom refers features reweighted on the corresponding channel.
Figure 9: Szechuan Lovage Rhizome: intermediate features on various channels from 4 levels and their heatmaps of SRR Attention. For each block (excepted for level 4), the top (-) are not re-scaled by attention and the bottom refers features reweighted on the corresponding channel.
Figure 10: SElfheal: intermediate features on various channels from 4 levels and their heatmaps of SRR Attention. For each block (excepted for level 4), the top (-) are not re-scaled by attention and the bottom refers features reweighted on the corresponding channel.
Figure 11: Outputs of Competitive Attention on models FPN-CA/SRR-CA. (solid lines for bottom-up pathway via lateral connections and dotted lines for top-down pathway)

In this section, we extract the intermediate features from our models FPN-CA/SRR-CA-18 with ResNet-18 as backbone networks. We define layers producing output maps of same size as one pyramid level and the features extracted from the last layer of various levels are shown in Fig. 8-10 for three examples. Additionally, we statistics activation values of competitve attention from two pathway in the process of merging features and their spatial attention heatmaps.
By observing intuitively, we can obviously see that the informathons of some features on many channels are suppressed, either re-scaling with a small weight or retaining more local features, and with this adjustment models can get better performance, which does confirm our inference in the section 4.2 that the fusion method of original FPN will lead to redundancies in feature maps. That is also one of the motivations for us to propose attention mechanism. Moreover, the attentional regions can be apparently seen such as serrated petals shape of Flos Chrysanthemum in Fig. 8 and we can also see some fuzzy features are recalibrated spatially and presented more clearly.
The aforementioned changes always occur in the low-level of networks for both FPN-CA and FPN-SRR-CA and features from high-level have not been adjusted too much, shown in level 3 of Fig. 8 - 10, which can be verified by the activation values statistics in Fig. 11. The activation values of level 4 to 3 are always kept at about 0.5 and fluctuate sightly, for the reason that the features of spatial and semantics flows before merging are extracted from the deep layers, which are adjusted enough. However, on the low-level, features from sematics flows represent more vigorously and the others from spatial flow represent more sparsely, which reflects that there is high information density on the semantic flows, which is more benifical to classifying. Furthermore, a majority of features from spatial flows with weak ability of classification are redundant and suppressed, and only a small part of features are selected to make the supplement for semantic flows.
Compared with activation values of competitive attention of FPN-CA, the features on various channels of FPN-SRR-CA are less suppressed. We infer that SRR modules contribute to restoring the spatial informations for misaligned features, which results in higher information density of semantic flow, hence its representation are more vigorous (activation values of CA are almost non-zero), and this situation reflects the SRR-CA module will be more cautious when reducing redundancies of feature maps.
As shown in heatmaps of SRR attention modules, we can see that the attention outputs of different regions are obviously distinguishing and the absolute values of target activation are usually bigger. Howerver, for the examples of Flos Chrysanthemum in Fig. 8 of appendix and Unibract Fritillary Bulb in Fig. 5 of main text, we can see the SRR attention focus more on the background on level 1 and we infer that the activation values of SRR attention are closely related to the original images, especially for low-level of networks. The low-level features can highly restore the original images and are more sensitive to colors. Therefore, due to the dark colors of background, the absolute values of backgound activation are bigger than target. Despite all this, SRR attention has played a role in distinguishing from different regions and recalibrated the misaligned features.