Plants play an irreplaceable role in our world and they have direct effect in many domains such as agriculture, climate, ecological system and so on. Besides, they are the main source of food for human survival and development. Many problems such as habitat degradation, global warming, ecosystems destruction, environment worsen, species extinction, and so on have something to do with plant protection. Plant species identification is the prerequisite for protection. There have been many research related to the issue. Method based on image classification is now considered to help improve the plant taxonomy. It is one of the most promising solutions among those related research work, as discussed ingoeau2016plant . And it has been a long term hot research issue.
Considering flowers and fruits of plants are seasonal, some researchers believe that leaves are more suitable for identification. In the early time, leaves are frequently used for computer-aided plant species classification. Most image-based identification methods and evaluation data proposed were based on leaf imageskumar2012leafsnap ; backes2009plant ; cerutti2011parametric . However, most leaf images are specimen or scanned at that time. The way to acquire samples is also strict. Afterwards, flowers begin to be employednilsback2008automated ; angelova2012development ; nilsback2006visual .
Obviously, approaches only with leaves or flowers are insufficient considering realistic plant identification and protection. More diverse parts of plants have to be considered for accurate identification, especially because it is not possible for plants to see their leaves all over the year. Compared with the photos token by realistic ways, the background in those datasets where the camera are closely to targets when people take pictures is simple. We believe that those are not real-world identification task. In this paper, we focus on plant species identification especially realistic recognition. For a real-world plant identification task, plant image samples should include many parts such as fruits, branches, entire, apart from leaves and flowers. At the same time, the way to create and acquire plant images should not be strict. They can be snapshotted by different users and at different time and users can be at their own will. Image samples can be with complicated background. Besides, the scene includes not only indoor but also field. We believe that these indeed belong to real-world plant identification. The task of the identification is more challenging compared with tradition species identification while it is more valuable at the same time. And only real-world species recognition can realize better, convenient,and comprehensive plant protection.
In the last few years, a number of projects and organizations such as iNaturalist, Botanica can generate large amounts of biodiversity data. Big biodiversity data can be available easier compared with the pastagarwal2006first . As it is convenient for a vast majority of people to snapshot plant images with mobile phones. The way to create and acquire plant images becomes easier and close to the condition of a real-world scenario. Besides, people like to share them and chat with each other on their personal social networks.
In this paper, we first review the related work about plant species recognition issue. We discuss commonly used approaches using leaf and flowers for tradition species identification. We then propose a novel framework and an effective data augmentation method to address the task of realistic recognition. As deep convolutional neural networks (CNNs) provide us an useful tool for large-scale image classification, we conduct our work basing on deep learning. As stated in Yarbus1967eye ; neisser1967cognitive , the salient objects which are to be recognized in an image are focused on in terms of our human visual attention . We crop the image in terms of our visual attention and name the operation attention cropping (AC). AC is accomplished with the generated saliency map using saliency detection approach. As we all known, data augmentation is an important operation in deep learning. Here, we apply AC as a data augmentation method for deep learning. The schematic diagram of our framework is shown in Figure 1. We validate our proposed approach through a series of comparisons and results show that superior results are achieved.
2 Related work
There are amounts of plant identification approaches that use digital images. As stated above, early algorithms are mainly with leaves. FlaviaFlavia and Swedish leaf databaseSwedish are two typical leaf datasets. Samples of Flavia and Swedish leaf are shown in Figure 2. wu2007leaf employed Probabilistic Neural Network(PNN) with image and data processing techniques to implement a general purpose automated leaf recognition for plant classification. Using Artificial Neural Network(ANN), soderkvist2001computer
designed a computer vision classifier to identify the different Swedish tree classes in terms of their leaves. Many methods employed shape or curvature features as plants are basically classified according to the shapes of their leavesbai2010learning ; neto2006plant ; du2006computer ; wang2014hierarchical ; ling2007shape . Shape or curvature features are relatively discriminative for leaf images according to the theory of plant shape taxonomy. It is efficient especially when image contents are extreme simple. Two-dimensional multifractal detrended fluctuation analysis is used for plant classification in wang2015two .
Afterwards, flower image samples begin to be employed. Oxford flower is applied in many related researchesnilsback2006visual ; nilsback2007delving ; nilsback2008automated ; nilsback2010delving . nilsback2006visual developped a visual vocabulary that explicitly represents the various aspects(colour, shape, and texture) that distinguish one flower from another. nilsback2008automated investigated to what extent combinations of features can improve classification performance on a large dataset of flower classes. Samples of Oxford flower dataset are shown in the last row of Figure 2. The work of nilsback2007delving and nilsback2010delving is similar. They focused on algorithms for automatically segmenting flowers in colour photographs.
As we stated above, more parts of a plant such as flowers, leaves, fruits, branches, stem, should be used for realistic plant species recognition. The image data can be collected with a number of different contributors, cameras, areas, periods of the year, individual plants, etc. Recently, an image-based plant identification dataset called PlantCLEF was initially conducted. It is near to real-world conditions. Our work is with PlantCLEF in this paper. Samples of the dataset are shown in Figure 3. The schematic diagram of our proposed method is shown in Figure 1. We will describe our approach and work in detail in the next.
3.1 Attention cropping
The background of images taken in real-world ways is usually complicated. A real-world plant image contains more than one object, i.e. target plants and background objects(small stones, ruderals, branches, non-target leaves and other interferents). Moreover, target plants are possibly touching or covering the background objects. However, the salient objects what we pay attention to are to be recognized in an image. For an image, visual attention facilitates our ability to rapidly locate the most important information in a sceneYarbus1967eye and the most useful point are focused on with our attention at first sight for an given objectneisser1967cognitive . As demonstrated in Figure 4 (a), the centered object indicated with a red box is our real interesting target and to be recognized. The left bottom one boxed with a black rectangle is not an object for recognition and it should be neglected. In addition to the interference of the non-target, there are also interferents and non-valuable redundance as demonstrated in Figure 4 (b) and (c). The object is recognized only with the sketchy and concentrated screenage or info borne in our mind although there are lots of contents. Other non-salient parts are neglected or ignored. We even do not have any aware of the redundance during the first judgement.
Here, salient regions where we attend are got with saliency detection ways. Using the approach described in li2013visual , we obtained the image saliency map. li2013visual shows that the convolution of an image amplitude spectrum with a low-pass Gaussian kernel is equivalent to an image saliency detector.
where is the original phase spectrum, is the resulting smoothed amplitude. is as follows:
The saliency map is obtained by reconstructing the 2D signal using the original phase and the amplitude spectrum. The low-pass Gaussian kernel scale is filtered at a scale selected by minimizing saliency map entropy. After the proper scale is specific, the resulting smoothed amplitude
will be computed according to the formula (2). And Hypercomplex Fourier Transform (HFT) is employed to replace standard Fourier Transform (FT) to performs the analysis in the frequency domain. After phase spectrumand the resulting smoothed amplitude are computed, then the saliency map is got in terms of the formula (1). The low-pass Gaussian kernel scale can be also set by hand. To get a better saliency map, it is got based on spectrum scale-space analysis to find a proper scale. Then the got saliency map will be used for segmentation.
After that, image segmentation is carried out for generating the regions of interest (ROI) for recognition. K-means is used to perform the segmentation. Given a set of observations, where each observation is a
-dimensional real vector, k-means clustering aims to partition theobservations into sets
so as to minimize the within-cluster sum of squares (WCSS) (i.e. variance). Formally, the objective is to find:
where is the mean of points in . The operation of cropping is then implemented in terms of the segmentation results. It is defined as:
where is the segmentation result vector, th is a threshold which can control the degree of cropping, is the start position of the target area and is the end position. th is got by multiplying the cluster number by parameter , which is the ratio of the clusters what we want to crop out to the total clusters.
The cluster number and parameter are set empirically in this paper. For other methods, these can be adaptive values. We get the corresponding coordinates of the ROI in accordance with the above computed results and . In the end, the original image is cropped in terms of the generated corresponding coordinates to obtain the attended image regions. Non-salient parts such as distant background, indistinct surroundings, and corners are tailored out finally. Take a sample of flower (Anemone nemorosa L) for example, we illustrate attention cropping in Figure 5. The intermediate results are also shown. In addition, comparisons between original images and final attention cropping results are shown in Figure 6.
3.2 Convolutional Neural Network
Convolutional neural networks (CNNs) demonstrate impressive results in image classification. In this paper, the approach of CNN is adopted to facilitate the classification. CNNs directly use raw image as an input and image category as an output so that forms an end-to-end system. They learn image features from training of network. A CNN network consists of three types of layers, including convolutional layers, pooling layers, and at least one final fully connected layer. The outputs are generally normalized with a Softmax activation function and therefore approximate posterior class probabilities. For a given output feature map, the activation of Softmax is as follows:
denote the probability of map mapping to class and subject to the constraints that and .
4 Experiments and analysis
We first evaluate the performance of our method on PlantCLEF. The employed dataset is composed of about one hundred thousand pictures belonging to 1000 species. Each picture belongs to one and only one of the seven types of views reported in the meta-data. These types are entire plant, fruit, leaf, flower, stem, branch, leaf scan.
An originality of PlantCLEF is that its ”social nature” makes it closer to the conditions of a real-world identification scenario: (i) images of the same species come from distinct plants living in distinct areas, (ii) pictures are taken by different users that might not use the same protocol of image acquisition, (iii) pictures are taken at different periods in the year.
4.1 Experiments setup
We then perform our proposed novel framework and data augmentation method of attention cropping (AC). The experiments are conducted with two NVDIA Geforce GTX-1080 GPUs. The deep learning software tool is Pytorch. MATLAB is also used in our experiments. The preprocessing which is attention cropping is fulfilled with MATLAB. For simplicity, we set the total cluster numberand cropping ratio .
Two neural networks of ResNet50 and Inception v3 are chosen to conduct the experiments to evaluate the performance of the proposed framework. ResNet50 is a kind of residual network. Residual networks are similar as VGG but with learning residual functions with reference to the layer inputs to ease the training of deeper neural networks. So they are not sequential models. ResNet50 is with a depth of 50 layers. Its major kernel style is 3
3. Inception v3 introduces inception modules. Inception modules help increase the width of the network. It is a convolutional block which is constituted by different kinds of convolutional kernels. Apart from 33 kernel is employed which is common used, other types such as 17,77,11,13 and so on are also adopted for constructing networks. Big and small convolutional kernels are used together in one block. Big convolutional styles and feature maps are with little number of kernels and small convolutional styles and feature maps are with large number of kernels.
Besides, experiments without AC are also carried out to prove the effectiveness of the proposed data augmentation method. Considering the training efficiency, the image preprocessing is carried out off-line. So, there are two different kinds of training datasets totally. One is the original and the other is got by using AC. Then the training is carried out with normal procedures. We set batch size 64 and use a linear decaying learning rate with a factor of 10 every 30 epochs. The initial learning rate is 0.1 and the total epochs are 90. Other parameters such as momentum and weight-decay are 0.9 and 1respectively.
The input end of deep learning network generally requires input of the same size picture. All images will be resized to the given size after data augmentation such as flipping, random cropping which is popularly used to train networks in the vision community. A crop of random size of (0.08 to 1.0) of the original size and a random aspect ratio of 3/4 to 4/3 of the original aspect ratio is made in random cropping. Different from that, AC data augmentation is in terms of vision cognition. It is more essential. More important and interesting information is focused on. And this crop is finally resized to different given size in terms of different deep learning networks. For resnet50, it is 2242243. And for inception v3 it is 2992993.
Then these all samples are shuffled for training. The number of the neurons of last fully connected (FC) layer are set N, which is the number of the categories. The loss function is cross-entropy loss. It is formulated as follows:
where is the predicted probability of the input belonging to class , is the ground truth distribution.
The learning method is stochastic gradient descent (SGD). And it is kept same for all the two employed deep learning networks. Considering the problem of minimizing an objective function that has the form of a sum:
where the parameter which minimizes
is to be estimated.is the loss or cost function. Each summand function is typically associated with the observation in the data set (used for training). Gradient descent method would perform the following iterations:
is a step size (sometimes called the learning rate in machine learning). In stochastic (or ”on-line”) gradient descent, the true gradient ofis approximated by a gradient at a single example in each iteration and the sample is selected randomly (or shuffled) instead of as a single group (as in standard gradient descent) or in the order they appear in the training set.
4.2 Results and analysis
Attention cropping (AC) is implemented after parameters are setup. Apart from flowers, AC also performs well on leaf, leafscan, fruits view types and et al. The AC flow charts of more species and view types including leaf, fruit, leafscan, and entire are demonstrated as in Figure 7.
AC results of different plant species and view types are shown in Figure 8. There are several scenarios and they are different from each other. Results show that the targets and foregrounds are reserved and the distant surroundings and redundancies are tailored out after attention cropping. The most important information and what we are most interested are facilitated to be located using AC to fulfill real-world identification although there are different scenarios and the background is complicated in realistic images.
The final identification is performed using pre-trained deep convolutional neural networks. Accuracy of different methods on the test set of PlantCLEF are shown in Table 1. Attention cropping is abbreviated as ”AC”.
|Reyes et al.reyes2015fine||0.486|
|Ours(ResNet50 without AC)||0.636|
|Ours(Inception v3 without AC)||0.653|
|Ours(ResNet50 with AC)||0.680|
|Ours(Inception v3 with AC)||0.695|
uses Kernel descriptor (KDES) for feature extraction early. the result ofchamp2015comparative ge2015content . choi2015plant uses five complementary CNN classifiers and combines the image classification results with Borda-fuse method. From Table 1, we first observe that the precision of methods using CNN is higher than that of methods using hand-crafted features. It confirms the supremacy of deep learning approaches. State-of-the-art results have been obtained using our proposed approach. The accuracies of ”ResNet50+AC” and ”Inception v3+AC” are higher than others. Turning to another comparison, we can see that the results of CNNs using attention cropping augmentation are superior than those of CNNs without attention cropping. Improvement are all demonstrated on the two models. We can see that the improvement is about from Table 1. This number indicates that using our proposed novel data augmentation method the performance outperforms the primitive models by a large margin. By conducting attention cropping, ”ResNet50+AC” achieves improvement in accuracy, compared with original ResNet50. ”Inception v3+AC” exceeds original Inception v3 by . The results show that attention cropping is an efficient data augmentation method. The performance is improved substantially.
|Angelova and zhu angelova2013efficient||0.806|
|Selective joint fine-tuningge2017borrowing||0.947|
|Ours(ResNet50 without AC)||0.924|
|Ours(Inception v3 without AC)||0.928|
|Ours(ResNet50 with AC)||0.947|
|Ours(Inception v3 with AC)||0.951|
In addition to the specific dataset for real-world plant species recognition, supplementary experiments on Oxford flower which is for tradition plant species recognition are also provided. Oxford flower consists of 102 different categories of flowers common to the UK. Compared with the dataset for real-world identification, the background is simple in Oxford flower dataset samples and the objects to be recognized are nearly full of the image. The task of identification is easy and the accuracy can be high even using common methods. The results are shown in Table 2.
lihua2015two proposes a two-layer local constrained sparse coding architecture and achieves a classification performance of . yoo2015multi introduces a multi-scale pyramid pooling and adds a Fisher kernel based pooling layer on top of a pre-trained CNN and obtains (Acc.). The Acc of rippel2015metric is . ge2017borrowing introduces selective joint fine-tuning and the accuracy is . Our results demonstrate new state-of-the-art performance: on Oxford 102 flowers. The effectiveness of AC is also observed visually in Table 2. By using AC, the accuracy outperforms that of the original model. It is increased about on Oxford flowers. The boost is and respectively. This indicates that our scheme is all useful in improving the performance for different types of datasets. It is noted that AC possesses greater advantage for real-world recognition compared with the conventional recognition where the background is simple. The performance is quit significant for the former. One reason is that our scheme of AC makes ”hard” samples in real-world recognition ”easy” while those samples in conventional recognition are ”easy” originally.
In this paper, we address real-world species recognition task which is more challenging and makes more sense. Based on deep learning and visual attention, a novel recognition schema is proposed. Images are cropped in terms of visual attention before recognized. AC helps us focus on our real interested target and remove the interferences. And we apply it as a data augmentation method. It is the first time to crop images in terms of visual attention for data augmentation although there have been many data augmentation methods in deep learning community. An extensive comparative experiments are carried out on different types of datasets including Oxford flower which is a traditional dataset and PlantCLEF which is a specific dataset for real-world identification. Experiments show that new state-of-the-art results have been provided. What is more important, contrastive results indicate that superior improvement is obtained by using AC. And the performance is quit significant especially in realistic identification. What is worth mentioning is that AC can be applied to other recognition tasks and application scenes in the vision community although we mainly focus on real-world plant species recognition in this paper.
In addition, with regard to future work, it would be interesting to investigate a problem that the recognition system has the ability to conduct unknown and never seen categories. And new technologies of machine learning keep future potential role for the interdisciplinary research field of species recognition including real-world species identification for the next few years.
The work described in this paper is supported by the National Nature Science Foundation of China (NSFC) under Grants 61771346.
- (1) H. Goëau, P. Bonnet, A. Joly, Plant identification in an open-world (lifeclef 2016), in: Working Notes of CLEF 2016-Conference and Labs of the Evaluation forum, Évora, Portugal, 5-8 September, 2016., 2016, pp. 428–439.
- (2) N. Kumar, P. N. Belhumeur, A. Biswas, D. W. Jacobs, W. J. Kress, I. C. Lopez, J. V. Soares, Leafsnap: A computer vision system for automatic plant species identification, in: Computer Vision–ECCV 2012, Springer, 2012, pp. 502–516.
A. R. Backes, D. Casanova, O. M. Bruno, Plant leaf identification based on volumetric fractal dimension, International Journal of Pattern Recognition and Artificial Intelligence 23 (06) (2009) 1145–1160.
- (4) G. Cerutti, L. Tougne, A. Vacavant, D. Coquin, A parametric active polygon for leaf segmentation and shape estimation, Advances in Visual Computing (2011) 202–213.
- (5) M.-E. Nilsback, A. Zisserman, Automated flower classification over a large number of classes, in: Computer Vision, Graphics & Image Processing, 2008. ICVGIP’08. Sixth Indian Conference on, IEEE, 2008, pp. 722–729.
- (6) A. Angelova, S. Zhu, Y. Lin, J. Wong, C. Shpecht, Development and deployment of a large-scale flower recognition mobile app, NEC Labs America Technical Report.
- (7) M.-E. Nilsback, A. Zisserman, A visual vocabulary for flower classification, in: Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, Vol. 2, IEEE, 2006, pp. 1447–1454.
- (8) G. Agarwal, P. Belhumeur, S. Feiner, D. Jacobs, W. J. Kress, R. Ramamoorthi, N. A. Bourg, N. Dixit, H. Ling, D. Mahajan, et al., First steps toward an electronic field guide for plants, Taxon 55 (3) (2006) 597–610.
- (9) Yarbus, Eye-movements and vision, Plenum Press.
- (10) U. Neisser, Cognitive psychology. appleton-century-crofts.[aac] nelson, k.(2003) self and social functions: Individual autobiographical memory and collective narrative, Memory 11 (2) (1967) 12536.
- (11) Flavia leaf dataset, http://flavia.sourceforge.net/.
- (12) Swedish leaf database, http://www.cvl.isy.liu.se/.
- (13) S. G. Wu, F. S. Bao, E. Y. Xu, Y.-X. Wang, Y.-F. Chang, Q.-L. Xiang, A leaf recognition algorithm for plant classification using probabilistic neural network, in: Signal Processing and Information Technology, 2007 IEEE International Symposium on, IEEE, 2007, pp. 11–16.
- (14) O. Söderkvist, Computer vision classification of leaves from swedish trees (2001).
- (15) X. Bai, X. Yang, L. J. Latecki, W. Liu, Z. Tu, Learning context-sensitive shape similarity by graph transduction, IEEE Transactions on Pattern Analysis and Machine Intelligence 32 (5) (2010) 861–874.
- (16) J. C. Neto, G. E. Meyer, D. D. Jones, A. K. Samal, Plant species identification using elliptic fourier leaf shape analysis, Computers and electronics in agriculture 50 (2) (2006) 121–134.
- (17) J.-X. Du, D.-S. Huang, X.-F. Wang, X. Gu, Computer-aided plant species identification (capsi) based on leaf shape matching technique, Transactions of the Institute of Measurement and Control 28 (3) (2006) 275–285.
- (18) B. Wang, Y. Gao, Hierarchical string cuts: a translation, rotation, scale, and mirror invariant descriptor for fast shape retrieval, IEEE Transactions on Image Processing 23 (9) (2014) 4101–4111.
- (19) H. Ling, D. W. Jacobs, Shape classification using the inner-distance, IEEE transactions on pattern analysis and machine intelligence 29 (2) (2007) 286–299.
- (20) F. Wang, D.-w. Liao, J.-w. Li, G.-p. Liao, Two-dimensional multifractal detrended fluctuation analysis for plant identification, Plant methods 11 (1) (2015) 12.
- (21) M.-E. Nilsback, A. Zisserman, Delving into the whorl of flower segmentation., in: BMVC, 2007, pp. 1–10.
- (22) M.-E. Nilsback, A. Zisserman, Delving deeper into the whorl of flower segmentation, Image and Vision Computing 28 (6) (2010) 1049–1062.
- (23) J. Li, M. D. Levine, X. An, X. Xu, H. He, Visual saliency based on scale-space analysis in the frequency domain, IEEE transactions on pattern analysis and machine intelligence 35 (4) (2013) 996–1010.
- (24) T.-L. Le, N.-D. Duong, H. Vu, T. T.-N. Nguyen, Mica at lifeclef 2015: Multi-organ plant identification., in: CLEF (Working Notes), 2015.
- (25) A. K. Reyes, J. C. Caicedo, J. E. Camargo, Fine-tuning deep convolutional networks for plant recognition., in: CLEF (Working Notes), 2015.
- (26) J. Champ, T. Lorieul, M. Servajean, A. Joly, A comparative study of fine-grained classification methods in the context of the lifeclef plant identification challenge 2015, in: CLEF: Conference and Labs of the Evaluation forum, Vol. 1391, 2015.
- (27) Z. Ge, C. McCool, C. Sanderson, P. Corke, Content specific feature learning for fine-grained plant classification., in: CLEF (Working Notes), 2015.
- (28) S. Choi, Plant identification with deep convolutional neural network: Snumedinfo at lifeclef plant identification task 2015., in: CLEF (Working Notes), 2015.
- (29) A. Angelova, S. Zhu, Efficient object detection and segmentation for fine-grained recognition, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 811–818.
- (30) G. Lihua, G. Chenggan, A two-layer local constrained sparse coding method for fine-grained visual categorization, arXiv preprint arXiv:1505.02505.
- (31) D. Yoo, S. Park, J.-Y. Lee, I. So Kweon, Multi-scale pyramid pooling for deep convolutional representation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2015, pp. 71–80.
- (32) O. Rippel, M. Paluri, P. Dollar, L. Bourdev, Metric learning with adaptive density discrimination, arXiv preprint arXiv:1511.05939.
- (33) Y.-D. Kim, T. Jang, B. Han, S. Choi, Learning to select pre-trained deep representations with bayesian evidence framework, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5318–5326.
W. Ge, Y. Yu, Borrowing treasures from the wealthy: Deep transfer learning through selective joint fine-tuning, in: Proc. IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, Vol. 6.