Two-View Fine-grained Classification of Plant Species

05/18/2020 ∙ by Voncarlos M. Araujo, et al. ∙ 26

Automatic plant classification is a challenging problem due to the wide biodiversity of the existing plant species in a fine-grained scenario. Powerful deep learning architectures have been used to improve the classification performance in such a fine-grained problem, but usually building models that are highly dependent on a large training dataset and which are not scalable. In this paper, we propose a novel method based on a two-view leaf image representation and a hierarchical classification strategy for fine-grained recognition of plant species. It uses the botanical taxonomy as a basis for a coarse-to-fine strategy applied to identify the plant genus and species. The two-view representation provides complementary global and local features of leaf images. A deep metric based on Siamese convolutional neural networks is used to reduce the dependence on a large number of training samples and make the method scalable to new plant species. The experimental results on two challenging fine-grained datasets of leaf images (i.e. LifeCLEF 2015 and LeafSnap) have shown the effectiveness of the proposed method, which achieved recognition accuracy of 0.87 and 0.96 respectively.



There are no comments yet.


page 4

page 7

page 8

page 9

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Automated plant classification concerns the identification of plant images into botanical species by applying machine learning algorithms

[11, 29, 42]

. The classification task may be performed on an image of the entire plant or just on parts of it such as branches, flowers, fruits, leaves, or stems. The main challenge of this pattern recognition task is related to the wide biodiversity of the existing plant species, in which it is possible to observe the likeness between different species (high inter-class similarity), and sometimes significant differences among samples belonging to the same species (high intra-class variability). The blue dotted rectangle in Figure 

1 shows an example of the possible similarity among different species, while the red line rectangle presents an example of the difference that may exist within samples of the same species caused by changes in shape, color, and texture. Such a variability is usually caused by the plant maturity, or even pose and illumination variation that may result from the image acquisition process. On top of that, there is the data unbalancing problem since some species are very rare in the flora environment, as well as the scalability constraint, as the number of plant species being discovered by scientists is continuously growing.

Figure 1: Intra-class and inter-class problem. Inter-class (blue dotted rectangle), has species very similar and intra-class (red line rectangle) contains variations like background, occlusion, pose, color, illumination and plant maturity stages inside the same species.

One may find in the literature some important strategies used to deal with such difficulties that are inherent to the plant species recognition [7, 39, 44, 45]. However, in the last years, methods based on fine-grained image classification (FGIC) have received special attention from the scientific community [4, 13, 31, 36]. Such methods consist of discriminating between classes in a subcategory of objects, such as species of birds, animals or models of vehicles. Different from the traditional image classification methods, FGIC methods recognize coarse classes firstly, then it goes further by discriminating fine classes in which the classification difficulty is higher due to the presence of intra-class and inter-class variability like those observed in case of plant species. FGIC-based methods may explore the taxonomic relationship between the plant classes, which are hierarchically organized based on shared biological characteristics [30] into three levels of abstraction: family, genus, and species. Exploring these characteristics may help us to distinguish very similar classes, by first selecting candidates in a coarse level of the hierarchy which can be rather distinguished in the finest level of the hierarchy (coarse-to-fine strategy). The visual identification of leaves carried out by domain experts is generally based on this hierarchical strategy, the so-called "plant taxonomy relationship" [33].

With this in mind, we propose in this paper a two-view similarity learning strategy for fine-grained plant classification, which consists of two stages that exploit different views of the leaf images. In the first stage, a coarse classification by plant genus is carried out using a deep metric based on a Siamese Convolutional Neural Network (SCNN) to compute the similarity between a testing sample and the reference images previously defined for each plant genus. The deep metric learned from pairs of images is used to provide the distance between two image samples represented by an unknown plant image and the genus reference image. At this stage, the entire leaf image is used, i.e., the input of the SCNN model is the whole leaf image characterizing a global view in terms of problem representation. The output of this stage is a ranked list of the top- genus candidates.

In the second stage, a fine classification of plant species is performed. Similarly, an SCNN is used as a deep metric, where the similarity is computed between the test sample and the reference images representing the plant species in the top- genus candidates returned by the first stage. Here, a local representation (view) of the leaf image is used, i.e., the SCNN receives as input a cropped image extracted from the center of the leaf image. The output of the second stage is a final ranked list of plant species obtained by combining the outputs of both stages.

The rationale behind the two-view scheme is to provide different representations of the problem. In the first view, the similarity computed by the SCNN takes into account global features extracted from the entire leaf image (shape and color), while in the second view, local features based on texture and the plant veins are considered. Such a representation strategy allows us to treat some specific issues of the leaf classification problem, for instance, species of plants inside the same level of taxonomy (e.g.

genus) may look similar in terms of global features, but they present imperceptible tiny local changes in their texture and vein patterns that are important to characterize its species.

We carried out extensive experiments and compared the proposed method with both hand-crafted methods and state-of-the-art methods based on deep learning that use different CNN architectures. For this purpose, a robust experimental protocol was defined based on two challenging fine-grained datasets of plant leaf images: LifeCLEF 2015 [19] and LeafSnap [22]

. In most of the experiments, the proposed method outperforms several existing methods by achieving superior classification accuracy using few samples to get the similarity between images. Besides that, the learned models do not need to be retrained when new plant species are added, what makes the proposed method highly scalable.

The contribution of this paper is threefold: (i) the two-view representation of plant species enables the capture of coarse and fine features of the leaves, which are very useful to distinguish among different genera and different species, respectively; (ii) the proposed method exploits the natural hierarchy of the problem combining coarse and fine representations into a hierarchical strategy that reduces the complexity of the classification task as a lower number of classes need to be disambiguated at each hierarchical level; (iii) the proposed method is highly scalable as new plant species can be easily added without retraining the SCNN models. This is highly desirable in a context of plant classification where the number of plant species is continuously growing.

The paper is organized as follows. Section 2 reviews the relevant literature in plant classification. In Section 3, the proposed method is described in detail. Section 4 presents our experimental findings on plant classification. Finally, Section 5 presents our conclusions, perspectives of future work, and final remarks.

2 Related Work

Recently, studies on plant classification based on image processing have become an interesting research topic in computer vision

[3, 4, 9, 17, 43]. In the literature, there are many datasets that can be employed to evaluate plant classification methods such as Flavia [42], Foliage [20], Swedish [35], LeafSnap [22], LifeCLEF [19], ICL [36], and MalayaKew [24]. These datasets well represent the problem domain, exposing the many difficulties such as fine-grained complexity, imbalanced distribution, large intra-class variability, small inter-class variability, and noisy images.

One may find in the literature several contributions to plant species recognition. Naresh and Nagendraswamy [27]

introduced a symbolic approach based on textural features extracted from leaf images for plant species recognition. A modified local binary pattern was proposed to extract features and the classification was performed using a simple nearest neighbor classifier. Besides, the concept of clustering was used to define multiple class representatives by grouping similar leaf samples using a threshold to create clusters to decrease the intra-class variation. However, in their experiments they observed the need to incorporate features extracted from other views of the leaf to improve the recognition accuracy.

Aakif and Khan [1] proposed a shape-defining feature, which is combined with morphological and Fourier descriptors. These features were used with artificial neural networks. The method was evaluated on a proprietary dataset of 14 classes as well as on Flavia and ICL datasets. Their emphasis was more related to the performance in terms of computational time than the recognition accuracy.

Fine-grained recognition is a challenging problem that consists of recognizing subordinate categories such as species of birds [14], dog breeds [21] and flower species [2]. Over the past decade, fine-grained recognition has achieved high-performance levels thanks to the combination of deep learning techniques with large annotated training datasets [41]. Some recent works have considered deep leaning techniques for fine-grained plant classification [4, 6, 23]. In particular, Barré et al. [6] and Lee et al. [23] have shown how convolutional neural networks (CNN) learn representations from plant leaf images using a deconvolutional approach. The most important finding is that shape information alone is not a good choice due to the occurrence of similar leaf contours, especially in closely related species. Therefore, it is important to exploit other kinds of features that may be present in leaf structure.

As observed in several works, CNN models usually need a high amount of data for training [18, 28, 38]. For instance, Barbedo [5] analyzed the impact of the amount of training samples on the accuracy of a CNN and he found out that it requires a substantial number of training data to provide solid results. Barré et al. [6] described an approach based on CNNs for plant classification, which employs data augmentation based on low-level transformations applied to the leaf images such as shifting, scaling and rotation. They correctly recognized 86.3% of the 184 species on the LeafSnap dataset. Besides that, the resulting CNN model needs to be retrained to include new plant species, which is a time-consuming process. Barré et al. [6] has also pointed out that most of the misclassified plants belonged to species that show strong visual similarities. Zhu et al. [45] introduced a two-way attention hierarchical model using CNNs. The first attention way consists of recognizing the family level based on plant taxonomy. The second attention way is to find a discriminative part of an input image by a heat-map strategy. They conducted experiments in Malayakew and ICL datasets, and the CNN with Xception architecture achieved the accuracy of 99% in both datasets. They used 90% of the datasets for training and the 10% reminding to test. Although the authors stated that they do not use any strategy of data augmentation, they have balanced the training dataset, thus, each class has a roughly equal number of samples.

A fine-grained classification approach may provide as output just a single class probability or a set of classes so-called "confidence-sets", which include the true class at a given confidence level. To this end, an input image is classified and the top-

best-ranked classes are selected as the confidence-set. Sfar et al. [31]

proposed a hierarchical classification of plants, in which they measured the posterior probabilities for each node of the hierarchy and then created the confidence-set using a confidence threshold. The experiments were carried out on four datasets, where three of them have a balanced number of samples per class in the training set. However, they observed a poor performance on the LifeCLEF 2011 dataset, which is imbalanced, since their strategy employed approach fails to recognize the species that have few training samples.

Wang and Wang [40] used a few-shot learning method based on Siamese Convolutional Neural Network (SCNN) to recognize leaf plants. The Euclidean distance was used to measure the distance between features. The SCNN used by [40] was inspired by the structure of GoogLeNet. They evaluated the proposed method on Flavia, Swedish, and LeafSnap datasets. They used a small number of learning samples, and the experimental results have shown that the highest classification accuracy (95.32%, 91.37%, and 91.75% for Flavia, Swedish and Leafsnap datasets, respectively.) was achieved for models trained with 20 samples per class. Zhi-Yong et al. [44] also used SCNN for plant recognition. They proposed a spatial structure using a deep metric. The SCNN was used to learn an embedding with similar and dissimilar pairs. Similar pairs were formed using the same organ of plants and dissimilar pairs are organized by different species of plants. They evaluated the performance in LifeCLEF 2015, the result was 0.84 using

metric, surpassing all other methods. Although the best results is worth mentioning that the spatial structure is modeled by recurrent neural networks. Recently,

Figueroa-Mata and Mata-Montero [12] proposed a way to learn a similarity metric that discriminates plant species. They compared whether SCNN models are better than CNN models regarding the performance and computational cost. Also, new species (20 leaves of Costa Rican dataset) never seen by the model SCNN were evaluated without retraining of the proposed model. In their first experiment, they conclude that for datasets with fewer than 20 images per species, the SCNN performed better than CNN in the context of plant recognition besides the fact of having a lower computational cost. The second experiment has shown that SCNN can generalize to other plant species without any retraining of the model.

To the best of our knowledge, from the existing methods in the literature [6, 15, 25, 37], only Wang and Wang [40], Zhi-Yong et al. [44] and Figueroa-Mata and Mata-Montero [12] have exploited deep metrics to compute the similarity between plant images. However, there is no previous work that uses a two-view similarity scheme combined with fine-grained classification of plants. The use of plant hierarchy and similarity learning makes our method more accurate and scalable as we show in the next sections.

3 Proposed Approach

We propose a fine-grained approach for classification of plant species from the leaf image. The coarse-to-fine classification strategy unveils the plant genus in the first stage and then its species in the second stage as illustrated in Figure 2. In the first stage, a coarse classification according to the plant genus is carried out using a deep metric based on an SCNN, which computes the similarity between a leaf image and reference images previously chosen to represent each plant genus. At this point, features are extracted from the entire leaf image, which is considered as the first view of the proposed approach. The rationale behind that is to compute the similarity between a test image and the genus reference images considering firstly a global view of the plant, i.e. representing the leaf by general features such as the leaf shape and color. The output of the first stage is a ranked list of the top- genus candidates.

In the second stage, given the top- genus candidates found in the first stage, a fine classification considering only the plant species which belong to such a genus is performed. Similarly, an SCNN is used as a deep metric, which is now computed on a different view of the leaf images that considers a local representation (second view). For such an aim, the SCNN receives as input a cropped image which is taken from the center of the leaf image. The idea behind this strategy is to perform a fine classification of plant species based on a finer representation of the leaf image that focuses on local details such as the texture and the vein patterns that are usually present in the central portion of leaves. The output of the second stage is a ranked list of plant species which is weighted by the output of the first stage, as shown in Figure 2. In the next sections, we present the proposed method in detail.

Figure 2: Overview of the proposed method for fine-grained classification of plant species from the leaf image.

3.1 SCNN-based Deep Metric

The similarity between the reference patterns and the leaf image is computed in both representation schemes of the proposed method with an SCNN [8]. The difference between the deep metrics of both stages is that the SCNNs are trained on different views of the leaf image. At the first stage, the SCNN is trained on the entire leaf image, while in the second stage, only a square region taken from the central area of the leaf image is used. Both SCNNs are based on two parallel CNNs that use the architecture of the VGG16 CNN [32]

. Such parallel CNNs are pre-trained on the ImageNet dataset and have shared weights. The SCNNs take as input two color images with a resolution of 224

224 pixel. Each CNN has five blocks of convolutional layers interchanged with five max-pooling layers, followed by two fully connected layers and an output layer. The first two blocks have two convolutional layers with 64 and 128 filters, respectively. The other two blocks have three convolutional layers with 256, 512 and 512 filters, respectively. All filters have size 3

3 and the max-pooling layers have pool size and stride 2. The units in the final convolutional layer are flattened into a single vector. The fully connected layers have 4,096 units. The original output layer of the VGG16 CNN has 1,000 units that compute the class scores. Such an output layer is replaced by a layer that computes a distance metric between the last fully connected layers of each Siamese twin (Equation 


). The ReLU activation function is used in all layers to minimize the vanishing gradient problem. The stochastic gradient descent method was used to train the CNNs and to prevent over-fitting, we use

2 regularization in the optimization algorithm and dropout after each fully connected layer.

Pairs of training samples are provided to the network and during the training, the logistic regression loss function (Equation 

2) is minimized iteratively. The main steps of the iterative training algorithm [34] used to learn the deep metric models is presented in Algorithm 1. The SCNN computes a distance metric between features extracted from the reference pattern and the test image as denoted in Equation 1.


where is the 1 distance and

is a non-linear transformation performed by the CNN. Finally, the

1 distance between and

is fed to a logistic regression loss function, which minimizes the difference of the probability distribution when

and are from the same category, and maximizes when they belong to different categories [10]. The logistic regression loss function is shown in Equation 2.


where is the label of the input pair. If the input images are from the same class, then =1; otherwise, =0.

Input : Training set = ,
Output : The 1 distance and loss for a training batch.
1 RandomSample for  to  do
6 end for
7 // Compute 1 distance.
8 // Initialize loss.
9 for  to  do
10       // Update loss.
12 end for
Algorithm 1 Iterative training method. is the total number of positive examples, is the total number of negative examples, RandomSample() denotes a set of elements chosen randomly from set or , is the example label, and is the total number of training iteration.

3.2 Hierarchical Classification Strategy

A coarse-to-fine classification is performed considering the hierarchical botanic taxonomy. To this end, first, the plant is classified according to its genus in the first stage of the method then its species is defined. For each species, we have randomly selected a number of supervised samples as reference images. The number of reference images per species was experimentally defined (from 1 to 6). The reference images for each genus are those selected for each of its species. A ranked list of the top- genus candidates is the output of the first stage of our hierarchical classification corresponding to the coarse classification step, which is based on the global representation of the leaf image. The top- genus candidates are used to select the species to be evaluated in the second stage of the hierarchical classification, which is done considering the local representation of the leaf image. Finally, the genus (coarse classification) and species (fine classification) are combined to produce a final ranked list of the -best hypotheses of plant species.

With this hierarchical classification, we can better deal with the inter-class and intra-class variations observed in the plant species domain. For example, in Figure 3a, four samples of plants are plotted in the global view space without considering the plant hierarchy. In such a scenario, the discrimination must be carried out among all species, making more difficult the discrimination between species which have similar characteristics, and that increases the complexity of the problem. The hierarchical strategy alleviates the intra-class and inter-class problems, clustering the leaves which have similar characteristics. For instance, by grouping the similar leaves in Figure 3b, we can see that the species belong to three different genera (, and ). When the hierarchy is considered (the genus is used) the classification process becomes easier, mainly to distinguish species between the genera and . However, some species of the same genus may be very similar as those that belong to . This is why some local analyses of the leaf image (Figure 3c) must be carried out to deal with such a possible low variability in terms of shape and color between species inside some genus ().

Figure 3: a) Different species (a, b, c and d) with similar characteristics are plotted in the global view space; b) The similar categories are separated into three groups that represent the genus taxonomy (, and ); c) Very similar species are discriminated in the local view space.

3.3 Fusion Schema

The final classification is given by the fusion of the outputs of the first and second stages. The output of the first stage is a list , which represents the top- ranking of genus references for a given test sample provided by the SCNN trained on the global view representation of the leaf image. It is important to remember that depending on the defined, we can have up to six references by genus. Thus, is a list with references, where each is the reference of the genus . Thus, we can also compute the frequency of each genus in , which is used to weight the output of the second stage.

In the second stage, the similarity value between the cropped test image () and the cropped reference image of each genus present in is provided by the SCNN considering the local view representation of the leaf image. Finally, the output of the second stage is a list of species ordered by the score , which is computed as described in Equation 3.


where is the weight of genus computed in the first stage, and is the sum of weights.

4 Experimental Results

This section describes the datasets and experiments used to evaluate the proposed method.

4.1 Datasets

We have evaluated the proposed approach on two fine-grained datasets: LifeCLEF 2015 and LeafSnap. The reason for choosing these two datasets is that both of them have a wide variability of plant leaf species and represent challenging tasks for the scientific community. The images of these datasets were gathered by various photographers in globally distributed locations, engaging mixed conditions of background, position, color and lighting, factors that greatly influence the quality of the images. Moreover, both datasets are imbalanced and for some plant species, there are few samples available for training. Since one of the goals of this paper is to classify plant species with a small number of labeled samples, we have evaluated to use only six images for training the models, which corresponds to the minimum number of training samples per species found in these datasets when creating a subset of labeled samples to train the SCNN. Such a supervised training subset contains positive and negative sample pairs. A positive sample pair means that the two images belong to the same category, and it is labeled as 1. A negative sample pair means that the two images belong to different categories, and it is labeled as 0. Table 1 shows the total number of training samples per species and per genus when considering only six samples per species in each dataset.

Dataset Number Total Number
of Images of Images
LifeCLEF 2015 (genus) 43 258
LeafSnap (genus) 73 438
LifeCLEF 2015 (species) 60 360
LeafSnap (species) 184 1,104
Table 1: The total number of training images per genus and per species when considering just six samples per category.

With such a set of samples we generated training subsets containing more non-similar pairs than similar ones as recommended by Melekhov et al. [26]. Table 2 shows the number of positive and negative samples in our training subsets. The LifeCLEF 2015 dataset already has a pre-defined test set made up of 221 leaf images. On the other hand, for the LeafSnap dataset, we randomly choose 15 images per class to compose the test set, totaling 2,760 leaf images.

Datasets Positive Negative
samples samples
LifeCLEF 2015 (genus) 4,000 12,000
LeafSnap (genus) 6,000 20,000
LifeCLEF 2015 (species) 8,000 35,000
LeafSnap (species) 10,000 50,000
Table 2: The number of positive and negative samples in the training sets.

4.2 Analysis and Experiments

We start this section by presenting two important analyses that are necessary to define the proposed method. Section 4.2.1 is used to define the configuration of the coarse-to-fine hierarchy. In Section 4.2.2, a second analysis shows the importance of the proposed two-view representation of the leaf images.

Four experiments were performed to evaluate the proposed method. In Section 4.2.3, we have an overall performance evaluation. Section 4.2.4 shows the contribution of using the two-view coarse-to-fine classification strategy for plant species recognition. Besides, we evaluated the proposed method regarding the impact of the imbalanced data (Section 4.2.5), and scalability (Section 4.2.6). Finally, in Section 4.2.7, we compare the proposed method with the state-of-the-art.

4.2.1 Configuration of the coarse-to-fine classification

In this section, we evaluate how to choose the coarse and fine levels of the proposed hierarchical strategy. To this end, we performed the classification of leaf images taking individually each taxonomic group: family, genus, and species considering different views: global and local. The results observed on the test dataset LifeCLEF 2015 are shown in Table 3.

Taxonomic group Classification Accuracy
Global Local
Family 0.6951 0.6301
Genus 0.8589 0.6733
Species 0.7509 0.7400
Table 3: Individual classification considering each taxonomic group: Family, Genus and Species and different views: global and local.

Table 3 shows a poor classification performance of the family group when compared to genus. The reason is that inside a family there are usually several distinct genera. Thus, we discarded the family group and defined our fine-grained approach as genus-to-species by using genus as coarse classification and species as the fine classification. Another reason to use the genus as the first step is that for botanists, the genus group is the most discriminative key feature [16]. However, as shown in Figure 1, inside the same genus (blue dotted rectangle) there are still very similar species that require additional effort to be classified. To deal with this problem we propose the two-view representation of the leaf image in which global and local features are combined.

4.2.2 Importance of the two-view representation

The proposed two-view representation allows to explore global and local features of the leaf image. As one may see in Figure 4, the entire leaf image provides features related to the general shape of the leaf in the convolutional layers of the CNN (first, second and third layers). The last two layers (fourth and fifth) seem to contain fine details of the leaf contours. On the other hand, the features extracted from the cropped leaf images are related to the veins and local texture of the leaf. Such a two-view representation can provide complementary features for the proposed fine-grained classification.

For instance, inside the genus Quercus there are 21 plant species in the LeafSnap dataset. In Figure 5, we can see that some species of that genus are very similar in terms of global features (general shape), but they present some small differences in their local features (plant veins). This is the case of species "marilandica" and "stellata". In fact, the global features may be enough to distinguish between the genera, but it is not enough to distinguish between species.

Figure 4: Two-view representation - the feature maps along the layers of the SCNN for entire and cropped images.
Figure 5: Entire and Cropped image sample of species inside the genus .

4.2.3 Global performance evaluation

In this section, the proposed hierarchical classification based on a two-view similarity scheme is evaluated. For each dataset the overall results are computed using an average classification score proposed in LifeCLEF 2015, which is described in [19], as well as the accuracy, which is computed by Equation 4.


Tables 4 and 5 present the results for LifeCLEF 2015 and LeafSnap datasets, respectively. The best results for both datasets were achieved using six reference samples per species (=6) and 30 candidates in the ranked list (=30). In the second stage, the accuracy is computed considering =1, but we reached 1.0 of accuracy rate with =5 for both datasets as reported in Table 6.

It is important to notice that the number of references has an important impact on the results. Therefore, we show in Figure 6 different values over different top-. It is expected that as the top- grows, the performance should increase. However, we notice that there is a decrease in performance when using =50. This is directly related to the coarse-to-fine classification, in which in the coarse stage, we define how many species will be taken to the fine stage using the genus references that appear in the top-.

Figure 6: Classification accuracy considering different number of references () and sizes () for the ranking list using the ImageCLEF 2015 dataset.

Figure 7 shows the average number of species concerning different sizes for the ranking list when =6. We realized that as we increase the top-, there is an increase in the average number of species that will be taken to the fine stage. When we change from 30 to 50, we observed an increase from five to nine classes. As a consequence, the classification final accuracy decreased. This, in fact, corroborates our assumption that a coarse-to-fine classification is an interesting strategy to deal with the high inter plant species variability.

Figure 7: Average number of species provided by the coarse stage, when using (=6) in each top-.
Hierarchical Stage top- Number of References ()
1 3 6
1st (Genus) 5 0.72 0.77 0.77
1st (Genus) 15 0.86 0.84 0.81
1st (Genus) 30 0.98 0.96 0.95
1st (Genus) 50 0.99 0.98 0.98
2nd (Species)*(5) 1 0.71 0.74 0.77
2nd (Species)*(15) 1 0.75 0.80 0.81
2nd (Species)*(30) 1 0.81 0.86 0.87
2nd (Species)*(50) 1 0.78 0.79 0.80
: number of reference samples per category in the
classification phase; *(): top- hypotheses used in the first
Table 4: Average classification score () of the hierarchical classification for the LifeCLEF 2015 dataset.
Hierarchical Stage top- Number of References ()
1 3 6
1st (Genus) 5 0.91 0.92 0.95
1st (Genus) 15 0.96 0.96 0.98
1st (Genus) 30 0.99 0.98 0.98
1st (Genus) 50 0.99 0.98 0.97
2nd (Species)*(5) 1 0.74 0.75 0.79
2nd (Species)*(15) 1 0.80 0.83 0.88
2nd (Species)*(30) 1 0.91 0.95 0.96
2nd (Species)*(50) 1 0.81 0.85 0.87
: number of reference samples per category in the
classification phase; *(): top- hypotheses used in the first
Table 5: Overall accuracy () of the hierarchical classification for the LeafSnap dataset.
Dataset Top-1 Top-3 Top-5
LifeCLEF 2015 (Genus+Species) 0.87 0.94 1.0
LeafSnap (Genus+Species) 0.96 0.99 1.0
Table 6: Final accuracy of the proposed method considering =6, =30 in the first stage (genus) and =1, 3 and 5 in the second stage (species).

4.2.4 Performance of the two-view classification

Figure 8 shows a radar chart in which one can see the percentage of correctly recognized leaf images of each species considering a coarse-to-fine classification using just one view (just the global representation of the leaf image) and two views (global and local representations). We considered the top-1 results using the LifeCLEF 2015 dataset. According to Figure 8, the two-view representation improves the classification rates of several species compared to the one-view. The use of the cropped leaf images was suitable to reduce the conflict between species. For instance, the eighth species, named quercus cerris, has 0.0 of accuracy when using just the one-view representation, on the other hand, it increased to 0.55 when we used the two-view representation.

Figure 8: Comparison of the proportion of correctly recognized leaves of each species at the proposed coarse-to-fine classification using one-view (just global images) and two-view (global and local images). 60 species are evaluate from LifeCLEF 2015 dataset.

As we can see, when just one view is considered, the quercus cerris is confused with quercus petraea, quercus rubra, and quercus pubescens. Figure 9 shows the leaves quercus cerris, quercus petraea, quercus rubra, and quercus pubescens species respectively. Noticeably, the four species in the Figure 9 have similar morphological characteristics which explain the confusion of the proposed approach when adopting just one-view (global).

Figure 9: Samples of confused species (global representation): a)quercus cerris; b)quercus petraea; c)quercus rubra; d)quercus pubescens.

However, it is important to observe that the accuracy rate of a few species has dropped off with the use of the two-view strategy. This is the case of the forty-eighth species in Figure 8, for instance. It is the betula pendula species. Observing the output of the two-view approach for related species, we found that there is a confusion between it and the betula utilis and betula pubescens. They have very similar texture and vein patterns as shown in Figure 10.

Figure 10: Samples of confused species (local representation): a)betula pendula; b)betula utilis and c)betula pubescens.

4.2.5 Impact of imbalanced data

Figure 11 presents the results for some species of the LifeCLEF 2015 dataset when we did not controlled the number of samples per species in the training set. The plot shows 18 classes with a minimum of 6 and a maximum of 382 training samples. As one may see, the classification performance of classes with very few samples (see the red boxes) is not affected when using the SCNN model. It seems that since the models are not trained to learn a regular classifier of plant species but a distance metric to provide the similarity between two images, they are not sensitive to imbalanced data.

Figure 11: Performance on 18 species of the LifeCLEF 2015 dataset. Each box is identified by two numbers: the first is the number of training samples per species / the second is the number of testing samples

4.2.6 Scalability

The scalability of the proposed method can be evaluated by considering plant species not seen during the training step. For such an aim, in the proposed method, it is just necessary to add reference images of the new species to be considered. Therefore, one of the main advantages of the proposed method is that it does not require retraining the SCNNs, avoiding such a time-consuming process.

Figure 12 shows the impact on the accuracy of the proposed method by adding new classes. The accuracy before adding new classes was 0.87 for the 60 classes of the LifeCLEF 2015 dataset. After adding 12, 50, 100, 150 and 184 new classes, which belong to the LeafSnap dataset, the accuracy dropped to 0.86, 0.85, 0.83, 0.82 and 0.81, respectively. It is important to note that even by adding 184 new species from another different dataset, the proposed method sustained an accuracy close to that achieved for only 60 species. These new species were never seen by the model, as well as, we did not use any of their samples for training the SCNNs. The SCNNs is able to compute the similarity between the reference and test images. Thus, with this experiment, we show that the proposed method scales relatively well to the number of classes. Besides, it performed well in case of a cross dataset evaluation.

Table 7 shows the necessary time to classify a single leaf image for a growing number of species. We observed that the computational time grows slower than the linear function as the number of reference images grows.

of Time
classes Dataset (sec)
60 LifeCLEF 2015 0.2010
72 LifeCLEF 2015 + LeafSnap 0.3965
110 LifeCLEF 2015 + LeafSnap 0.9276
160 LifeCLEF 2015 + LeafSnap 1.3830
210 LifeCLEF 2015 + LeafSnap 1.6789
244 LifeCLEF 2015 + LeafSnap 1.8458
Table 7: Computational time for classify one leaf plant considering the scalability of classes.
Figure 12: Scalability accuracy for new plant leaf classes never seen in the model.

4.2.7 Comparison with the State-of-the-Art

We compared the results of the proposed method with the works in the literature that have used LifeCLEF 2015 and LeafSnap datasets, as shown in Table 8. Six out of eight of the recent works of plant identification use CNN models [37, 25, 15, 4, 6, 7], while other two works use SCNN models [44, 40]. Differently from all these methods, we have combined a fine-grained strategy with a two-view representation of the leaf image. As one may see, the proposed method overcomes the related works, besides producing a scalable solution.

LifeCLEF 2015 LeafSnap
Reference Approach () ()
Sungbin [37] CNN 0.76 -
Lee et al. [25] CNN 0.80 -
Araújo et al. [4] CNN 0.86 -
Ghazi et al. [15] CNN 0.84 -
Zhi-Yong et al. [44] SCNN 0.84 -
Barré et al. [6] CNN - 0.86
Bodhwani et al. [7] CNN - 0.93
Wang and Wang [40] SCNN - 0.91
Proposed Method SCNN 0.87 0.96
Table 8: Comparison with the state-of-the-art for leaf classification for LifeCLEF 2015 and LeafSnap datasets.

5 Conclusion

We proposed a novel method based on a two-view leaf image representation and hierarchical classification strategy for fine-grained recognition of plant species. The botanical taxonomy is used to drive a coarse-to-fine classification strategy applied to identify the plant genus and species. An interesting two-view representation of the plant leaf provides complementary global (shape and color) and local features (texture and plant veins). Deep metric based on SCNN was used to reduce the dependence of the proposed method on a large number of training samples. Besides that, the SCNN makes the proposed method scalable, and new plant species can be easily integrated into the SCNN models without retraining them.

The experiments on two challenging fine-grained datasets of leaf images (LifeCLEF 2015 and LeafSnap) confirmed the effectiveness of the proposed method – the recognition accuracy over those two datasets reaches 0.87 and 0.96 respectively. As future work, we plan to deal with CNN architectures to create an SCNN with hierarchical property for plant classification.


  • Aakif and Khan [2015] Aakif, A., Khan, M.F., 2015. Automatic classification of plants based on their leaves. Biosystems Engineering 139, 66–75. doi:10.1016/j.biosystemseng.2015.08.003.
  • Angelova and Zhu [2013] Angelova, A., Zhu, S., 2013. Efficient Object Detection and Segmentation for Fine-Grained Recognition, in: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 811–818. doi:10.1109/CVPR.2013.110.
  • Aráujo et al. [2017] Aráujo, V., Britto, A.S., Brun, A.L., Koerich, A.L., Falate, R., 2017. Multiple classifier system for plant leaf recognition, in: 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC). doi:10.1109/SMC.2017.8122891.
  • Araújo et al. [2018] Araújo, V.M., Jr., A.S.B., Brun, A.L., Koerich, A.L., Oliveira, L.E.S., 2018.

    Fine-grained hierarchical classification of plant leaf images using fusion of deep models, in: IEEE 30th International Conference on Tools with Artificial Intelligence, ICTAI 2018, 5-7 November 2018, Volos, Greece, pp. 1–5.

  • Barbedo [2018] Barbedo, J.G.A., 2018.

    Impact of dataset size and variety on the effectiveness of deep learning and transfer learning for plant disease classification.

    Computers and Electronics in Agriculture 153, 46–53. doi:10.1016/j.compag.2018.08.013.
  • Barré et al. [2017] Barré, P., Stöver, B.C., Müller, K.F., Steinhage, V., 2017. LeafNet: A computer vision system for automatic plant species identification. Ecological Informatics 40, 50–56. doi:10.1016/j.ecoinf.2017.05.005.
  • Bodhwani et al. [2019] Bodhwani, V., Acharjya, D.P., Bodhwani, U., 2019. Deep residual networks for plant identification. Procedia Computer Science 152, 186–194. doi:10.1016/j.procs.2019.05.042.
  • Bromley et al. [1993] Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., Shah, R., 1993. Signature Verification Using a “Siamese” Time Delay Neural Network, in: Proceedings of the 6th International Conference on Neural Information Processing Systems, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. pp. 737–744.
  • Chaki et al. [2015] Chaki, J., Parekh, R., Bhattacharya, S., 2015. Plant leaf recognition using texture and shape features with neural classifiers. Pattern Recognition Letters 58, 61–68. doi:10.1016/j.patrec.2015.02.010.
  • Chopra et al. [2005] Chopra, S., Hadsell, R., LeCun, Y., 2005. Learning a similarity metric discriminatively, with application to face verification, in: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 1 - Volume 01, IEEE Computer Society, USA. p. 539–546. doi:10.1109/CVPR.2005.202.
  • Elhariri et al. [2014] Elhariri, E., El-Bendary, N., Hassanien, A.E., 2014. Plant classification system based on leaf features. Proceedings of 2014 9th IEEE International Conference on Computer Engineering and Systems, ICCES 2014. , 271–276.doi:10.1109/ICCES.2014.7030971.
  • Figueroa-Mata and Mata-Montero [2020] Figueroa-Mata, G., Mata-Montero, E., 2020. Using a convolutional siamese network for image-based plant species identification with small datasets. Biomimetics 5. doi:10.3390/biomimetics5010008.
  • Ge et al. [2016] Ge, Z., Bewley, A., McCool, C., Corke, P., Upcroft, B., Sanderson, C., 2016. Fine-grained classification via mixture of deep convolutional neural networks, in: 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), IEEE. pp. 1–6. doi:10.1109/WACV.2016.7477700, arXiv:1511.09209.
  • Ge et al. [2015] Ge, Z., McCool, C., Sanderson, C., Bewley, A., Chen, Z., Corke, P., 2015. Fine-grained bird species recognition via hierarchical subset learning, in: 2015 IEEE International Conference on Image Processing (ICIP), pp. 561--565. doi:10.1109/ICIP.2015.7350861.
  • Ghazi et al. [2017] Ghazi, M.M., Yanikoglu, B., Aptoula, E., 2017. Plant identification using deep neural networks via optimization of transfer learning parameters. Neurocomputing 235, 228 -- 235. doi:
  • Griffing [2011] Griffing, L.R., 2011. Who invented the dichotomous key? Richard Waller’s watercolors of the herbs of Britain. American Journal of Botany 98, 1911--1923. doi:10.3732/ajb.1100188.
  • Grinblat et al. [2016] Grinblat, G.L., Uzal, L.C., Larese, M.G., Granitto, P.M., 2016. Deep learning for plant identification using vein morphological patterns. Computers and Electronics in Agriculture 127, 418--424. doi:10.1016/j.compag.2016.07.003.
  • Han et al. [2018] Han, D., Liu, Q., Fan, W., 2018. A new image classification method using cnn transfer learning and web data augmentation. Expert Systems with Applications 95, 43 -- 56. doi:
  • Joly et al. [2015] Joly, A., Goëau, H., Glotin, H., Spampinato, C., Bonnet, P., Vellinga, W.P., Planqué, R., Rauber, A., Palazzo, S., Fisher, B., Müller, H., 2015. LifeCLEF 2015: Multimedia Life Species Identification Challenges, in: Proceedings of the 6th International Conference on Experimental IR Meets Multilinguality, Multimodality, and Interaction - Volume 9283, pp. 462--483. doi:10.1007/978-3-319-24027-5_46.
  • Kadir et al. [2012] Kadir, A., Nugroho, L.E., Susanto, A., Santosa, P.I., 2012.

    Performance Improvement of Leaf Identification System Using Principal Component Analysis.

    International Journal of Advanced Science and Technology 44, 113--124.
  • [21] Khosla, A., Jayadevaprakash, N., Yao, B., fei Li, F., . L.: Novel dataset for fine-grained image categorization, in: First Workshop on Fine-Grained Visual Categorization, CVPR (2011).
  • Kumar et al. [2012] Kumar, N., Belhumeur, P.N., Biswas, A., Jacobs, D.W., Kress, W.J., Lopez, I.C., Soares, J.V.B., 2012. Leafsnap: A Computer Vision System for Automatic Plant Species Identification, in: Computer Vision -- ECCV 2012, pp. 502--516.
  • Lee et al. [2017a] Lee, S.H., Chan, C.S., Mayo, S.J., Remagnino, P., 2017a. How deep learning extracts and learns leaf features for plant classification. Pattern Recognition 71, 1--13. doi:10.1016/j.patcog.2017.05.015.
  • Lee et al. [2015] Lee, S.H., Chan, C.S., Wilkin, P., Remagnino, P., 2015. Deep-plant: Plant identification with convolutional neural networks. Proceedings - International Conference on Image Processing, ICIP 2015-December, 452--456. doi:10.1109/ICIP.2015.7350839, arXiv:1506.08425.
  • Lee et al. [2017b] Lee, S.H., Chang, Y.L., Chan, C.S., Remagnino, P., 2017b. Hgo-cnn: Hybrid generic-organ convolutional neural network for multi-organ plant classification, in: 2017 IEEE Int’l Conference on Image Processing (ICIP), pp. 4462--4466. doi:10.1109/ICIP.2017.8297126.
  • Melekhov et al. [2016] Melekhov, I., Kannala, J., Rahtu, E., 2016. Siamese network features for image matching, in: 2016 23rd International Conference on Pattern Recognition (ICPR), pp. 378--383. doi:10.1109/ICPR.2016.7899663.
  • Naresh and Nagendraswamy [2016] Naresh, Y.G., Nagendraswamy, H.S., 2016. Classification of medicinal plants: An approach using modified LBP with symbolic representation. Neurocomputing 173, 1789--1797. doi:10.1016/j.neucom.2015.08.090.
  • Pawara et al. [2017] Pawara, P., Okafor, E., Schomaker, L., Wiering, M., 2017. Data Augmentation for Plant Classification, in: Advanced Concepts for Intelligent Vision Systems, Springer International Publishing, Cham. pp. 615--626.
  • Priya et al. [2012] Priya, C.A., Balasaravanan, T., Thanamani, A.S., 2012.

    An efficient leaf recognition algorithm for plant classification using support vector machine.

    International Conference on Pattern Recognition, Informatics and Medical Engineering, PRIME 2012 , 428--432.doi:10.1109/ICPRIME.2012.6208384.
  • Schuh and Brower [2009] Schuh, R.T., Brower, A.V.Z., 2009. Biological Systematics: Principles and Applications. 1 ed., Cornell University Press.
  • Sfar et al. [2015] Sfar, A.R., Boujemaa, N., Geman, D., 2015. Confidence Sets for Fine-Grained Categorization and Plant Species Identification. International Journal of Computer Vision 111, 255--275. doi:10.1007/s11263-014-0743-3.
  • Simonyan and Zisserman [2014] Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556.
  • Simpson [2010] Simpson, M., 2010. Plant Systematics. Elsevier Science.
  • Snell et al. [2017] Snell, J., Swersky, K., Zemel, R.S., 2017. Prototypical networks for few-shot learning. CoRR abs/1703.05175. arXiv:1703.05175.
  • Söderkvist [2001] Söderkvist, O., 2001. Computer vision classification of leaves from swedish trees.
  • Šulc and Matas [2017] Šulc, M., Matas, J., 2017. Fine-grained recognition of plants from images. Plant Methods 13, 1--14. doi:10.1186/s13007-017-0265-4.
  • Sungbin [2015] Sungbin, C., 2015. Plant identification with deep convolutional neural network snumedinfo at lifeclef plant identification task 2015. CLEF (Working Notes) .
  • Taylor and Nitschke [2017] Taylor, L., Nitschke, G., 2017. Improving deep learning using generic data augmentation. CoRR abs/1708.06020. arXiv:1708.06020.
  • Wäldchen et al. [2018] Wäldchen, J., Rzanny, M., Seeland, M., Mäder, P., 2018. Automated plant species identification—Trends and future directions. PLoS Computational Biology 14, 1--19. doi:10.1371/journal.pcbi.1005993.
  • Wang and Wang [2019] Wang, B., Wang, D., 2019. Plant Leaves Classification : A Few-Shot Learning Method Based on Siamese Network. IEEE Access 7, 151754--151763. doi:10.1109/ACCESS.2019.2947510.
  • Wei et al. [2019] Wei, X.S., Wang, P., Liu, L., Shen, C., Wu, J., 2019. Piecewise Classifier Mappings: Learning Fine-Grained Learners for Novel Categories With Few Examples. IEEE Transactions on Image Processing 28, 6116--6125. doi:10.1109/tip.2019.2924811, arXiv:1805.04288.
  • Wu et al. [2007] Wu, S.G., Bao, F.S., Xu, E.Y., Wang, Y.X., Chang, Y.F., Xiang, Q.L., 2007. A leaf recognition algorithm for plant classification using probabilistic neural network. ISSPIT 2007 - 2007 IEEE International Symposium on Signal Processing and Information Technology , 11--16.doi:10.1109/ISSPIT.2007.4458016, arXiv:0707.4289.
  • Yanikoglu et al. [2014] Yanikoglu, B., Aptoula, E., Tirkaz, C., 2014. Automatic plant identification from photographs. Machine Vision and Applications 25, 1369--1383. doi:10.1007/s00138-014-0612-7.
  • Zhi-Yong et al. [2018] Zhi-Yong, G., Xie, H.X., Li, J.F., Liu, S.L., 2018. Spatial-structure siamese network for plant identification. International Journal of Pattern Recognition and Artificial Intelligence 32, 1850035. doi:10.1142/S0218001418500350, arXiv:
  • Zhu et al. [2019] Zhu, Y., Sun, W., Cao, X., Wang, C., Wu, D., 2019.

    TA-CNN : Two-way attention models in deep convolutional neural network for plant recognition.

    Neurocomputing doi:10.1016/j.neucom.2019.07.016.