Human experts vs. machines in taxa recognition

08/23/2017 ∙ by Johanna Ärje, et al. ∙ Jyväskylän yliopisto 0

Biomonitoring of waterbodies is vital as the number of anthropogenic stressors on aquatic ecosystems keeps growing. However, the continuous decrease in funding makes it impossible to meet monitoring goals or sustain traditional manual sample processing. In this paper, we review what kind of statistical tools can be used to enhance the cost efficiency of biomonitoring: We explore automated identification of freshwater macroinvertebrates which are used as one indicator group in biomonitoring of aquatic ecosystems. We present the first classification results of a new imaging system producing multiple images per specimen. Moreover, these results are compared with the results of human experts. On a data set of 29 taxonomical groups, automated classification produces a higher average accuracy than human experts.



There are no comments yet.


page 2

page 3

Code Repositories


Using a CNN to identify benthic macroinvertebrate pictures.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Due to its inherent slowness, traditional manual identification has long been a bottleneck in bioassessments (Fig. 1). The growing demand for biological monitoring and the declining funding and number of taxonomic experts is forcing ecologists to search for alternatives for the cost-intensive and time consuming manual identification of monitoring samples (Borja and Elliott, 2013; Nygård et al., 2016). Identification of taxonomic groups in biomonitoring of, e.g., aquatic environments often involves a large number of samples, specimens in a sample, and the number of taxonomic groups to identify. For example, even in relatively species-poor regions like Finland, the calculation of Water framework directive related indices often involves hundreds of individual specimens from 118-349 lotic diatom taxa and 44-113 lotic benthic macroinvertebrate taxa (Ärje et al., 2016).




Ecological assessment

Figure 1: A schematic of the biomonitoring process.

While a growing body of work has used different genetic tools (e.g. Elbrecht et al., 2017; Zimmermann et al., 2015) for species identification, these methods are not yet standardized or capable of producing reliable abundance data currently required in, e.g., Water framework directive. While we have also worked on genetic approaches and acknowledge the great promise that genetic taxa identification methods hold (e.g. Hering et al., 2018), we will not explore them here but alternatively examine the suitability of machine learning techniques on image data for routine taxa identification.

Many studies on automatic classification of biological image data have been published during the past decade. Yousef Kalafi et al. (2018) have done an extensive review on automatic species identification and automated imaging systems. Classification methods for aquatic macroinvertebrates have been proposed in several studies (e.g. Culverhouse et al., 2006; Lytle et al., 2010; Kiranyaz et al., 2011; Ärje et al., 2013; Joutsijoki et al., 2014; Raitoharju et al., 2018)

. The most popular classification methods used for identification of biological image data, such as insects, are deep neural networks and support vector machines

(Kho et al., 2017).

Despite the potential of computational, as well as DNA methods for taxa identification, some taxonomists continue to object the shift from manual to novel identification methods (Kelly et al., 2015; Leese et al., 2018)

. Often biologists that take a cursory look at automated identification tend to mistrust computational methods because they observe that a classifier is unable to separate two specimens which to them are clearly different to the human eye. Similarly, experts are baffled when the same classifier is able to discriminate between two specimens from low-resolution images while they as taxonomic experts cannot. This mismatch in the ability of computers to identify taxa observed for single cases is often mistakenly extrapolated into an overall unreliability of algorithms. But how different truly is both the logic used and the overall accuracy of taxonomic experts and algorithms?

Only few studies assess the accuracy of human experts and automatic classifiers, and their consequences on aquatic biomonitoring. In a study on human accuracy, Haase et al. (2010) reported on the audit of macroinvertebrate samples from an EU Water Framework Directive monitoring program. They found a great discrepancy between the experts determining the true taxonomic classes and the audited laboratory workers. Contrastingly, in a study on the effect of mistakes made in automated taxa identification on biological indices, Ärje et al. (2017) found a relatively small impact. Literature on direct human versus machine comparisons in classification tasks in an aquatic biomonitoring context is equally scant and ambiguous. Culverhouse et al. (2003) compared human and machine identification of six phytoplankton species using images and noted a similar average performance for both the experts and a computer algorithm. In Lytle et al. (2010), automatic classifiers outperformed 26 humans (a mix of experts and amateurs) when distinguishing between two stonefly taxa. Given these contrasting results, we feel it is necessary to simultaneously examine the effect of taxonomic hierarchy and of using human logical pathways for human and computer-based identification.

Taxonomic experts identify specimens based on a predefined taxonomic resolution while automatic classifiers operate on the information of taxonomic rank used in the training data. There are different ways for accounting for data hierarchy, such as taxonomy, in classification. Hierarchical classification is widely investigated in the current literature. Silla and Freitas (2011) sought to describe and unify the concepts of methods used in hierarchical classification problems from different domains. Using the existing literature, they categorized the classification approaches into: 1) flat classification, where the classification is performed at the most specific (deepest) rank of the taxonomy which may not always be species level, 2) local classification per level, per node or per parent node, and 3) global classification, where the whole hierarchical structure of taxonomy is taken into account at once. They found that the existing literature suggested any local or global hierarchical classifier performed better than a flat classifier, if the performance measure was specifically designed for a hierarchical structure.

Several subsequent studies have compared flat classifiers to hierarchical classifiers. Rodrigues et al. (2012) did not find a significant difference between flat and hierarchical approaches in classification of points-of-interest for land-use analysis whereas Levatic et al. (2015) found that the use of hierarchy and multi-label structure improved classification results when compared to single-label cases. Babbar et al. (2016) performed a theoretical study on the difference between flat and hierarchical classification and found that for well-balanced data flat classifiers should be preferred, whereas hierarchical classifiers are a better for unbalanced data.

Automatic classification of benthic macroinvertebrates, as well as plankton, has received increasing attention in recent years. However, most of the previous studies have focused on single-image data (see e.g. Ärje et al., 2010; Kiranyaz et al., 2011; Ärje et al., 2013; Joutsijoki et al., 2014; Uusitalo et al., 2016; Lee et al., 2016; Ärje et al., 2017) and have not taken the inherent hierarchical structure of the data into account. In single-image data studies, the posture of the specimens can have substantial impact on the classification. Besides Lytle et al. (2010), an imaging system producing multiple-image data is presented in Raitoharju et al. (2018). In this paper, we present a comparison of taxonomic experts and automatic classification methods on a benthic macroinvertebrate data that incorporates information on the taxonomic resolution. We test flat classifiers, local per level classifiers, and hierarchical top-down classification, i.e., local classification per parent node, and perform the automatic classification using convolutional neural networks and support vector machines. The results are compared with the results of a proficiency test organized for human taxonomic experts and with a test where taxonomic experts used the same images as the automatic classifiers. The comparisons evaluate traditional single level accuracy and additionally use a novel variant of an accuracy measure that accounts for the hierarchical structure of the data.

Ii Theory

Ii-a Hierarchy in classification

Silla and Freitas (2011) unified the concepts of methods used in hierarchical classification problems, and in this section we follow their terminology.

Human experts base visual identification of, e.g., invertebrate taxa on rules defined in the International commission on zoological nomenclature (1999). Therefore, human experts can be thought of as hierarchical, local per parent node classifiers (see Fig. (c)c) that first identify the order of the specimen, then the family, genus, and species. The classification task is not necessarily a single level problem as some taxa need to be identified to different taxonomic levels (see Fig. (a)a) either because of predefined rules, such as minimal taxonomic requirements, or as a function of necessity when specimens lack characteristics needed to allow for better resolution. While for some taxa, genus or family might be enough others might require species level identification depending on what the taxa information is later used for.

Order A

Family A

Family B

Genus A

Species A

Species B

Genus B

Family C

Genus C

Species C

Order A

Family A

Family B

Genus A

Species A

Species B

Genus B

Family C

Genus C

Species C

Order A

Family A

Family B

Genus A

Species A

Species B

Genus B

Family C

Genus C

Species C
Figure 5: Different types of classifiers for hierarchical data: (a) Flat classification, (b) Local classification per level, (c) Local classification per parent node. The dashed boxes represent a single trained classifier.

Usually, automatic classification methods have no information on the possible hierarchical nature of the data. The classifiers simply aspire to identify the specimens to the class labels provided in the training data. In the case of benthic macroinvertebrate data, the class labels represent a mix of families, genera, and species. An algorithm working this way is called a flat classifier as it is not aware that species A and B belong to the same genus A, but uses the same approach to distinguish them from each other as when separating species A from genus B. Flat classification produces a single label prediction for each specimen but the hierarchical level of that label may vary depending on the data (Fig. (a)a).

Depending on what the taxa information is later used for, it could be beneficial to build a classifier that identifies a certain taxonomic rank well. For example, a common biological index used in river macroinvertebrate biomonitoring is the number of typical EPT families (Ephemeroptera, Plecoptera, Trichoptera). For the purpose of evaluating this index, it would be reasonable to train a classifier to identify the family level with high accuracy. However, such a classifier trained with the family level labels would have no intrinsic information on certain families descending from the same order. This type of a classification scheme is known as local classification per level (see Fig. (b)b). One could build a classification system with local level classifiers for each level of the hierarchy. While such a system would predict multiple labels for each specimen there would be no guarantee that the predictions for the different levels are taxonomically coherent.

It is also possible to build a hierarchical classification system that accounts for the hierarchical nature of the data and force it to operate in the same manner as human experts. This requires to build a sequence of several classifiers: i) an order level classifier to predict the order of each specimen, ii) multiple family level classifiers, one for each possible order present in the data, iii) multiple genus level classifiers, one for each family present in the data, and finally, iv) multiple species level classifiers to predict the species within each genus. This type of a hierarchical classification scheme is known as local classification per parent node and it predicts the labels for each rank of taxonomic resolution for all the specimens in the data (see Fig. (c)c). While a human-like hierarchical classifier is guaranteed to logically follow taxonomy all classification errors made on higher levels of hierarchy will propagate to the lower level predictions.

The focus of this work is on the comparison of identification results obtained by taxonomic expert logic and machine logic. As traditional machine logic uses flat classification and taxonomic expert logic can be thought of as local classification per parent node, we will not consider global hierarchical classifiers.

Ii-B Performance measures

Traditionally, classification methods are compared based on their accuracy, which is the proportion of correct predictions, or classification error (CE),


is a 0-1 loss function and

is the total number of observations. Other measures of performance such as false positive rate, false negative rate, sensitivity, and specificity can also be calculated from the confusion matrix and take single label predictions into account. These performance measures can be calculated for both flat classification (Fig.

(a)a) or for each level of local classification (Fig. (b)b, (c)c).

With hierarchical data, each observation has multiple labels and we need to measure the performance as a whole accounting for all the labels. Verma et al. (2012) presented context sensitivite loss (CSL) function which takes the top-down success into account. They used this loss function to define context-sensitive error (CSE),


and is the total number of levels in the hierarchy.

Because the deepest available level of hierarchy can vary in taxonomic data, we propose to modify the measure to a level-aware context-sensitive error (LCSE),

where is as above and is the number of available levels in the hierarchy for observation .

Iii Materials and methods

Iii-a Proficiency test for human experts

In order to compare automatic and manual classification, we needed classification results on the same set of taxa for both. The Finnish Environment Institute (SYKE), an appointed National Reference Laboratory in the environmental sector in Finland, organized a proficiency test on taxonomic identification of boreal freshwater lotic, lentic, profundal, and North-Eastern Baltic benthic macroinvertebrates in 2016. The aim of the test was to assess the reliability of professional and semi-professional identification of macroinvertebrate taxa routinely encountered during North-Eastern Baltic coastal or boreal lake and stream monitoring (Meissner et al., 2017). A part of the proficiency test included 10 participants who all identified a different set of 50 specimens of lotic freshwater macroinvertebrates belonging to a total of 46 taxonomic groups, of which 39 are in common with the multiple-image data introduced in the following Section III-B (see taxa list in Table III). The samples sent out to the participants included 0–4 specimens of each taxa. The class labels of the 39 overlapping taxa consisted of 26 species, 12 genera, and one family. The chosen taxonomic resolution is based on the requirements for the Finnish national freshwater monitoring program for macroinvertebrates (Meissner et al., 2010). The ’true’ labels of the specimens were predetermined by an expert panel and the specimens were shipped to the participants. Participants were provided with the list of the almost 300 possible taxa labels (Meissner et al., 2017).

Iii-B Image data

We produced all images with a new imaging system described in Raitoharju et al. (2018) that allows for multiple images per specimen. The system is illustrated in Fig. 6. It consists of two Basler ACA1920-155UC cameras (frame rate of 150 fps) with Megapixel Macro Lens (f=75mm, F:3.5-CWD535mm) placed at a 90 degree angle to each other, a high power LED light and a cuvette (i.e. a rectangular test tube) in a metal container. The device is sealed with a lid to block any extra light. The imaging system has a software that builds a model of the background of the cuvette filled with alcohol and sets off the cameras when a significant change in the view of the camera is detected. When a macroinvertebrate specimen is put into the cuvette, it sinks and both cameras take multiple shots of it (Fig. 7). The number of images per specimen depends on the size and weight of each specimen: Heavier specimens sink faster, leading to a smaller number of images. Compared to the system and data described in Raitoharju et al. (2018), we have improved the system to handle more than two images per specimen.

Figure 6: Schematic of the imaging system for macroinvertebrates pictured from above.
Figure 7: Example images of a Polycentropus flavomaculatus specimen from two cameras. The top row images are from camera 1 and the bottom row images from camera 2.

Using the described imaging device, the Finnish Environment Institute compiled a new image database of 126 lotic freshwater macroinvertebrate taxa and over 2.6 million images. For the current work, we restricted the number of classes to 39 taxa present in the human proficiency test described in Section III-A to compare the classification results with those of the taxonomic experts. We also restricted the number of images per specimen to a maximum of 50 images for computational reasons. If a specimen had more images from both cameras combined, we randomly selected 50 of them. The final data comprises 9631 observations and a total of 460004 images belonging to 39 taxa at the deepest available taxonomic rank. In total, considering one taxonomic rank at a time, the data consists of 7 orders, 23 families, 30 genera, and 26 species (see Fig. 8). The number of specimens for each taxa and the taxonomic resolution are shown in Table III. The image resolution for this data varies from pixels to pixels. The ’true’ labels for the specimens were defined by a group of taxonomic experts. While we acknowledge that there might be some mislabeled specimens, combining the knowledge of multiple taxonomic experts should improve the accuracy (Caley et al., 2014). We provide the data for public use in

Figure 8: Taxonomic resolution and distribution of the multilabel image data. The area of the slices represent the relative size of each taxonomic group at the different ranks of taxonomic hierarchy.

Iii-C Classification set-up

To have classification results comparable to the proficiency test, we compiled a set of data divisions for the image data with the exact same number of test specimens per taxa as in the proficiency test. As the proficiency test had 10 participants identifying lotic freshwater macroinvertebrates, we created 10 data divisions. The test sets comprise randomly selected 45–46 specimens belonging to the 39 taxonomic groups present in both the physical data and the image data. The test sets have an approximately equal number of specimens from each class. We divided the rest of the specimens of each data split for training (80 %) and validation (20 %). Due to the nature of the collected data, the training and validation data are unbalanced. In the following sections, these data sets are referred to as the ”comparison data”. The number of specimens per test set in the comparison data is lower than in the proficiency test because 4–5 specimens sent to each participant belonged to taxonomic groups not present in the image data.

Since the comparison between professionals and semi-professionals analysing physical data with a laboratory microscope and automatic classifiers using image data is unequal, we asked the proficiency test participants to also try to identy taxa from the test images of the comparison data. Each participant received one of the test sets and a list of the 39 possible taxa labels. To avoid fatigue and to encourage more experts to participate, we restricted the number of images per test specimen to 10. The automatic classifiers used exactly the same test data. In addition, because some of the images are fuzzy, the experts were allowed to classify the taxa to a higher taxonomic rank if they were unsure. The automatic classifiers always predicted the classes of the test specimens to the deepest available rank of taxonomic resolution. Of the ten experts participating in the proficiency test, three volunteered to take part in this image classification study.

As the comparison test sets are very small, we also studied the performance of the automatic classifiers on larger test sets. We split the specimens randomly into training (70 %), validation (10 %) and test (20 %) data 10 times. This time the number of specimens in each taxon varied in all training, validation, and test sets depending on the size of the taxa in the dataset. We refer to these sets as the ”machine learning data” as the splitting is typical for machine learning, but not suitable for comparisons with humans. For the test sets in the machine learning data, we included all images (max. 50) per specimen.

We considered different approaches to take the hierarchical nature of the data into account: A flat classifier is a single classifier with the 39 taxa as output labels. Local per level classifiers are built for each taxonomic rank separately: a classifier for the orders and another classifier for the families. We only trained local per level classifiers for the two highest taxonomic ranks as some of the taxa in the data have information only on these ranks. The top-down, local per parent node classifier is a system comprising 17 classifiers: one classifier at the top to identify the order of a specimen, four classifiers at the family level as there are four families with more than one genus within them, five classifiers at the genus level, and seven classifiers at the species level (see Table III). Some of the specimens get their predictions already at the order level since there are three orders with only one family or genus within them. In the data, there are two genera (Leuctra sp. and Nemoura sp.) for which only some of the specimens have information on species (Leuctra nigra and Nemoura cinerea). To separate these groups with the local per parent node classification approach, we temporarily marked the species for the rest of the Leuctra sp. and Nemoura sp. specimens as ’’. We trained the local species level classifiers and if they predicted the ’’ label, we marked the specimen as predicted only to genus level.

Iii-D Classification methods

We selected our methods for the automatic classification to be CNN (Krizhevsky et al., 2012) and SVM (Cortes and Vapnik, 1995). As our CNN model, we used the MatConvNet (Vedaldi and Lenc, 2015) implementation of the AlexNet CNN architecture (Krizhevsky et al., 2012)

. The architecture has five convolution layers followed by three fully-connected layers. The last fully-connected layer is followed by a softmaxloss(train)/softmax(test) layer. In our tests, we considered also the output of the last fully-connected layer instead of the softmax output, because we observed that this produced better results, when the final class was decided based on the average of the outputs for each image of a specimen. We trained flat and local per level classifiers from scratch using 60 training epochs. For the 17 classifiers of each local per parent node classifier, we took the flat classifier for the corresponding data split as our starting point and fine-tuned the network for 10 epochs (5 epochs only the last fully-connected layer, 3 epochs all fully-connected layers, and 2 epochs all layers). In all cases, we used a batch size of 256 and trained the network using stochastic gradient descent with a momentum of 0.9. When training from scratch, we used a learning rate varying from 0.01 to 0.0001 and for fine-tuning a learning rate varying from 0.005 to 0.0001. We saved the networks after each epoch and selected the final model based on the classification accuracy on the validation set.

While CNNs use the original images as input, we extracted a set of 64 simple geometry and intensity-based features from the images using ImageJ (Rasband, 2010) for SVMs

. The geometric features extracted include, e.g., area, perimeter, width and height of a bounding rectangle, while the intensity-based features were extracted from gray, red, green, and blue scale channels of the images. The complete set of features is listed in detail in

Ärje et al. (2013). As these features are simple and the classification task of identifying such a large number of classes is a complex one, we found that making a principal component transformation on the features improves classification results. Therefore, we performed a principal component transformation, as well as standardization, on the features before using them for classification.

We built our SVM model (Chang and Lin, 2011) using R (R Core Team, 2016) package e1071 (Meyer et al., 2018) and used a Gaussian kernel. For flat classification and local per level classification, we performed a grid search for the parameters over and . For the local per parent node hierarchical classification system, we explored a larger grid as the classification problems can be very different from another at different nodes of the hierarchical system. Due to the amount of data and time consumed by evaluating just a single parameter combination, we did the following: we randomly selected one image per specimen and used this data to perform the grid search for the parameters over and . After determining the optimal parameter values with this smaller data, we did a small, , grid search around those values with all the images (max. 50 images per specimen).

For both, the comparison and the machine learning data, we did the following: With each data split, we used the training data to train the model and the validation data to either select the best epoch to stop training (CNNs) or select optimal parameter values (SVMs) based on the classification accuracy of the validation specimens. With SVMs, we combined the training and validation data to train the final model after fixing the parameters. At the end, we classified each test image and selected the final class for each specimen using either average output (CNNs) or majority vote over all the images of the specimen (CNNs, SVMs).

Iv Analysis and inference

Iv-a Comparison data

Classification results for the comparison test sets of the image data as well as results of the proficiency test on physical data are presented in Table I. The first row of results shows the average CE on the deepest available rank of taxonomy. These are the results traditionally examined with flat classifiers. Taxonomic experts using physical data and microscopes to identify the taxa still outperform the automatic approaches. This result by taxonomic experts can be considered as a gold-standard to compare to. However, taxonomic experts predicting taxa from the images make the most classification errors. This is understandable as the image quality can be sub-par for some specimens and the experts have not studied identification from these types of images. For the automatic classifiers, CNN using the flat classification approach and the average output for deciding the final class has the lowest CE and is in the range of taxonomic experts with physical data. The average output clearly outperforms the majority vote as a decision rule for the final class even though the number of images per specimen is relatively high.

flat, flat, local/ hier. flat local/ hier. images physical
aver. vote level level data
Deepest level 0.114 0.131 0.138 0.243 0.28 0.553 0.061
0.036 0.054 0.055 0.081 0.074 0.153 0.053
0.052 0.070 0.070 0.173 0.191 0.353 0.028
0.023 0.034 0.036 0.061 0.053 0.162 0.024
Order 0.004 0.018 0.011 0.011 0.085 0.075 0.075 0.210 0.007
0.009 0.02 0.012 0.012 0.041 0.026 0.026 0.190 0.015
Family 0.039 0.059 0.150 0.059 0.173 0.181 0.193 0.291 0.020
0.029 0.037 0.259 0.044 0.070 0.062 0.069 0.151 0.020
Error structure #ERR(order) 2 8 5 39 34 29 3
#ERR(family) 16 19 22 40 54 11 6
#ERR(genus) 12 12 12 16 22 15 6
#ERR(species) 22 21 24 16 18 21 13
Table I: Classification results for comparison test data. CE and LCSE are averaged over all 10 experts/data splits (for experts with images, 3 data splits). The number of new classification errors at each taxonomic rank is summed over all 10 data splits, where (for experts with images, 3 data splits, ).

While flat classification gives only a single level and single label predictions, it is still possible to make comparisons on different ranks of taxonomic resolution. We simply take the predictions from the deepest rank of taxonomy of the data and add the ascending taxa labels accordingly. Let us call this a bottom-up examination. Using the bottom-up examination, we can calculate LCSE also for flat classifiers. The LCSE values for all classifiers as well as for taxonomic experts are clearly smaller than the CE values (see Table I). This means that most of the classification errors occur on deeper ranks of taxonomic resolution while the order and family might be predicted correctly. If all the classification errors were done already on the order level, CE and LCSE would be the same. For taxonomic experts using physical data, LCSE is close to zero as expected since taxonomic experts use a top-down hierarchical logic for the classification task, and identifying the higher ranks of taxonomy should be an easy task for an expert. Also in terms of LCSE, CNNs get close to the taxonomic expert level.

Contrary to the previous findings in hierarchical classification literature (Silla and Freitas, 2011), the flat classifiers for both CNN and SVM produce better results than the hierarchical classification approach. Babbar et al. (2016) stated in their study that if the data is highly unbalanced, hierarchical classifiers are better options even though their empirical error (CE) may be higher due to error propagation. While our test data is balanced, the training data used to train the classifiers is not. However, taking the hierarchical nature of the data into account when building the classifier produces not only a higher CE but also a little higher LCSE. It is worth noting that the optimization of the classifiers is based on CE, not LCSE. The only improvement the hierarchical classification system offers is a slightly lower CE on the order level for SVM. Note that for the order level, the hierarchical classifier and the local per level classifier are the same. Interestingly, the local per level SVM and CNN classifiers for family level perform worse than the flat classifiers with the ascending taxa labels. The notably high CE for local per level CNN for family level is due to data split three, where CNN classifies all observations to the family Elmidae. When leaving this data split out, the average classification error is 7 %.

The bottom part of Table I shows the error structure for each classifier and the taxonomic experts. The number of new errors at the different taxonomic ranks sum up to the total amount of misclassifications for the 10 balanced test splits. The difference in taxonomic expert and machine logic is evident through the number of errors on each taxonomic rank. For taxonomic experts using physical data, there are very few misclassifications at the order level and the number of errors increases with the taxonomic resolution. For experts using image data, all the order level errors are due to completely missing predictions for images being too challenging to identify. That is, all the predictions made by the experts were correct at the order level and as with physical data, the number of errors increases as with the taxonomic rank. For the automatic classifiers, most misclassifications are made at either species or family level. There is no such clear hierarchy in the error structure as for the taxonomic experts.

In biomonitoring and ecosystem assessment, not only a low number of classification errors is essential, but also the type of errors made as some misclassifications can have higher cost than others. To examine this, we analysed the confusion matrices of the classifiers and taxonomic experts. Concerning especially demanding taxa, both the taxonomists and automatic classifiers had difficulties identifying Hydropsyche saxonica. Human experts easily misclassified them as Hydropsyche angustipennis when using physical data and into a mix of other Hydropsyche species when using image data. The image data has no Hydropsyche angustipennis specimens and the automatic classifiers predicted many of the Hydropsyche saxonica to be Hydropsyche pellucidula (see Fig. 9). Hydropsyche saxonica is also one of the least represented taxa in the image data with only 17 specimens (see Table III) which is likely to be the reason the automatic classifiers have trouble classifying them. Besides this taxa, the human experts had another challenging taxa in the physical data. Some Rhyacophila nubila were misclassified as Rhyacophila fasciata. With the more difficult image data, the taxonomic experts classified these individuals to genus level only or left them unidentified, while SVMs mixed them with other taxa as there were no Rhyacophila fasciata in the image data. In addition, with the image data, the human experts had trouble identifying Elmis aenea with some of them unidentified completely and some of them misclassified as Oulimnius tuberculatus. The automatic classifiers identified this taxon more easily.

Iv-B Machine learning data

The results on the machine learning data with larger test sets are shown in Table II. Both CE and LCSE for all the classifiers are clearly lower with these data splits. That is due to two factors: these results are more stable, meaning they are not affected by individual difficult specimens, and here the size of each taxa in the test set reflects the size of the taxa in the training/validation sets. The comparison test sets of Section IV-A had only 0–4 specimens of each taxa and therefore the taxa with only few training specimens had the same weight as the taxa with hundreds of training specimens. For the machine learning data, taxa with little training data will also have only few test specimens and a small weight on the classification error of the entire test set.

flat, flat, local/ hier. flat local/ hier.
aver. vote level level
Deepest level 0.078 0.087 0.087 0.17 0.181
0.009 0.009 0.013 0.008 0.009
0.044 0.052 0.048 0.124 0.129
0.006 0.006 0.005 0.006 0.008
Order 0.01 0.015 0.011 0.011 0.055 0.053 0.053
0.002 0.003 0.002 0.002 0.006 0.005 0.005
Family 0.041 0.05 0.033 0.044 0.129 0.126 0.135
0.006 0.007 0.003 0.004 0.006 0.008 0.011
Error structure #ERR(order) 194 287 216 1071 1017
#ERR(family) 605 685 638 1428 1589
#ERR(genus) 304 307 319 455 505
#ERR(species) 410 412 510 344 393
Table II: Classification results for machine learning test data. CE and LCSE are averaged over all 10 data splits, where each test split has . The number of new classification errors at each taxonomic rank is summed over all 10 data splits, where .

The results are similar to those in Table I. CNNs produce the best classification results. Again, the flat classification versions of CNN and SVM outperform the hierarchical classifiers contradicting previous findings of hierarchical classification studies (Silla and Freitas, 2011). With the machine learning data splits, the local per level classification approach gives slightly lower CE than the flat classifier on both order level (SVM) and family level (SVM and CNN).

When considering individual challenging taxa, the best classifier, CNN, has mostly trouble with the least represented taxa in the data due to lack of adequate training data. The smallest taxa are Hydropsyche saxonica, Nemoura cinerea, Capnosis schilleri, Sialis sp., Leuctra nigra and Sphaerium sp. with average number of specimens in the training data, and respectively. With the exceptions of Sialis sp. and Sphaerium sp., the average CE for these taxa ranged from 62% to 98% for CNNs and from 61% to 100% for SVMs. On the contrary, all the classifiers performed well on classifying Sphaerium sp. (), and CNNs also relatively well on classifying Sialis sp. ().

One reason why the hierarchical, local per parent node approach performs worse than flat classification could be that the hierarchy in the data is not based on visual aspects. The taxonomic resolution is based on affinity which can be independent of the appearance of the taxa. However, the automatic classifiers base all classification decisions on visual features hence the man-made hierarchy of the data could confuse the classifiers. Fig. 9 gives examples of taxa that belong to the same family or genus but have clear differences in their appearance, e.g., size.

Figure 9: Examples of visual differences among taxa belonging to the same family or genus. Top row: Hydropsyche pellucidula, Hydropsyche saxonica, and Hydropsyche siltalai all belong to the genus Hydropsyche sp. Bottom row: Neureclipsis bimaculata, Plectronemia, Polycentropus flavomaculatus, and Polycentropus irroratus all belong to the family Polycentropodidae. In both cases, the taxa are of different sizes and colors.

V Discussion

The status assessment of ecosystems is often based on the use of biological indicators that are manually identified by human experts. The manual collection and identification of the data by ecological experts is, however, known to be costly and time-consuming. While recently a growing number of studies explore the enormous potential of genetic identification methods, these are currently not standardized, and thus currently cannot be used to their full potential for legislative biomonitoring purposes (e.g. Hering et al., 2018). An interim solution could lie in the use of a computer-based identification system that could be used to simply replace the step of human identification in current biomonitoring while preserving all other steps of the existing process chain. To switch to this novel approach, ecologists must start to trust in the machine logic. In this work, we compared human expert predictions for physical and image data to those of machine learning methods on image data.

To automate the identification process, we have developed a generic imaging system producing multiple images for each specimen. With our imaging system, we collected a large dataset of benthic freshwater macroinvertebrate images and assigned labels consisting of multiple taxonomic ranks. The classical approach in the computer-based identification has been a flat classification, where the classification is performed at the most specific rank of the taxonomic resolution. In addition to the classical flat approach, we considered also local hierarchical classifiers, namely local per level classifiers and local per parent node classifiers. We selected convolutional neural networks (CNNs) and support vector machines (SVMs) as classification methods. We are not aware of any earlier works applying the local hierarchical classifiers based on the taxonomic resolution of invertebrates. We evaluated both automatic classifiers and taxonomic experts using the classification error (CE) at the most specific level and a novel variant of the context sensitivity error (CSE) taking the top-down success into account. We call this variant level-aware context-sensitive error (LCSE).

We split the image data to produce test sets similar to the ones used in the proficiency test with physical data for taxonomic experts to be able to directly compare machines and human experts. We found that the taxonomic experts obtained the best classification performance when analysing the physical data using a microscope (CE=6.1% and LCSE=2.8%) and the worst when using the image data (CE=55.3% and LCSE=35.3%). The best automatic classifier was the CNN using flat classification approach and the average output of all the images for a specimen as the decision rule to decide the final label (CE=11.4% and LCSE=5.3%). This result is well within the range of human experts taking part in the proficiency test. We observed also that, contrary to earlier observations in the literature, the flat classifiers with both CNN and SVM

performed better than the local per parent node hierarchical classifiers. We assume this is because the hierarchy based on the taxonomic resolution does not necessarily correlate with the visual similarity of the taxa. The hierarchical classifiers would be likely more successful if they could first separate the easiest superclasses and then concentrate on more subtle differences within those superclasses. Besides the

CE and LCSE measures, we also investigated the main differences in confusion matrices. The most difficult classes were partially overlapping for machines and experts, but there were some differences as well. Human experts using images preferred to stay at higher ranks of taxonomic hierarchy for difficult taxa while machines were forced to predict the deepest possible level, and thus, ended up predicting wrong species. Unsurprisingly, we observed that CNNs had trouble identifying the classes with a low amount of training samples.

The test sets in our comparison data were very small to not burden the human participants too much. This naturally makes the results unstable in the sense that few difficult specimens or bad images may affect the results a lot. Therefore, we evaluated the automatic classifiers also on different data splits, where the test sets were considerably larger and also represented the overall taxa distribution. The ranking of the automatic classifiers with respect to the CE and LCSE measures was similar, while the absolute CE and LCSE values were much smaller for these larger test sets. Again, forcing automatic classifiers to operate with the logic of human experts, i.e., local per parent node approach, did not improve classification results.

The main purpose of this paper was to investigate differences in the identification logic of humans and machines. Taxonomic experts still outperformed the selected automatic methods, but CNNs’ performance was close and fell within the range of typical human experts. In the future, we will apply more advanced machine learning techniques, boost the performance on the most rare classes using, e.g., transfer learning and data augmentation, and consider global hierarchical classifiers. We expect that automatic methods can replace human experts in the routine-like identification of the easiest taxa already in the near future, while the human experts or genetic methods can concentrate on the harder cases. Therefore, it is important that ecologists start having confidence in the machines’ ability to perform this task and better understand the main challenges that are associated with automatic identification.


We thank the Academy of Finland for the grants of Ärje (284513, 289076), Tirronen (289076, 289104) Kärkkäinen (289076), Meissner (289104), and Raitoharju (288584). We would like to thank CSC for computational resources.


  • Ärje et al. (2016) Ärje, J., Choi, K.-P., Divino, F., Meissner, K., and Kärkkäinen, S. (2016). Understanding the statistical properties of the percent model affinity index can improve biomonitoring related decision making. Stochastic Environmental Research and Risk Assessment, 30(7):1981–2008.
  • Ärje et al. (2017) Ärje, J., Kärkkäinen, S., Meissner, K., Iosifidis, A., Ince, T., Gabbouj, M., and Kiraynaz, S. (2017). The effect of automated taxa identification errors on biological indices. Expert Systems with Applications, 72:108–120.
  • Ärje et al. (2010) Ärje, J., Kärkkäinen, S., Meissner, K., and Turpeinen, T. (2010). Statistical classification methods and proportion estimation – an application to a macroinvertebrate image database. Proceedings of the 2010 IEEE Workshop on Machine Learning for Signal Processing (MLSP).
  • Ärje et al. (2013) Ärje, J., Kärkkäinen, S., Turpeinen, T., and Meissner, K. (2013).

    Breaking the curse of dimensionality in quadratic discriminant analysis models with a novel variant of a bayes classifier enhances automated taxa identification of freshwater macroinvertebrates.

    Environmetrics, 24(4):248–259.
  • Babbar et al. (2016) Babbar, R., Partalas, I., Gaussier, E., Amini, M.-R., and Amblard, C. (2016). Learning taxonomy adaptation in large scale classification. Journal of Machine Learning Research, 17:1–37.
  • Borja and Elliott (2013) Borja, A. and Elliott, M. (2013). Marine monitoring during an economic crisis: the cure is worse than the disease. Marine Pollution Bulletin, 68:1–3.
  • Caley et al. (2014) Caley, M. J., O’Leary, R. A., Fisher, R., Low-Choy, S., Johnson, S., and Mengersen, K. (2014). What is an expert? a systems perspective on expertise. Ecology and Evolution, 4(3):231–242.
  • Chang and Lin (2011) Chang, C.-C. and Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3):27:1–27.
  • Cortes and Vapnik (1995) Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine Learning, 20:273–297.
  • Culverhouse et al. (2003) Culverhouse, P., Williams, R., Reguera, B., Herry, V., and González-Gil, S. (2003). Do experts make mistakes? A comparison of human and machine identification of dinoflagellates. Marine Ecology Progress Series, 247:17–25.
  • Culverhouse et al. (2006) Culverhouse, P., Williams, R., Reguera, B., Herry, V., and González-Gil, S. (2006). Automatic image analysis of plankton: future perspectives. Marine Ecology Progress Series, 312.
  • Elbrecht et al. (2017) Elbrecht, V., Vamos, E. E., Meissner, K., Aroviita, J., and Leese, F. (2017). Assessing strengths and weaknesses of dna metabarcoding-based macroinvertebrate identification for routine stream monitoring. Methods in Ecology and Evolution, 8(10):1265–1275.
  • Haase et al. (2010) Haase, P., Pauls, S. U., Schindehütte, K., and Sunderman, A. (2010). First audit of macroinvertebrate samples from an EU Water Framework Directive monitoring program: human error greatly lowers precision of assessment results. Journal of the North American Benthological Society, 29(4):1279–1291.
  • Hering et al. (2018) Hering, D., Borja, A., Jones, J. I., Pont, D., Boets, P., Bouchez, A., Bruce, K., Drakare, S., Hänfling, B., Kahlert, M., Leese, F., Meissner, K., Mergen, P., Reyjol, Y., Segurado, P., Vogler, A., and Kelly, M. (2018). Implementation options for dna-based identification into ecological status assessment under the european water framework directive. Water Research, 138:192–205.
  • International commission on zoological nomenclature (1999) International commission on zoological nomenclature (1999). International code of zoological nomenclature. International Trust for Zoological Nomenclature, fourth edition.
  • Joutsijoki et al. (2014) Joutsijoki, H., Meissner, K., Gabbouj, M., Kiranyaz, S., Raitoharju, J., Ärje, J., Kärkkäinen, S., Tirronen, V., Turpeinen, T., and Juhola, M. (2014). Evaluating the performance of artificial neural networks for the classification of freshwater benthic macroinvertebrates. Ecological Informatics, 20:1–12.
  • Kelly et al. (2015) Kelly, M., Schneider, S., and King, L. (2015). Customs, habits, and traditions: the role of nonscientific factors in the development of ecological assessment methods. WIREs Water, 2:159–165.
  • Kho et al. (2017) Kho, S. J., Manickam, S., Malek, S., Mosleh, M., and Dhillon, S. K. (2017). Automated plant identification using artificial neural network and support vector machine. Frontiers in Life Science, 10(1):98–107.
  • Kiranyaz et al. (2011) Kiranyaz, S., Ince, T., Pulkkinen, J., Gabbouj, M., Ärje, J., Kärkkäinen, S., Tirronen, V., Juhola, M., Turpeinen, T., and Meissner, K. (2011). Classification and retrieval on macroinvertebrate image databases. Computers in Biology and Medicine, 41(7):463–472.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages 1097–1105.
  • Lee et al. (2016) Lee, H., Park, M., and Kim, J. (2016). Plankton classification on imbalanced large scale database via convolutional neural networks with transfer learning. Image Processing (ICIP), 2016 IEEE International Conference on, pages 3713–3717.
  • Leese et al. (2018) Leese, F., Bouchez, A., Abarenkov, K., Altermatt, F., Borja, Á.., Bruce, K., Ekrema, T., Čiamporová-Zat́ovičová, F., Costa, F. O., Duarte, S., Elbrecht, V., Fontaneto, D., Franc, A., Geiger, M. F., Hering, D., Kahlert, M., Kalamujić Stroil, B., Kelly, M., Keskin, E., Liska, I., Mergen, P., Meissner, K., Pawlowski, J., Penev, L., Reyjol, Y., Rotter, A., Steinke, D., van der Wal, B., Vitecek, S., Zimmermann, J., and Weigand, A. M. (2018). Why we need sustainable networks bridging countries, disciplines, cultures and generations for aquatic biomonitoring 2.0: a perspective derived from the DNAqua-Net COST Action. Advances in Ecological Research, 58:63–99.
  • Levatic et al. (2015) Levatic, J., Kocev, D., and Dzeroski, S. (2015). The importance of the label hierarchy in hierarchical multi-label classification. Journal of Intelligent Information Systems, 45(2):247–271.
  • Lytle et al. (2010) Lytle, D. A., Martínez-Muñoz, G., Zhang, W., Larios, N., Shapiro, L., Paasch, R., Moldenke, A., Mortensen, E. N., Todorovic, S., and Dietterich, T. G. (2010). Automated processing and identification of benthic invertebrate samples. Journal of the North American Benthological Society, 29(3):867–874.
  • Meissner et al. (2010) Meissner, K., Aroviita, J., Hellsten, S., Järvinen, M., Karjalainen, S. M., Kuoppala, M., Mykrä, H., and Vuori, K.-M. (2010). Jokien ja järvien biologinen seuranta - näytteenotosta tiedon tallentamiseen. Online guidance.
  • Meissner et al. (2017) Meissner, K., Nygård, H., Björklöf, K., Jaale, M., Hasari, M., Laitila, L., Rissanen, J., and Leivuori, M. (2017). Proficiency test 04/2016: Taxonomic identification of boreal freshwater lotic, lentic, profundal and North-Eastern Baltic benthic macroinvertebrates. Reports of the Finnish Environment Institute, 2.
  • Meyer et al. (2018) Meyer, D., Dimitriadou, E., Hornik, K., Weingessel, A., and Leisch, F. (2018).

    e1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), TU Wien

    R package version 1.7-1.
  • Nygård et al. (2016) Nygård, H., Oinonen, S., Lehtiniemi, M., Hällfors, H., Rantajärvi, E., and Uusitalo, L. (2016). Price versus value of marine monitoring. Fronties in Marine Science, 3:205.
  • R Core Team (2016) R Core Team (2016). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.
  • Raitoharju et al. (2018) Raitoharju, J., Riabchenko, E., Ahmad, I., Iosifidis, A., Gabbouj, M., Kiranyaz, S., Tirronen, V., Ärje, J., Kärkkäinen, S., and Meissner, K. (2018). Benchmark database for fine-grained image classification of benthic macroinvertebrates. Image and Vision Computing, 78:73–83.
  • Rasband (2010) Rasband, W. S. (1997-2010). ImageJ. U.S. National Institutes of Health, Bethesda, Maryland, USA.
  • Rodrigues et al. (2012) Rodrigues, F., Pereira, F. C., Alves, A., Jiang, S., and Ferreira, J. (2012). Automatic classification of points-of-interest for land-use analysis. Proceedings of GEOProcessing 2012: The Fourth International Conference on Advanced Geographic Information Systems, Applications, and Services, pages 41–49.
  • Silla and Freitas (2011) Silla, C. N. J. and Freitas, A. A. (2011). A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery, 22(1–2):31–72.
  • Uusitalo et al. (2016) Uusitalo, L., Fernandes, J. A., Bachiller, E., Tasala, S., and Lehtiniemi, M. (2016). Semi-automated classification method addressing marine strategy framework directive (msfd) zooplankton indicators. Ecological Indicators, 71:398–405.
  • Vedaldi and Lenc (2015) Vedaldi, A. and Lenc, K. (2015). MatConvNet: Convolutional neural networks for Matlab. In Proceedings of International Conference on Multimedia, pages 689–692.
  • Verma et al. (2012) Verma, N., Mahajan, D., Sellamanickam, S., and Nair, V. (2012). Learning hierarchical similarity metrics.

    2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pages 2280–2287.
    Providence, RI USA.
  • Yousef Kalafi et al. (2018) Yousef Kalafi, E., Town, C., and Kaur Dhillon, S. (2018). How automated image analysis techniques help scientists in species identification and classification. Folia Morphologica, 77(2):179–193.
  • Zimmermann et al. (2015) Zimmermann, J., Glockner, G., Jahn, R., Enke, N., and Gemeinholzer, B. (2015). Meta-barcoding vs. morpological identification to assess diatom diversity in environmental studies. Molecular Ecology Resources, 15:526–542.


Taxa Species Genus Family Order #specimens #images
Elmis aenea Elmis aenea Elmis Elmidae Coleoptera 648 32398
Limnius volckmari Limnius volckmari Limnius Elmidae Coleoptera 314 15621
Oulimnius tuberculatus Oulimnius tuberculatus Oulimnius Elmidae Coleoptera 335 16674
Hydraena sp. - Hydraena Hydraenidae Coleoptera 198 9900
Simuliidae - - Simuliidae Diptera 887 44240
Ameletus inopinatus Ameletus inopinatus Ameletus Ameletidae Ephemeroptera 127 6346
Baetis rhodani Baetis rhodani Baetis Baetidae Ephemeroptera 404 19829
Baetis vernus group Baetis vernus Baetis Baetidae Ephemeroptera 176 8588
Ephemerella aurivillii Ephemerella aurivillii Ephemerella Ephemerellidae Ephemeroptera 356 16458
Ephemerella mucronata Ephemerella mucronata Ephemerella Ephemerellidae Ephemeroptera 304 15175
Heptagenia sulphurea Heptagenia sulphurea Heptagenia Heptageniidae Ephemeroptera 438 21502
Kageronia fuscogrisea Kageronia fuscogrisea Kageronia Heptageniidae Ephemeroptera 222 10826
Leptophlebia sp. - Leptophlebia Leptophlebiidae Ephemeroptera 412 20366
Sialis sp. - Sialis Sialidae Megaloptera 26 1162
Capnopsis schilleri Capnopsis schilleri Capnopsis Capniidae Plecoptera 21 1050
Leuctra nigra Leuctra nigra Leuctra Leuctridae Plecoptera 27 1350
Leuctra sp. - Leuctra Leuctridae Plecoptera 298 14899
Amphinemura borealis Amphinemura borealis Amphinemura Nemouridae Plecoptera 322 16100
Nemoura cinerea Nemoura cinerea Nemoura Nemouridae Plecoptera 16 800
Nemoura sp. - Nemoura Nemouridae Plecoptera 187 9314
Protonemura sp. - Protonemura Nemouridae Plecoptera 100 4908
Diura sp. - Diura Perlodiae Plecoptera 98 4427
Isoperla sp. - Isoperla Perlodiae Plecoptera 243 12148
Taeniopteryx nebulosa Taeniopteryx nebulosa Taeniopteryx Taenioptegyridae Plecoptera 331 16325
Micrasema gelidum Micrasema gelidum Micrasema Brachycentridae Trichoptera 233 11528
Micrasema setiferum Micrasema setiferum Micrasema Brachycentridae Trichoptera 323 13819
Agapetus sp. - Agapetus Glossosomatidae Trichoptera 290 14387
Silo pallipes Silo pallipes Silo Goeridae Trichoptera 56 2658
Hydropsyche pellucidula Hydropsyche pellucidula Hydropsyche Hydropsychidae Trichoptera 192 6513
Hydropsyche saxonica Hydropsyche saxonica Hydropsyche Hydropsychidae Trichoptera 17 490
Hydropsyche siltalai Hydropsyche siltalai Hydropsyche Hydropsychidae Trichoptera 395 19456
Oxyethira sp. - Oxyethira Hydroptilidae Trichoptera 218 10381
Lepidostoma hirtum Lepidostoma hirtum Lepidostoma Lepidostomatidae Trichoptera 267 10982
Neureclipsis bimaculata Neureclipsis bimaculata Neureclipsis Polycentropodidae Trichoptera 477 23721
Plectrocnemia sp. - Plectrocnemia Polycentropodidae Trichoptera 63 3015
Polycentropus flavomaculatus Polycentropus flavomaculatus Polycentropus Polycentropodidae Trichoptera 224 11005
Polycentropus irroratus Polycentropus irroratus Polycentropus Polycentropodidae Trichoptera 59 2917
Rhyacophila nubila Rhyacophila nubila Rhycophila Rhyacophilidae Trichoptera 177 6993
Sphaerium sp. - Sphaerium Sphaeridae Veneroida 150 1733
Table III: Taxonomic resolution of the multiple image data and the numbers of specimens and images per taxa. Taxa included in the proficiency test for human experts but not included in the image data were Brachyptera risi, Cloeon sp., Cloeon diptera group, Cloeon inscriptum, Cloeon simile, Helobdella stagnalis, and Tinodes waeneri.