Pictures recorded by cameras installed in wild habitats have served as windows into the lives of wild animals for many decades. Modern camera traps operate by combining infrared sensors with cameras so that images are automatically captured when motion is detected nearby nearby. Although this allows them to be non-intrusive and inexpensive for large areas, camera traps often capture spurious data such as leaves blowing past the sensor. Despite this, millions of photographs produced each year are used to inform several studies . Currently, the perusal of these pictures by human experts is the most effective and popular way to extract information from the data. Manual sorting is cumbersome and scales poorly with the volume of data. The ability to automatically extract information, from millions of unorganized photographs of animals, can have a profound impact on studies in animal biology, ecology and conservation.
Advancements in Machine Learning algorithms as well as Computer Vision techniques, have resulted in a large number of algorithms that can automatically extract information from visual data. Unfortunately, images from camera traps present insurmountable combinations of difficulties including insufficient light, poor framerates, cluttered backgrounds and significant occlusion. Early solutions coped with these problems by introducing a manual step and by tailoring the solution specifically to certain characteristic visual patterns in the data. For example,Extract Compare was originally proposed for the identification of individual tigers using a manual mark-up of silhouettes of an unknown individual. It used a 3D database of scanned shapes of tigers and their associated textures and images. Although the software tool is currently available for several species, its generalization (originally developed for tigers) to other species required extensive effort and research .
Fully-automatic methods have been developed to detect the presence of animals in camera trap images. Specifically, the use of Deep Convolutional Neural Networks (DCNN) has been shown to be effective with large datasets [Norouzzadeh] (millions of training images) as well as smaller datasets [Chen] (hundreds). One of the early adopters of DCNNs for camera trap data, Yu et al[Yu]Yang]), from cropped images. Although they achieve 82% accuracy in detecting animals, this requires some manual preprocessing (cropping). Another approach to animal detection is to compute the difference of images from the same camera trap within a short period time[Figueroa]. This method is useful for identify the presence of marked animals (such as felines) that can be observed in a sequence of images.
There has been some work on species classification using DCNN. Chen et al[Chen] develop a fully automatic method for classification across 20 species from North America, using a dataset of about 20,000 images. The images were used to train a DCNN that achieves an overall species classification accuracy of . It is well known that performance of DCNNs depends heavily on the volume of supervisory training data[Glorot]. More recently Norouzzadeh et al[Norouzzadeh] developed a method that achieves an accuracy of using the VGG model (Simonyan[simonyan]).
While many of the methods discussed above are useful in detecting the presence of an animal or identifying the species of the detected animal, there is a paucity of automatic techniques that can recognise individuals. For tigers, other data such as pug-marks and roars have been used to perform recognition with success rate. To the best of our knowledge, this paper introduces the first fully automatic method for recognition of individuals from camera trap images. We present a unified methodology that also performs animal detection and species identification at rates comparable to existing methods. The main contribution is that our method performs individual recognition of tigers and leopards with limited training data.
We present the effectiveness of our classifiers for detecting animals, identifying the species of animals and for recognition of individual tiger and leopard in different subsections. We assessed our proposed classifiers quantitatively by calculating the proportion of correct predictions. We used four statistical measures that consider different combinations of true positives and true negatives: sensitivity, specificity, precision and accuracy.Sensitivity is the fraction of images from a particular class that are correctly classified as belonging to that class. Specificity is the fraction of images that are correctly identified as not belonging to a particular class. Precision is the fraction of images reported to be of a particular class that are correctly identified. Accuracy is the fraction of correct classifications, positive or negative. If , , and denote the numbers of true positives, true negatives, false positives and false negatives respectively, then the measures are calculated using the following formulae:
1.1 Animal detection
Our method is able to detect animals with an accuracy of about . Although it is possible that other state-of-the-art detectors may outperform our method on this task alone, the true strength of our approach is added abilities such as classifing species and recognising individuals. In our experiments we observed that, for detecting the presence of animals in images, indeed increasing the volume of the dataset (Figure 1a) leads to improvement of all measures. It is reassuring to note that with only of the images in the dataset, all measures are already better than . It was also observed that training with fewer examples yields better results when tested on a large test set (Figure 1b). Finally, we observed that a split of the dataset between training and test yields the best results (Figure 1b). So we adopt this split for all other results in this paper. Details of the experimental setup used to obtain these results are explained in Section LABEL:sec:method-detection.
|(a) volume of dataset||(b) proportion of training||(c) increasing training (fixed test)|
1.2 Species identification
When our method was applied to classify images based on the species of the animal present, we observed that the resulting accuracy varies widely across species. Even for species with similar numbers of training examples, the rate of false positives and false negatives vary widely. While we observed low confusion for certain species (leopard, chital, dhole, tiger), our method was less effective in distinguishing animals which lack distinguishing patterns (elephant, sambar, muntjac, bear). Figure 2
shows the false positive and false negative rates across different species. The full confusion matrix is shown in Figure4.
|(a) False positive rate:||(b) False negative rate:|
1.3 Individual recognition
Our method is able to recognise individual animals from two species (leopard and tiger) with an accuracy of about . We performed several tests across different training sets containing different numbers of individuals as permitted by the data (up to 62 leopards and 32 tigers). Figure 3 shows plots of accuracy and sensitivity when our method was trained with a balanced dataset (similar numbers of different individuals) for leopard (a) and tiger (b).
|(a) leopard||(b) tiger|
We use a corpus of camera-trap images that have been tagged by human experts – as belonging to a particular species and, where possible, an individual within the species – to train a machine learning algorithm to replicate the tags when presented with only the images. In this context, learning is defined [Mohri] as progressive improvement of performance on the specific classification task. In the rest of this section, we explain how the data needs to be pre-processed and organised, the specific machine learning tool that we found to be most effective for the task and details on how we performed individual recognition on photographs of tigers and leopards.
The dataset used in this work was collected by the Wildlife Conservation Society(WCS) India via efforts over decades to install more than 258 camera traps across the jungles of Southern India. The photographs used in this paper have a resolution of 2,048x1,536 pixels and those captured at night time are illuminated by a flash. These images were meticulously labelled by volunteers so that each image has a tag that identifies the species of animal present in it (if any). In addition, databases of specific individuals (identified by characteristic skin markings) were created for for two species (tiger and leopard). Experts assigned unique identification tags (names) for tiger and leopard individuals using a software tool [Hiby]
that performs pattern recognition on skin markings. Similar datasets have been used in previous studies[karanth1998, royle].
Our dataset consists of a total of 19512 images, of which 9070 contain animals from ten species of interest (for this work): bear (Melursus ursinus), chital (Axis axis), dhole (Cuon alpinus), elephant (Elephas maximus), gaur (Bos gaurus), leopard (Panthera pardus fusca), muntjac (Muntiacus), sambar (Rusa unicolor), tiger (Panthera tigris), and wild pig (Sus scrofa). Table 1 summarizes the distribution of images of the dataset among species. We labelled images that did not contain an animal from the above list as Unclassified and considered them as negative examples. Unclassified images are those taken by camera traps when triggered by humans who inhabit neighbouring villages, vehicles of rangers or other animals which we do not consider for this study (i.e. dogs, hares, porcupine, etc.).
|Species||> images||to images||to images||< images||Total|
For animal detection, we observed better performance with a balanced dataset (9070 with animals and 9070 unclassified images). Finally, for training and evaluating our classifier that performs individual recognition, the dataset contains a total of tiger images with varying numbers of images of each of the individuals. Three of the individuals are represented by more than images each, twelve individuals are seen in more than images each and there are at least images for individuals, this distribution is summarized in Table 2. There are images of leopards, of which individuals are captured in more than 100 photographs and individuals are in at least images each, summarized in Table 2.
2.2 Animal detection
Detecting the presence of animals in images is a routine binary classification task commonly encountered in computer vision applications. We found that a widely used architecture called AlexNet , pre-trained on a large database of general images (ImageNet), is well suited to this task. We appended a Support Vector Machine (SVM) with a linear kernel to the output of AlexNet and trained this additional layer using our dataset. For this we found that using an equal number of positive and negative examples (9070 images with animals and 9070 without) yielded best results.
We observed that the classification accuracy was sensitive to whether or not the images were captured in daylight. To investigate this, we conducted three separate sets of training and validation experiments. In the first (diurnal), we used 3995 images of animals in daylight (and 3995 unclassified). The second set consisted of 1178 images of animals at night (and 1178 unclassified). Finally, we considered all images mixed (9070 with animals and 9070 without). For each set, we separated the images into training and validation with a split of . The accuracies were , and respectively. Although the lower accuracy for nighttime images could be attributed to using fewer images in training, it is unlikely to be the only cause since using a smaller set of diurnal images outperforms the larger set with nighttime images mixed in. These results are presented in Table 3.
|Sub-dataset||Num. Images||Training Acc||Test Acc|
We investigated the impact of the quantity of data on our automatic animal detector by repeating the experiment using four different fractions of the dataset. We randomly subsampled the dataset to , , and of its original size and in each case we repeated the experiment for 10 different random samples. For all experiments we maintained the split between training and validation. The results of this experiments, plotted in Figure 1.a), demonstrate that indeed all metrics improve when more data is used. However, even with only of the dataset, all metrics are above .
To understand the importance of selecting the proportion of images to use in training, we performed two experiments. First, we fixed the number of validation images to (2721 images) and compared the results of training with of the training set (6349 images). For each measurement, we averaged ten random repetitions, as before. The result (Figure 1.b) shows that although the trend is increasing, the differences diminish faster than when the test and training set were increased in volume. This suggests that perhaps there are difficult cases in the validation set which are less detrimental to average results when the validation set is large. Finally, we tested the performance of our animal detector (Figure 1.c) on various fractions of training and validation sets and found that is a good choice.
2.3 Species identification
We tested two methods for identifying the species of animals in camera trap images. First, as with animal detection, we applied a pre-trained AlexNet in conjunction with an SVM layer that is trained on our dataset. The accuracy for all species except Chital and Muntjac is over using this solution. We experimented with first running an animal detector on the input images, followed by species identification. Indeed, this improved the accuracies for these two classes to and , but accuracies for a few other species (Sambar, Elephant and Wild Pig) dropped significantly. Our hypothesis is that there is insufficient data to effectively train a classifier for species identification without spatial information about where animals are in the images.
As a second method, we adapted a recent method developed for object detection given image-level labels [bilen]. This method, called Weakly Supervised Deep Detection Network (WSDDN), introduces a spatial pyramid pooling layer on top of AlexNet’s convolutional layers. The output of these layers is then used in parallel to perform recognition over multiple rectangular regions in the image and detection of the rectangular region of the image that contains most of the salient information assocated with that image-level label associated with the image.
Originally, image-level classification scores are obtained by summing the region scores over all regions in each image. This approach of Bilen and Veldadi tends to consider only one (or few) strongly predicted ’tiger’ region equally with several regions that weakly predicted the presence of a ’tiger’. We replace this with the following approach. We identify the maximum score amongst all classes for each rectangular region and select the top 30 regions based on this maximum class score. Then, we average the score for each class over these 30 regions and pick the class with the highest mean (top-1 class) as the predicted species. Table 4 shows the performances obtained in spacies classification with Alexnet pre-trained model and with WSDDN model.
|Species||BC+AlexNet||AlexNet||WSDDN top-1||WSDDN top-5|
|a) recognising individual tigers||b) recognising individual leopards|
2.4 Individual recognition of tiger and leopard
We applied the same architecture (WSDDN) to recognise individual animals by treating each individual (rather than species) as a separate class. That is, the pre-trained network was fine-tuned using several different images of known individuals to learn a mapping from characteristic markings on the skin to the specific individuals with those markings. For this, we used the labelled individuals in the dataset. We tested the efficacy of individual recognition using by training with different combinations: Only tigers, only leopards, and leopards and tigers combined.
For the case where we trained using only one class (leopards or tigers), we investigated the impact of balancing the dataset so that each class (individual) was represented equally during training. In addition, motivated by the importance that segmentation plays in vision-based tasks [shukla], we explored the use of segmented animals to train our networks. We segmented images automatically using a unary classifier (our animal detector on patches) whose result is used to obtain a segmentation mask by considering pair-wise similarities between patches [krahenbuhl]. The numbers of true negatives, false negatives, false positives and true positives for the various combinations are shown in figure 5.
To create a balanced dataset across all individuals, dozens of images were removed so that every individual was represented using the same number of images in the training data. Despite this reduction of total data, we found that balancing the training dataset significantly improves all metrics for tigers as well as leopards (Figure 5) . More surprisingly, we observed that training using automatically segmented tigers did not improve the classifier.
We tested a jointly trained classifier that is able to recognize individuals of leopard and tiger. We chose the tiger individuals and leopard individuals for which we had more than images each and trained a classifier to recognize any of the individuals. We found the accuracy and specificity of all individuals to be high (). However, the sensitivity of the classifier varied widely across the individuals (see figure 6). We attribute the differences to the varying qualities of the images for different individuals.