The iWildCam 2020 Competition Dataset

by   Sara Beery, et al.

Camera traps enable the automatic collection of large quantities of image data. Biologists all over the world use camera traps to monitor animal populations. We have recently been making strides towards automatic species classification in camera trap images. However, as we try to expand the geographic scope of these models we are faced with an interesting question: how do we train models that perform well on new (unseen during training) camera trap locations? Can we leverage data from other modalities, such as citizen science data and remote sensing data? In order to tackle this problem, we have prepared a challenge where the training data and test data are from different cameras spread across the globe. For each camera, we provide a series of remote sensing imagery that is tied to the location of the camera. We also provide citizen science imagery from the set of species seen in our data. The challenge is to correctly classify species in the test camera traps.


page 1

page 2

page 3


The iWildCam 2019 Challenge Dataset

Camera Traps (or Wild Cams) enable the automatic collection of large qua...

The iWildCam 2021 Competition Dataset

Camera traps enable the automatic collection of large quantities of imag...

The GeoLifeCLEF 2020 Dataset

Understanding the geographic distribution of species is a key concern in...

Recognition in Terra Incognita

It is desirable for detection and classification algorithms to generaliz...

Efficient Pipeline for Camera Trap Image Review

Biologists all over the world use camera traps to monitor biodiversity a...

Can poachers find animals from public camera trap images?

To protect the location of camera trap data containing sensitive, high-t...

1 Introduction

In order to understand the effects of pollution, exploitation, urbanization, global warming, and conservation policy on our planet’s biodiversity, we need access to accurate, consistent biodiversity measurements. Researchers often use camera traps – static, motion-triggered cameras placed in the wild – to study changes in species diversity, population density, and behavioral patterns. These cameras can take thousands of images per day, and the time it takes for human experts to identify species in the data is a major bottleneck. By automating this process, we can provide an important tool for scalable biodiversity assessment.

Camera trap images are taken automatically based on a triggered sensor, so there is no guarantee that the animal will be centered, focused, well-lit, or at an appropriate scale (they can be either very close or very far from the camera, each causing its own problems). See Fig. 2

for examples of these challenges. Further, up to 70% of the photos at any given location may be triggered by something other than an animal, such as wind in the trees. Automating camera trap labeling is not a new challenge for the computer vision community

[23, 7, 26, 15, 10, 25, 22, 17, 5, 1, 6, 16, 18, 3, 19, 16]

. However, most of the proposed solutions have used the same camera locations for both training and testing the performance of an automated system. If we wish to build systems that are trained to detect and classify animals and then deployed to new locations without further training, we must measure the ability of machine learning and computer vision to

generalize to new environments [4, 19]. This is central to the 2018 [4], 2019 [2], and 2020 iWildCam challenges.

Figure 1: The iWildCam 2020 dataset. This year’s dataset includes data from multiple modalities: camera traps, citizen scientists, and remote sensing. Here we can see an example of data from a camera trap paired with a visualization of the infrared channel of the paired remote sensing imagery.

The 2020 iWildCam challenge includes a new component: the use of multiple data modalities (see Fig. 1). An ecosystem can be monitored in a variety of ways (e.g. camera traps, citizen scientists, remote sensing) each of which has its own strengths and limitations. To facilitate the exploration of techniques for combining these complementary data streams, we provide a time series of remote sensing imagery for each camera trap location as well as curated subsets of the iNaturalist competition datasets matching the species seen in the camera trap data. It has been shown that species classification performance can be dramatically improved by using information beyond the image itself [14, 8, 6] so we expect that participants will find creative and effective uses for this data.

(1) Illumination (2) Blur (3) ROI Size (4) Occlusion (5) Camouflage (6) Perspective
Figure 2: Common data challenges in camera trap images. (1) Illumination: Animals are not always well-lit. (2) Motion blur: common with poor illumination at night. (3) Size of the region of interest (ROI): Animals can be small or far from the camera. (4) Occlusion: e.g. by bushes or rocks. (5) Camouflage: decreases saliency in animals’ natural habitat. (6) Perspective: Animals can be close to the camera, resulting in partial views of the body.

2 Data Preparation

The dataset consists of three primary components: (i) camera trap images, (ii) citizen science images, and (iii) multispectral imagery for each camera location.

2.1 Camera Trap Data

The camera trap data (along with expert annotations) is provided by the Wildlife Conservation Society (WCS) [24]. We split the data by camera location, so no images from the test cameras are included in the training set to avoid overfitting to one set of backgrounds [5].

The training set contains images from locations, and the test set contains images from locations. These locations are spread across 12 countries in different parts of the world. Each image is associated with a location ID so that images from the same location can be linked. As is typical for camera traps, approximately 50% of the total number of images are empty (this varies per location).

There are 276 species represented in the camera trap images. The class distribution is long-tailed, as shown in Fig. 3. Since we have split the data by location, some classes appear only in the training set. Any images with classes that appeared only in the test set were removed.

Figure 3: Camera trap class distribution. Per-class distribution of the camera trap data, which exhibits a long tail. We show examples of both a common class (the African giant pouched rat) and a rare class (the Indonesian mountain weasel). Within the plot we show images of each species, centered and focused, from iNaturalist. On the right we show images of each species within the frame of a camera trap, from WCS.

2.2 iNaturalist Data

iNaturalist is an online community where citizen scientists post photos of plants and animals and collaboratively identify the species [13]. To facilitate the use of iNaturalist data, we provide a mapping from our classes into the iNaturalist taxonomy.111Note that for the purposes of the competition, competitors may only use iNaturalist data from the iNaturalist competition datasets. We also provide the subsets of the iNaturalist 2017-2019 competition datasets [21] that correspond to species seen in the camera trap data. This data provides additional images for training, covering classes.

Though small relative to the camera trap data, the iNaturalist data has some unique characteristics. First, the class distribution is completely different (though it is still long tailed). Second, iNaturalist images are typically higher quality than the corresponding camera trap images, providing valuable examples for hard classes. See Fig. 4 for a comparison between iNaturalist images and camera trap images.

(1) Class ID 101 (2) Class ID 563 (3) Class ID 154
Figure 4: Camera trap data (left) vs iNaturalist data (right). (1) Animal is large, so camera trap image does not fully capture it. (2) Animal is small, so it makes up a small part of the camera trap images. (3) Quality is equivalent, although iNaturalist images have more camera pose and animal pose variation.

2.3 Remote Sensing Data

For each camera location we provide multispectral imagery collected by the Landsat 8 satellite [20]. All data comes from the the Landsat 8 Tier 1 Surface Reflectance dataset [11] provided by Google Earth Engine [12]. This data has been been atmospherically corrected and meets certain radiometric and geometric quality standards.

Data collection. The precise location of a camera trap is generally considered to be sensitive information, so we first obfuscate the coordinates of the camera. For each time point when imagery is available (the Landsat 8 satellite images the Earth once every 16 days), we extract a square patch centered at the obfuscated coordinates consisting of 9 bands of multispectral imagery and 2 bands of per-pixel metadata. Each patch covers an area of 6km 6km. Since one Landsat 8 pixel covers an area of 30m, each patch is pixels. Note that the bit depth of Landsat 8 data is 16.

The multispectral imagery consists of 9 different bands, ordered by descending frequency / ascending wavelength. Band 1 is ultra-blue. Bands 2, 3, and 4 are traditional blue, green, and red. Band 5-9 are infrared. Note that bands 8 and 9 are from a different sensor than bands 1-7 and have been upsampled from 100m/pixel to 30m/pixel. Refer to [11] or [20] for more details.

Each patch of imagery has two corresponding quality assessment (QA) bands which carry per-pixel metadata. The first QA band (pixelqa) contains automatically generated labels for classes like clear, water, cloud, or cloud shadow which can help to interpret the pixel values. The second QA band (radsatqa) labels the pixels in each band for which the sensor was saturated. Cloud cover and saturated pixels are common issues in remote sensing data, and the QA bands may provide some assistance. However, they are automatically generated and cannot be trusted completely. See [11] for more details.

3 Baseline Results

We trained a basic image classifier as a baseline for comparison. The model is a randomly initialized Inception-v3 with input size , which was trained using only camera trap images. During training, images were randomly cropped and perturbed in brightness, saturation, hue, and contrast. We used the rmsprop optimizer with an initial learning rate of 0.0045 and a decay factor of 0.94.

Let be the number of classes. We trained using a class balanced loss from [9], given by


is the vector of predicted class probabilities (after softmax),

is the ground truth class, is the categorical cross-entropy loss, is the number of samples for class , and

is a hyperparameter which we set to 0.9.

This baseline achieved a macro-averaged F1 score of and an accuracy of on the iWildCam 2020 test set.

4 Conclusion

The iWildCam 2020 dataset provides a test bed for studying generalization to new locations at a larger geographic scale than previous iWildCam competitions [4, 2]. In addition, it facilitates exploration of multimodal approaches to camera trap image classification and pairs remote sensing imagery with camera trap imagery for the first time.

In subsequent years, we plan to extend the iWildCam challenge by adding additional data streams and tasks, such as detection and segmentation. We hope to use the knowledge we gain throughout these challenges to facilitate the development of systems that can accurately provide real-time species ID and counts in camera trap images at a global scale. Any forward progress made will have a direct impact on the scalability of biodiversity research geographically, temporally, and taxonomically.

5 Acknowledgements

We would like to thank Dan Morris and Siyu Yang (Microsoft AI for Earth) for their help curating the dataset, providing bounding boxes from the MegaDetector, and hosting the data on Azure. We also thank the Wildlife Conservation Society for providing the camera trap data and annotations. We thank Kaggle for supporting the iWildCam competition for the past three years. Thanks also to the FGVC Workshop, Visipedia, and our advisor Pietro Perona for continued support. This work was supported in part by NSF GRFP Grant No. 1745301. The views are those of the authors and do not necessarily reflect the views of the NSF.


  • [1] S. Beery, Y. Liu, D. Morris, J. Piavis, A. Kapoor, N. Joshi, M. Meister, and P. Perona (2020) Synthetic examples improve generalization for rare classes. In The IEEE Winter Conference on Applications of Computer Vision, pp. 863–873. Cited by: §1.
  • [2] S. Beery, D. Morris, and P. Perona (2019) The iwildcam 2019 challenge dataset. ArXiv abs/1907.07617. Cited by: §1, §4.
  • [3] S. Beery, D. Morris, and S. Yang (2019) Efficient pipeline for camera trap image review. arXiv preprint arXiv:1907.06772. Cited by: §1.
  • [4] S. Beery, G. van Horn, O. MacAodha, and P. Perona (2019) The iwildcam 2018 challenge dataset. arXiv preprint arXiv:1904.05986. Cited by: §1, §4.
  • [5] S. Beery, G. Van Horn, and P. Perona (2018) Recognition in terra incognita. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 456–473. Cited by: §1, §2.1.
  • [6] S. Beery, G. Wu, V. Rathod, R. Votel, and J. Huang (2020) Context r-cnn: long term temporal context for per-camera object detection. arXiv preprint arXiv:1912.03538. Cited by: §1, §1.
  • [7] G. Chen, T. X. Han, Z. He, R. Kays, and T. Forrester (2014)

    Deep convolutional neural network based species recognition for wild animal monitoring

    In Image Processing (ICIP), 2014 IEEE International Conference on, pp. 858–862. Cited by: §1.
  • [8] G. Chu, B. Potetz, W. Wang, A. Howard, Y. Song, F. Brucher, T. Leung, and H. Adam (2019) Geo-aware networks for fine-grained recognition. ICCV Workshop on Computer Vision for Wildlife Conservation. Cited by: §1.
  • [9] Y. Cui, M. Jia, T. Lin, Y. Song, and S. J. Belongie (2019) Class-balanced loss based on effective number of samples. CoRR abs/1901.05555. External Links: Link, 1901.05555 Cited by: §3.
  • [10] J. Giraldo-Zuluaga, A. Salazar, A. Gomez, and A. Diaz-Pulido (2017)

    Camera-trap images segmentation using multi-layer robust principal component analysis

    The Visual Computer, pp. 1–13. Cited by: §1.
  • [11] Google Earth Engine USGS Landsat 8 Surface Reflectance Tier 1. Note: Cited by: §2.3, §2.3, §2.3.
  • [12] N. Gorelick, M. Hancher, M. Dixon, S. Ilyushchenko, D. Thau, and R. Moore (2017) Google earth engine: planetary-scale geospatial analysis for everyone. Remote Sensing of Environment. External Links: Document, Link Cited by: §2.3.
  • [13] iNaturalist. Note: Cited by: §2.2.
  • [14] O. Mac Aodha, E. Cole, and P. Perona (2019) Presence-only geographical priors for fine-grained image classification. ICCV. Cited by: §1.
  • [15] A. Miguel, S. Beery, E. Flores, L. Klemesrud, and R. Bayrakcismith (2016) Finding areas of motion in camera trap images. In Image Processing (ICIP), 2016 IEEE International Conference on, pp. 1334–1338. Cited by: §1.
  • [16] M. S. Norouzzadeh, D. Morris, S. Beery, N. Joshi, N. Jojic, and J. Clune (2019)

    A deep active learning system for species identification and counting in camera trap images

    arXiv preprint arXiv:1910.09716. Cited by: §1.
  • [17] M. S. Norouzzadeh, A. Nguyen, M. Kosmala, A. Swanson, C. Packer, and J. Clune (2017)

    Automatically identifying wild animals in camera trap images with deep learning

    arXiv preprint arXiv:1703.05830. Cited by: §1.
  • [18] S. Schneider, G. W. Taylor, and S. Kremer (2018) Deep learning object detection methods for ecological camera trap data. In 2018 15th Conference on Computer and Robot Vision (CRV), pp. 321–328. Cited by: §1.
  • [19] M. A. Tabak, M. S. Norouzzadeh, D. W. Wolfson, E. J. Newton, R. K. Boughton, J. S. Ivan, E. A. Odell, E. S. Newkirk, R. Y. Conrey, J. L. Stenglein, et al. (2020) Improving the accessibility and transferability of machine learning algorithms for identification of animals in camera trap images: mlwic2. bioRxiv. Cited by: §1.
  • [20] U.S. Geological Survey Landsat 8 Imagery. Note: Cited by: §2.3, §2.3.
  • [21] G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie (2018) The inaturalist species classification and detection dataset. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 8769–8778. Cited by: §2.2.
  • [22] A. G. Villa, A. Salazar, and F. Vargas (2017) Towards automatic wild animal monitoring: identification of animal species in camera-trap images using very deep convolutional neural networks. Ecological Informatics 41, pp. 24–32. Cited by: §1.
  • [23] M. J. Wilber, W. J. Scheirer, P. Leitner, B. Heflin, J. Zott, D. Reinke, D. K. Delaney, and T. E. Boult (2013) Animal recognition in the mojave desert: vision tools for field biologists. In Applications of Computer Vision (WACV), 2013 IEEE Workshop on, pp. 206–213. Cited by: §1.
  • [24] Wildlife Conservation Society Camera Traps Dataset. Note: Cited by: §2.1.
  • [25] H. Yousif, J. Yuan, R. Kays, and Z. He (2017) Fast human-animal detection from highly cluttered camera-trap images using joint background modeling and deep learning classification. In Circuits and Systems (ISCAS), 2017 IEEE International Symposium on, pp. 1–4. Cited by: §1.
  • [26] Z. Zhang, Z. He, G. Cao, and W. Cao (2016) Animal detection from highly cluttered natural scenes using spatiotemporal object region proposals and patch verification. IEEE Transactions on Multimedia 18 (10), pp. 2079–2092. Cited by: §1.