1 Introduction
A sufficient amount of labelled data is critical for machine-learning based models and a lack of training data often forms the bottleneck in the development of new algorithms. This problem is magnified in the area of digital agriculture as the objects of interest – plants – have a wide variety in appearance that stems from the plants’ growing stage, its specific cultivar, and its health. Plants also react in appearance to outside factors such as drought, time of day, temperature, humidity, and sunlight available. Furthermore, the correct classification of plants requires expert knowledge, which cannot easily be crowdsourced. All of this frames the labelling of plant-data as a challenge that is significantly harder compared to similar image-labelling tasks. Yet, as we witness the introduction of sensors [28, 30, 1, 18], robotics [21, 5, 6, 11, 25, 23], and machine learning [20, 29, 19, 22, 17, 16] to agricultural applications, there is a strong demand for such training data. This research area running under the names of precision agriculture, digital agriculture, smart farming, or Agriculture 4.0 has the potential to increase yields while reducing hte usage of resources, such as water, fertilizer, and herbicides [16, 8, 3, 9, 4, 12, 26, 24, 14, 15, 27, 2, 10]. This next revolution in agriculture is fuelled by data, in particular labelled image-data with rich metadata information.
Dataset | Type | Image Count | |||
Lab-data |
|
|
|||
Subsample |
|
|
|||
Field-data |
|
|
In this paper we describe two datasets, each consisting of hundreds of thousands of images, suitable for machine-learning and computer vision applications. The first dataset, the lab-data, consists of indoor-grown crops and weeds that had been imaged from a wide variety of angles. The plants selected are species common on farmlands in the Canadian prairies and many US states. This dataset consists of images of individual plants, images showing several plants with and without bounding boxes, and metadata for each image and plant. At the time of writing more than 1.2 million images have been added to this dataset, since April 2020. All of these images had been captured and automatically labelled by using a robotic system as in [7]. The second dataset, the field-data, consists of images taken in the field in the growing seasons of 2019 and 2020. These images show a top down perspective on the crops (and weeds) as they grow in cultivated farmland.
We present here these two datasets with respect to some of their key metrics, such as the number of images per species imaged. Further we provide a sample from the lab-data consisting of 14,000 images. This sample (around 1% of the total data) is intended to give researchers an overview on how the data is structured with respect to available plant ages and growing stages, as well as the number of individual plants imaged by the system. For this, we carefully selected images from the entire dataset, such that the age distribution per species is preserved and a wide range of individual plants is represented. Following best-practices for data-accessability, as outlined in [20]
, we give immediate open access to this data-sample, provide the metadata, and give insights on how the data was collected. In terms of long-term data storage and accessability both full datasets will eventually be fully open-source following a data management plant of stagewise release. Our goal is to provide researchers in digital agriculture with labeled data to facilitate data-driven innovation in the field.
The rest of this paper is structured as follows: In Chaper 2 we describe the general structure of both datasets and their metadata. In Chapter 3 we visualize and describe key characteristics of the datasets, such as the number of individual plants per species. In Chapter 4 we describe the structure of the sample. We conclude the paper in Chapter 5 with information on the planned growth of the datasets and how to obtain the sample and the original dataset.
2 Data- and Metadata-Structure
The lab-data can be divided into 4 different kind of files that relate to each other as follows:
-
Plain images: These are the images as captured by the camera. They typically show several plants in the same image.
-
Bounding box images: These images are the same as the original images with the difference that they are overlaid with visible bounding boxes around the plants as calculated by the system. Plants too close to the border of the image or overlapping too far into each other are not being bounded by the system.
-
Single plant images: These are images cropped out from plain images according to the calculated bounding boxes. Only plants for which a bounding box has been drawn are cropped out as individual image.
-
JSON-files: These files contain the metadata associated with each plain image and are described in more detail below.
See Figure 1 for examples on a plain image, the respective image with bounding boxes, and cropped out single plant images.
![]() |
![]() |
![]() |
![]() |
![]() |

Each JSON-file contains information about the plain image and the respective single plant images as follows:
-
version: An internally used version number
-
file_name: The filename of the original image (the plain image)
-
bb_file_name: The filename of the bounding box image
-
singles, bb_images, source_file_path: These fields are internally used filepath information for retrieving data from an object storage.
-
date: A string containing the day of imaging in the format YYYY-(M)M-(D)D. The string does not contain leading zeroes for months or days.
-
time: A string containing the time of imaging in the format (h)h:(m)m:(s)s. The string does not contain leading zeroes for hours, minutes, or seconds. The timezone is CST (Central Standard Time; UTC-06:00) and CDT (Central Daylight Time; UTC-05:00), respectively.
-
room, institute, camera, lens: These fields encode the location and camera-equipment used.
-
vertical_res, horizontal_res: The resolution given in pixels along the vertical and horizontal image-axis, respectively.
-
camera_pose: This field contains the following subfields: x, y, z, polar_angle, azimuthal_angle. The first three subfields describe the camera position in cm with respect to an origin-point inside the imagable volume of the system. The latter two subfields describe the camera’s pan (polar_angle) and tilt (azimuthal_angle). See Figure 3 for details.
-
bounding_boxes: This field contains a list of elements each cooresponding to one plant around which a bounding box was drawn by the system (see above for the description of bounding box images and single plant images). Each element in this list contains the same list of subfields as follows:
-
plant_id: Each plant is identified through an individual identifier. This allows to differentiate individual plants from the same breed and species.
-
label: A common name label attached to the plant, such as “Canola”
-
scientific_name: The scientific name of the plant, such as Brassica napus
-
subimage_file_name: The filename of the respective single plant image. This field is used to associate the single plant images with the plain image from which they had been cropped from or to the respective JSON-file (by replacing the file-ending from .jpg to .json)
-
date_planted: The day at which the plant was planted. This information is used to determine a plant’s age, which in our case is defined as the number of days that have elapsed between planting and imaging
-
position_id: An internally used identifier for the spatial position on which the plant was located when imaged
-
x_min, y_min, x_max, y_max: The coordinates of the upper-left and lower-right corner of the calculated bounding box, respectively.
-

The field-data collection is in structure similar to the above. However, as there are no labels attached to individual plants we have only the following fields in use: version, file_name, date, time, room, institute, camera, lens, label, vertical_res, horizontal_res, source_file_path. Note that the label field is with respect to the entire image, whereas for lab-data it was associated with a cropped out single plant image. The entry under label thus describes the type of crop that is cultivated on the farmland from which the image had been taken. Indeed, as imaging also took place before the application of herbicides, we can see weeds between the dominant crops on some images. Other fields in the JSON-files have the same interpretation and usage as above. An example for a field-data image is given in Figure 2.
3 Dataset Characteristics
We now describe the lab-data with respect to different metrics, followed by metrics on the field-data.
![]() |
![]() |
![]() |
Figure 4 shows the number of images taken by the system starting from April 2020 (imaging before April 2020 took place and is available on request. It is not included in this dataset due to experimentation with the imaging system itself). As of this writing we have taken more than 446,000 plain images (that is images that contain multiple plants) and cropped out over 1.2 million single plant images from these. The system is run several times per week producing thousands of new plain images. The data acquisition rate drops significantly after October 2020 as access was restricted due to the COVID pandemic. We anticipate the acquisition rate to raise as accessability to our facilities improve.
The third panel in 4
shows the imaged plants’ age-distribution. We define age as the number of days elapsed between seeding the individual plant and the time the image had been taken. The histogram uses a binning size of 7 days. The majority of plants had been imaged, when they were between 7 and 35 days old. This corresponds to the growth stage in farmlands at which it is critical to distinguish between weeds and crops. Thus, our emphasis on plants of this age matches the data needed for important applications in digital agriculture such as estimating germination rate of crops, quantifying weeds, and automated weeding. By our definition of age the germination time itself influences the age-value in our metadata. Since germination times vary between species, so do their age-distributions. For example, the age distribution of weeds is generally shifted towards “older” plants by one or more weeks compared to crops. Indeed, in our efforts to grow both of them, we have encountered that weeds require a longer time and more care to germinate. Most plants are only imaged towards a point at which they “outgrow” the system, i.e., where the plants’ size and shape lead to overlaps and inaccuracies when calculating bounding boxes.
Table 2 lists how many single plant images per species are in the dataset. The number varies strongly by species, which is due to availability of seeds, germination success, and access to our facilities.
In Table 3 we list how many individual plants we have imaged per species. Again the numbers vary due to availability and germination success of seeds and access to our facilities.
Common Name | Scientific Name |
|
||
---|---|---|---|---|
Barley | Hordeum vulgare | 30597 | ||
Barnyard Grass | Echinochloa crus-galli | 76258 | ||
Common Bean | Phaseolus vulgaris | 159217 | ||
Canada Thistle | Cirsium arvense | 89731 | ||
Canola | Brassica napus | 255004 | ||
Dandelion | Taraxacum officinale | 87426 | ||
Field Pea | Pisum sativum | 68658 | ||
Oat | Avena sativa | 59153 | ||
Smartweed | Persicaria spp. | 99650 | ||
Soybean | Glycine max | 203980 | ||
Wheat | Triticum aestivum | 120417 | ||
Wild Buckwheat | Fallopia convolvulus | 24973 | ||
Wild Oat | Avena fatua | 7065 | ||
Yellow Foxtail | Setaria pumila | 14815 |
Common Name | Scientific Name |
|
||
---|---|---|---|---|
Barley | Hordeum vulgare | 14 | ||
Barnyard Grass | Echinochloa crus-galli | 21 | ||
Common Bean | Phaseolus vulgaris | 53 | ||
Canada Thistle | Cirsium arvense | 14 | ||
Canola | Brassica napus | 128 | ||
Dandelion | Taraxacum officinale | 16 | ||
Field Pea | Pisum sativum | 24 | ||
Oat | Avena sativa | 28 | ||
Smartweed | Persicaria spp. | 21 | ||
Soybean | Glycine max | 84 | ||
Wheat | Triticum aestivum | 47 | ||
Wild Buckwheat | Fallopia convolvulus | 15 | ||
Wild Oat | Avena fatua | 3 | ||
Yellow Foxtail | Setaria pumila | 5 |
We now give a short description of the field-data. The collection of field-data was performed by imaging the field via a stereoscopic camera mounted to a tractor. The camera, pointed straight down, records a video as the tractor drives through the field. We chose one of the two video channels to extract frames as images. These images form the field-data collection. Here the amount of images extracted is chosen such that consecutive images show some overlap with respect ot the area imaged. We also provide the video data itself for the user to extract images under their own timing conditions or to work on the video directly. Table 4 and Table 5 give a breakdown by the number of images extracted from the videos per month and species, respectively. Field-data from the 2020 growing season is further accompanied by metadata information about temperature, wind-speed, cloud coverage, and camera height from ground.
Year | Month | Image count |
2019 | June | 45954 |
July | 84033 | |
2020 | May | 14084 |
June | 197980 | |
July | 167896 | |
August | 32230 |
Common Name | Scientific Name |
|
||
---|---|---|---|---|
Canola | Brassica napus | 137551 | ||
Faba bean | Vicia faba | 24147 | ||
Oat | Avena sativa | 18578 | ||
Soybean | Glycine max | 342780 | ||
Wheat | Triticum aestivum | 19121 |
4 Description of subsample
To create a visual overview on the lab-data we created a subsample of it that is structured as follows:
For each species listed in Table 2 we have selected 1,000 single plant images, thus the subsample contains 14,000 images. Furthermore, within each of these categories we have selected images, such that the age-distribution of the 1,000 images closely matches the age-distribution of all available images for that species. In addition we selected images such that all individual plants grown are represented in the subsample with the following exceptions: There are 51 individual Common Beans present in the subsample (instead of 53 in the entire dataset), as well as 113 Canola plants (of 128), 51 Soybean plants (of 84) and 37 Wheat plants (of 47). The distribution of the image dimensions (width, height) for the subsamples resembles the size distribution of the entire dataset, we did however not select images to directly optimize the sample under that criteria.
The total size on disk of the subsample is approximately 2.2 GB. The subsample does only contain single plant images, which are organized in one subfolder per species. We consider this subsample as a good entry point into the entirety of the dataset, which can be used to train some initial models. For example, simple models that differentiate between species or classes of species (e.g., monocots versus dicots, crops versus weeds).
5 Conclusion and data availability
In this paper we presented an extensive dataset of labelled plant images. These images show crops and weeds as common in the Canadian prairies and northern US states. After describing the data-structure we presented a subsample that mirrors the full dataset in key characteristics, but is smaller in overall size and thus more tractable. We are actively growing the dataset into several dimensions: New field- and lab-data is being acquired and processed as of writing. Furthermore, additional data-sources such as the generation of 3d-pointclouds and hyperspectral scans are being tested and developed. Additional field-data sources are also being explored, including imagery from UAVs and a semi-autonomous rover. Data from these sources will accompany the datasets presented in this paper in the near future.
The 14,000 images sample is available on https://doi.org/10.25739/rwcw-ex45 at the CyVerse Data Store, a portal for full data lifecycle management. The full dataset which contains 1.2 million single plant images (and counting) is made available to researchers and industry through the data-portal hosted by EMILI under http://emilicanada.com/ (Digital Agriculture Asset Map). The authors take Lobet’s general critique [20] on data-driven research in digital agriculture (or any research field) seriously. We further created a datasheet following the guidelines of Gebru et al. [13]
References
- [1] (2018) Nanostructured (bio)sensors for smart agriculture. TrAC Trends in Analytical Chemistry 98, pp. 95–103. External Links: ISSN 0165-9936, Document, Link Cited by: §1.
- [2] (2019) The digitisation of agriculture: a survey of research activities on smart farming. Array 3-4, pp. 100009. External Links: ISSN 2590-0056, Document, Link Cited by: §1.
- [3] (2018) Deep learning with unsupervised data labeling for weed detection in line crops in uav images. Remote Sensing 10 (11). External Links: Link, ISSN 2072-4292, Document Cited by: §1.
-
[4]
(2013)
Digital image processing techniques for detecting, quantifying and classifying plant diseases
. SpringerPlus 2, pp. 660. Cited by: §1. - [5] (2017) Agricultural robots for field operations. part 2: operations and systems. Biosystems Engineering 153, pp. 110–128. External Links: ISSN 1537-5110, Document, Link Cited by: §1.
- [6] (2017) Agricultural robots for field operations. part 2: operations and systems. Biosystems Engineering 153, pp. 110–128. External Links: ISSN 1537-5110, Document, Link Cited by: §1.
- [7] (2020-12) An embedded system for the automated generation of labeled plant images to enable machine learning applications in agriculture. PLOS ONE 15 (12), pp. 1–23. External Links: Document, Link Cited by: §1.
- [8] (2017) Controlled comparison of machine vision algorithms for rumex and urtica detection in grassland. Computers and Electronics in Agriculture 140, pp. 123–138. External Links: ISSN 0168-1699, Document, Link Cited by: §1.
- [9] (2018) Analysis of morphology-based features for classification of crop and weeds in precision agriculture. IEEE Robotics and Automation Letters 3 (4), pp. 2950–2956. External Links: Document Cited by: §1.
- [10] (2020) Smart farming: agriculture’s shift from a labor intensive to technology native industry. Internet of Things 9, pp. 100142. External Links: ISSN 2542-6605, Document, Link Cited by: §1.
- [11] (2018) Agricultural robotics: the future of robotic agriculture. CoRR abs/1806.06762. External Links: Link, 1806.06762 Cited by: §1.
- [12] (2015) Lights, camera, action: high-throughput plant phenotyping is ready for a close-up. Current Opinion in Plant Biology 24, pp. 93–99. External Links: ISSN 1369-5266, Document, Link Cited by: §1.
- [13] (2020) Datasheets for datasets. External Links: 1803.09010 Cited by: §5.
- [14] (2017) High-throughput phenotyping. American Journal of Botany 104 (4), pp. 505–508. External Links: Document, Link, https://bsapubs.onlinelibrary.wiley.com/doi/pdf/10.3732/ajb.1700044 Cited by: §1.
- [15] (2018) Citizen crowds and experts: observer variability in image-based plant phenotyping. Plant methods 14 (1), pp. 1–14. Cited by: §1.
-
[16]
(2019)
A comprehensive review on automation in agriculture using artificial intelligence
. Artificial Intelligence in Agriculture 2, pp. 1–12. External Links: ISSN 2589-7217, Document, Link Cited by: §1. - [17] (2018) Deep learning in agriculture: a survey. Computers and Electronics in Agriculture 147, pp. 70–90. Cited by: §1.
- [18] (2019) Evolution of internet of things (IoT) and its significant impact in the field of precision agriculture. Computers and Electronics in Agriculture 157, pp. 218–231. External Links: ISSN 0168-1699, Document, Link Cited by: §1.
- [19] (2018) Machine learning in agriculture: a review. Sensors 18 (8). External Links: Link, ISSN 1424-8220, Document Cited by: §1.
- [20] (2017) Image analysis in plant sciences: publish then perish. Trends in Plant Science 22 (7), pp. 559–566. External Links: ISSN 1360-1385, Document, Link Cited by: §1, §1, §5.
- [21] (2016) Advances in robotic agriculture for crops. Biosystems Engineering 100 (146), pp. 1–2. Cited by: §1.
- [22] (2018) Computer vision and artificial intelligence in precision agriculture for grain crops: a systematic review. Computers and Electronics in Agriculture 153, pp. 69–81. Cited by: §1.
- [23] (2019) Farming reimagined: a case study of autonomous farm equipment and creating an innovation opportunity space for broadacre smart farming. NJAS - Wageningen Journal of Life Sciences 90-91, pp. 100307. External Links: ISSN 1573-5214, Document, Link Cited by: §1.
- [24] (2017) High throughput phenotyping to accelerate crop breeding and monitoring of diseases in the field. Current Opinion in Plant Biology 38, pp. 184–192. Note: 38 Biotic interactions 2017 External Links: ISSN 1369-5266, Document, Link Cited by: §1.
- [25] (2018) Research and development in agricultural robotics: a perspective of digital farming.. International Journal of Agricultural and Biological Engineering 11, pp. 1–14. Cited by: §1.
- [26] (2016) Machine learning for high-throughput stress phenotyping in plants. Trends in Plant Science 21 (2), pp. 110–124. External Links: ISSN 1360-1385, Document, Link Cited by: §1.
- [27] (2017) Plant phenomics, from sensors to knowledge. Current Biology 27 (15), pp. R770–R783. External Links: ISSN 0960-9822, Document, Link Cited by: §1.
- [28] (2016) 3-d imaging systems for agricultural applications - a review. Sensors 16 (5). External Links: Link, ISSN 1424-8220, Document Cited by: §1.
- [29] (2018-04) Automated plant species identification-trends and future directions. PLOS Computational Biology 14, pp. 1–19. External Links: Document, Link Cited by: §1.
- [30] (2017) A survey of ranging and imaging techniques for precision agriculture phenotyping. IEEE/ASME Transactions on Mechatronics 22 (6), pp. 2428–2439. External Links: Document Cited by: §1.
Comments
There are no comments yet.