I Introduction
Image classification through convolutional neural networks (CNN) became a staple of today’s machine learning discussion. Here, the utilization of GPUs as well as the availability of large, openaccess datasets enabled the explosive success of CNNs. Some exmaples of these datasets are MNIST
[lecun1998mnist, deng2012mnist], CIFAR
[krizhevsky2009learning], and ImageNet
[ILSVRC15]. Fuelled by computational power, improved architectures, and data ML has made considerable progress. Howver the need for data persists up to today. Indeed, there is a general “more is better” mentality when it comes to the amount of training samples available. Yet, at the same time datageneration, labelling, storage, dissemination, and usage comes with nonnegletable costs in time, infrastructure, and money (see for example [najafabadi2015deep]). As a consequence, it becomes increasingly important to answer the following questions:
How much data is needed to achieve a certain performance goal?

How does performance relate to sample size?

Can available data be reduced to a subset without impairing the performance of the models trained on it? Can the removal of samples even improve model performances?

How can such a reduction of data be performed? How can we valuate data points within a training set and determine which data points can or should be discarded?
Due to the blackbox nature of modern deep neural networks and the highdimensionality of images the above questions are difficult to address directly. It is thus useful to make largescale observations, one of these is the concept of learning curves. A learning curve plots a model’s performance on a held out testing set (this metric is often called risk)
against the number of samples the model was trained on. Several studies have observed a powerlaw relationship between risk and training volume [hestness2017deep, rosenfeld2019constructive, spigler2020asymptotic, bahri2021explaining] with exponents usually in the range of . There are theoretical discussions that suggest that the values of depends on the inherent dimensionality of the sample data [bahri2021explaining]. It must also be said that the powerlaw relationship does not hold on all sample scales. Instead we can observe three phases for learning curves [hestness2017deep], which are in order of increasing size of training sets: (i) the small data phase, in which the model does not perform significantly better than making random predictions; (ii) the phase in which the pwerlaw relationship holds; (iii) the phase of irreducible error, in which the powerlaw relation comes to an end and no further improvement can be observed. Even though the learning curve transitions into and out of the powerlaw behavior this description makes model performances somewhat predictable. The basic idea is this: We can train the model on a sequence of relatively small subsets and determine the exponent. Then we can use the powerlaw relation to estimate the model’s performance for the full dataset (assuming we do not transition into the phase of irreducible error in the meantime!)
[rosenfeld2019constructive]. This idea, can be extended to more elaborate methods. See [viering2021shape] for an overview.In this paper we investigate a dataset of plant images on blue background, see Figure 1 for an example. The sample consists of 90900 images: 10000 training images and 100 validation images per 9 different plant classes. The images show a variety of different growth stages for each class and cover a wide range of imagine angles. The images are randomly selected from a larger collection that was created and labelled by an autonomous robotic imager [10.1371/journal.pone.0243923]
. The machine learning task we consider is to correctly classify the images to one of the 9 plant species. In this context we perform the following analysis:

We confirm the powerlaw relationship between training volume and model accuracies and measure the respective exponent .

We investigate how the introduction of noise (purposfully mislabelling a random subset of training samples) affects .

We reevaluate
by changing the model from a randomly initialized one to a model pretrained on ImageNet. This reduces the amount of trainable parameters by several orders of magnitude.
These are necessary first steps to describe the dataquality with respect to training effectiveness. Our observations indicate that noise has a significant impact on the the exponent .Furthermore, we can observe that a reduction of training parameters leads to a much worse .
Ii Description of Dataset
In this paper we consider plant images taken by our robotic imager EAGLI (see [10.1371/journal.pone.0243923]). This system is capable of automatically imaging and labelling several plants at once from a wide variety of angles and distances. Examples of such images can be seen in Figure 1. The overall purpose of these images is for machine learning applications in digital agriculture, such as weeddetection, yieldestimates, planthealth, and phenotyping. The whole dataset (more than 1.5 million images of the type seen in Figure 1, plus over half a million additional unlabelled fieldimages and over half a million labelled images that contain multiple plants) is available via the TerraByte project homepage^{1}^{1}1https://terrabyte.acs.uwinnipeg.ca/resources.html. See also [beck2022terrabyte] for more details on the downloader.. More data as well as types of data are being continuously added.
Out of this whole dataset we have randomly selected 90900 images such that there are 10100 images per plant class. The classes we chose contain 7 different cashcrops (e.g., barley, wheat, soybeans, peas) and 2 types of weeds (barnyard grass and smartweed) that are common to the Manitoba region. Thus, the dataset consists of 90000 sample images and 900 test images. Throughout this paper the 900 test images are the same for all risk evaluations and have not been used in the training of any of the models. The 90000 sample images are shuffled per class. This ensures that two images of the same individual plant imaged on the same day are not listed in order when creating the training subsets. Then we created an increasing sequence of subsets, such that each larger subset contains all the images of the previous subsets. The smallest sample subset contains 10 images per class (90 images in total) whereas the largest subset is the entire sample set of 10000 images per class. We will denote these subsets by , where is the size of the subset, for example or .
Iii Training Setup
We note that results can depend on modelarchitecture, modelsize, and tuning of hyperparameters. However, at this point we are foremost interested in the dataquality, a description of dataredundancy, and learning curve parameters. The literature suggests that we can expect not identical, yet very similar, results when chosing a different model to train on the same data (see for example [rosenfeld2019constructive]). Unless mentioned otherwise we always trained a classification model as follows:

We use the popular ResNet50 architecture [he2016deep] for image classification with ADAM optimizer.

We choose crossentropy as loss function.

The model’s weights have been chosen randomly. As the classes are balanced, there is no need for class weights.

In addition to crossentropy loss, we also track the top1 accuracy of the model.

20% of the training set is held out as validation data. It is used to determine when the model has converged. In models with equal training volume the same data is held out.

We use early stopping on the validationaccuracy with a patience of 50 epochs, and a maximum of 100 training epochs.

The only dataaugmentation used is a 50% chance to horizontally flip the image and rescaling to 224x224 pixels.

Validation accuracy and loss is either reported for the model that triggered the early stopping, or the best model found.

To evaluate the trained models we classify 900 images of the overall held out test set. This is independent of , the volume of the training set and validation set used.
Iv Results
Iva Learning curve exponent
Recent research [hestness2017deep, rosenfeld2019constructive, spigler2020asymptotic, bahri2021explaining] observed a powerlaw behavior for learning curves of the form:
with . For example [hestness2017deep] reports a value of after training a family of ResNetmodels on ImageNet data. We note here that powerlaw relationships appear as a linear curve in loglogplots with slope , due to . To validate that the same powerlaw can be observed in our dataset, we train our ResNetmodel on several sample sets . The result is illustrated in Figure 2 it shows the model’s performance over the full range of training subsets on a loglogplot. From the observed accuracies and crossentropy loss we identify the powerlaw region starting not earlier than at the subset of size . Further we did not observe reaching the phase of irreducible error. We fitted a powerlaw curve over the datapoints from to and observe an exponent of for the top1 accuracy error and for the crossentropy loss. These values are relatively large – at least compared to for ResNet models on ImageNet [hestness2017deep]
. Having a large exponent tells us that training a classification task on our data is relatively easy, whereas the reasons for that can be manifold. We suspect three data characteristics working together towards this effect: (i) For once, the representatives of the classes in our dataset have likely less variance than the classes in ImageNet. For example, two different soybeans differ mostly in the number and position of their leaves and even those correlate strongly with the plants maturity. For comparism a typical class in ImageNet, say “Bicycle”, can be represented by a wide range of colors and forms. (ii) The objects to classify in the plant dataset are all in front of a homogenous, mostly noisefree, unicolored background. This is not the case in ImageNet where the objects to classify are placed in natural environments. (iii) Our images had been automatically labelled and thus has virtually no labelling errors. ImageNet, instead, was manually labelled and for early versions of ImageNet (on which the exponent in
[hestness2017deep] is based) it is estimated that 5% of the images are mislabelled [northcutt2021pervasive]. We investigate now in more detail the impact of noisy labels for our data.IvB Noisy training data
In this evaluation we purposefully change the labels in , before using them in training. To be more specific: In we changed for each datum with a 5% chance the label to a random label selected from one of the 8 possible wrong labels. Consequently, if a datum was changed in it is changed in the same way for each subset it is a member of. Since the validation set is a subset of the introduced noise spreads over both the training and the validation set. We did not introduce noise to the held out test set of 900 images. Figure 3 shows the learning curves for top1 accuracy error and crossentropy loss, respectively. As expected, the overall performance on the test set has suffered, compared to using the original training data. More noteworthy, however, is that introducing 5% noise has resulted in a significant decrease of the exponent and thus a reduced effectiveness of the training data. For the top1 accuracy error the difference is 0.115, for the crossentropy loss the difference is 0.136. Noisy training data has affected the effectiveness of the training data on all scales and we need disproportionally more data to compensate this effect and achieve comparable performance.
IvC Reduction of model parameters
To investiagte how the exponent changes when modifying the amount of training parameters, we perform a simple change in the ResNet model. Instead of starting the model from scratch with randomly initialized weights we load a model that has already been trained on ImageNet and freeze all layers, but the output layer. This reduces the number of trainable parameters from 23.5 million parameters to 18 thousand parameters. The effect on the learning curve is illustarted in Figure 4. We see an even larger decrease of the exponent, compared to introducing noise, with the exponent dropping to approximately 0.2 on both metrics. This shows that models trained on ImageNet do not transfer well to our data and the respective training becomes very ineffective. This trend can be determined very early with training on sample sets as small as 1% of the available samples. This serves as an example of how learning curve parameters can lead us to models that are good candidates for a given machine learning task.
V Conclusion
We investigated learning curves for a dataset that consists of different crops and weeds common to the Manitoba prairies. We first observed that the learning curves for our data follows a powerlaw relation with a large exponent (crossentropy loss), indicating that the classification task is comparably easy to other public datasets, such as ImageNet. We then investigated how the introduction of labelling noise or a reduction of trainable parameters influences the exponent. Both resulted in a significant decrease of the exponent and thus a disproportionally larger amount of data is required to achieve results comparable to the first scenario (no noise, randomly initialized weights). By comparing the parameters of learning curves for different models on the same dataset one can quickly determine which models are more suitable for the task at hand. We invite researchers to analyze out dataset further. It is available under: https://terrabyte.acs.uwinnipeg.ca/resources.html