Image classification through convolutional neural networks (CNN) became a staple of today’s machine learning discussion. Here, the utilization of GPUs as well as the availability of large, open-access datasets enabled the explosive success of CNNs. Some exmaples of these datasets are MNIST[lecun1998mnist, deng2012mnist]
, and ImageNet[ILSVRC15]. Fuelled by computational power, improved architectures, and data ML has made considerable progress. Howver the need for data persists up to today. Indeed, there is a general “more is better” mentality when it comes to the amount of training samples available. Yet, at the same time data-generation, labelling, storage, dissemination, and usage comes with non-negletable costs in time, infrastructure, and money (see for example [najafabadi2015deep]). As a consequence, it becomes increasingly important to answer the following questions:
How much data is needed to achieve a certain performance goal?
How does performance relate to sample size?
Can available data be reduced to a subset without impairing the performance of the models trained on it? Can the removal of samples even improve model performances?
How can such a reduction of data be performed? How can we valuate data points within a training set and determine which data points can or should be discarded?
Due to the black-box nature of modern deep neural networks and the high-dimensionality of images the above questions are difficult to address directly. It is thus useful to make large-scale observations, one of these is the concept of learning curves. A learning curve plots a model’s performance on a held out testing set (this metric is often called risk)against the number of samples the model was trained on. Several studies have observed a power-law relationship between risk and training volume [hestness2017deep, rosenfeld2019constructive, spigler2020asymptotic, bahri2021explaining] with exponents usually in the range of . There are theoretical discussions that suggest that the values of depends on the inherent dimensionality of the sample data [bahri2021explaining]. It must also be said that the power-law relationship does not hold on all sample scales. Instead we can observe three phases for learning curves [hestness2017deep], which are in order of increasing size of training sets: (i) the small data phase, in which the model does not perform significantly better than making random predictions; (ii) the phase in which the pwer-law relationship holds; (iii) the phase of irreducible error, in which the power-law relation comes to an end and no further improvement can be observed. Even though the learning curve transitions into and out of the power-law behavior this description makes model performances somewhat predictable. The basic idea is this: We can train the model on a sequence of relatively small subsets and determine the exponent
. Then we can use the power-law relation to estimate the model’s performance for the full dataset (assuming we do not transition into the phase of irreducible error in the meantime!)[rosenfeld2019constructive]. This idea, can be extended to more elaborate methods. See [viering2021shape] for an overview.
In this paper we investigate a dataset of plant images on blue background, see Figure 1 for an example. The sample consists of 90900 images: 10000 training images and 100 validation images per 9 different plant classes. The images show a variety of different growth stages for each class and cover a wide range of imagine angles. The images are randomly selected from a larger collection that was created and labelled by an autonomous robotic imager [10.1371/journal.pone.0243923]
. The machine learning task we consider is to correctly classify the images to one of the 9 plant species. In this context we perform the following analysis:
We confirm the power-law relationship between training volume and model accuracies and measure the respective exponent .
We investigate how the introduction of noise (purposfully mislabelling a random subset of training samples) affects .
by changing the model from a randomly initialized one to a model pretrained on ImageNet. This reduces the amount of trainable parameters by several orders of magnitude.
These are necessary first steps to describe the data-quality with respect to training effectiveness. Our observations indicate that noise has a significant impact on the the exponent .Furthermore, we can observe that a reduction of training parameters leads to a much worse .
Ii Description of Dataset
In this paper we consider plant images taken by our robotic imager EAGL-I (see [10.1371/journal.pone.0243923]). This system is capable of automatically imaging and labelling several plants at once from a wide variety of angles and distances. Examples of such images can be seen in Figure 1. The overall purpose of these images is for machine learning applications in digital agriculture, such as weed-detection, yield-estimates, plant-health, and phenotyping. The whole dataset (more than 1.5 million images of the type seen in Figure 1, plus over half a million additional unlabelled field-images and over half a million labelled images that contain multiple plants) is available via the TerraByte project homepage111https://terrabyte.acs.uwinnipeg.ca/resources.html. See also [beck2022terrabyte] for more details on the downloader.. More data as well as types of data are being continuously added.
Out of this whole dataset we have randomly selected 90900 images such that there are 10100 images per plant class. The classes we chose contain 7 different cash-crops (e.g., barley, wheat, soybeans, peas) and 2 types of weeds (barnyard grass and smartweed) that are common to the Manitoba region. Thus, the dataset consists of 90000 sample images and 900 test images. Throughout this paper the 900 test images are the same for all risk evaluations and have not been used in the training of any of the models. The 90000 sample images are shuffled per class. This ensures that two images of the same individual plant imaged on the same day are not listed in order when creating the training subsets. Then we created an increasing sequence of subsets, such that each larger subset contains all the images of the previous subsets. The smallest sample subset contains 10 images per class (90 images in total) whereas the largest subset is the entire sample set of 10000 images per class. We will denote these subsets by , where is the size of the subset, for example or .
Iii Training Setup
We note that results can depend on model-architecture, model-size, and tuning of hyper-parameters. However, at this point we are foremost interested in the data-quality, a description of data-redundancy, and learning curve parameters. The literature suggests that we can expect not identical, yet very similar, results when chosing a different model to train on the same data (see for example [rosenfeld2019constructive]). Unless mentioned otherwise we always trained a classification model as follows:
We use the popular ResNet50 architecture [he2016deep] for image classification with ADAM optimizer.
We choose cross-entropy as loss function.
The model’s weights have been chosen randomly. As the classes are balanced, there is no need for class weights.
In addition to cross-entropy loss, we also track the top-1 accuracy of the model.
20% of the training set is held out as validation data. It is used to determine when the model has converged. In models with equal training volume the same data is held out.
We use early stopping on the validation-accuracy with a patience of 50 epochs, and a maximum of 100 training epochs.
The only data-augmentation used is a 50% chance to horizontally flip the image and rescaling to 224x224 pixels.
Validation accuracy and loss is either reported for the model that triggered the early stopping, or the best model found.
To evaluate the trained models we classify 900 images of the overall held out test set. This is independent of , the volume of the training set and validation set used.
Iv-a Learning curve exponent
Recent research [hestness2017deep, rosenfeld2019constructive, spigler2020asymptotic, bahri2021explaining] observed a power-law behavior for learning curves of the form:
with . For example [hestness2017deep] reports a value of after training a family of ResNet-models on ImageNet data. We note here that power-law relationships appear as a linear curve in log-log-plots with slope , due to . To validate that the same power-law can be observed in our dataset, we train our ResNet-model on several sample sets . The result is illustrated in Figure 2 it shows the model’s performance over the full range of training subsets on a log-log-plot. From the observed accuracies and cross-entropy loss we identify the power-law region starting not earlier than at the subset of size . Further we did not observe reaching the phase of irreducible error. We fitted a power-law curve over the datapoints from to and observe an exponent of for the top-1 accuracy error and for the cross-entropy loss. These values are relatively large – at least compared to for ResNet models on ImageNet [hestness2017deep]
. Having a large exponent tells us that training a classification task on our data is relatively easy, whereas the reasons for that can be manifold. We suspect three data characteristics working together towards this effect: (i) For once, the representatives of the classes in our dataset have likely less variance than the classes in ImageNet. For example, two different soybeans differ mostly in the number and position of their leaves and even those correlate strongly with the plants maturity. For comparism a typical class in ImageNet, say “Bicycle”, can be represented by a wide range of colors and forms. (ii) The objects to classify in the plant dataset are all in front of a homogenous, mostly noise-free, unicolored background. This is not the case in ImageNet where the objects to classify are placed in natural environments. (iii) Our images had been automatically labelled and thus has virtually no labelling errors. ImageNet, instead, was manually labelled and for early versions of ImageNet (on which the exponent in[hestness2017deep] is based) it is estimated that 5% of the images are mislabelled [northcutt2021pervasive]. We investigate now in more detail the impact of noisy labels for our data.
Iv-B Noisy training data
In this evaluation we purposefully change the labels in , before using them in training. To be more specific: In we changed for each datum with a 5% chance the label to a random label selected from one of the 8 possible wrong labels. Consequently, if a datum was changed in it is changed in the same way for each subset it is a member of. Since the validation set is a subset of the introduced noise spreads over both the training and the validation set. We did not introduce noise to the held out test set of 900 images. Figure 3 shows the learning curves for top-1 accuracy error and cross-entropy loss, respectively. As expected, the overall performance on the test set has suffered, compared to using the original training data. More noteworthy, however, is that introducing 5% noise has resulted in a significant decrease of the exponent and thus a reduced effectiveness of the training data. For the top-1 accuracy error the difference is 0.115, for the cross-entropy loss the difference is 0.136. Noisy training data has affected the effectiveness of the training data on all scales and we need disproportionally more data to compensate this effect and achieve comparable performance.
Iv-C Reduction of model parameters
To investiagte how the exponent changes when modifying the amount of training parameters, we perform a simple change in the ResNet model. Instead of starting the model from scratch with randomly initialized weights we load a model that has already been trained on ImageNet and freeze all layers, but the output layer. This reduces the number of trainable parameters from 23.5 million parameters to 18 thousand parameters. The effect on the learning curve is illustarted in Figure 4. We see an even larger decrease of the exponent, compared to introducing noise, with the exponent dropping to approximately 0.2 on both metrics. This shows that models trained on ImageNet do not transfer well to our data and the respective training becomes very ineffective. This trend can be determined very early with training on sample sets as small as 1% of the available samples. This serves as an example of how learning curve parameters can lead us to models that are good candidates for a given machine learning task.
We investigated learning curves for a dataset that consists of different crops and weeds common to the Manitoba prairies. We first observed that the learning curves for our data follows a power-law relation with a large exponent (cross-entropy loss), indicating that the classification task is comparably easy to other public datasets, such as ImageNet. We then investigated how the introduction of labelling noise or a reduction of trainable parameters influences the exponent. Both resulted in a significant decrease of the exponent and thus a disproportionally larger amount of data is required to achieve results comparable to the first scenario (no noise, randomly initialized weights). By comparing the parameters of learning curves for different models on the same dataset one can quickly determine which models are more suitable for the task at hand. We invite researchers to analyze out dataset further. It is available under: https://terrabyte.acs.uwinnipeg.ca/resources.html