Computer vision systems for quality inspection are widespread throughout agriculture and many other industries. Deep learning has become the driving force in many applications largely due to advantages such as potentially high accuracy and ease of use due to the large number of open source libraries. The common methodology for training the networks is either to adapt an open-source network or for an author to design their own network. However, it can be difficult to choose which network is best for a specific task as it often comes with a trade-off between complexity, accuracy and speed. Therefore, in this work our contribution is showing a systematic approach is create an overview over the trade-off for a specific agricultural task of corn kernel fragment recognition from corn silage harvested from a forage harvester. In corn silage kernels must be cracked sufficiently such that when used as fodder for dairy cows the starch content is easily ingested and milk yield can be optimised (Johnson et al., 2003). An recognition system for high quality can help farmers use their machine optimally, avoiding both quality decreasing by up to 25% and inefficient usage of diesel fuel (Marsh, 2013). Furthermore, such systems can help solve the potential food crisis as the population is expected to reach 9.1 billion in 2050 (FAO, 2009).
This work extends upon that done in Rasmussen and Moeslund (2019)
where it was shown that kernel fragment shape and size characteristics could be measured with Convolutional Neural Networks (CNNs) for bounding-box detection and instance segmentation, however, only a single form of each was trained and it is unknown if these architectures are optimal. InHuang et al. (2017)
the trade-off between speed and accuracy was explored for CNN-based object detectors. Whilst comprehensive and useful as the open-source implementations are available through TensorFlow object detection API, networks are trained and evaluated on the large COCO benchmark dataset(Lin et al., 2014) and it is not as clear what the trade-off is for a specific use-case on a smaller scale like kernel fragmentation. We provide an overview of the trade-off for the kernel recognition by training variants of three meta-architectures of increasing complexity with the API from Huang et al. (2017) and explore different feature extractors and input image resolutions. This allows us to show an approach to determine optimal model design choices for CNN-based kernel fragment recognition.
The data used to train and test the networks are the same as that used in Rasmussen and Moeslund (2019) and consist of RGB images of silage taken post-harvest. Typically, kernel processing evaluation requires the separation of kernels and stover (leaves and stalks) either through manual means as in (Mertens, 2005; Penn State Extension, 2016)
followed by sieving measurements or sieving estimation with image processing(Drewry et al., 2019). However, the manual separation step can be cumbersome making it problematic for a farmer whilst harvesting. Therefore, in Rasmussen and Moeslund (2019) images and annotations were collected of non-separated corn silage for a direct measurement.
The dataset consists of a total of 2043 images with 11601 kernel fragment annotations. A notable difference in this work compared to Rasmussen and Moeslund (2019) is a validation set is added to combat overfitting whilst training by evaluating a model variant with the lowest validation loss. In Rasmussen and Moeslund (2019) the data was split 60% for training and 40% for testing, here we keep the same training set but evenly split the original test set such that validation and test cover 20% each. For the variation of image sizes when training and testing models images are resized from the original images dimensions of 6401280 to either 6001200, 400730 or 200
365 using bilinear interpolation.
3 Cnn Meta-Architectures
The TensorFlow object detection API provides a number of options for meta-architectures and includes pre-trained models with different backbone feature extractors and hyperparameters. Hyperparameters for the training of our models remained unchanged to the configurations files provided in the API, apart from the learning rate being decreased by a factor of 10 as only fine-tuning is performed. Networks are trained using TensorFlow 1.13.1 on an machine containing an NVIDIA Titan XP and GTX 1080Ti.
The first meta-architecture adopted is the Single Shot Multibox Detector (SSD) and is an efficient single-stage bounding-box detector. SSD has a competitive accuracy whilst running much faster than other more complex networks. For the varying complexity of feature extraction within SSD we adopt MobileNetv1(Howard et al., 2017), MobileNetv2 (Sandler et al., 2018) and InceptionV2 (Szegedy et al., 2016)
. Next, we train Faster R-CNN, a two-stage bounding-box detector that utilises the Region Proposal Network (RPN) to produce candidate proposals whose boxes are regressed and classified. For Faster R-CNN we train variants with Inceptionv2, ResNet50 and ResNet101 fromHe et al. (2015). Lastly and most complex is the instance segmentation network Mask R-CNN (He et al., 2017). The network is an extension of Faster R-CNN but with the added ability of producing masks for prediction. As the RPN is also part of Mask R-CNN the network is also able to output bounding-boxes, thus both forms will be evaluated. The feature extractors trained for Mask R-CNN are also Inceptionv2, ResNet50 and ResNet101.
The results in Table 1 are based upon a subset of the COCO metrics where the models with bounding-box predictions can be seen in first section and segmentation models in the second section. Additionally, we show the AP@0.5 results from Rasmussen and Moeslund (2019) for R-FCN Dai et al. (2016b) with ResNet101 and the MNC (Dai et al., 2016a) with AlexNet (Krizhevsky et al., 2012). As mentioned in Section 2, we altered the test set such that a validation set is also available. Therefore, the results are not calculated on the exact same images as in Rasmussen and Moeslund (2019) but we argue that the new test set is large enough such that the results are comparable.
|R-FCN ResNet101 (Rasmussen and Moeslund, 2019)||6001200||NA||34.0||NA||101.0|
|Faster R-CNN InceptionV2||6001200||25.6||51.9||45.1||51.1|
|Faster R-CNN ResNet50||6001200||24.5||51.1||45.8||96.8|
|Faster R-CNN ResNet101||6001200||25.5||52.1||45.3||112.4|
|Mask R-CNN InceptionV2||6001200||26.0||52.7||46.5||129.8|
|Mask R-CNN ResNet50||6001200||26.4||50.7||49.2||316.6|
|Mask R-CNN ResNet101||6001200||26.9||52.4||50.1||381.5|
|MNC AlexNet (Rasmussen and Moeslund, 2019)||6001200||NA||36.1||NA||87.0|
|Mask R-CNN InceptionV2||6001200||23.3||51.5||41.2||129.8|
|Mask R-CNN ResNet50||6001200||23.7||49.8||43.6||316.6|
|Mask R-CNN ResNet101||6001200||25.3||52.0||46.4||381.5|
The results in Table 1 are visualised in Figure 1 where we show the AP@0.5 in (a), AP in (b) and AR@100 in (c) all against the inference time of the models. Firstly, we see a significant improvement in the AP@0.5 in comparison to the R-FCN model from Rasmussen and Moeslund (2019) in addition to a decrease in inference time for all SSD variants and some of the Faster R-CNNs and Mask R-CNNs. The models trained in this work have an AP@0.5 of around 20 percentage points higher, while running at up to 5-8 faster for bounding-boxes. However, the segmentation variants proved to be slower than previous with only the Mask R-CNN Inceptionv2 at image size 200365 running 1.27 faster and improving AP@0.5 by 0.6 percentage points in comparison to the MNC model from Rasmussen and Moeslund (2019). However, improvements of up to 17.7 percentage points are seen for more complex models but at a cost of increased inference time.
Comparing the varying meta-architecture complexity we see that there is a slight gain in the metrics when evaluating bounding-box outputs. However, this comes at a cost of inference time, especially between Faster R-CNN and Mask R-CNN. Within each meta-architecture we see slight differences between feature extractors. At 6001200 AP for SSD improves by 9.4% from MobileNetv1 to MobileNetv2 but falls for Inceptionv2, Faster R-CNN increases by 4.5% from Inceptionv2 to ResNet101 and Mask R-CNN by 3.5% from Inceptionv2 to ResNet101. This shows that less is gained spending time on determining the optimal architecture for feature extraction in comparison to choosing the meta-architecture. This is in contrast to the findings in Huang et al. (2017) where large improvements could be made, for example, Faster R-CNN had a 70% increase in AP on the MS COCO test set over the evaluated feature extractors. Finally, we do see improvements in the metrics when increasing the image size from 200365 to 400730, but not as much when between 400730 and 6001200. Additionally, a significant increase in inference time is seen for most meta-architectures when the image size is at the largest.
Lastly, an example image with predictions from the best performing model with respect to AP and AP@0.5 can be seen in Figure 2.
In this work we have shown a systematic approach to train object recognition networks towards the task of kernel fragment recognition in corn silage whilst providing an overview of the trade-off in complexity, accuracy and speed. We show that slight improvements in AP and AR can be made by adopting more complex meta-architectures but at a larger cost of inference time. For all models the gain in AP and AR from a small to a medium image size was considerable, however, was minimal or worse when increasing onwards to a larger size. Minimal improvements could be made when altering the feature extractor for each meta-architecture, a contrast to findings on COCO in (Huang et al., 2017) We propose that this approach can be transferred to other similar domains where training data can be sparse in order select an appropriate model and speculate that these design choices for our models could be directly transferred to tasks with similarities in images, such as high amounts of clutter and occlusion. The improvements in kernel fragment recognition through better model selection open possibilities for a more efficient and robust system for farmers to obtain improved yields.
This work was funded by Innovation Fund Denmark under Grant 7038-00170B.
Instance-aware semantic segmentation via multi-task network cascades.
2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 3150–3158. External Links: Cited by: §4.
- R-fcn: object detection via region-based fully convolutional networks. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (Eds.), pp. 379–387. External Links: Cited by: §4.
- Predicting kernel processing score of harvested and processed corn silage via image processing techniques. Computers and Electronics in Agriculture 160, pp. 144 – 152. External Links: Cited by: §2.
- How to Feed the World 2050. FAO. Note: http://www.fao.org/fileadmin/templates/wsfs/docs/expert_paper/How_to_Feed_the_World_in_2050.pdf (accessed February 10, 2020) Cited by: §1.
- Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV), Vol. , pp. 2980–2988. External Links: Cited by: §3.
- Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §3.
- MobileNets: efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861. External Links: Cited by: §3.
- Speed/accuracy trade-offs for modern convolutional object detectors. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 3296–3297. External Links: Cited by: §1, §4, §5.
- Corn silage management: effects of hybrid, chop length, and mechanical processing on digestion and energy content. Journal of Dairy Science 86 (1), pp. 208 – 231. External Links: Cited by: §1.
- ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.), pp. 1097–1105. External Links: Cited by: §4.
- Microsoft coco: common objects in context. In ECCV, pp. 740–755. Cited by: §1.
- A comparison of fuel usage and harvest capacity in self-propelled forage harvesters. International Journal of Agricultural and Biosystems Engineering 7 (7), pp. 649 – 654. External Links: Cited by: §1.
- Particle size, fragmentation index, and effective fiber: tools for evaluating the physical attributes of corn silages. In: Proceedings of the Four-State Dairy Nutrition and Management Conference, pp. . Cited by: §2.
- Penn State Particle Separator. Penn State. Note: https:/https://extension.psu.edu/penn-state-particle-separator (accessed February 10, 2020) Cited by: §2.
- Maize silage kernel fragment estimation using deep learning-based object recognition in non-separated kernel/stover rgb images. Sensors 19 (19). External Links: Cited by: §1, §2, §2, Table 1, §4, §4.
- MobileNetV2: inverted residuals and linear bottlenecks. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp. 4510–4520. External Links: Cited by: §3.
- Rethinking the inception architecture for computer vision. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 2818–2826. External Links: Cited by: §3.