With the versatility of machine learning tools expanding to various scopes(Bengio, 2008)
, neural networks are particularly becoming popular for a wide variety of applications. For instance, convolutional neural network-based applications include Graph Transformer Networks, GTN for rapid, online recognition of handwriting(LeCun et al., 1998)2008), large vocabulary continuous speech recognition (Sercu et al., 2015) and avatar CAPTCHA machine image recognition (Cheung, 2012)
by training machines to distinguish between human faces and computer generated faces. In the area of object detection, while dimension reduction techniques such as principal component analysis (PCA)(Malagon-Borja & Fuentes, 2007)
, Independent Component Analysis and Linear Discriminant Analysis, LDA(Martinez & Kak, 2001)
have been widely used, Support vector machines, SVM(Burges, 1998) are quite successful for similar applications as well. They provide competitive prior knowledge-free mapping of complex planes in images to their high dimensional space. Part-based discriminative model (Felzenszwalb et al., 2010), and exemplar-SVMs (Malisiewicz et al., 2011) have been tested to tackle the Pascal visual object classes (VOC) challenges and robotics applications. However, the presence of a large amount of data (microscopic images of SCN eggs with significant intra-class varieties and non-uniformly illumination), the intent of avoiding hand-crafted features and availability of powerful computing infrastructure made deep neural networks a suitable candidate for the current application. From a model architecture perspective, multi-scale convolutional networks for scene labeling (Farabet et al., 2013) and pool-based, pylon model segmentation from trees (Lempitsky et al., 2011) having similarities in multi-scale approach on superpixels. However, the class of objects in this case have high similarities than the results obtained with direct application of either models. Application of multiscale architectures in this case would, in principle be detrimental to the performance of an autoencoder that is trained to be selective.
Soybean Cyst Nematodes are unwanted microorganism that are known to compete with the roots of soybean plants for available nutrients causing stuntedness, limiting nodulation of nitrogen fixations and therefore large yield loss of between (Grabau, 2011; Tylka, 2008)
. The Cysts are formed by dead female worms which, prior to dying already secreted the eggs and still provide suitable condition for their continuous development. The challenge therefore is to isolate eggs from many other particles in a soil sample. The current practice is to manually identify and count such eggs under a microscope which is an extremely tedious and time-consuming effort while being significantly error-prone. Unfortunately, numerous computer vision-based automation attempts have failed as the problem is extremely nontrivial due the rarity of SCN egg present on a typical microscopic image frame and they have great similarities with various non-egg objects on those frames. Thus, isolating the rarely present SCN eggs from other undesired non-eggs particles on microscopic image frames (as shown in fig.1) is a complex object detection problem that have enormous plant science implications.
In this context, this work attempts to develop an efficient, high throughput, end-to-end SCN egg detection solution using a novel convolutional selective autoencoder approach. The primary contributions of this work are summarized as follows:
novel selective autoencoder approach - to train a deep convolutional autoencoder to suppress undesired parts of an image frame while allowing the desired parts resulting in efficient object detection.
demonstrating the efficacy of the proposed method on a new impactful plant science application involving rare object detection in a microscopic image frame cluttered with disturbances having great similarities with the objects of interest (typically SCN eggs among all objects).
differencing enhanced non-maximum post-processing for improving detection performance.
1.1 Paper layout
The paper was introduced in this section with some motivations to the importance of the research. Section 2 describes the prior work for the platform and an inspiration for improvement. Section 3 is devoted to formulation and description of the composite architecture. In section 4, dataset generation is discussed and algorithm’s implementation are shown. Results shown are analysed and discussed in section 5. The summary, conclusions and future directions are provided in the concluding section.
2 Background and motivation
2.1 Convolutional networks
Convolutional networks are discriminative models that rely primarily on local neighborhood matching for data dimension reduction using nonlinear mapping (i.e. sigmoid, softmax, hyperbolic tangent). Each unit of the feature maps has common shared weights or kernels for efficient training in relatively - compared to fully connected layers - lower trainable parameters, added to an additive bias on which is squashed. Feature extraction and classifier learning are the two main functions of these networks(LeCun et al., 1998). However, to learn the most expressive features, we have to determine the invariance rich codes embedded in the raw data and then follow with a fully connected layer to reduce further the dimensionality of the data and map the most important codes to a low dimension of the examples. Many image processing and complex simulations depend on the invariance property of the convolution neural network stated in (LeCun & Bengio, 1998) to prevent overfitting by learning expressive codes. The feature maps are able to preserve local neighborhood patterns for each receptive field as with over-completeness dictionary in (Aharon et al., 2006). The fully connected layers tend to complement the learned features by propagating only the highly active weights and serving as the classifier. A full and detailed review may be found in (LeCun et al., 1998) where the authors note the advantage of local correlation enforcing convolution before spatio-temporal recognition. For efficient learning purposes, convolutional networks are able to utilize distributed map-reduce frameworks (Fung & Mann, 2004) as well as GPU computing.
2.2 Selective autoencoders
Deep autoencoders typically extract hierarchical features leading to a relatively small code layer to capture succinct information regarding an input image such that it can be reliably reconstructed through the decoding layers. Among various uses, denoising autoencoders(Vincent et al., 2008) are particularly interesting as they help denoise input images by reconstructing cleaner versions of them. With a similar motivation, we train a deep convolutional autoencoder to suppress undesired parts (non-egg objects) of an image frame while allowing the desired parts (egg objects) resulting in efficient object detection. In this process, an image frame is divided into many smaller patches, where a patch is labeled as an egg patch when a full SCN egg is enclosed and centered in a patch while a negative example is when there is no egg present in a patch or there is an egg that is neither full nor centered as shown in fig. 2. Similar idea can be found in (Malisiewicz et al., 2011) where an exemplar-based SVM is applied on a neuro-psychological problem. Other similar formulations in (Keeler et al., 1991; Matan et al., 1992) are sometimes called centering. Also, in (LeCun et al., 1998) a segmentation graph as prior knowledge was considered with the aim of learning better features.
3 Algorithm description
Based on convolutional network’s (convnet) performances on several tasks reviewed, an end-to-end convolutional autoencoder (as shown in fig. 3) is designed for the current problem.
Given an dimensional image frame, P number of patches of dimension, were extracted from the image ensuring adequate localisation of algorithm on frames. While an original patch is denoted by , the corresponding label patch is denoted by for . The constituent layers in the model learning steps from data are outlined as follows:
Preprocessing & patch labels: Data pairs
are globally normalized together. Furthermore, in order to reduce the probability of false alarms while enhancing egg similar shape, size and pose only, those patches are blocked (considered negative example) where the non-egg varieties are extremely similar in shape or some eggs are partly visible.
Convnet layers: At each convolution or deconvolution layer, a chosen filter size is convolved with the patches to learn a dimensional feature map from which joint weight over the dimensional feature maps that are useful for enforcing local correlation is learnt to characterize all maps as follows,
is the squashing function, rectified linear unit used andis a convolution operator of the joint weights, , the biases and input from previous layer, .
To enhance the invariance further, pooling is done to select representative features in a local neighborhood. It ensures that the neurons activation in a locality do not all favor high entropy in which case information gets diffused. In this formulation, maxpooling(Scherer et al., 2010) was selected as a representative for a neighborhood.
where is the number of receptive fields of the input feature maps, is the pooled feature map, is the input from a previous layer, and , where and and denote the horizontal and vertical input dimensions respectively.
Corruption-induced fully connected layers: After flattening features maps from the convolution and subsampling layers, the input features to this layer, say , are corrupted with random Gaussian noise to produce . This was done to utilize some already experimented benefits of the denoising autoencoder architecture proposed by (Vincent et al., 2008).
and the decoder is given by
stands for the decoder and encoder function respectively, and are also rectified linear units, ReLU. The biases are subscripted,and with alphabet for each layer respectively while the weights, and are not necessarily tied. In the foregoing, the nonlinearity function (Yuan, 2014), rectified linear unit is given by,
It intuitively has the advantage of maximizing the likelihoods whenever there is an egg patch.
Unpool: In this layer, a reversal of the pooled dimension is done by stretching and widening (Jones, 2015) the identified features from the filters of the previous layer. It is also an upscaling of the activation around the symmetry lines of each feature map which would then be optimized by the back-propagation algorithm.
Error minimization: The training process includes a regularization function as in the (LeCun et al., 1998)
without which the error profile would not generally be monotonically decreasing. The nesterov momentum-based(Sutskever et al., 2013)stochastic gradient descent was used for improved results when compared to other loss functions: adaptive subgradient, ADAGRAD (Duchi et al., 2011), Adaptive learning rate method, ADADELTA (Zeiler, 2012), for the reconstruction error updates. Given the reconstructed output, and the labels, . Let be the set of weights and biases respectively for all layers, the loss function, which is minimized at each time steps, like other layers during the back-propagation algorithm. It is expressed as,
where is a parameter controlling the regularization function, ;
where represents layer and represents the dimension of the weight at each layer. Even though (Bengio, 2008) points that SGD with early stopping is equivalent to an regularization. The mean square error training loss is given by,
Then the weights are updated for each time step of the stochastic gradient descent is updated as explained in (LeCun et al., 1998) to be
where is the learning rate equivalent of step size in optimisation problems.
More details of the background can be found in (Masci et al., 2011) while the so far described and those in section 4 are the more important aspects and improvements made.
4 Dataset and implementation
The dataset is typically generated using a 1-inch-diameter soil probe to collect soil. Soil was collected during Fall 2015 from random placement of soil probe within several farms in the state of Iowa exhibiting different levels of SCN infestation. Each soil sample is mixed together in a bag and washed with water. Purple dye is then applied to the soil and the soil is sonicated to break apart the sac releasing the SCN eggs. A small sample is put on a cover slip and images of the sample were taken using a camera through a microscope.
About a thousand images were collected using this protocol. These images were then labeled by trained plant-pathologists. Labeling consisted of carefully screening each image and identifying the location of every SCN egg present in that image. To enable efficient labeling, a Matlab based app with GUI shown in fig 1 was created to simplify identification and marking of the SCN eggs location in the image. The app design included a user-friendly way of selecting images, zoom functions, drawing a rectangle region of interest over the eggs, saving the location and skipping an image if no SCN eggs were found. The app was deployed on a touch screen enabled device like the Microsoft Surface Pro, allowing the plant-pathologists who detects the eggs physically to just use their fingertips for rapid labeling. The bounding box of every SCN egg in the 644 images was extracted and stored.
Training set: This is divided mainly into the cropped and labeled training image which were each with frames of (), the labeled sets which were in number each with dimensions . However, only of the latter was used for training while was randomly left out to test the model. For uniformity, sets were resized down to the patch size, = while the unsegmented frames were patched and vectorized to the same size patches and both are concatenated. The total set available to train the model is shown on table 1 after the transformation which includes rotation of each egg between to cover the input space of its variety in orientation as this helps reduce parameters to learn. Training dataset was made up of for training and for validation.
|set type||original dimension||final dimension|
|s,t,r & l||rotations|
|s, r & l||rotations|
Training process: Training the problem required that the learning rate was kept low at with the momentum rate of
to prevent oscillations about the minima. The trade-off was that training for several more epochs, to aboutwas used for this model. As stated earlier, and regularization parameters of each were added to widen the parameter search space for locating the minima since that helps to minimize the difference between the test and training.
The training was done on GPU Titan Black with 2880 CUDA cores, 6GB memory, in the theano(Bergstra et al., 2010), lasagne and nolearn wrappers (Thoma, 2016) of python based on improvements described in Section 3. Lasagne had the layer details, nonlinearity types, objective function, theano extension and many more built into it. Nolearn on the other hand was a coordinating library for the implementation of the layers in lasagne including the visualization aspects. In the training section, a () = () filter size and a non-overlapping () = () were found to be experimentally less costly to produce the results. Algorithm training was done in batches of patches which was found to be suitable. The trained model had overall 743209 learnable parameters. Batch iterative training in the nolearn and lasagne functions was replaced with theano’s LeNet 5 (LeCun et al., 1998) early stopping algorithm which showed further reduction in validation error relative to the train error.
In the progress plot of fig. 4 shown, the effect of our regularizers are to raise artificially the training error over the validation error in order to ensure accommodation of more training epoch. More training epochs are required for the lowered learning rates as pointed earlier while the learning rate was optimal to allow gradual convergence to the minimum achieved.
Two different models using receptive field were explored as shown in fig. 3.
Model 1: (compressed decoder) Used feature maps at the unpooling and deconvolution layers. The model limits spreading of compressed information over many maps. However, it may suffer from discarding information due to capacity.
Model 2: (uncompressed decoder) Used feature maps at the unpooling and deconvolution layers. The model has the capability of the capturing more information of the fully-connected layer with a potential problem of high information entropy.
Testing set: Testing process used , =
patches at strideswhere and are the final vertical and horizontal number of patches respectively which are expressed as,
Post-processing algorithm: In order to reduce false alarms especially when non-egg particles have high degree of similarity with the eggs, two forms of postprocessing around local neighborhoods in each frame were explored and described in Algorithm 1.
5 Results and discussions
In this section, performance of the convolutional selective autoencoder is presented and analyzed with respect to the SCN egg detection problem. The analysis is divided into two main parts: detection effectiveness (from an algorithmic perspective) and computation time and accuracy (from an application requirement perspective).
Before discussing the algorithm’s detection effectiveness, a justification of the pipeline’s ability to reproduce the patch blocking training is shown in the fig. 6.
5.1 Detection effectiveness
Egg detection results obtained from the convnet-based tool-chain for randomly chosen testing sets are shown in fig 7.
Results shown on plates a and b of fig. 7 where the algorithm captures the eggs (only) shows its effectiveness in suppressing the neighboring non-egg particles. In these cases, properties such as shape, pose, illumination of the non-egg particles are reasonably different from those of the eggs particles. This would especially be true in the local neighborhood of the eggs where any influence of highly similar non eggs would easily have influenced the result negatively.
On plate fig. 7c, both model types have one false alarm at different locations for an optimal post-processing threshold value. While the influence of local neighbor may not be so large, the possibility of a mislabeled non-egg particles by human labeler may provide a suitable reason for such an anomaly as shown in the result of Model 2. For example, one may argue that for the highlighted false alarm on plate fig. 7d, it as an egg indicating large possibility of human-labeling error. Therefore, the tool-chain can potentially help experts identify some of their probable defects in identifying the eggs as well as remove the bias in detection since human decisions are subject to changes of interpretations. Model 2 results generally seem to be show more ”false alarms” possibly due to its enlarged feature maps. This is also supported by its usual low probability for actual labeled ground truth eggs detected by Model 1. The latter on the contrary usually detects eggs with high fidelity and no doubtful misses were recorded.
An envisaged defect of our proposed framework would be the non-detection of boundary situated eggs. However, an end correction scheme was added to the framework. This is a zero-padding type scheme to extend each test frame beyond its boundaries to ensure that those eggs with edges directly on the boundary of the frame are centered sometimes. In this study, we added padding with size same as that of a patch on all sides of an image frame. This ensures that the patching process will effectively cover an object on the image boundary, allowing the algorithm to enclose an egg completely with higher probability, at least once for a high resolution postprocessing.
Figure 8 shows a situation with a boundary situated egg. A low activation of the plate c causes a missed detection of the boundary situated egg due to lack of boundary padding whereas end correction enables a successful detection as shown in plate d.
5.2 Computation time and accuracy
While training the convnet model with the large data sets as described earlier takes several hours aided by our GPU parallelizing capability, testing to identify the eggs from the frames can be a much faster high throughput operation. Testing patches were created with an adequate stride during in order to reduce the number of misses of bounding box of an egg. All the previous results were generated with strides (, ). However, a (, ) for instance would mean that there is higher chance of having not fully enclosed eggs. Hence, with lower patch stride, accuracy increases at a cost of increased complexity. A (, ) at constant , , and should be the best possible. However, it would require large memory and computation time. Formally, the computation complexity for prediction due to the patching step can be described as:
Table 2 shows the detection times required for an image frame with different patch strides. In comparison, a well-trained expert plant-pathologist may take in the order of minutes to examine a frame.
|stride||P - #of patches/frame||detection time(sec)|
In most test cases, the () stride provides a good trade-off between computation time and accuracy.
Typical performance metrics used for object detection tasks such as the accuracy and the confusion matrix may be inadequate due to the overwhelming presence of non-egg particles. Therefore, the following three performance metrics were formulated specifically for the current rare object detection problem: average detection accuracy, ADA
average miss-to-egg ratio, AMER
average non-eggs discarded, AND
Based on the metrics, the combined result for all test frames is shown in table 3. Note, the post-processing thresholds are chosen with a higher preference on detection accuracy compared to lowering false alarms. The rationale is that with a low missed detection probabbility, the resulting frames can be quickly examined by the experts to reject the false alarms (which is drastically low in number compared to the non-egg objects in the original frames) and still have a reliable count of eggs.
It is clear that the models are very efficient in discarding most of the non-egg particles (metric AND). While the detection performance is also significantly high (metric ADA), false alarms are mostly caused due to objects with very similar characteristics as eggs (metric AND). However, many of the false alarms can be caused by human labeling errors and therefore, the tool-chain may require re-verification step by the experts to adaptively improve the model performance. Note that one of the assumptions of this selective autoencoder framework is that the patch size must be at least same or larger compared to the size of the largest SCN egg to be detected.
6 Summary, conclusions and future work
An end-to-end convolutional selective autoencoder approach is developed for a complex rare object detection problem. Hyperparameters and model structures for the convolutional network are meticulously explored for a critical plant science problem regarding automated detection of SCN eggs in microscopic images of soil samples. The machine learning pipeline uses expert-labeled training examples (with the possibility of human-errors) and can serve as a decision support tool that has potential of saving enormous time of agricultural scientists in characterizing a significant disease affecting soybean yield in the United States. From a machine learning perspective, a major issue is that a typical image frame in this application mostly contain other objects that are extremely similar to the objects of interest (SCN eggs). Therefore, hand-crafting features become a very difficult proposition and hence, deep learning becomes an appropriate choice. The following research areas are currently being pursued: (i) improvement of pre-processing the object patches via learning optimal transformation for a more efficient detection; (ii) adaptively fuse decision from multiple deep architectures ; (iii) exploring various unpooling strategies and interfacing classifiers at the fully connected layers and (iv) learning to automatically count the rare objects within the automated pipeline.
Authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the GeForce GTX TITAN Black GPU used for this research.
- Aharon et al. (2006) Aharon, Michal, Elad, Michael, and Bruckstein, Alfred. An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11):4311–4322, 2006.
- Bengio (2008) Bengio, Yoshua. Learning deep architecture for ai. Foundations and Trends in Machine Learning, pp. 1–71, 2008.
- Bergstra et al. (2010) Bergstra, James, Breulex, Olivier, Bastien, Frederic, Lamblin, Pascal, Pascanu, Razvan, Desjardins, Guillaume, Turian, Joseph, Warde-Farley, David, and Bengio, Yoshua. Theano: a cpu and gpu math expression compiler. Proceedings of the Python for Scientific Computing Conference (SciPy), June 2010. Oral Presentation.
Burges, Christopher J.C.
A tutorial on support vector machines for pattern recognition.Data Mining and Knowledge Discovery 2, 1998.
- Cheung (2012) Cheung, Brian. Convolutional neural networks applied to human face classification. ICMLA, 2(12):580–583, 2012.
- Collobert & Weston (2008) Collobert, Ronan and Weston, Jason. A unified architecture for natural language processing:deep neural networks with multitask learning. 25th International Conference on Machine Learning, pp. 1 – 7, 2008.
- Duchi et al. (2011) Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121–2159, July 2011.
- Farabet et al. (2013) Farabet, Clement, Couprie, Camille, Najman, Laurent, and LeCun, Yann. Learning hierarchical features for scene labeling. exdb, pp. 1–15, 2013.
- Felzenszwalb et al. (2010) Felzenszwalb, Pedro F., Girshick, Ross B., McAllester, David, and Ramanan, Deva. Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), September 2010.
- Fung & Mann (2004) Fung, James and Mann, Steve. Using multiple graphics cards as a general purpose parallel computer: Applications to compute vision. International Conference on Pattern Recognition, ICPR, 1:805–808, August 2004.
- Grabau (2011) Grabau, Joseph Zane. Management strategies for control of soybean cyst nematode and their effect on the nematode community. Master’s thesis, University of Minnesota, St Paul, MN 55108-6068, June 2011.
- Jones (2015) Jones, Swarbrick. Convolutional autoencoders in python/theano/lasagne, April 2015. URL https://swarbrickjones.wordpress.com/.
- Keeler et al. (1991) Keeler, James, Rumelhart, David, and Leow, Wee-Kheng. Integrated segmentation and recognition of hand printed numerals. NIPS, 4(397):557–563, 1991.
- LeCun & Bengio (1998) LeCun, Yann and Bengio, Yoshua. Convolutional networks for images, speech and time-series. In The Handbook of Brain Theory and Neural Networks. MIT Press, 1998.
- LeCun et al. (1998) LeCun, Yann, Bottou, Leon, Bengio, Yoshua, and Haffner, Patrick. Gradient-based learning applied to document recognition. Proc of IEEE, pp. 1–46, November 1998.
- Lempitsky et al. (2011) Lempitsky, Victor, Veldadi, Andrea, and Zisserman, Andrew. Pylon model for semantic segmentation. nips,2011, pp. 1–9, 2011.
- Malagon-Borja & Fuentes (2007) Malagon-Borja, Luis and Fuentes, Olac. Object detection using image reconstruction with pca. Elsevier Image Vision Computing, 03(004):1–8, March 2007.
- Malisiewicz et al. (2011) Malisiewicz, Tomasz, Gupta, Abhinav, and Efros, Alexei A. Ensemble of exemplar-svms for object detection and beyond. ICCV, 2011.
- Martinez & Kak (2001) Martinez, Aleix M. and Kak, Avinash C. Pca versus lda. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(2):228–233, February 2001.
- Masci et al. (2011) Masci, Jonathan, Meier, Ueli, Ciresan, Dan, and Schmidhuber, Jurgen. Stacked convolutional auto-encoders for hierarchical feature extraction, 2011.
- Matan et al. (1992) Matan, Ofer, Burges, Christopher, LeCun, Yann, and Denker, John. Multi-digit recognition using a space displacement neural network. NIPS, 557:488 – 495, 1992.
- Scherer et al. (2010) Scherer, Dominik, Muller, Andreas, and Behnke, Sven. Evaluation of pooling operations in convolutional architectures for object recognition. Intenational Conference on Artificial Neural Networks, ICANN, pp. 1–10, 2010.
- Sercu et al. (2015) Sercu, Tom, Puhrsch, Christian, Kingsbury, Brian, and LeCun, Yann. Very deep multilingual convolutional neural networks for lvcsr. arXiv:1509.08967v1 [cs.CL], pp. 5, September 2015.
- Sutskever et al. (2013) Sutskever, Ilya, Martens, James, Dahl, George, and Hinton, Geoffrey. On the importance of initialization and momentum in deep learning. International Conference on Machine Learning, JMLR, 28, 2013.
- Thoma (2016) Thoma, Martin. Lasagne for python newbies, February 2016. URL https://martin-thoma.com/lasagne-for-python-newbies/.
- Tylka (2008) Tylka, Gregory L. Soybean nematode management: Field guide. online, 2008. URL http://www.iasoybeans.com/sites/default/files/production-research/scnfieldguide.pdf.
- Vincent et al. (2008) Vincent, Pascal, Larochelle, Hugo, and Bengio, Yoshua. Extracting and composing robust features with denoising autoencoders. Proceedings of the 25th International conference on Machine Learning-ICML ’08, pp. 1096–1103, 2008.
- Yuan (2014) Yuan, Eric. Convolutional neural networks, March 2014. URL http://eric-yuan.me/cnn/.
- Zeiler (2012) Zeiler, Matthew D. Adadelta:an adaptive learning rate method. arXiv:1212.5701v1, pp. 1–6, December 2012.