Image-based dense object tracking is of broad interest in the monitoring of crowd movement as well as the study of collective behavior in biological systems Li et al. (2015). Automated recognition of individuals in a dense group based on video recording would allow for the efficient implementation of monitoring and tracking frameworks with no additional manual labeling or tracking devices, which are often either impractical or invasive. The challenges in image-based dense object recognition include occlusions and variability in viewpoints and individual appearance. However, recent progress in convolutional neural networks (CNNs) for image segmentation Long et al. (2015), scene analysis Pinheiro and Collobert (2014), and object detection Dai et al. (2014); He et al. (2015); Sermanet et al. (2013); Ren et al. (2016) represent promising developments towards dense object detection and tracking. Here we apply these tools to a classical unsolved problem in behavioral ecology, the identification of individual organisms in a honeybee hive.
Honeybees have long drawn fascination and the study of their behavior has yielded important insights into animal communication, physiology, and neuroscience von Frisch (1967); Seeley (2010); Winston (1991); Karaboga and Akay (2009). Honeybees also provide an excellent model system for the study of social behavior as they can be viewed in the natural environment of an observation hive (Fig. 1). However, the complexity of a hive environment presents significant challenges for automated image-based analysis and previous techniques have shown only limited success, particularly under natural conditions Florea (2013); Hendriks et al. (2012); Kimura et al. (2011, 2014); Wario et al. (2015). A typical colony consists of hundreds or thousands of closely packed, often occluded, and continually moving individuals. The bees are unevenly distributed over a complex background, the honeycomb, which consists of a variety of different cells containing nectar, pollen, and brood in various stages of development. If tracking difficulties can be resolved, however, automated image analysis would easily surpass human observers by simultaneously following large numbers of organisms, thus permitting sophisticated studies of social behavior including subtle effects of genetic and molecular perturbations.
Leveraging high-resolution images of an observation bee hive, we present a method of individual recognition and localization across frames of a video recording. We devise a problem-specific individual labeling, adapt a previously proposed segmentation architecture, and expand its functionality to infer individual bee orientation on the comb. We next strengthen this approach by combining image data in following time frames in a recurrent manner allowing for important reduction of computational cost without compromising the accuracy. As no labeled data for this problem exist, we took advantage of the distributed online marketplace of Amazon Mechanical Turk (AMT) to create extensive training data at modest cost. Our method achieves detection accuracy comparable to human performance on this real-world dense object image data. Finally, we demonstrate the usefulness of our detection techniques towards a full tracking solution by producing exemplar trajectories with simple registration methods.
While there have been numerous computer-tracking approaches for the study of social insects, most of them rely on marking individuals, either with simple spots placed on a few individuals Biesmeijer and Seeley (2005), or more complex tags with barcodes that distinguish a higher number of individuals Mersch et al. (2013); Wario et al. (2015). Tagging is an obvious solution to recognizing individuals in a dense environment, however, it is laborious, inapplicable to other systems, and impractical on a whole-colony scale. As new individuals emerge, it becomes impossible to mark them without opening and significantly disrupting the colony. Additionally, tag recognition becomes impossible in the situations of partial tag occlusion or viewpoint change Wario et al. (2015). Due to similar difficulties, previous studies of human crowd tracking were limited to few individuals Kratz and Nishino (2010); Ali and Dailey (2009) or based on priors about collective motion to aid the performance of tracking algorithms Ge et al. (2012); Rodriguez et al. (2011).
A necessary step towards efficient, image-based dense object tracking is the capacity for individual detection in each frame of a video recording. Recent advances in CNNs have demonstrated their capability to detect and recognize objects in an image (e.g Girshick (2015)). Such object detection methods typically involve an exhaustive sliding window search Sermanet et al. (2013) or a variety of region-based proposals Hosang et al. (2016). The detection step is then followed by Sermanet et al. (2013) or coupled with Ren et al. (2016); Pinheiro et al. (2015) classification of the detected object in the proposed box-shaped region Sermanet et al. (2013); He et al. (2015) or a masked patch Dai et al. (2014); Pinheiro et al. (2015). Such two-step or two-function architectures were designed for on images containing multi-class, largely variable, and sparse objects.
In contrast, the images of honeybee colonies, cells or human crowds, can contain large numbers of densely packed and highly similar individuals of the same category. In these cases, region-based detection proposals can produce a large list of candidate regions, possibly even covering entire image with distinct objects sharing the same bounding box or mask. Additionally, as each image contains a large number of relatively small objects, keeping the initial image resolution is important for precise object localization. Approximative bounding box estimationSermanet et al. (2013), as well as image rescaling Pinheiro et al. (2015) can result in an error margin of the location estimation which is too large for distinguishing among individuals.
Fully convolutional networks Long et al. (2015) allow for image segmentation and categorization on an individual pixel level. These networks are proposal-free and produce label maps for the entire image at its original resolution. Within this framework, each pixel is attributed a category, however, differentiation between instances of objects of the same category is not possible. Instance-aware segmentation has been previously proposed Dai et al. (2015) introducing box-level instance proposals. Images of high-density objects with numerous adjacent individuals necessitate developments allowing for accurate object instance recognition in an efficient manner independent of the number of instances present in the image.
More recently, deep recurrent neural networks (RNNs) were introduced to resolve the task of state estimation with application to the problem of multi-object trackingOndruska and Posner (2016). Using simulated and real laser sensor measurements this work aimed at predicting the current, unoccluded, complete scene given a sequence of observations capturing only partial information about the scene. A generative probabilistic model inspired by Bayesian filtering Chen (2003) was proposed and framed as a RNN architecture allowing for accurate scene estimation even when presented with incomplete observations. The efficacy of this approach however, was demonstrated entirely on simulated data or simple near-perfect sensor measurements with smooth, linear motion generating black-and-white images where object detection is not part of the tracking task. Here we test the strength of the Bayesian filtering concept on real-world image data comprising dense and cluttered objects with unknown motion dynamics.
We propose a solution integrating the fully convolutional neural network U-Net Ronneberger et al. (2015) (Fig. 2) with a recurrent component for accurate object detection in a video sequence. In order to allow for object instance recognition, we defined an adapted labeling covering only the central part of each individual and non-adjacent to other individuals. We demonstrate the capacity of the network to accurately reproduce these labels which additionally allow for recognition of the main axis of each individual. To further indicate the head direction on the main body axis, we propose a loss function approximating individual orientation angle and expand the foreground-background segmentation with object orientation angle estimation. In addition, the recurrent component of the network leverages the information encoded in the video sequence and improves accuracy, while keeping the network at a fraction of the size of the original U-Net. Our proposed approach can localize individuals and recognize their orientation in following frames of a video recording efficiently, in one iteration, without a separate region proposal, sliding window, or masking, thus providing an important foundation for further individual object tracking in a dense group.
Imaging experiment and dataset
Image data was generated from high-resolution video recordings of a custom-designed observation beehive in which a honeybee colony was placed on one side of a beehive comb, covered with transparent glass and illuminated with infrared light which is imperceptible to the bees (Fig. 1) In brief, the hive was situated on the roof of a laboratory building at OIST graduate university within a prefabricated room of size of 3.6 m x 2.7 m x 2.3 m. The temperature was kept constant at and the humidity between 30 and 40%. An entrance/exit pipe 20 cm long connected the hive to the outside. We used a Vieworks Industrial Camera VC series VC-25MX-M72D0-DIN-FM (CMOS sensor, 25 Megapixels, CoaXpress interface, monochrome, F-mount, with image size of 5120 x 5120 pixels) located 1 m from the hive, so that a typical bee body covered pixels. The glass surface covered 51 cm x 51 cm. Infrared LEDs operating at 850 nm were mounted around the camera at an angle to avoid reflections. We placed LEDs on four 23 cm x 22 cm panels with each panel equipped with 14 stripes (6 LEDs / strip) for a total of 84 LEDs per panel generating 13.4 W per panel. Additionally we used one high power infrared spot made of 3 LEDs (ENGIN LZ4-00R608) operating at 850nm and generating 9W. Image data was streamed with four optic fibers to a server where it was compressed without loss and stored with custom software. The resulting images are in grayscale with 8-bit encoding. The data analyzed here come from two video recordings at FPS and FPS. For the higher FPS time resolution, the infrared light intensity was doubled to compensate for the shorter exposure time. Imaged colonies typically contain greater than 500 individuals.
bee instances for labeling 10-times by independent workers to obtain an estimate of human error in position and angle labeling. This error was calculated as standard deviation of distance of each of the 10 labels to the reference label used in the main dataset for training and testing.
For the original image (a), we show training labels marking bees (b, blue), and abdomens of bees inside honeycomb cells (b, red). The rest of the image is background. (c) Results of the segmentation network in which each pixel in the input image is classified with a background label, bee label, or abdomen label. (d) We show predicted locations and body axes estimations (d, red) compared to human labeling (d, yellow). For each contiguously labelled region, the predicted bee location was calculated as the centroid and the predicted body axis was calculated as the angle of the first principal component. Regions representing abdomens are drawn as circles as orientation is ambiguous. Two unlabeled false positives (FP’s) are present in this example in the image boundary, as well as a questionable class label mismatch – a partially visible bee was labeled as fully visible (blue class label) but predicted as bee abdomen (red class label).
As the annotation outcome, every individual in an image (e.g. Fig. 3a) was assigned denoting the coordinates of the central point of a bee against the top-left corner of the image, type of the label ( when full bee body is visible and when the bee is inside a comb cell), and the body rotation angle against the vertical pointing upwards and calculated clockwise ( if ). To use this information for segmentation-based individual localization, we generated regions centered over the central point of each bee. For labels with the regions were ellipse-shaped with semi-minor axis pixels and semi-major axis pixels, and rotated by the angle (Fig. 3b). For labels where the regions were circular with pixels. Regions of this shape and size cover the central parts of each bee and are non-adjacent to regions covering neighboring bees in the image.
To compensate for the class imbalance between foreground bee regions and the non-bee background, we generated weights used for balancing the loss function at every pixel. For every bee region a 2D Gaussian of the same shape was generated, centered over the bee central point, and scaled by either the proportion in the training set of the background pixels to the number of bee-region pixels of the given type and in the task of class segmentation, or scaled by proportion in the training set of the background pixels to the number of bee-region pixels of any type in the task of finding bee orientation angle.
Training and testing datasets were organized in two ways. First, out of the images were randomly sampled in equal proportions from the FPS and FPS recording and used as test set. Second, the images were organized in 60 sequences of 360 images of pixel size. In this time series data the first 324 images of each sequence were used for training and the remaining 36 for testing.
Network and training
We used the U-Net Ronneberger et al. (2015) segmentation architecture. The number of filters in the initial convolutional layer was doubled after every pooling layer in the expansive path and divided by 2 after each deconvolution in the contracting path (Fig. 2). The convolution kernel size was set as 3.
We first trained the network for foreground-background segmentation with the loss function defined as 3-class softmax scaled by the class imbalance in the entire training set. Next, we expanded the task to finding the direction of each individual orientation. Each foreground pixel, instead of the class label, was set at the value of the bee rotation angle and background pixels were labeled as . Class identity was not used in this expanded task. The loss function was defined as:
where is the class weight and are the predicted and labeled orientation angle, respectively.
In the network output each contiguous foreground region was interpreted as an individual bee. Foreground patches smaller than and larger than pixels were discarded, as the label size is pixels. The centroid location was calculated as the mid-point of all x- and y-coordinates of points in each region. The main body axis was calculated as the angle of the first principal component of the points in each region. In the segmentation task, region class was assigned as the class identity of the majority of pixels within given region. In the bee orientation recognition task, the predicted angle was calculated as the top quantile of all values predicted in the given foreground region. This strategy was motivated by the observation that the orientations in the outer edges of a region are often underestimated, and that the highest value found within a region is closest to the labeled orientation angle. In addition to an independent prediction, the orientation angle was used to assign back and front to the region principal axis.
We additionally expanded the functionality of U-Net to to take advantage of regularities in the image time series patterns. In each pass of the network training or prediction the before-last layer was kept as a prior for the next pass of the network. In the following pass the next image in the time sequence was used as input and the before-last layer was concatenated with the prior representation before calculating network output.
Adaptive moment estimationKingma and Ba (2014) was used during training. Method accuracy was estimated in terms of the capacity to correctly recognize and localize all individuals in an image. We implemented the CNN using Caffe2.
We first tested if individual recognition can be accomplished with the chosen segmentation architecture and two classes of foreground pixels, those that are part of visible bees and those that are part of the abdomens of bees inside the honeycomb cells. We found that the original U-Net architecture resulted in important overfitting and an increase in loss function in the test set (Supplemental Fig. S4), hence we reduced the size of the U-Net by using 32 filters in the first convolutional layer and eliminating one pooling and one deconvolution layer. This reduced U-Net contained a total of parameters compared to parameters in the original U-Net, thus shrinking the network to just of the original size. Decreasing the number of parameters diminished overfitting. Even so, overfitting was still observable in the reduced network (Supplemental Fig. S4). We also tested different regularization scenarios using weight decay and dropout Srivastava et al. , none of which achieved satisfactory performance within the feasible time span of training (Supplemental Fig. S4). This could be due to fact that in fully convolutional neural networks, such as U-Net, there is no fully connected layer on which the dropout is usually performed. Different from the fully connected layers, convolutional layers have smaller number of parameters compared to the size of feature maps. Hence, it is believed that convolutional layers suffer less from overfitting and, even though dropout has shown its effectiveness in convolutional layers in some cases Park and Kwak ; Springenberg et al. (2014), its effect in convolutional layers has not been studied thoroughly.
We therefore used early-stop in training as a measure against overfitting. In the following, we report network performance after iterations of training – the iteration selected based on the increase of loss function of this segmentation network. We apply this stop criterion to training of this segmentation network as well as orientation finding network and recurrent network described below.
|class||[pixel]||angle ||axis |
|Human labeling||-||0.15 (0.07)||0.04||6.7||7.7||-||-|
|Segmentation||0.96||0.21 (0.12)||0.19||5.9||-||13.3 (11.2)||-|
|Orientation||0.94||0.18 (0.10)||-||5.6||34.0 (32.2)||15.7 (13.1)||22.1 (16.7)|
|Recurrent orientation||0.96||0.14 (0.06)||-||5.1||15.2 (13.3)||10.6 (8.8)||12.1 (9.7)|
The segmentation network predicted individual location with a precision of pixel on average (Table 1), which is similar to the variability among human-assigned labels ( pixels) and much less then a typical bee width of pixels. While the class prediction was also accurate, there were seemingly high number of false positives (FPs). However, we noted that most FPs are reported on the image boundary where only incomplete object is visible – are within pixel margin of the image (e.g. Fig. 3d, Supplemental Fig. S5). Similarly, the disagreement among human labelers was the highest on the image boundaries, of disagreements are located within pixel margins of the image, a surface of of the size of the images annotated by the raters (Supplemental Fig. S6-S7). Therefore, in a comprehensive tracking solution, the number of FPs can be reduced by discarding the boundary regions and using overlapping image patches. Additionally, we noticed multiple examples of FPs that, upon a closer inspection were, instead abdomens of bees inside cells that were difficult to spot by human raters (see e.g. Supplemental Fig. S8). Therefore, among the of FPs predicted as bee abdomens, we expect some to be unlabeled true positives (TPs). Foreground class identity – full bee body vs. abdomen of a partially visible bee – was incorrectly assigned in of cases however, note that the distinction of the two can often be disputable (e.g. Fig. 3).
We used the elliptical shapes of segmented regions representing bee bodies to deduce the main body axis orientation. In particular, we found that the first principal component of the segmented patches resulted in a relatively precise approximation of each individual orientation with only difference on average with the labeled axis orientation (Fig. 3d, Table 1).
Location and orientation recognition
We expanded the segmentation network into an architecture appropriate for the estimation of object orientation angle instead of object category. In this approach, foreground class labels were exchanged with object instance orientation angle. This architecture produced similar performance to the segmentation network with a high TP rate and body axis recognition based on the label shape , suggesting that changing the label and loss function did not affect the foreground-background segmentation accuracy.
For the orientation angle, we observed that the error distribution exhibited a small constant baseline component indicative of random predictions (Supplemental Fig. S9
), and to avoid the undue influence of outlier values we report median error in our results (Table1). This baseline error can be partially explained by uncertainties at image boundaries, as well as by the variability of angle labeling among human raters (Supplemental Fig. S10-S11), and is not related to the bee density in the image. The orientation angle prediction has a median error of which is proportionally similar to the axis error given that the head-tail orientation error can range within and the axis within . Notably, rotation invariance of the CNNs is an unresolved question and more complicated solutions were proposed to address it Dieleman et al. (2015); Marcos et al. (2016); Worrall et al. (2016). It is therefore encouraging that a relatively simple loss function with a reduced U-Net segmentation network allows for approximation of the orientation of the densely packed honeybees. Moreover, the predicted orientation angle can be used merely to indicate the head-tail location on the axis estimated from the shape of the label. In this way we obtained an orientation error of the directed axis with this network (Table 1).
Recurrent detection and tracking
We inspected whether regularities in object appearance and movement across time could improve the orientation angle prediction. Image data were organized in a time sequence and, in following iterations of training and testing, consecutive images in the sequence were fed as network input. In each iteration, the penultimate layer of the network was kept as representation of a prior that was concatenated with the same penultimate layer representation of the following image in the sequence in the next iteration of training or testing. In this way network output was a result of both the information in the previous and current time point.
Indeed, we found that incorporating time series image data reduced the error in orientation angle prediction by two-fold and axis prediction by 2/3 . The orientation error obtained by orienting body axis with the predicted orientation angle was reduced to (Fig. 4, Table 1
), which is significantly better than the non-recurrent approach (Kruskal-Wallis test,) and only marginally higher than the variability observed among human raters.
Finally, to explore whether our bee detection results could provide the foundation for fully automated image-based tracking, we used elementary ideas to reconstruct bee trajectories. We matched the closest individuals in following time points and, in case a trajectory is lost, searched up to five frames ahead for a close match that could complete this trajectory. Individual’s position, orientation, angle, and velocity were taken into account in the matching. Additionally, short trajectories beginning or ending in the central parts of the image were discarded as potential FPs. As we have no ground truth labels for the individuals’ trajectories in our data, we cannot yet quantitatively assess the accuracy of this way performed trajectory estimation. We note however many examples that appear relatively complete (Fig. 5, Supplemental Movie) among the 60 sequences of 36 frames of the test set.
Accurate individual recognition is an important step towards automated dense object tracking. Here we described an approach for recognizing all individual bees and their orientation in the natural environment of a densely packed honeybee comb. We leverage the power of current segmentation architectures and design labeling to encode additional information about the segmented object – both in label shape and value – which allowed us to accurately indicate individual’s position and orientation. We additionally enhance our recognition approach through a recursive framework that places the improved accuracy near the level of human labeling, at a strongly reduced computational cost.
While our principal advance is dense object detection, our results are an important step towards individual trajectory reconstruction as demonstrated with our naive matching approach. Of course, quantitative trajectory reconstruction requires algorithms and analysis beyond the scope of this manuscript. Nevertheless, the positive examples achieved through this simple matching approach, even at low frame-rate ( FPS), demonstrate that the results provide important steps towards fully automated image-based dense object tracking.
Finally, we suggest that the environment of a honeybee hive offers an excellent model system for the development of tracking approaches. The hive is dense and complex though still tractable for labeling and offers unparalleled access for video recordings. We expect our work to foster significant advances in the quantitative study of this important social organism. In addition, our labeled dataset can be used for the development of other image-based tracking methods and the flexibility of CNN-based segmentation will allow our approach to be usefully applied to a variety of systems.
Funding for this work was provided by the OIST Graduate University to ASM and GS. Additional funding was provided by KAKENHI grants 16H06209 and 16KK0175 from the Japan Society for the Promotion of Science to ASM. We are grateful to Yoann Portugal for assistance with colony maintenance and image acquisition, as well as to Quoc-Viet Ha for work on the video acquisition and storage pipeline.
- Li et al. (2015) T. Li, H. Chang, M. Wang, B. Ni, R. Hong, and S. Yan, (2015), arXiv:1502.01812 [cs.CV] .
- Long et al. (2015) J. Long, E. Shelhamer, and T. Darrell, in
Pinheiro and Collobert (2014)
P. Pinheiro and R. Collobert, in
International Conference on Machine Learning(2014) pp. 82–90.
- Dai et al. (2014) J. Dai, K. He, and J. Sun, (2014), arXiv:1412.1283 [cs.CV] .
- He et al. (2015) K. He, X. Zhang, S. Ren, and J. Sun, IEEE Trans. Pattern Anal. Mach. Intell. 37, 1904 (2015).
- Sermanet et al. (2013) P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, (2013), arXiv:1312.6229 [cs.CV] .
- Ren et al. (2016) S. Ren, K. He, R. Girshick, and J. Sun, IEEE Trans. Pattern Anal. Mach. Intell. (2016).
- von Frisch (1967) K. von Frisch, The Dance Language and Orientation of Bees, Vol. 159 (Belknap Press of Harvard University Press, 1967).
- Seeley (2010) T. D. Seeley, Honeybee Democracy (Princeton University Press, 2010).
- Winston (1991) M. L. Winston, The Biology of the Honey Bee (Harvard University Press, 1991).
- Karaboga and Akay (2009) D. Karaboga and B. Akay, Artif. Intell. Rev. 31, 61 (2009).
- Florea (2013) M. I. Florea, Automatic detection of honeybees in a hive, Ph.D. thesis (2013).
- Hendriks et al. (2012) C. Hendriks, Z. Yu, A. Lecocq, T. Bakker, B. Locke, and O. Terenius, in Workshop Vis. Observation Anal. Anim. Insect Behav. ICPR (Citeseer, 2012).
- Kimura et al. (2011) T. Kimura, M. Ohashi, R. Okada, and H. Ikeno, Apidologie 42, 607 (2011).
- Kimura et al. (2014) T. Kimura, M. Ohashi, K. Crailsheim, T. Schmickl, R. Okada, G. Radspieler, and H. Ikeno, PLoS One 9, e84656 (2014).
- Wario et al. (2015) F. Wario, B. Wild, M. J. Couvillon, R. Rojas, and T. Landgraf, Front. Ecol. Evol. 3 (2015).
- Biesmeijer and Seeley (2005) J. C. Biesmeijer and T. D. Seeley, Behav. Ecol. Sociobiol. 59, 133 (2005).
- Mersch et al. (2013) D. P. Mersch, A. Crespi, and L. Keller, Science 340, 1090 (2013).
- Kratz and Nishino (2010) L. Kratz and K. Nishino, in Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on (ieeexplore.ieee.org, 2010) pp. 693–700.
- Ali and Dailey (2009) I. Ali and M. N. Dailey, in Advanced Concepts for Intelligent Vision Systems, Lecture Notes in Computer Science, edited by J. Blanc-Talon, W. Philips, D. Popescu, and P. Scheunders (Springer Berlin Heidelberg, 2009) pp. 540–549.
- Ge et al. (2012) W. Ge, R. T. Collins, and R. B. Ruback, IEEE Trans. Pattern Anal. Mach. Intell. 34, 1003 (2012).
- Rodriguez et al. (2011) M. Rodriguez, J. Sivic, I. Laptev, and J. Y. Audibert, in 2011 International Conference on Computer Vision (2011) pp. 1235–1242.
- Girshick (2015) R. Girshick, in Proceedings of the IEEE International Conference on Computer Vision (cv-foundation.org, 2015) pp. 1440–1448.
- Hosang et al. (2016) J. Hosang, R. Benenson, P. Dollár, and B. Schiele, IEEE Trans. Pattern Anal. Mach. Intell. 38, 814 (2016).
- Pinheiro et al. (2015) P. O. Pinheiro, R. Collobert, and P. Dollar, (2015), arXiv:1506.06204 [cs.CV] .
- Dai et al. (2015) J. Dai, K. He, and J. Sun, (2015), arXiv:1512.04412 [cs.CV] .
- Ondruska and Posner (2016) P. Ondruska and I. Posner, (2016), arXiv:1602.00991 [cs.LG] .
- Chen (2003) Z. Chen, Statistics 182, 1 (2003).
- Ronneberger et al. (2015) O. Ronneberger, P. Fischer, and T. Brox, in Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015, Lecture Notes in Computer Science, edited by N. Navab, J. Hornegger, W. M. Wells, and A. F. Frangi (Springer International Publishing, 2015) pp. 234–241.
- Kingma and Ba (2014) D. Kingma and J. Ba, (2014), arXiv:1412.6980 [cs.LG] .
- (31) N. Srivastava, G. R. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, .
- (32) S. Park and N. Kwak, .
- Springenberg et al. (2014) J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, (2014), arXiv:1412.6806 [cs.LG] .
- Dieleman et al. (2015) S. Dieleman, K. W. Willett, and J. Dambre, (2015), arXiv:1503.07077 [astro-ph.IM] .
- Marcos et al. (2016) D. Marcos, M. Volpi, and D. Tuia, (2016), arXiv:1604.06720 [cs.CV] .
- Worrall et al. (2016) D. E. Worrall, S. J. Garbin, D. Turmukhambetov, and G. J. Brostow, (2016), arXiv:1612.04642 [cs.CV] .