Towards dense object tracking in a 2D honeybee hive

12/22/2017 ∙ by Katarzyna Bozek, et al. ∙ Okinawa Institute of Science and Technology Graduate University 0

From human crowds to cells in tissue, the detection and efficient tracking of multiple objects in dense configurations is an important and unsolved problem. In the past, limitations of image analysis have restricted studies of dense groups to tracking a single or subset of marked individuals, or to coarse-grained group-level dynamics, all of which yield incomplete information. Here, we combine convolutional neural networks (CNNs) with the model environment of a honeybee hive to automatically recognize all individuals in a dense group from raw image data. We create new, adapted individual labeling and use the segmentation architecture U-Net with a loss function dependent on both object identity and orientation. We additionally exploit temporal regularities of the video recording in a recurrent manner and achieve near human-level performance while reducing the network size by 94 U-Net architecture. Given our novel application of CNNs, we generate extensive problem-specific image data in which labeled examples are produced through a custom interface with Amazon Mechanical Turk. This dataset contains over 375,000 labeled bee instances across 720 video frames at 2 FPS, representing an extensive resource for the development and testing of tracking methods. We correctly detect 96 body dimension, and orientation error of 12 degrees, approximating the variability of human raters. Our results provide an important step towards efficient image-based dense object tracking by allowing for the accurate determination of object location and orientation across time-series image data efficiently within one network architecture.



There are no comments yet.


page 4

page 5

page 7

page 9

page 10

page 12

page 13

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.


Image-based dense object tracking is of broad interest in the monitoring of crowd movement as well as the study of collective behavior in biological systems Li et al. (2015). Automated recognition of individuals in a dense group based on video recording would allow for the efficient implementation of monitoring and tracking frameworks with no additional manual labeling or tracking devices, which are often either impractical or invasive. The challenges in image-based dense object recognition include occlusions and variability in viewpoints and individual appearance. However, recent progress in convolutional neural networks (CNNs) for image segmentation Long et al. (2015), scene analysis Pinheiro and Collobert (2014), and object detection Dai et al. (2014); He et al. (2015); Sermanet et al. (2013); Ren et al. (2016) represent promising developments towards dense object detection and tracking. Here we apply these tools to a classical unsolved problem in behavioral ecology, the identification of individual organisms in a honeybee hive.

Honeybees have long drawn fascination and the study of their behavior has yielded important insights into animal communication, physiology, and neuroscience von Frisch (1967); Seeley (2010); Winston (1991); Karaboga and Akay (2009). Honeybees also provide an excellent model system for the study of social behavior as they can be viewed in the natural environment of an observation hive (Fig. 1). However, the complexity of a hive environment presents significant challenges for automated image-based analysis and previous techniques have shown only limited success, particularly under natural conditions Florea (2013); Hendriks et al. (2012); Kimura et al. (2011, 2014); Wario et al. (2015). A typical colony consists of hundreds or thousands of closely packed, often occluded, and continually moving individuals. The bees are unevenly distributed over a complex background, the honeycomb, which consists of a variety of different cells containing nectar, pollen, and brood in various stages of development. If tracking difficulties can be resolved, however, automated image analysis would easily surpass human observers by simultaneously following large numbers of organisms, thus permitting sophisticated studies of social behavior including subtle effects of genetic and molecular perturbations.

Leveraging high-resolution images of an observation bee hive, we present a method of individual recognition and localization across frames of a video recording. We devise a problem-specific individual labeling, adapt a previously proposed segmentation architecture, and expand its functionality to infer individual bee orientation on the comb. We next strengthen this approach by combining image data in following time frames in a recurrent manner allowing for important reduction of computational cost without compromising the accuracy. As no labeled data for this problem exist, we took advantage of the distributed online marketplace of Amazon Mechanical Turk (AMT) to create extensive training data at modest cost. Our method achieves detection accuracy comparable to human performance on this real-world dense object image data. Finally, we demonstrate the usefulness of our detection techniques towards a full tracking solution by producing exemplar trajectories with simple registration methods.

Related work

While there have been numerous computer-tracking approaches for the study of social insects, most of them rely on marking individuals, either with simple spots placed on a few individuals Biesmeijer and Seeley (2005), or more complex tags with barcodes that distinguish a higher number of individuals Mersch et al. (2013); Wario et al. (2015). Tagging is an obvious solution to recognizing individuals in a dense environment, however, it is laborious, inapplicable to other systems, and impractical on a whole-colony scale. As new individuals emerge, it becomes impossible to mark them without opening and significantly disrupting the colony. Additionally, tag recognition becomes impossible in the situations of partial tag occlusion or viewpoint change Wario et al. (2015). Due to similar difficulties, previous studies of human crowd tracking were limited to few individuals Kratz and Nishino (2010); Ali and Dailey (2009) or based on priors about collective motion to aid the performance of tracking algorithms Ge et al. (2012); Rodriguez et al. (2011).

A necessary step towards efficient, image-based dense object tracking is the capacity for individual detection in each frame of a video recording. Recent advances in CNNs have demonstrated their capability to detect and recognize objects in an image (e.g Girshick (2015)). Such object detection methods typically involve an exhaustive sliding window search Sermanet et al. (2013) or a variety of region-based proposals Hosang et al. (2016). The detection step is then followed by Sermanet et al. (2013) or coupled with Ren et al. (2016); Pinheiro et al. (2015) classification of the detected object in the proposed box-shaped region Sermanet et al. (2013); He et al. (2015) or a masked patch Dai et al. (2014); Pinheiro et al. (2015). Such two-step or two-function architectures were designed for on images containing multi-class, largely variable, and sparse objects.

In contrast, the images of honeybee colonies, cells or human crowds, can contain large numbers of densely packed and highly similar individuals of the same category. In these cases, region-based detection proposals can produce a large list of candidate regions, possibly even covering entire image with distinct objects sharing the same bounding box or mask. Additionally, as each image contains a large number of relatively small objects, keeping the initial image resolution is important for precise object localization. Approximative bounding box estimation

Sermanet et al. (2013), as well as image rescaling Pinheiro et al. (2015) can result in an error margin of the location estimation which is too large for distinguishing among individuals.

Fully convolutional networks Long et al. (2015) allow for image segmentation and categorization on an individual pixel level. These networks are proposal-free and produce label maps for the entire image at its original resolution. Within this framework, each pixel is attributed a category, however, differentiation between instances of objects of the same category is not possible. Instance-aware segmentation has been previously proposed Dai et al. (2015) introducing box-level instance proposals. Images of high-density objects with numerous adjacent individuals necessitate developments allowing for accurate object instance recognition in an efficient manner independent of the number of instances present in the image.

More recently, deep recurrent neural networks (RNNs) were introduced to resolve the task of state estimation with application to the problem of multi-object tracking

Ondruska and Posner (2016). Using simulated and real laser sensor measurements this work aimed at predicting the current, unoccluded, complete scene given a sequence of observations capturing only partial information about the scene. A generative probabilistic model inspired by Bayesian filtering Chen (2003) was proposed and framed as a RNN architecture allowing for accurate scene estimation even when presented with incomplete observations. The efficacy of this approach however, was demonstrated entirely on simulated data or simple near-perfect sensor measurements with smooth, linear motion generating black-and-white images where object detection is not part of the tracking task. Here we test the strength of the Bayesian filtering concept on real-world image data comprising dense and cluttered objects with unknown motion dynamics.

Figure 1: Observation beehive and imaging arrangement. Image data was generated from high resolution video recordings of a custom-designed observation beehive in which a honeybee colony was placed on an artificial comb, covered with transparent glass and illuminated with infrared light. Colonies in the observation hive are approximately two-dimensional and can range in size from hundreds to thousands of individuals.

I Approach

We propose a solution integrating the fully convolutional neural network U-Net Ronneberger et al. (2015) (Fig. 2) with a recurrent component for accurate object detection in a video sequence. In order to allow for object instance recognition, we defined an adapted labeling covering only the central part of each individual and non-adjacent to other individuals. We demonstrate the capacity of the network to accurately reproduce these labels which additionally allow for recognition of the main axis of each individual. To further indicate the head direction on the main body axis, we propose a loss function approximating individual orientation angle and expand the foreground-background segmentation with object orientation angle estimation. In addition, the recurrent component of the network leverages the information encoded in the video sequence and improves accuracy, while keeping the network at a fraction of the size of the original U-Net. Our proposed approach can localize individuals and recognize their orientation in following frames of a video recording efficiently, in one iteration, without a separate region proposal, sliding window, or masking, thus providing an important foundation for further individual object tracking in a dense group.

Figure 2: Network architecture. We used the U-Net architecture with a reduced number of filters and one less pooling and deconvolution steps. A recurrent element was added before the final prediction – prior representation was stored (pink) and concatenated with the representation of the next image in the time series (red).

Imaging experiment and dataset

Image data was generated from high-resolution video recordings of a custom-designed observation beehive in which a honeybee colony was placed on one side of a beehive comb, covered with transparent glass and illuminated with infrared light which is imperceptible to the bees (Fig. 1) In brief, the hive was situated on the roof of a laboratory building at OIST graduate university within a prefabricated room of size of 3.6 m x 2.7 m x 2.3 m. The temperature was kept constant at and the humidity between 30 and 40%. An entrance/exit pipe 20 cm long connected the hive to the outside. We used a Vieworks Industrial Camera VC series VC-25MX-M72D0-DIN-FM (CMOS sensor, 25 Megapixels, CoaXpress interface, monochrome, F-mount, with image size of 5120 x 5120 pixels) located 1 m from the hive, so that a typical bee body covered pixels. The glass surface covered 51 cm x 51 cm. Infrared LEDs operating at 850 nm were mounted around the camera at an angle to avoid reflections. We placed LEDs on four 23 cm x 22 cm panels with each panel equipped with 14 stripes (6 LEDs / strip) for a total of 84 LEDs per panel generating 13.4 W per panel. Additionally we used one high power infrared spot made of 3 LEDs (ENGIN LZ4-00R608) operating at 850nm and generating 9W. Image data was streamed with four optic fibers to a server where it was compressed without loss and stored with custom software. The resulting images are in grayscale with 8-bit encoding. The data analyzed here come from two video recordings at  FPS and  FPS. For the higher  FPS time resolution, the infrared light intensity was doubled to compensate for the shorter exposure time. Imaged colonies typically contain greater than 500 individuals.

Data labeling

We devised a custom javascript interface for manual annotation of bee locations and orientations in the images (Supplemental Fig. S1). Through the interface the user defines a bee position and orientation by dragging, dropping, and rotating a bee symbol in an image. An additional round symbol was used to mark the abdomens of bees partially hidden inside of a comb cell where the orientation angle is difficult to determine. We used this interface to generate a labeled image set through AMT. We used 360 frames of the  FPS and 360 of the  FPS recording, both down-sampled to  FPS. In each frame we selected regions of size of and pixels, respectively, containing most of the colony bees against various backgrounds that we submitted for labeling (Supplemental Figs. S2-S3). As a result we obtained a dataset of  FPS and  FPS pixel images containing total of labeled bees, with an average of bees per image. We also submitted four frames – two from each recording – with a total of

bee instances for labeling 10-times by independent workers to obtain an estimate of human error in position and angle labeling. This error was calculated as standard deviation of distance of each of the 10 labels to the reference label used in the main dataset for training and testing.

(a) Original image
(b) Segmentation labels
(c) Segmentation results
(d) Position and body axis
Figure 3: Example results for segmentation.

For the original image (a), we show training labels marking bees (b, blue), and abdomens of bees inside honeycomb cells (b, red). The rest of the image is background. (c) Results of the segmentation network in which each pixel in the input image is classified with a background label, bee label, or abdomen label. (d) We show predicted locations and body axes estimations (d, red) compared to human labeling (d, yellow). For each contiguously labelled region, the predicted bee location was calculated as the centroid and the predicted body axis was calculated as the angle of the first principal component. Regions representing abdomens are drawn as circles as orientation is ambiguous. Two unlabeled false positives (FP’s) are present in this example in the image boundary, as well as a questionable class label mismatch – a partially visible bee was labeled as fully visible (blue class label) but predicted as bee abdomen (red class label).

Training data

As the annotation outcome, every individual in an image (e.g. Fig. 3a) was assigned denoting the coordinates of the central point of a bee against the top-left corner of the image, type of the label ( when full bee body is visible and when the bee is inside a comb cell), and the body rotation angle against the vertical pointing upwards and calculated clockwise ( if ). To use this information for segmentation-based individual localization, we generated regions centered over the central point of each bee. For labels with the regions were ellipse-shaped with semi-minor axis pixels and semi-major axis pixels, and rotated by the angle (Fig. 3b). For labels where the regions were circular with pixels. Regions of this shape and size cover the central parts of each bee and are non-adjacent to regions covering neighboring bees in the image.

To compensate for the class imbalance between foreground bee regions and the non-bee background, we generated weights used for balancing the loss function at every pixel. For every bee region a 2D Gaussian of the same shape was generated, centered over the bee central point, and scaled by either the proportion in the training set of the background pixels to the number of bee-region pixels of the given type and in the task of class segmentation, or scaled by proportion in the training set of the background pixels to the number of bee-region pixels of any type in the task of finding bee orientation angle.

Training and testing datasets were organized in two ways. First, out of the images were randomly sampled in equal proportions from the  FPS and  FPS recording and used as test set. Second, the images were organized in 60 sequences of 360 images of pixel size. In this time series data the first 324 images of each sequence were used for training and the remaining 36 for testing.

Network and training

We used the U-Net Ronneberger et al. (2015) segmentation architecture. The number of filters in the initial convolutional layer was doubled after every pooling layer in the expansive path and divided by 2 after each deconvolution in the contracting path (Fig. 2). The convolution kernel size was set as 3.

We first trained the network for foreground-background segmentation with the loss function defined as 3-class softmax scaled by the class imbalance in the entire training set. Next, we expanded the task to finding the direction of each individual orientation. Each foreground pixel, instead of the class label, was set at the value of the bee rotation angle and background pixels were labeled as . Class identity was not used in this expanded task. The loss function was defined as:


where is the class weight and are the predicted and labeled orientation angle, respectively.

In the network output each contiguous foreground region was interpreted as an individual bee. Foreground patches smaller than and larger than pixels were discarded, as the label size is pixels. The centroid location was calculated as the mid-point of all x- and y-coordinates of points in each region. The main body axis was calculated as the angle of the first principal component of the points in each region. In the segmentation task, region class was assigned as the class identity of the majority of pixels within given region. In the bee orientation recognition task, the predicted angle was calculated as the top quantile of all values predicted in the given foreground region. This strategy was motivated by the observation that the orientations in the outer edges of a region are often underestimated, and that the highest value found within a region is closest to the labeled orientation angle. In addition to an independent prediction, the orientation angle was used to assign back and front to the region principal axis.

We additionally expanded the functionality of U-Net to to take advantage of regularities in the image time series patterns. In each pass of the network training or prediction the before-last layer was kept as a prior for the next pass of the network. In the following pass the next image in the time sequence was used as input and the before-last layer was concatenated with the prior representation before calculating network output.

Adaptive moment estimation

Kingma and Ba (2014) was used during training. Method accuracy was estimated in terms of the capacity to correctly recognize and localize all individuals in an image. We implemented the CNN using Caffe2.

(a) Original image
(b) Network output
(c) Body axis and angle
(d) Directed axis
Figure 4: Example results for body orientation prediction. For each original image input (a), the network produces orientation predictions (b) for pixels identified as foreground (classes bee or bee abdomen). Orientation values are represented by the colorwheel within the dashed square. As in Fig. 3 we estimate body location (c, small squares) and body axis (c, white lines) by computing the centroid and first principal component of contiguous foreground regions, respectively. The body orientation is separately computed as the mean orientation angle for each region (c, red arrows). The location and body orientation from human labelling are denoted by yellow arrows. (d) The final predicted body orientation angle is calculated as the body axis with the direction indicated by the estimated angle (d, labels in yellow and predicted directed axis in red). The observation hive is aligned perpendicular to the floor so that a vertically-oriented bee is shown as a vertical arrow.



We first tested if individual recognition can be accomplished with the chosen segmentation architecture and two classes of foreground pixels, those that are part of visible bees and those that are part of the abdomens of bees inside the honeycomb cells. We found that the original U-Net architecture resulted in important overfitting and an increase in loss function in the test set (Supplemental Fig. S4), hence we reduced the size of the U-Net by using 32 filters in the first convolutional layer and eliminating one pooling and one deconvolution layer. This reduced U-Net contained a total of parameters compared to parameters in the original U-Net, thus shrinking the network to just of the original size. Decreasing the number of parameters diminished overfitting. Even so, overfitting was still observable in the reduced network (Supplemental Fig. S4). We also tested different regularization scenarios using weight decay and dropout Srivastava et al. , none of which achieved satisfactory performance within the feasible time span of training (Supplemental Fig. S4). This could be due to fact that in fully convolutional neural networks, such as U-Net, there is no fully connected layer on which the dropout is usually performed. Different from the fully connected layers, convolutional layers have smaller number of parameters compared to the size of feature maps. Hence, it is believed that convolutional layers suffer less from overfitting and, even though dropout has shown its effectiveness in convolutional layers in some cases Park and Kwak ; Springenberg et al. (2014), its effect in convolutional layers has not been studied thoroughly.

We therefore used early-stop in training as a measure against overfitting. In the following, we report network performance after iterations of training – the iteration selected based on the increase of loss function of this segmentation network. We apply this stop criterion to training of this segmentation network as well as orientation finding network and recurrent network described below.

TP FP Error:
Object Position Orientation Axis [] Directed
class [pixel] angle [] axis []
Human labeling - 0.15 (0.07) 0.04 6.7 7.7 - -
Segmentation 0.96 0.21 (0.12) 0.19 5.9 - 13.3 (11.2) -
Orientation 0.94 0.18 (0.10) - 5.6 34.0 (32.2) 15.7 (13.1) 22.1 (16.7)
Recurrent orientation 0.96 0.14 (0.06) - 5.1 15.2 (13.3) 10.6 (8.8) 12.1 (9.7)
Table 1: Summary results for location and orientation prediction. In the first row we show the variability among human raters estimated by repeating the labeling task 10 times on an image set. TP-true positives, FP-false positives. As network performance median of error values are listed. Values in brackets are the results after a pixel margin of the image is discarded, eliminating predictions on partially visible objects. Results cited in the abstract are marked in bold and the full error distributions are presented in Supplemental Fig. S9.

The segmentation network predicted individual location with a precision of pixel on average (Table 1), which is similar to the variability among human-assigned labels ( pixels) and much less then a typical bee width of pixels. While the class prediction was also accurate, there were seemingly high number of false positives (FPs). However, we noted that most FPs are reported on the image boundary where only incomplete object is visible – are within pixel margin of the image (e.g. Fig. 3d, Supplemental Fig. S5). Similarly, the disagreement among human labelers was the highest on the image boundaries, of disagreements are located within pixel margins of the image, a surface of of the size of the images annotated by the raters (Supplemental Fig. S6-S7). Therefore, in a comprehensive tracking solution, the number of FPs can be reduced by discarding the boundary regions and using overlapping image patches. Additionally, we noticed multiple examples of FPs that, upon a closer inspection were, instead abdomens of bees inside cells that were difficult to spot by human raters (see e.g. Supplemental Fig. S8). Therefore, among the of FPs predicted as bee abdomens, we expect some to be unlabeled true positives (TPs). Foreground class identity – full bee body vs. abdomen of a partially visible bee – was incorrectly assigned in of cases however, note that the distinction of the two can often be disputable (e.g. Fig. 3).

We used the elliptical shapes of segmented regions representing bee bodies to deduce the main body axis orientation. In particular, we found that the first principal component of the segmented patches resulted in a relatively precise approximation of each individual orientation with only difference on average with the labeled axis orientation (Fig. 3d, Table 1).

Location and orientation recognition

We expanded the segmentation network into an architecture appropriate for the estimation of object orientation angle instead of object category. In this approach, foreground class labels were exchanged with object instance orientation angle. This architecture produced similar performance to the segmentation network with a high TP rate and body axis recognition based on the label shape , suggesting that changing the label and loss function did not affect the foreground-background segmentation accuracy.

For the orientation angle, we observed that the error distribution exhibited a small constant baseline component indicative of random predictions (Supplemental Fig. S9

), and to avoid the undue influence of outlier values we report median error in our results (Table 

1). This baseline error can be partially explained by uncertainties at image boundaries, as well as by the variability of angle labeling among human raters (Supplemental Fig. S10-S11), and is not related to the bee density in the image. The orientation angle prediction has a median error of which is proportionally similar to the axis error given that the head-tail orientation error can range within and the axis within . Notably, rotation invariance of the CNNs is an unresolved question and more complicated solutions were proposed to address it Dieleman et al. (2015); Marcos et al. (2016); Worrall et al. (2016). It is therefore encouraging that a relatively simple loss function with a reduced U-Net segmentation network allows for approximation of the orientation of the densely packed honeybees. Moreover, the predicted orientation angle can be used merely to indicate the head-tail location on the axis estimated from the shape of the label. In this way we obtained an orientation error of the directed axis with this network (Table 1).

Recurrent detection and tracking

We inspected whether regularities in object appearance and movement across time could improve the orientation angle prediction. Image data were organized in a time sequence and, in following iterations of training and testing, consecutive images in the sequence were fed as network input. In each iteration, the penultimate layer of the network was kept as representation of a prior that was concatenated with the same penultimate layer representation of the following image in the sequence in the next iteration of training or testing. In this way network output was a result of both the information in the previous and current time point.

Indeed, we found that incorporating time series image data reduced the error in orientation angle prediction by two-fold and axis prediction by 2/3 . The orientation error obtained by orienting body axis with the predicted orientation angle was reduced to (Fig. 4, Table 1

), which is significantly better than the non-recurrent approach (Kruskal-Wallis test,

) and only marginally higher than the variability observed among human raters.

Finally, to explore whether our bee detection results could provide the foundation for fully automated image-based tracking, we used elementary ideas to reconstruct bee trajectories. We matched the closest individuals in following time points and, in case a trajectory is lost, searched up to five frames ahead for a close match that could complete this trajectory. Individual’s position, orientation, angle, and velocity were taken into account in the matching. Additionally, short trajectories beginning or ending in the central parts of the image were discarded as potential FPs. As we have no ground truth labels for the individuals’ trajectories in our data, we cannot yet quantitatively assess the accuracy of this way performed trajectory estimation. We note however many examples that appear relatively complete (Fig. 5, Supplemental Movie) among the 60 sequences of 36 frames of the test set.

Figure 5: Trajectory reconstruction. The results of a recurrent approach to object detection allow for trajectory reconstruction using an elementary matching method for registering individuals across frames (see Supplemental Movie).

Ii Conclusions

Accurate individual recognition is an important step towards automated dense object tracking. Here we described an approach for recognizing all individual bees and their orientation in the natural environment of a densely packed honeybee comb. We leverage the power of current segmentation architectures and design labeling to encode additional information about the segmented object – both in label shape and value – which allowed us to accurately indicate individual’s position and orientation. We additionally enhance our recognition approach through a recursive framework that places the improved accuracy near the level of human labeling, at a strongly reduced computational cost.

While our principal advance is dense object detection, our results are an important step towards individual trajectory reconstruction as demonstrated with our naive matching approach. Of course, quantitative trajectory reconstruction requires algorithms and analysis beyond the scope of this manuscript. Nevertheless, the positive examples achieved through this simple matching approach, even at low frame-rate ( FPS), demonstrate that the results provide important steps towards fully automated image-based dense object tracking.

Finally, we suggest that the environment of a honeybee hive offers an excellent model system for the development of tracking approaches. The hive is dense and complex though still tractable for labeling and offers unparalleled access for video recordings. We expect our work to foster significant advances in the quantitative study of this important social organism. In addition, our labeled dataset can be used for the development of other image-based tracking methods and the flexibility of CNN-based segmentation will allow our approach to be usefully applied to a variety of systems.


Funding for this work was provided by the OIST Graduate University to ASM and GS. Additional funding was provided by KAKENHI grants 16H06209 and 16KK0175 from the Japan Society for the Promotion of Science to ASM. We are grateful to Yoann Portugal for assistance with colony maintenance and image acquisition, as well as to Quoc-Viet Ha for work on the video acquisition and storage pipeline.


Supplementary Material

Figure S1: Amazon Mechanical Turk labelling schematic. Annotators were instructed to mark all the bees in an image of a bee comb through dragging and dropping a bee symbol on each bee in the image and matching the symbol’s orientation angle. There are 2 symbols for marking - a bee symbol to mark fully visible bees and circle symbol to mark bees that are inside the cells, where only the bee abdomen is visible. Annotators are also instructed to use the same symbols to mark the small number of bees that are upside down.

Figure S2: Regions of the 30 FPS beehive video used for human labeling and network training. The squares designate the size of subregions used as one Amazon Mechanical Turk task.
Figure S3: Regions of the 70 FPS beehive video used for human labeling and network training. The squares designate the size of subregions used as one Amazon Mechanical Turk task.
Figure S4:

The change of the loss function with training epoch using reduced U-Net (first column), full U-Net (second column) and different levels of dropout (column 3 and 4). The network performing the segmentation task is shown in the first row, the network performing orientation angle search in the second row. The loss for the training set is marked in blue and the loss for the test set in red. The full U-Net results in substantial overfitting, while reducing the size of U-Net reduces the amount of overfitting. Various levels of dropout result in prohibitively slow training (3rd column), or also lead to overfitting with worse overall results (4th column). For this reason we chose early-stop in training (iteration 18 indicated on the upper-left panel) as a measure against overfitting.

Figure S5: Edge effects of the input image reduce network performance. We show a spatial histogram of the number of incorrectly predicted bees (FP’s) across the 512x512 pixel image patches used as input to the network.
(a) 30 FPS recording
(b) 70 FPS recording
Figure S6: Edge effects of the input image increase variability among human annotators. We show a spatial histogram of the location of disagreements in bee labeling among human raters. In an individual AMT task, annotators labeled 1024x1024 pixel image patch, one of which is indicated with a square shape in the images (white dashed outline).
Figure S7: The number of the labeling disagreements among AMT annotators (left) and FP’s as identified through the network segmentation (right) is large near the image patch boundary suggesting an edge effect that can be improved in later implementations.
Figure S8: Bees which are partially obscured inside comb cells are often hard to identify by human labelers but still correctly segmented. Shown are examples of difficult to label cases. For the original image (left column), the bee abdomens unnoticed by labelers are highlighted (middle column). These tails were however picked by the segmentation network, which is shown in the corresponding images in the right panel with labels marked in yellow and predictions in red. Such cases contribute to number of FP’s in the network performance reported in Table I.
Figure S9: Network prediction errors for all labelled bee instances in the 2,176 images of the test dataset. We show error histograms as well as the mean and median errors for position, orientation angle, axis angle and directed axis angle predictions. The flat tails for angle predictions suggest a small baseline of random predictions.
Figure S10: Variability in annotated bee position and orientation among human raters. For 2034 bee instances we show the histogram of the standard deviation of 10 repeated annotation tasks against the one reference annotation used in the dataset for network training and testing.
Figure S11: Example of the variability among human raters. Each yellow line is centered and aligned to a honeybee body identified by an annotator from Amazon Mechanical Turk and the same image was presented across 10 different annotators. Circles are centered on image locations identified as bee abdomens.
Supplementary Movie: In the ancillary file “simple_tracking.mp4” we show an example of a reconstructed trajectories. Individuals in one frame are matched to the closest individuals in following frames using position, orientation, angle, and velocity. In case a trajectory is lost, we searched up to five frames ahead for a close match to complete this trajectory.