Frequently counting the number of pigs in grouping houses is a critical management task for large-scale pig farming facilities. On one hand, pigs are often moved into different barns at distinct growth stages or grouped into separate large pens by size. Farmers need to know how many pigs are in each large pens. On the other hand, comparing the counting result with the actual number of pigs enables the early detection of unexpected events, e.g., missing pigs. However, walking around the pig barns to count a large number of pigs is costly in labor. Thus, automated pig counting and monitoring using computer vision techniques is a promising way to support intensive pig farming management, while reducing cost.
In recent years, various computer vision algorithms have been widely adopted to support various developments of agriculture and farming automation, such as cattle gait tacking 
, pig weight estimation and fruit counting . Despite of these exciting progresses, pig counting remains a very challenging task, due to large pig movements, high group density, overlapping, occlusion and camera perspective, as illustrated in Fig. 1. Few works in literature studied the development of automated pig counting system. Existing works  only handled pig counting problem in a single image. Nonetheless, as shown in Fig. 1, the field of view of a single image is only restricted to a small region and it is impossible to monitor a large pig grouping house. Furthermore, it could not deal with the cases that pigs frequently enter into or exist from the camera view. Towards overcoming these challenges, we presented an novel automated counting algorithm with an inspection robot and monocular fisheye camera. Fig. 2a showed two pictures of our inspection robot with a fisheye camera installed on the roof rail in our experimental pig grouping houses. Fig. 2b visualized a single video frame with detected pig skeletons output using our pig counting pipeline.
The main contributions of this paper are summarized as follows. 1) The sensor configuration is presented, which is suitable for pig counting in large-scale grouping house. 2) A novel bottom-up detection method is proposed to identify pigs, while addressing detection challenges due to overlapping, occlusion and deformation of body shapes. 3) A novel online spatial-aware temporal response filtering (STRF) method is designed to suppress false positives caused by tracking failures or pig movements. 4) An efficient algorithm of the counting pipeline is designed and deployed to an embedded system, which achieves high speed performance.
Ii Related Work
Counting in the crowd is an important, but challenging task due to severe occlusion, perspective distortions, complex illumination and diverse distribution of target sizes 
. Recently, deep-learning-based methods[17, 18] have been developed to estimate single image density map for crowd counting. Sindagi et. al.  developed a contextual pyramid convolutional neural network (CNN) for crowd density map estimation. Both global and local contexts were employed in the network to achieve better accuracy. Shen et. al.  proposed an adversarial cross-scale consistency pursuit method to improve the estimation consistency and reduce the averaging effect in . These methods formulate the counting problem as density map estimation, thus having the advantage to handle server occlusion and perspective distortions. However, density-map-based methods lost the detailed individual information and discarded the accurate location information for each single target. Therefore, it loses the ability to associate targets across time, and is not suitable for video-based counting.
Recently, researchers in agriculture presented many works towards tackling counting problems in various scenarios. Tian et. al.  counted pigs in a single image using a CNN-based method for pig density map estimation. Similar as , this method is not suitable for video-based counting problem. As a single image only have a small field of view (as shown in Fig. 1), it cannot be used for pig counting in large grouping houses. Liu et. al. [9, 8]
developed a fruit counting pipeline using a monocular camera. Individual fruits are first segmented using a CNN-based method, and then tracked by a Kalman Filter corrected Kanade-Lucas-Tomasi (KLT) tracker. A structure from motion (SfM) algorithm was utilized to get the relative 3D location and size estimate to reject outliers and double counted fruit tracks. This method is only suitable for rigid shape and stationary target counting task, and does not work for moving livestock counting cases. Hodgson et. al. demonstrated that images collected by unmanned aerial vehicles (UAV) could help wildlife monitoring and counting. Rivas et. al.  studied cattle detection from aerial view photos. Counting based on aerial view is promising, but could hardly be used for indoor livestock counting scenes without developing algorithms to handle severe occlusion (e.g. caused by indoor building structures or perspective distortions), overlapping, double counted tracking trajectories due to entering into or existing from camera view.
Different from previous approaches, we presented a novel video-based pig counting system for large pig grouping houses. The developed counting pipeline overcame dense detection challenges (e.g. overlapping or occlusion) by a novel bottom-up pig body parts detection and association algorithm. A STRF method was developed to obtain the counting number by reducing the counting error caused by tracking failures or pig movements.
In this work, we presented an efficient pig counting system for large pig grouping houses. Fig. 3 demonstrates the entire algorithm pipeline. In our counting system, the camera moved from one side of the pig grouping house roof till the other end of roof and scanned the whole house with top-down view. A whole single counting pass scanned the house once by the camera. As summarized in the Fig. 3, subsequently, we detected multiple pig body keypoints, associated them to localize each individual pig, tracked pigs cross frames and obtained counting results using STRF method.
Iii-a Sensor Configuration
An inspection robot, which can move back and forth along a rail installed on the roof of the pig house, was used for pig counting data acquisition and processing from top-down view (Fig. 2a). Several sensors used for different applications were installed inside the robot, including a monocular fisheye camera for pig counting, a RGB-D camera for pig weight estimation, gas and temperature sensors for environmental control etc. Inside the inspection robot, an embedded system with RockChip RK3399 multi-core ARM processor was used for processing data from cameras and running the pig counting algorithm.
Iii-B Bottom-Up Detection
, have been widely used. These methods first proposed locations of detection candidates using bounding boxes, and then classified each box to be the real target or not. Non-maximum suppression (NMS) are employed as a post-processing method to significantly reduce false positive candidates by removing the bounding boxes that have high overlap ratios (intersection over union) with each other. Nonetheless, using bounding boxes to localize the pigs is sub-optimal in this application. The deformable long oval pig shapes are very challenging for bounding-box-based approaches in crowded scene. As shown in Fig.4C1, the bounding boxes around two adjacent pigs have very high overlap ratio, whose ambiguous nature tends to confuse the neural network training. Moreover for inference, the NMS post-processing step would enforce the detector to only select one bounding box for these high overlapping cases, resulting in false negatives. Compared with bounding boxes, the pig skeletons defined by keypoints are more suitable for differentiating pigs in the crowd as shown in Fig. 4C2. In this work, we defined five pig body keypoints, including one mid point (red), two quarter points (green) and two end points (blue); and tree-structured pig skeletons connecting the adjacent keypoints.
) to overcome aforementioned limitations. This method is consisted of a keypoints detection step and a keypoints association step. These steps were based on a deep convolutional encoder-decoder network. The network output two different kinds of maps: 1) Four keypoint heatmaps and 2) an offset vector field. (Fig.5d) Each heatmap provided information to classify each pixel into to one of the keypoints or background class. The offset vector field indicated the relative positional relationships between the adjacent keypoints, which helped the system group the keypoints and identify which pig instance that these keypoints belonged to.
Keypoint detection The goal is to detect all visible keypoints belonging to each single pig in the input. For this purpose, we applied a fully convolutional network to produce heatmaps with four channels (three channels for each keypoint type and one for background), which had the same size with the input image. This heatmap prediction was then formulated as a per-pixel multi-class classification problem. For each pixel location, the neural network learned to predict if it belonged to one of the keypoint type or background. We followed  to generate classification targets. Let be a circular region centered at position with radius . We denoted as the -th keypoint of type . All pixels of had the same class label . In this work, was set to be 5. Cross-entropy loss was employed for this task. At testing stage, the local maxima of the heatmaps were chosen as the predicted keypoints.
Keypoint association Due to instance-agnostic nature of the predicted keypoints on heatmaps, one unique instance ID had to be assigned for each detected keypoint so that we ”connect the dots” belonging to the same individual instance. For this purpose, we added to our neural network a separate two channel outputs of offset field indicating the displacement from a given keypoint to its parent in the skeleton (Fig. 4C2). Here we denoted as the parent node of keypoint . If , the target offset was vector starting from . If itself is a root node, i.e. , ended at ; Otherwise ended at .
Let us denote the offset field predicted by the network as . In order to supervise the training, the regression loss for offset field was defined as
where was the binary background mask used for ignoring the regression loss at the background pixels, where the offset vector were undefined.
At testing stage, an iterative greedy algorithm was adopt to associate the predicted keypoints. We alternatively searched the best candidate parent node for all the predicted keypoints, and removed the surplus keypoints from their candidate children list, until no better hypothesis could be found. The best candidate parent node was defined as the keypoint which was in the correct class and match the predicted offset vector best. The euclidean distance between the predicted offset and the actual offset was used to measure the match.
Architecture of the network We proposed an architecture (Fig. 5b) for the network. Depthwise separable convolutions  were used as the basic building blocks to reduce the computational cost. Following 
, we used location-withheld maxpooling to improve the localization accuracy, which preserved indices at the max pooling layers of the encoder and passed them to the corresponding up-sampling layers of the decoder.
Iii-C Keypoints Tracking
In order to count pigs across video frames, an efficient on-line tracking method was employed to associate pig keypoints temporally. This method took the grouped pig keypoints for single frames as input, and then assigned a unique identification number (id) to each pig across frames. This problem was formulated as a bipartite graph matching based energy maximization problem. The estimated pig candidates at frame were then associated with the previous pig candidates at frame by bipartite graph matching.
where was the pig candidate in and was the pig candidate in .
was a binary variable and indicates ifand were associated. The potential represented the similarity measurements between and .
where represented the keypoints appearance similarities between candidates. And implied the spatial similarities. and were hyper-parameters to balance the contributions of the two terms.
The spatial similarities was calculated as the distance between the propagated spatial location and encoded center location. . The appearance similarity was calculated as the the
distance across all keypoints embedded deep features betweenand .
where represented the keypoint deep appearance feature obtained from convolution layer before the last upsampling layer of our keypoints CNN. were the hyper-parameters balancing the weights.
The aforementioned bipartite graph matching problem was solved using Hungarian method.
Iii-D Spatial-Aware Temporal Response Filtering
Traditionally, video-based counting methods [9, 8] counted the number of unique tracklet ID as the final counting results. These methods were suitable for the cases, where the target objects were stationary and object occlusion was very rare. In the large-scale pig counting scenario, however, pigs moved fast in different directions, and the same pig will often walked out of the camera view and came back again. In addition, the indoor building structures (e.g. the feeding machine) would sometimes block large part of the camera view causing severe occlusions. Occlusions across long frames will cause tracking failure, and break trajectory of one single object into two or more. In these cases, counting the number of unique tracklet IDs would suffer from large false positive errors. To overcome these limitations, we represented a novel spatial-aware temporal response filtering (STRF) method to perform on-line counting, while minimizing the false positives.
The STRF took the tracking trajectories for all previous frames as input, and output the final counting number. It consisted of two steps: 1) spatial encoding; and 2) temporal response filtering. The spatial encoding stage processed each video frame independently, and each detected pig candidate in the frame was assigned a code number based on their spatial locations. The temporal response filtering stage examined each candidate’s trajectory across time and obtained a count number, , for this single candidate. The final counting result was the sum of all count number for all candidates: .
As shown in Fig. 6a, the spatial encoding stage divided one image frame into activated zone and deactivated zone by an activity scanning line. This scanning line was stationary in a single frame, but served to scan the whole pig house moving with the inspection robot. For all detected pig candidates, activity codes will be assigned based on which activity zone these pigs were in. In our work, pigs in activated zone were assigned code value , and pigs in deactivated zone were assigned code value . Deactivated zone indicated that all candidates inside have already been counted by the algorithm; and the candidates in activated zone would be counted when the activity scanning line scanned through them.
In the temporal response filtering step, lists of spatial codes in temporal order were generated for each trajectory. One trajectory had one list of spatial codes, and each element of the list corresponded to a time point. Fig. 6c illustrated one example of one single pig trajectory from time point to time point , where the blue color represented code and the red color represented code . As it was shown, the generated temporal code was [0, 0, 0, 0, 1, 1, 1] from to . The final count for this trajectory was obtained as the sum of the first order difference of the temporal codes. In this case, the count would be , which indicated that this pig was scanned once (from deactivated zone into activated zone) and the total count should be added by . Similarly, Fig. 6d showed a pig trajectory with code [1, 0, 0, 0, 1, 1, 0, 0] and sum of the the first order difference inferred that the count was . This meant that this pig, which has been counted before, moved from scanned zone to to-be scanned zone. Thus, the total count should minus . This design enabled the algorithm to avoid false positives counting caused by pig movements into/out camera view. Fig. 6e-g showed examples when the pig trajectory count was . Fig. 6e-f represented pig trajectories that never went across the scanning line. Fig. 6g represented cases where the trajectory started and ended in the same activity zone. These examples demonstrated that SFRT would not be influenced by the tracking failures (e.g. broke one trajectory into several cased by occlusion) that happened only in one single zone. In this study, a low-pass filter with window size of 5 was applied before the first order differential calculation. This low-pass filtering step was designed to avoid the trajectory jitter near the activity scanning line. The final counting result for the whole video also added the number of detected candidates in deactivated zone of the beginning frame and the number of detected candidates in activated zone of the ending frame.
We collected 51 videos by inspection robots installed in pig grouping houses of two different pig farming corporations. All videos were originally recorded at 1280720 resolution with frame rate of 25 . For this study, we first resized the video frame to 360640, and then cropped them to 352640. All experiments in this work used this resolution. Each video (pig house) had 120250 pigs. The length of the videos ranged from 2 minutes to 4 minutes. We randomly split these videos into three subsets, 21 for training, 5 for validation and 25 for testing. The ground truth were provided by workers, who counted the pigs inside the grouping houses when the videos were recorded.
Iv-a Comparison With Human Reader
To demonstrate the effectiveness of out method, we compared the performance of our counting system with human readers on test dataset. There were three readers for this study. The readers were required to provide count results by watching the same top-down view videos as the input of the algorithm. There were no time limits for the reading process, and the readers were allowed to pause, rewind, replay the video and took notes for unlimited times. Each reader estimated the pig counts for all the videos in the test datasets. The counting error for both the proposed method and human reader were evaluated using mean absolute percentage error (MAPE) and mean absolute error (MAE). The three readers have MAPE of 11.0%, 17.4%, and 15.9%; and MAE of 12.6, 26.3 and 25.2, respectively. The average time that the human readers have spent on per video is around 1.5 hours. In contrast, our method had MAPE of 2.67%, and MAE of 3.32, which significantly outperformed the human readers.
Iv-B Ablation Study
To validate our proposed CNN architecture for keypoints detection, we compared our method with UNet  and stacked Hourglass network  using the same train, validation and test datasets. Both methods were modified to fit our pixel-level keypoints detection pipelines. Following , the cropping operators was removed from UNet and 7 UNet-submodules were used. The Stacked Hourglass network tested had two hourglass stacked. The Percent of Detected Joints (PDJ) 
was used as the evaluation metric. One keypoint was considered as detected if the distance between the predicted keypoint and the ground truth was smaller than a fraction of the total length of the skeleton of the pig. As shown in TableI, our method achieved better keypoints detection accuracy for all 5 body parts with significantly less computation cost and smaller parameter size.
|SSD  + Tracking||327%||412|
|YOLOv3  + Tracking||247%||368|
|Proposed Keypoint Detector + Tracking||152%||191|
|SSD  + Tracking + STRF||10.1%||12.2|
|YOLOv3  + Tracking + STRF||5.35%||7.00|
|Proposed Keypoint Detector + Tracking + STRF||2.67%||3.32|
We also compared our bottom-up detection method with SSD  and YOLOv3  using top-down bounding boxes detection metric: mean average precision with 0.5 IOU (mAP@0.5). The proposed bottom-up approach did not directly output bounding boxes of pig. Thus, we used keypoints/skeleton bounding boxes instead. It should be noted that the keypoints bounding boxes are more strict and harder to predict, and our network was never trained for the bounding boxes detection task. SSD achieved 73.3% mAP while YOLOv3 achieves 79.7% mAP. Our method had 84.3% mAP. Although more challenging, our method showed better performance. It should be noted that a large part of the detection failures happened around image boudaries where large fisheye distortion and image cutoff happened. Due to the design of STRF methods, most of the failures will not influence the final counting result.
To evaluate the effectiveness of the STRF method, we compared the counting results with and without STRF using our detection method, SSD and YOLOv3, resepctively. Table II showed that the MAPE and MAE are significantly small when using STRF. And our method achieved better performance with/without STRF compared with SSD or YOLOv3.
Iv-C Runtime Analysis
We analyzed the runtime performance of our method using the test dataset. On desktop computer, it achieved 3.42 frames per second (FPS) running speed with a Intel i7-6850K CPU and 32GB DDR4 2133MHz Memory. When accelerated by a single NVIDIA GeForce GTX 1080Ti GPU, it achieved 82.6 FPS. The proposed counting algorithm has also been deployed on two different edge computing devices. It achieved 0.625 FPS on a Firefly-RK3399 platform, which had a 2GB Memory and a Rockchip RK3399 CPU. On NVIDIA Jetson Nano platform, it achieved 3.19 FPS with a 4GB memory, a quad-core ARM A57 CPU and 128 CUDA cores.
In this work, we presented a hardware configuration and novel efficient algorithm for pig counting in large grouping houses. An inspection robot with a monocular fisheye camera was installed on the roof with rails, along which the root could move back and forth to collect top-down view videos. A novel efficient bottom-up CNN detection approach was developed to first detect pigs from the crowd. Second, a online tracking method was employed to associate pig ID temporally. A novel STRF method was proposed to calculate the final pig counts, while significantly avoid false positive counting due to tracking failure or large pig movements. The low computation cost design significantly reduce the computation time and model size. This counting algorithm has been deployed in edge computing device of the inspection robot, and achieved counting accuracy superior to human readers.
-  (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §III-B.
Realtime multi-person 2d pose estimation using part affinity fields. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7291–7299. Cited by: §III-B.
-  (2018) Object detection for cattle gait tracking. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 2206–2213. Cited by: §I.
-  (2016) Precision wildlife monitoring using unmanned aerial vehicles. Scientific reports 6, pp. 22574. Cited by: §II.
-  (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. Cited by: §III-B.
-  (2017) Image-to-image translation with conditional adversarial networks. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §IV-B.
-  (2016) Ssd: single shot multibox detector. In European conference on computer vision, pp. 21–37. Cited by: §III-B, §IV-B, TABLE II.
-  (2018) Robust fruit counting: combining deep learning, tracking, and structure from motion. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1045–1052. Cited by: §II, §III-D.
-  (2019) Monocular camera based fruit counting and mapping with semantic data association. IEEE Robotics and Automation Letters 4 (3), pp. 2296–2303. Cited by: §I, §II, §III-D.
-  (2016) Stacked hourglass networks for human pose estimation. In European conference on computer vision, pp. 483–499. Cited by: TABLE I, §IV-B.
-  (2018) Personlab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 269–286. Cited by: §III-B.
-  (2018) On-barn pig weight estimation based on body measurements by a kinect v1 depth camera. Computers and Electronics in Agriculture 148, pp. 29–36. Cited by: §I.
-  (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §III-B, §IV-B, TABLE II.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pp. 91–99. Cited by: §III-B.
-  (2018) Detection of cattle using drones and convolutional neural networks. Sensors 18 (7), pp. 2048. Cited by: §II.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In MICCAI, pp. 234–241. Cited by: TABLE I, §IV-B.
-  (2018) Crowd counting via adversarial cross-scale consistency pursuit. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5245–5254. Cited by: §II.
-  (2017) Generating high-quality crowd density maps using contextual pyramid cnns. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1861–1870. Cited by: §II, §II.
-  (2019) Automated pig counting using deep learning. Computers and Electronics in Agriculture 163. External Links: Cited by: §I, §II.
-  (2014) Deeppose: human pose estimation via deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1653–1660. Cited by: §IV-B.