Australia has a reputation for safe and high-quality food. Trust in the provenance of that food - including region of origin, sustainability and ethical production - creates premium products that command higher prices and open new markets for Australian producers. Most of the provenance claims valued by consumers relate to the farm where the animals are raised. These include origin (region or country), animal welfare and sustainable practices. As part of automated farm provenance, we have conducted experiments and research to automatically detect and classify animal behaviours.
This research component focuses on monitoring cattle’s activities and behaviours, particularly drinking events, using data annotation from video recordings of cattle near the water trough area as shown in Fig. 1
. This information was then used to label signals from motion sensors on cattle’s collars and ear tags to train embedded machine-learning algorithms. As manual labelling of cattle behaviours from videos (and synchronising with motion sensor signals) could be significantly labour-intensive, there is a need for an image-based machine learning solution to automate this process once there is enough labelled image data. The problems this project aims to solve include:
What camera views are optimal for capturing informative video streams of cow behaviours, particularly drinking behaviour events?
What is the best way to annotate video?
How to identify individual cows from video?
How to classify cow behaviours from video and associate behaviours with cow IDs?
A related work  proposes bounding-box cattle detection and single-frame behaviour recognition for hoof disease and estrous behaviour based on the body curvature and overlapping between bounding boxes. However this work does not identify individual cows and their subtle behaviours such as drinking versus grazing, so it is difficult to know which cow is doing what. Existing approaches to identify individual cows includes using nose patterns  or body patterns  or rear-view videos , however these require a well-control environment and/or close-up image capture that are difficult to set up in the field. To reduce the complexity and cost of cattle behaviour recognition, motion sensors such as accelerometers and gyroscopes associated with cattle IDs  have been used but these lack behaviour annotation. Therefore video recording and labelling are required to generate annotations for machine learning algorithms to recognise behaviours from sensor signals [23, 20].
In this paper, we propose a solution to detect, identify and recognise behaviours of individual cattle. We collected videos from multiple cameras, annotated video for identifying individual cattle and their action, and then developed an image-based deep-learning solution to automate the process. Source code for a baseline solution, raw videos and data generated from this project are released at https://github.com/chuong/cattle_identification_action_recognition.
There are a few challenges when working with cows on the field:
It is challenging to identify similar animals from a larger herd on the field. Because the cows are all black and have similar size in our experiment, cow identification based on body patterns  or nose print  cannot be used. Furthermore, cow identification based on rearview and body movements  cannot be applied to uncontrolled environment on the field. This in turn makes it difficult to associate their activities with their ID.
Behaviour annotation is challenging, particularly for non-expert humans, to distinguish as many behaviours look similar. Furthermore, the transition between different behaviours is sometimes not clear-cut.
It is difficult to synchronise the video recordings while recording asynchronously on the field.
Videos are long and require significant resources to generate a complete set of annotations.
There is significant self occlusion when the animals get close together. The action recognition could be view dependent.
As these are challenging tasks to humans, they are also challenging to train a computer to perform the same task. As a result, a number of technical solutions are used to get around these problems. The experiment was approved by the CSIRO FD McMaster Laboratory Chiswick Animal Ethics Committee with the animal research authority number 17/20.
Ii-a Data collection
Multiple cameras fisheye cameras (GoPro 5 Black) with built-in GPS receiver were installed at CSIRO Armidale site to record videos. GPS signal embedded in the recorded video provided accurate timestamps to synchronise different cameras within the uncertainty of a frame. External battery power banks were used to power the cameras throughout the day under any weather conditions.
To tell the cows apart, numbers were handwritten on different parts of the cow body so that at least one of these are visible from any view angle and close enough distances as shown in Fig. 2. Fortunately, the different shapes of the limited list of numbers are good enough for a deep network to distinguish without explicit handwriting recognition.
Ii-B Data annotation
Due to long video recordings, annotation is only applied when at least one cow comes close to the main camera. Particularly, this work focuses on drinking action, all video segments containing drinking actions occurring at the water trough are marked from the videos recorded by the top camera. Then the duration of each video segment is doubled and centred on the original segment before the annotation is applied to the frames within the doubled duration. This allows to include other non-drinking behaviour to be observed and annotated without causing a significant class-imbalanced problem.
Only key frames are selected for ID annotation to save the annotation effort. Linear interpolation between the key frames gives good enough accuracy. Cow IDs and associated bounding boxes are manually annotated on the key frames and saved as a ground truth dataset for identification. The cow ID and bounding box annotations are visualised to verify the accuracy.
Behaviour annotation is performed separately and directly on videos and stored in VTT subtitle  files. This includes cow ID, the start & end time points of each behaviour or action. VLC player is used to play the video and verify the annotation from the subtitles.
Both cow ID and behaviour annotation is shown together in Fig. 3.
Ii-C Approaches in object detection
An object detection network is needed for cow ID detection. This could be done using popular deep network architectures such as YOLO  and Faster R-CNN . Unlike conventional object detection and classification where objects have distinct appearance, cows in this experiment look almost identical except the written number on their bodies. As a result, the cow detection is expected to work well, but cow classification for identification is challenging. As a result, Cascade R-CNN  was chosen as it is expected to be better at fine-grained object detection and classification.
Ii-D Approaches in action recognition
There are two approaches for action detection for cow behaviours. The first approach is detection of single action of a cow identified by previous ID detection. This approach requires a video containing only a single cow in the middle of the image. The second approach is action recognition for multiple objects in an image. This approach includes both object detection and action recognition.
Unfortunately, existing network architectures for the second approach such as YOWO  and MOC-Detector  require training from scratch on new datasets and do not train well on our cow dataset. One reason is the limited size of the dataset. Another reason is that the cow actions look very similar to each other, and their heads point toward the ground most of the time and make their actions look very similar. This makes it difficult for the second action recognition approach to distinguish similar cows and recognise their actions at the same time. As a result, we have to rely on the first approach. One candidate for action recognition is Temporal Segment Networks (TSN)  which recognises action from a stack of frames, with optional input of optical flow between consecutive frames.
Iii-a Data collection and preprocessing
An experiment was carried out from 18/03/2020 to 27/03/2020. Every day, eight cows with painted IDs on their sides and heads to distinguish them were released into the monitoring area. A water trough with automatic refilling mechanism was installed on the side of the patch and connected to a water supply. The experiment protocol was approved by the Animal Research Authority.
Three GoPro Hero 5 Black  cameras were mounted on poles next to the water trough to focus on drinking behaviour events. The cameras were powered by large power banks connected via USB-C cables.
Videos were recorded from the 3 cameras for 8 hours a day for 8 days. The views from three cameras next to the water trough are shown in Fig. 2.
The middle camera view shows clear separation between cows. As a result, video annotation was carried out on the videos recorded from this camera. GPS data embedded in GoPro videos files could be extracted to obtain time stamp using FFMPEG  and GoPro-Utils [19, 8]. Commands ffmpeg and gpmd2csv are used to extract GPS timestamp and location from GoPro videos.
Annotation for cow identification was performed using CVAT 
annotation tool which could export to different data formats for deep learning. This annotation was carried out on the videos recorded for the 8 days of the experiment and exported to COCO dataset format. Training, validation and testing data are split in 0.70:0.05:0.25 ratio.
Annotation for cow behaviour was performed using VLC video player  and VTT subtitle . Cow ID and behaviour annotations are processed using a custom Python script to export to KINECTICS  dataset formats.
Each cow is annotated with a bounding box and an ID using CVAT . The annotation was performed on key frames, 1 for every 9 frames from the video. Linear interpolation is performed to obtain annotation for other frames between the key frames. The behaviour is recorded as VTT subtitle  files which consist of time duration to the accuracy of second, for example:
0:05:11.000 --> 0:05:23.000 Cow 2 Drinking 0:05:17.000 --> 0:05:42.000 Cow 4 Other 0:05:22.000 --> 0:05:40.000 Cow 8 Grazing
A visualisation of the combined ID, bounding box and behaviour annotations is shown in Fig. 3. The annotations were exported to different formats including COCO  and KINECTICS  depending on the required input of deep learning platform.
For behaviour recognition, video clips of each action are extracted to KINECTICS400 format with frame size of 256256 pixels. This is achieved by converting cow rectangular bounding boxes squared shape and scaled to this image size as shown in Fig. 4 . Additional zero padding is added for regions outside the input image so that the cow is mostly in the middle of the frame.
. Additional zero padding is added for regions outside the input image so that the cow is mostly in the middle of the frame.
A new cow behaviour dataset containing a single cow in each video frame following the format of Kinectics400 has been created. A total of 1715 videos were exported where ”Drinking” class has 360 video, ”Grazing” 413 videos and ”Other” 942 video. Again training, validation and testing data are split in 0.70:0.05:0.25 ratio.
Iii-B Cow identification
Among different deep learning frameworks, we chose MMDetection framework  which is based on PyTorch and supports multiple object classification and detection architectures and different data format. Cascade R-CNN shown in Fig
which is based on PyTorch and supports multiple object classification and detection architectures and different data format. Cascade R-CNN shown in Fig5 was trained with cow ID (8 classes) and bounding box annotation as ground truth data. After training Cascade R-CNN on the cow’s COCO dataset, the testing yielded average precision of 0.812 and average recall of 0.844. For comparison, the testing accuracy achieved by  is 89.95% on their private dataset which consists of cropped images captured from a relatively fixed view from a cow, while our cow images were captured at an arbitrary view.
Iii-C Cow behaviour classification
We chose Temporal Segment Networks (TSN)  without temporal information from MMAction2 library  to train on KINECTICS  data export from our annotation. Finetuning a TSN action recognition model (pretrained on Kinectics400) over 100 epochs with learning rate of 8e-5 and decay of 1e-4 leads to an average accuracy of 0.72. More details of testing is shown in Table I. The accuracy of ”Drinking” and ”Grazing” actions are quite high, while the accuracy of ”Other” action is quite low. This is due to the fact that the ”Other” videos contains some ”Drinking” and ”Grazing” actions that were missed out during annotation process.
Iii-D Joint cow identification and behaviour recognition
To process a video stream to obtain cow ID and their behaviour, we combine the two networks into a single pipeline as shown in Fig. 7. The first network takes a video frame and detect all cows and their bounding boxes. The second network takes a 1-second cropped video based on the bounding boxes of each cow and classifies its behaviour. The output of the pipeline is cow ID and the confidence of each behaviour as shown in Fig. 7.
Iv Discussion and future works
This research has led to three important outcomes. First, a camera setup that could be used not only to acquire videos to annotate motion sensor data, but also to directly monitor cow welfare such as shy feeding. Second, annotated video data of cow IDs, bounding boxes and behaviours as ground truth for machine training based on the videos and other synchronised sensor data from collars and ear tags. Finally, a processing pipeline with two trained cow ID detection and behaviour classifiers that can automatically label new video streams captured from similar views.
We demonstrated a working prototype that can automatically monitor cow welfare on farm using low-powered embedded sensors and edge machine learning. This system will play a major role in evident-based quality control and automatic compliance in supply chain.
There are several improvements for this system to be made for the deployment in production and large-scale environments:
Large-scale cattle identification and tracking. Our work is limited to a fixed and limited number of cattle (8) that are labelled and used to train a network. In reality, the number of cattle could be 100s or 1000s, so the current approach of using handwritten numbers to distinguish cows may not be practical. A possible approach is zero-shot identification and/or continuous 2D or 3D tracking. Additional sensors could be utilised to re-identify cattle.
Multiview cattle identification and behaviour recognition. Although videos from 3 or more cameras were captured, only videos from Camera 1 were labelled and used in this study. Single view identification is not sufficient when there are more cattle and occlusion is unavoidable. Multiview camera approach will solve this problem and potentially improve the accuracy.
Big data management and visualisation. The amount of data generated from multiple cameras is large and difficult to manage and synchronise. There is a need for a dedicated management system and a user interface to allow a user to navigate the data at any timestamp and across all available videos. This will also facilitate multiview data annotation and quality control.
This research was funded by Science and Industry Endowment Fund https://sief.org.au. We acknowledge the following technical staff who have contributed to the research at CSIRO FD McMaster Laboratory Chiswick: Katie Austin (Technical Officer), Alistair Donaldson (Technical Officer) and Reg Woodgate (Senior Technical Officer) with NSW Department of Primary Industries, and Jody McNally (Research Technician) and Troy Kalinowski (Technical Officer) with CSIRO Agriculture and Food. Thanks are also due to Dr Ron Li and Mr Wyman Huang from Imaging and Computer Vision Group of CSIRO Data61 for their support.
-  (2020) FFmped. Note: https://ffmpeg.org/ Cited by: §III-A.
-  (2020) Image-based individual cow recognition using body patterns. Image 11 (3). Cited by: §I, 1st item, §III-B.
-  (2020) Cattle identification: the history of nose prints approach in brief. In IOP Conference Series: Earth and Environmental Science, Vol. 594, pp. 012026. Cited by: §I, 1st item.
-  (2016) A validation of technologies monitoring dairy cow feeding, ruminating, and lying behaviors. Journal of dairy science 99 (9), pp. 7458–7466. Cited by: §I.
Cascade r-cnn: delving into high quality object detection.
Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6154–6162. Cited by: §II-C.
-  (2018) A short note about kinetics-600. arXiv preprint arXiv:1808.01340. Cited by: §III-A, §III-A, §III-C.
-  (2020) HERO5 BLACK — GoPro. Note: https://gopro.com/en/au/update/hero5 Cited by: §III-A.
-  (2020) JuanIrache/gopro-utils: Tools to parse metadata from GoPro Hero 5 & 6. Note: https://github.com/JuanIrache/gopro-utils Cited by: §III-A.
-  (2017) Cow behavior recognition based on image analysis and activities. International Journal of Agricultural and Biological Engineering 10 (3), pp. 165–174. Cited by: §I.
-  (2020) Actions as moving points. In European Conference on Computer Vision, pp. 68–84. Cited by: §II-D.
-  (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §III-A.
-  (2020) Towards improving spatiotemporal action recognition in videos. arXiv preprint arXiv:2012.08097. Cited by: §II-D.
-  (2020) open-mmlab/mmaction2: OpenMMLab’s Next Generation Toolbx for Action Understanding. Note: https://github.com/open-mmlab/mmaction2 Cited by: §III-B, §III-C.
-  (2020) openvinotoolkit/cvat: Powerful and efficient Computer Vision Annotation Tool. Note: https://github.com/openvinotoolkit/cvat Cited by: §III-A, §III-A.
-  (2019) Individual cattle identification using a deep learning based framework. IFAC-PapersOnLine 52 (30), pp. 318–323. Cited by: §I, 1st item.
-  (2018) Cattle behaviour classification from collar, halter, and ear tag sensors. Information processing in agriculture 5 (1), pp. 124–133. Cited by: §I.
-  (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 779–788. Cited by: §II-C.
-  (2015) Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems 28, pp. 91–99. Cited by: §II-C.
-  (2020) stilldavid/gopro-utils: Tools to parse meta data from GoPro 5 & 6. Note: https://github.com/stilldavid/gopro-utils Cited by: §III-A.
-  (2021) Real-time behavioral recognition in dairy cows based on geomagnetism and acceleration information. IEEE Access 9, pp. 109497–109509. Cited by: §I.
Official download of VLC media player, the best Open Source player. Note: https://www.videolan.org/vlc/index.html Cited by: §III-A.
-  (2020) WebVTT: The Web Video Text Tracks Format. Note: https://www.w3.org/TR/webvtt1/ Cited by: §II-B, §III-A, §III-A.
-  (2018) Development and validation of an ensemble classifier for real-time recognition of cow behavior patterns from accelerometer data and location data. PloS one 13 (9), pp. e0203546. Cited by: §I.
-  (2016) Temporal segment networks: towards good practices for deep action recognition. In European conference on computer vision, pp. 20–36. Cited by: §II-D, §III-C.