MetricLearningIdentification
Underpinning code for our paper - "Visual Identification of Individual Holstein Friesian Cattle via Deep Metric Learning"
view repo
Holstein Friesian cattle exhibit individually-characteristic black and white coat patterns visually akin to those arising from Turing's reaction-diffusion systems. This work takes advantage of these natural markings in order to automate visual detection and biometric identification of individual Holstein Friesians via convolutional neural networks and deep metric learning techniques. Using agriculturally relevant top-down imaging, we present methods for the detection, localisation, and identification of individual Holstein Friesians in an open herd setting, i.e. where changes in the herd do not require system re-training. We propose the use of SoftMax-based reciprocal triplet loss to address the identification problem and evaluate the techniques in detail against fixed herd paradigms. We find that deep metric learning systems show strong performance even under conditions where cattle unseen during system training are to be identified and re-identified - achieving 98.2 accuracy when trained on just half of the population. This work paves the way for facilitating the visual non-intrusive monitoring of cattle applicable to precision farming for automated health and welfare monitoring and to veterinary research in behavioural analysis, disease outbreak tracing, and more.
READ FULL TEXT VIEW PDFUnderpinning code for our paper - "Visual Identification of Individual Holstein Friesian Cattle via Deep Metric Learning"
2021
Motivation. Driven by their high milk yield Tadesse and Dessie (2003), black and white patterned Holstein Friesian and British Friesian cattle are the dominant dairy cattle breeds farmed in the UK New et al. (2005); Department for Environment and Affairs (2008) (see Fig. 2). Legal frameworks mandate traceability of livestock throughout their lives Parliament and Council (1997); of Agriculture (USDA) – Animal and Service in order to identify individuals for monitoring, control of disease outbreak, and more Hansen et al. (2018); Smith et al. (2005); Bowling et al. (2008); Caporale et al. (2001). For cattle this is realised in the form of a national tracking database linked to a unique ear-tag identification for each animal Houston (2001); Buick (2004); Shanahan et al. (2009), or additionally via injectable transponders Klindtworth et al. (1999), branding Adcock et al. (2018), and more Awad (2016) (see Fig. 3). Such tags, however, cannot provide the continuous localisation of individuals that would open up numerous applications in precision farming and a number of research areas, including welfare assessment, behavioural and social analysis, disease development and infection transmission, amongst others Ungar et al. (2005); Turner et al. (2000). Even for conventional identification, tagging has been called into question from a welfare standpoint Johnston et al. (1996); Edwards and Johnston (1999), regarding longevity/reliability Fosgate et al. (2006), and permanent damage Edwards et al. (2001); Wardrope (1995). Building upon previous research Martinez-Ortiz et al. (2013); Li et al. (2017); Andrew et al. (2016, 2017, 2019); Andrew (2019); Andrew et al. (2020), we propose to take advantage of the intrinsic, characteristic formations of the breed’s coat pattern in order to perform non-intrusive visual identification (ID), laying down the essential precursors to continuous monitoring of herds on an individual animal level via non-intrusive visual observation (see Fig. 1).
RoIs projected into this latent ID space can then be classified by a lightweight approaches such as k-nearest neighbours, ultimately yielding
cattle identities for input images. Unknown cattle can be projected into this same space as long as the model has learnt a sufficiently discriminative reduction such that its new embeddings can be differentiated from other clusters based on distance.
![]() |
![]() |
![]() |
![]() |
![]() |
|
![]() |
Closed-set Identification. Our previous works showed that visual cattle detection, localisation, and re-identification via deep learning is robustly feasible in closed-set scenarios where a system is trained and tested on a fixed set of known Holstein Friesian cattle under study Andrew et al. (2016, 2017); Andrew (2019). However in this setup, imagery of all animals must be taken and manually annotated/identified before system training can take place. Consequently, any change in the population or transfer of the system to a new herd requires labour-intensive data gathering and labelling, plus computationally demanding retraining of the system.
Open-set Identification. In this paper our focus is on a flexible scenario - the open-set recognition of individual Holstein Friesian cattle. Instead of only being able to recognise individuals that have been seen before and trained against, the system should be able to identify and re-identify cattle that have never been seen before without further retraining. To provide a complete process, we propose a full pipeline for detection and open-set recognition from image input to IDs. (see Fig. 1).
The remainder of this paper and its contributions are organised as follows: Section 2 discusses relevant related works in the context of this paper. Next, Section 4 outlines Holstein Friesian breed RoI detection, the first stage of the proposed identification pipeline, followed by the second stage in Section 5 on open-set individual recognition with extensive experiments on various relevant techniques. Finally, concluding remarks and possible avenues for future work are given in Section 7.
The most longstanding approaches to cattle biometrics leverage the discovery of the cattle muzzle as a dermatoglyphic trait as far back as 1922 by Petersen, WE. Petersen (1922). Since then, this property has been taken advantage of in the form of semi-automated approaches Kumar and Singh (2017); Kumar et al. (2017); Kimura et al. (2004); Tharwat et al. (2014) and those operating on muzzle images Awad and Hassaballah (2019); El Hadad et al. (2015); Barry et al. (2007). These techniques, however, rely upon the presence of heavily constrained images of the cattle muzzle that are not easily attainable. Other works have looked towards retinal biometrics Allen et al. (2008), facial features Barbedo et al. (2019); Cai and Li (2013), and body scans Arslan et al. (2014), all requiring specialised imaging.
Only a few works have utilised advancements in the field of computer vision for the automated extraction of individual identity based on full body dorsal features Martinez-Ortiz et al. (2013); Li et al. (2017)
. Our previous works have taken advantage of this property; exploiting hand crafted features extracted on the coat
Andrew et al. (2016) (similar to a later work by Li, W. et al. Li et al. (2017)), which was outperformed by a deep approach using convolutional neural networks extracting spatio-temporal features Andrew et al. (2017, 2019); Andrew (2019), similar to Qiao et al. (2019). More recently, there have been works that integrate multiple views of cattle faces for identification Barbedo et al. (2019), utilise thermal imagery for background subtraction as a pre-processing technique for a standard CNN-based classification pipeline Bhole et al. (2019), and detect cattle presence from UAV-acquired imagery Barbedo et al. (2019). In this work we continue to exploit dorsal biometric features from coat patterns exhibited by Holstein and Holstein Friesian breeds as they provably provide sufficient distinction across populations. In addition, the images are easily acquired via static ceiling-mounted cameras, or outdoors using UAVs. Note that such birds-eye view images provide a canonical and consistent viewpoint of the object, the possibility of occlusions is widely eradicated, and imagery can be captured in a non-intrusive manner.Object detectors generally fall into two classes: one-stage detectors such as SSD Liu et al. (2016) and YOLO Redmon et al. (2016) which infer class probability and bounding box offsets within a single feedforward network, and two-stage detectors such as Faster R-CNN
The problem of open-set recognition – that is, automatically re-identifying never before seen objects – is a well-studied area in computer vision and machine learning. Traditional and seminal techniques typically have their foundations in probabilistic and statistical approaches
Jain et al. (2014); Scheirer et al. (2014); Rudd et al. (2017), with alternatives including specialised support vector machines
Scheirer et al. (2012); Júnior et al. (2016) and more Bendale and Boult (2015); Júnior et al. (2017).However, given the performance gains on benchmark datasets achieved using deep learning and neural network techniques Sermanet et al. (2013); Girshick et al. (2014); Krizhevsky et al. (2012)
, approaches to open-set recognition have followed suit. Proposed deep models can be found to operate in an autoencoder paradigm
Oza and Patel (2019); Yoshihashi et al. (2019), where a network learns to transform an image input into an efficient latent representation and then reconstructs it from that representation as closely as possible. Alternatives include open-set loss function formulations instead of softmax
Bendale and Boult (2016), the generation of counterfactual images close to the training set to strengthen object discrimination Neal et al. (2018), and approaches that combine these two techniques Ge et al. (2017); Shu et al. (2017). Some further, less relevant techniques are discussed in Geng et al. (2018).The approach taken in this work is to learn a latent representation of the training set of individual cattle in the form of an embedding that generalises visual uniqueness of the breed, beyond that of the specific training herd. The idea is that this dimensionality reduction should be discriminative to the extent that new unseen individuals projected into this same space will differ significantly from the embeddings of the known training set. This form of approach has history in literature Meyer and Drummond (2019); Lagunes-Fortiz et al. (2019); Hassen and Chan (2018), where embeddings have been originally used for human re-identification Schroff et al. (2015); Hermans et al. (2017), as well as data aggregation and clustering Oh Song et al. (2016); Opitz et al. (2018); Oh Song et al. (2017). In our experiments, we will investigate the effect of various loss functions for constructing latent spaces Schroff et al. (2015); Lagunes-Fortiz et al. (2019); Masullo et al. (2019) and quantify their suitability for the open-set recognition of Holstein Friesian cattle.
To facilitate the experiments carried out in this paper, we introduce the OpenCows2020 dataset, which will be made available publicly. The dataset consists of indoor and outdoor top-down imagery collated from our previous works and datasets Andrew et al. (2016, 2017, 2019). Indoor footage was acquired with statically affixed cameras, whilst out imagery was captured onboard a UAV. The dataset is split into two components detailed below: (a) for cattle detection and localisation, the first stage in our pipeline, and (b) for open-set identification.
The detection and localisation component of the OpenSet2020 dataset consists of whole images with hand annotated cattle regions across in-barn and outdoor settings. When training a detector on this set, one obtains a model that is widely domain agnostic with respect to the environment, and can be deployed in a variety of farming-relevant conditions. This component of the dataset consists of a total of images, containing cattle instances. Around 52 of this set are original, non-augmented images. The rest were synthesised with a combination of random cropping, scaling, rotation, blurring and more using Jung et al. (2020) to enhance the training set. For each cow, we manually annotated a bounding box that encloses the animal’s torso, excluding the head, neck, legs and tail in adherence with the VOC 2012 guidelines Everingham et al. . This is in order to limit content to a canonical, compact and minimally deforming species-relevant region. Illustrative examples from this set are given in Fig. 4.
The second component of the OpenSet2020 dataset consists of identified cattle from the detection image set. Individuals with less than 20 instances were discarded, resulting in a population of individuals, an average of instances per class and regions overall. A random example from each individual is given in Figure 5 to illustrate the variety in coat patterns, as well as the various acquisition methods, backgrounds/environments, illumination conditions, etc.
The first stage in the pipeline (see Fig. 1, blue) is to be able to automatically and robustly detect and locate Holstein Friesian cattle within relevant imagery. That is, we want to train a generic breed-wide cattle detector such that for some image input, we receive a set of bounding box coordinates with confidence scores (see Figure 6) enclosing every cow within it as output. Note that the object class of (all) cattle is highly diverse with each individual presenting a different coat pattern. The RetinaNet Lin et al. (2017b) architecture serves as the detection backbone for this breed recognition task where we will compare its performance against other relevant seminal baselines (see Section 4.3).
![]() |
![]() |
RetinaNet consists of a backbone feature pyramid network Lin et al. (2017a) followed by two task-specific sub-networks. One sub-network performs object classification on the backbone’s output using focal loss, the other regresses the bounding box position. To implement focal loss, we first define as follows for convenience:
(1) |
where is the ground truth and
is the estimated probability when
= 1. For detection we only need to separate cattle from the background, therefore presenting a binary classification problem. As such, focal loss is defined as:(2) |
where is cross entropy for binary classification, is the modulating factor that balances easy/difficult samples, and can balance the number of positive/negative samples. The focal loss function guarantees that the training process pays attention to positive and difficult samples first.
The regression sub-network predicts four parameters representing the offset coordinates between anchor box and ground-truth box . Their ground-truth offsets can be expressed as:
(3) | ||||
where is the ground-truth box and is the anchor box. The width and height of the bounding box are given by and . The regression loss can be defined as:
(4) |
where Smooth L1 loss is defined as:
(5) |
Our particular RetinaNet implementation utilises a ResNet-50 backbone He et al. (2016)
as the feature pyramid network, with weights pre-trained on ImageNet
Deng et al. (2009). The intersection over union (IoU) threshold, the prior anchor’s confidence of foreground, and other parameters are set to those proposed in Lin et al. (2017b). The network was fine-tuned on the detection component of our dataset using a batch size of Robbins and Monro (1951) at an initial learning rate of with a momentum of Qian (1999) and weight decay at . Training and testing splits were randomly chosen in a ratio of , respectively, with any synthetic instances removed from the test set. Focal loss function parameters were selected with = 2, = 0.25, = 1. Training time was around 30 hours on an Nvidia V100 GPU (Tesla P100-PCIE-16GB) for epochs of training. Finally, to provide a suitable comparison with baselines, two popular and seminal architectures – YOLO v3 Redmon and Farhadi (2018) and Faster R-CNN Ren et al. (2015) – are evaluated on the same dataset and splits in the following section.Quantitative comparisons of the proposed detection method against classic and recent approaches are shown in Table 1. Mean average precision (mAP) is given as the chosen metric to quantitatively compare performances computed via the area under the curve for the precision-recall curve obtained from each method. As can be seen, the strongest performance was achieved by the RetinaNet-underpinned architecture at near perfect mAP rates suitable for practical application - which justifies the network’s use in our proposed image-to-ID pipeline. Specifically, our implementation obtains this performance with the following parameter choices: confidence score threshold = 0.5, non-maximum suppression (NMS) threshold = 0.28, IOU threshold = 0.5.
Figure 7 depicts limitations and shows instances of RetinaNet detection failures. Examples (a) and (b) arise from image boundary clipping following the VOC labelling guidelines Everingham et al. on object visibility/occlusion which can be avoided in most practical applications by ignoring boundary areas. In (c), poor localisation is the result of closely situated cattle in conjunction with the choice of a low NMS threshold. We chose to keep the NMS threshold as low as possible, otherwise it occasionally leads to false positive detections in groups of crowded cattle (see Fig. 6(a)). Finally we found that in rare cases, as shown in (e), when two cattle are parallel, in close proximity and have a diagonal heading, a predicted box between the two cows can sometimes be observed. This is as a result of one of the intrinsic drawbacks of orthogonal bounding boxes. In the case of objects with diagonal heading, a ground truth bounding box will include maximal background pixels. Consequently, background pixels could be occupied by neighbouring cattle, which misguides the network.
|
|
|
|||||||
---|---|---|---|---|---|---|---|---|---|
YOLO V3 Redmon and Farhadi (2018) | N | N | 80.3 | ||||||
Faster R-CNN Ren et al. (2015)(Resnet50 backbone) | Y | N | 94.8 | ||||||
RetinaNet Lin et al. (2017b)(Resnet50 backbone) | N | Y | 97.5 |
![]() |
![]() |
![]() |
![]() |
![]() |
Given robustly identified image regions that contain cattle, we would like to discriminate individuals, seen or unseen, without the costly step of manually labelling new individuals and fully re-training a closed-set classifier. The key idea to approach this task is to learn a mapping into a class-distinctive latent space where maps of images of the same individual naturally cluster together. Such a feature embedding encodes a latent representation of inputs and, for images, also equates to a significant dimensionality reduction from a matrix to an embedding with size , where is the dimensionality of the embedded space. In the latent space, distances directly encode input similarity, hence the term of metric learning. To actually classify inputs after constructing a successful embedding, a lightweight clustering algorithm can be applied to the latent space (e.g. k-Nearest Neighbours) where clusters now represent individuals.
Success in building this form of latent representation relies heavily – amongst many other factors – upon the careful choice of a loss function that naturally yields an identity-clustered space. A seminal example in metric learning originates from the use of Siamese architectures Hadsell et al. (2006), where image pairs are passed through a dual stream network with coupled weights to obtain their embedding. Weights are shared between two identical network streams :
(7) |
The authors then proposed training this architecture with a contrastive loss to cluster instances according to their class:
(8) |
where is a binary label denoting similarity or dissimilarity on the inputs , and is the Euclidean distance between two embeddings with dimensionality . The problem with this formulation is that it cannot simultaneously encourage learning of visual similarities and dissimilarities, both of which are critical for obtaining clean, well-separated clusters on our coat pattern differentiation task. This shortcoming can be overcome by a triplet loss formulation Schroff et al. (2015); utilising the embeddings of a triplet containing three image inputs denoting an anchor, a positive example from the same class, and a negative example from a different class, respectively. The idea being to encourage minimal distance between the anchor and the positive , and maximal distance between the anchor and the negative sample in the embedded space. Figure 7(a) illustrates the learning goal, whilst the loss function is given by:
(9) |
where
denotes a constant margin hyperparameter. The inclusion of the constant
often turns out to cause learning issues since the margin can be satisfied at any distance from the anchor; Figure 7(b) illustrates this problem. Alleviating this limitation is a recent formulation named reciprocal triplet loss Masullo et al. (2019), which removes the margin hyperparameter altogether:(10) |
![]() |
![]() |
Recent work Lagunes-Fortiz et al. (2019) has demonstrated improvements in open-set recognition on various datasets Hodan et al. (2017); Wang et al. (2017) via the inclusion of a SoftMax term in the triplet loss formulation during training given by:
(11) |
where
(12) |
and where is a constant weighting hyperparameter and is standard triplet loss as defined in equation 9. For our experiments, we select as suggested in the original paper Lagunes-Fortiz et al. (2019)
as the result of a parameter grid search. This formulation is able to outperform the standard triplet loss approach since it combines the best of both worlds; fully supervised learning and a separable embedded space. Most importantly for the task at hand, we propose to combine a fully supervised loss term as given by Softmax loss with the reciprocal triplet loss formulation which removes the necessity of specifying a margin parameter. This combination is novel and given by:
(13) |
where and are defined by equations 10 and 12 above, respectively. Comparative results for all of these loss functions are given in our experiments as follows.
In the following section, we compare and contrast different triplet loss functions to quantitatively show performance differences on our task of open-set identification of Holstein Friesian cattle. The goal of the experiments carried out here is to investigate the extent to which different feature embedding spaces are suitable for our specific open-set classification task. Within the context of the overall identification pipeline given in Figure 1, we will assume that the earlier stage (as described in Section 4) has successfully detected the presence of cattle and extracted good-quality regions of interest. These regions are now ready to be identified, as assessed in these experiments.
The employed embedding network utilises a ResNet50 backbone He et al. (2016), with weights pre-trained on ImageNet Deng et al. (2009). The final fully connected layer was set to have outputs, defining the dimensionality of the embedding space. This dimensionality choice was founded on existing research suggesting
to be suitable for fine-grained recognition tasks such as face recognition
Schroff et al. (2015) or image class retrieval Balntas et al. (2016). In each experiment, the network was fine-tuned on the training portion of the identification regions in the OpenCows2020 dataset over epochs with a batch size of . We chose Stochastic Gradient Descent Robbins and Monro (1951) as the optimiser, set to an initial learning rate of with momentum Qian (1999) and weight decay . For every training run, the reported accuracy value is the highest achieved over the epochs of training. Of note is that we found the momentum component led to significant instability during training with reciprocal triplet loss, thus we disabled it for runs using that function. Finally, for a comparative closed-set classifier chosen as another baseline, the same ResNet50 architecture was used.Once an image is passed through the network, we obtain its -dimensional embedding . We then used -NN with (as suggested by similar research Lagunes-Fortiz et al. (2019)), where more complex alternatives for provided only negligible performance gain. Using -NN to classify unseen classes operates by projecting every non-testing instance from every class into the latent space; both those seen and unseen during the network training. Subsequently, every testing instance (of known and unknown individuals) is also projected into the latent space. Finally, each testing instance is classified from votes from the surrounding nearest embeddings from non-testing instances. Accuracy is then defined as the number of correct predictions divided by the cardinality of the testing set.
To validate the model in its capacity to generalise from seen to unseen individuals, we perform several -fold cross validations. In order to do so, the set of individuals are randomly split into evenly-sized bins. For each fold , the -th bin forms the unseen set of individuals (withheld during training), and the rest form the known set, which are trained against. The number of folds is incrementally lowered from to observe the effect of withholding more individuals from training; Table 2 illustrates quantitative results. That is, how well does the model perform on an increasingly open problem? Within each individual class, its instances were randomly split into training and testing samples in a ratio of , respectively. These splits remain constant throughout experimentation to ensure consistency and enable quantitative comparison.
During training, one observes the network learning quickly and, as a result, a large fraction of triplets are rendered relatively uninformative. The commonly-employed remedy is to mine triplets a priori for difficult examples. This offline process was superseded by Hermans et al. in their 2017 paper Hermans et al. (2017); proposing two online methods for mining more appropriate triplets: ‘batch hard’ and ‘batch all’. Triplets are mined within each mini-batch during training and their triplet loss computed over the selections. In this way, a costly offline search before training is no longer necessary. Consequently, we employ ‘batch hard’ here as our online mining strategy, as given by:
(14) |
where is the mini-batch of triplets, are the anchor classes and are the images for those anchors. This formulation selects moderate triplets overall, since they are the hardest examples within each mini-batch, which is in turn a small subset of the training data. We use this mining strategy for all of the tested loss functions given in the following results section.
Average Accuracy (%) : [Minimum, Maximum] | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Known / Unknown (%) | 90 / 10 | 83 / 17 | 75 / 25 | 67 / 33 | 50 / 50 | 40 / 60 | 30 / 70 | 20 / 80 | 10 / 90 | ||
Cross-entropy (Closed-set) | 36.69 | 25.6 | 13.1 | 7.86 | |||||||
Triplet Loss Schroff et al. (2015) | 93.34 | 84.06 | 81.65 | 71.57 | |||||||
Reciprocal Triplet Loss Masullo et al. (2019) | 83.35 | 94.96 | 89.11 | 87.7 | |||||||
Softmax + Triplet Loss Lagunes-Fortiz et al. (2019) | 97.78 | 94.56 | 89.31 | 86.9 | |||||||
|
97.38 | 95.56 | 89.52 | 84.48 |
Key quantitative results for our experiments are given in Table 2. As can be seen, we found that our proposal for the combination of a supervised Softmax term on the reciprocal triplet loss function led to a slight performance gain when compared to other functions. Figure 9 illustrates these values in graph form, expressing the ability for the implemented methods to cope with an increasingly open-set problem. Visible in the graph is also a standard CNN-based classification baseline using Softmax and cross-entropy loss. As one would expect, this has a linear relationship with how open the identification problem is set; the baseline method can in no way generalise to unseen classes by design. In stark contrast, all embedding-based methods can be seen to drastically outperform the implemented baseline, suggesting the suitability in this form of approach to the problem. Encouragingly, as shown in Figure 10, we found that identification error had no tendency to originate from the unknown identity set.
One issue we encountered is that when there are only a small number of training classes, the model can quickly learn to satisfy that limited set; achieving near-zero loss and 100% accuracy on the validation data for those seen classes. However, the construction of the latent space is widely incomplete and there is no room for the model to learn any further, and thus performance on novel classes cannot be improved. For best performance in practise therefore, we suggest to utilise as wide an identity landscape as possible (many individuals) to carve out a diverse latent space capturing a wide range of intra-breed variance. The avoidance of overfitting is critical, as illustrated in Figure
11, where eventual perfect performance (overfitting) on a small set of known training identities does not allow performance to generalise to novel classes. The reciprocal triplet loss formulation performs slightly better across the learning task which is reflected quantitatively in our findings (see Figure 9). Thus, we suggest utilisation of RTL over the original triplet loss function for the task at hand.
![]() |
![]() |
To provide a qualitative visualisation, we include Figure 12, which is a visualisation of the embedded space and the corresponding clusters. This plot and the others in this section were produced using the t-distributed Stochastic Neighbour Embedding (t-SNE) van der Maaten and Hinton (2008) technique for visualising high-dimensional spaces with a perplexity of . Visible – particularly in relation to the embedded training set (see Fig. 11(a)) – is the success of the model trained via triplet loss formulations, clumping like-identities together whilst distancing others. This is then sufficient to cluster and thereby re-identify never before seen testing identities (see Fig. 11(b)). Most importantly in this case, despite only being shown half of the identity classes during training, the model learned a discriminative enough embedding that generalises well to previously unseen cattle. Thus, surprisingly few coat pattern identities are sufficient to create a latent space that spans dimensions which can successfully accommodate and cluster unseen identities.
![]() |
![]() |
Figure 13 visualises the embeddings of the consistent training set for a 50% open problem across all the implemented loss functions used to train latent spaces. The inclusion of a Softmax component in the loss function provided quantifiable improvements in identification accuracy. This is also qualitatively reflected in the quality of the embeddings and corresponding clusters, comparing the top and bottom rows in Figure 13. Thus, both quantitative and qualitative findings re-inforce the suitability of the proposed method to the task at hand. The core technical takeaway is that the inclusion of a fully supervised loss term appears to beneficially support a purely metric learning-based approach in training a discriminative and separable latent representation that is able to generalise to unseen instances of Holstein Friesians. Figure 14 illustrates an example from each class overlaid in this same latent space. This visualises the spatial similarities and dissimilarities the network uses to generate separable embeddings for the classes that are seen during training that generalise to unseen individuals (shown in red).
![]() |
![]() |
![]() |
![]() |
This work proposes a complete pipeline for identifying individual Holstein Friesian cattle, both seen and never before seen, in agriculturally-relevant imagery. An assessment of existing state-of-the-art object detectors determined that they are well-suited to serve as an initial breed-wide cattle detector, and RetinaNet demonstrated sufficiently strong performance at mAP on the employed dataset. Extensive experiments in open-set recognition found that surprisingly few instances are needed in order to learn and construct a robust embedding space – from image RoI to ID clusters – that generalises well to unseen cattle. Specifically, Reciprocal Triplet Loss in conjunction with a supervised Softmax component was found to demonstrably generalise best in terms of performance across open-set experiments. For instance, for a latent space built from 23 out of 46 individuals, a cross-validated accuracy of was observed. Considering its wider application, these experiments suggest that the proposed pipeline is a viable step towards automating cattle detection and identification non-intrusively in agriculturally-relevant scenarios where herds change dynamically over time. Importantly, the identification component can be trained at the time of deployment on a present herd and, as shown here for the first time, performs well without re-enrolment of individuals or re-training of the system as the population changes - a key requirement for transferability in practical agricultural settings.
Further research will look towards investigating the scalability of this form of approach to large populations. That is, increasing the base number of individuals via additional data acquisition with the intention of learning a general representation of dorsal features exhibited by Holstein Friesian cattle. In doing so, this paves the way for the model to generalise to new farms and new herds prior to deployment, with significant implications for the precision livestock farming sector.
Another future avenue of research will investigate extension to movement tracking from video sequences through continuous re-identification. As we have shown that our cattle detection and individual identification techniques are highly accurate, the incorporation of simple tracking techniques between video frames have the potential to filter out any remaining errors. How robust this approach will be to heavy bunching of cows (for example, before milking in traditional parlours) remains to be tested.
Further goals include the incorporation of collision detection for analysis of social networks and transmission dynamics, and behaviour detection for automated welfare and health assessment, which would allow longitudinal tracking of the disease and welfare status of individual cows. Within this regard, the addition of a depth imagery component alongside standard RGB to support and improve these objectives needs to be evaluated.
Automatic individual holstein friesian cattle identification via selective local coat pattern matching in rgb-d imagery
. In 2016 IEEE International Conference on Image Processing (ICIP), pp. 484–488. Cited by: §1, §1, §2.1, §3.Using muzzle pattern recognition as a biometric approach for cattle identification
. Transactions of the ASABE 50 (3), pp. 1073–1080. Cited by: §2.The cattle book 2008: descriptive statistics of cattle numbers in great britain on 1 june 2008
. DEFRA. Cited by: Figure 2, §1.T-less: an rgb-d dataset for 6d pose estimation of texture-less objects
. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 880–888. Cited by: §5.1.The importance of metric learning for robotic vision: open set recognition and active learning
. In 2019 International Conference on Robotics and Automation (ICRA), pp. 2924–2931. Cited by: §2.3.Visualizing high-dimensional data using t-sne
. Journal of machine learning research 9 (nov), pp. 2579–2605. Cited by: Figure 12, Figure 13, Figure 14, §6.2.1.