Animal biometrics, especially image-based individual re-identification, has recently gained extensive attention due to the availability of large volumes of wildlife image data gathered via automatic game cameras and citizen science projects. The benefits of automated re-identification methods are evident as they allow valuable data for conservation efforts to be obtained, for example, accurate population size estimates and novel information about animal migration and behavior patterns(mccoy2018long; araujo2020getting). Compared to traditional methods such as tagging, which may cause stress and change the behavior of the animal, image-based re-identification offers a non-invasive technique for monitoring of endangered species (norouzzadeh2018automatically).
The Saimaa ringed seal (Pusa hispida saimensis) is an endangered species native to Lake Saimaa, Finland. Seals of this species have a distinct ring-like pelage pattern, which is both permanent and unique for each individual, providing a basis for re-identification. An ongoing conservation effort (koivuniemi2016photo; koivuniemi2019mark; kunnasranta2021sealed) uses image-based re-identification to study animal migration and behavior. Currently, however, this re-identification work is carried out manually, which in view of the large number of images is very labour intensive and time consuming. Automated computer vision-based re-identification would clearly be of great benefit when carrying out this task.
A variety of methods for animal re-identification exist that utilize distinct characteristics in fur, feather and skin patterns (hotspotter; berger2015ibeis; moskvyak2019robust; li2019amur), and methods originally developed for human face re-identification have been successfully applied to animals (deb2018face; crouse2017lemur; agarwal2019triplet). Visual animal re-identification can be formulated as a task of finding a match for the given query image from a database of known individuals, which is equivalent to a content-based image retrieval (CBIR) problem (smeulders2000content) where an image is searched from a database based on the image content. However, despite the clear similarity between CBIR and re-identification tasks, utilizing utilization of CBIR approaches for animal re-identification has remained largely unstudied.
Saimaa ringed seals introduce additional challenges to the re-identification that make the task more difficult compared to many other animals for which re-identification has already been successfully applied. First, the image data is extremely biased. The majority of images are collected using static game cameras producing images with the same viewing angle and background and a limited set of possible seal locations and poses in the frame. At the same time, the high site fidelity (a tendency to return to previously visited locations) and low sociality of Saimaa ringed seals often result in a large portion of images of one individual seal being captured by only one game camera. Machine learning models trained on this kind of data tend to learn features that do not generalize to new datasets (e.g., data from a different year with different game camera locations). Moreover, as only a small portion of Saimaa ringed seal habitat can be covered with game cameras, datasets for seal identification are usually complemented with DSLR camera images, as well as images obtained via citizen science projects (e.g., mobile phone camera pictures). This image heterogeneity introduces a domain shift and due to the fact that different individuals are often captured with different cameras, it also contributes to the database bias problem. Finally, re-identifying Saimaa ringed seals from images is very challenging per se because of: (i) the large variation in possible poses, which is further exacerbated by the deformable nature of the seals, (ii) the non-uniform pelage patterns, limiting the size of the regions that can be used for the re-identification task, and (iii) the low contrast between the ring pattern and the rest of the pelage, as well as the varying appearance (e.g., wet and dry fur). Re-identification of Saimaa ringed seals is therefore considerably more difficult than, for example, zebra re-identification, where there are clearly visible patterns and limited variation in the pose of the torso.
In this paper, we address the above challenges by proposing the NOvel Ringed seal re-identification by Pelage Pattern Aggregation (NORPPA) method for automatic Saimaa ringed seal re-identification (Fig. 1). The method is inspired by CBIR methods and builds on earlier work (nepovinnykh2020siamese) where Siamese networks were utilized to learn a similarity metric for local patches of pelage patterns. We further develop this approach by proposing an improved pattern feature embedding, which is done by utilizing affine invariant local CNN features and aggregating them into a fixed size embedding vector describing global features. The input image is first preprocessed using tone mapping and then segmented to detect and separate the seals from the background. The pelage pattern is further extracted using a U-net encoder-decoder (ronneberger2015u) based method. Affine invariant features are extracted and aggregated into a descriptor. Finally, the re-identification is performed by finding a descriptor with the minimum distance from the database of known individuals.
In the experimental part of the work, we show that the proposed method outperforms previously developed re-identification methods for Saimaa ringed seals as well as HotSpotter (hotspotter), a popular species agnostic pattern-based re-identification approach, on the challenging task of Saimaa ringed seal re-identification. In addition, different variations of the method are comprehensively evaluated to find the best pattern feature embeddings for the task. The main contribution of this paper can be summarized as follows: (i) a novel Saimaa ringed seal re-identification method (NORPPA) inspired by content based image retrieval methods, (ii) a novel combination of local affine-covariant region learning and CNN-based descriptors and feature aggregation to obtain a single fixed size pattern embedding vector with high discrimination power, and (iii) extensive evaluation of the method and its modifications on a challenging Saimaa ringed seal dataset. While the method was developed for Saimaa ringed seals, it is also possible to apply it to other patterned species as shown by badreldeen2021metric.
2 Related work
2.1 Animal re-identification
Animal re-identification is a broad term referring to the process of identifying an individual animal based on its features. The features are based on biological traits, and they can be captured in a number of ways, for example, acoustically (hartwig2005individual; pruchova2017cues) or visually in the form of images (vidal2021perspectives) or videos (freytag2016Chimpanzee). Currently, image-based approaches are the most widely utilized approach due to the relative ease of data acquisition and manual analysis (schneider2019past).
Various animal species can be re-identified by different types of visually unique biological traits such as fur pattern, face or fin shape. Examples of such traits are presented in Fig. 2. Algorithmically, the methods can be divided into classification and metric-based approaches (vidal2021perspectives)
. Classification-based approaches assume that the database of known individuals is known and finite, and the final algorithm can only identify individuals from that database. Metric-based methods, on the other hand, aim to learn a similarity metric between the input images. The re-identification is then performed by clustering or matching based on the similarity, which means that metric-based approaches are not limited by the initial database and can be applied to new individuals without retraining. Re-identification algorithms also differ in the feature extraction approaches used, which can be manual or semi-manual, where user input is required to extract salient regions, or automatic, where input images are fully processed by the method. Fully automatic methods are of most interest as they would allow efficient analysis of large data volumes.
One of the largest wildlife re-identification projects, Wildbook (berger2017wildbook)
uses different kinds of algorithms for edge-based or pattern-based re-identification. The most efficient algorithms for deep learning edge-based re-identification are CurvRank(weideman2017integral), finFindR (thompson2019finfindr), and OC/WDTW (bogucki2019applying), which have been applied to marine mammals such as bottlenose dolphins (Tursiops truncatus), humpback whales (Megaptera novaeangliae), right whales (Eubalaena glacialis), and use the unique shape of tail or fins to identify the animals.
Wildbook uses PIE and Hotspotter metric-based algorithms to re-identify animals by pattern. PIE (moskvyak2019robust) is a deep learning-based method for matching of individuals invariantly to the pose. It receives shape embedding and pose embedding separately and normalizes the shape to match the individual regardless of the specific pose. PIE was originally developed for manta rays (moskvyak2019robust), but in Wildbook it is also used for humpback whale flukes, orcas, and right whales. HotSpotter (hotspotter) is a SIFT-based (lowe1999object) species agnostic algorithm that uses viewpoint invariant descriptors and a scoring mechanism which emphasizes the most distinctive key points, called “hot spots,” on an animal pattern. The HotSpotter algorithm has been successfully used for re-identification of zebras (Equus quagga) (hotspotter) and giraffes (Giraffa tippelskirchi) (parham2017animal), jaguars (Panthera onca) (hotspotter) and ocelots (Leopardus pardalis) (nipko2020identifying).
Most recent methods for animal re-identification utilize deep learning, particularly convolutional neural networks (CNNs)(schneider2019past; schneider2020similarity). CNNs have been successfully applied for re-identification of primate faces (deb2018face; brust2017towards) and for pattern-based re-identification and recognition of Amur tigers (Panthera tigris)(li2019amur; Liu_2019_ICCV_Workshops; Liu_2019_ICCV), cattle muzzle (kumar2018deep), zebras (Equus quagga) and giraffes (Giraffa tippelskirchi) (badreldeen2021metric)
. In order to improve re-identification accuracy, pose estimation and key point alignment have been proposed(yeleshetty20203d; yu2021ap; moskvyak2021keypoint).
2.2 Ringed seal re-identification
A number of methods for the re-identification of Saimaa ringed seals have been developed (zhelezniakov2015segmentation; chehrsimin2017automatic; nepovinnykh2018identification; chelak2021eden). Generally, the methods start with preprocessing steps, including seal segmentation, and then proceed to analyzing the unique pelage pattern to generate a descriptor for each individual seal. A seal segmentation method utilizing superpixel classification was proposed in (zhelezniakov2015segmentation)
. The re-identification method employs common texture features extracted from the segmented seal and a Bayesian classifier. Additional color normalization and contrast enhancement steps were applied in(chehrsimin2017automatic) to make the pattern more visible. The actual re-identification was performed using the Hotspotter algorithm (hotspotter).
The first attempt to utilize CNNs for Saimaa ringed seal identification was done in (nepovinnykh2018identification)
. The individual re-identification was reformulated as a classification problem where each class corresponds to a unique individual, and transfer learning was utilized to train an individual classifier. While the performance is good on a small dataset, the method is only able to reliably perform the re-identification if there is a large set of example images for each individual. Furthermore, the whole system needs to be retrained if a new seal individual is introduced. Finally, it is unclear if the high accuracy is due to the method’s ability to learn the necessary features from the pelage pattern, or if it also learns features such as pose, size, or illumination, which separate individuals in the used dataset but do not provide the means to generalize the method to other datasets.
In order to address these issues, a one-shot approach was proposed in (nepovinnykh2020siamese). The method starts with CNN-based segmentation of the seal. The pelage pattern is extracted utilizing a Sato tubeness filter-based method. For the re-identification, the whole pattern image is divided into patches, which are then fed into an embedding CNN. The CNN is trained using a triplet loss and essentially provides a metric that measures the visual similarity between the patches. Re-identification is then performed based on this similarity by using topology-preserving projections. The main advantage of using a triplet CNN is the ability to easily add new individuals into the database since no retraining is necessary.
The pattern embedding step is crucial for any re-identification method as distinctive but compact embedding that captures the characteristics of the pattern forms the basis for successful re-identification. The pattern embedding step was considered in more detail in (chelak2021eden), where EDEN, a new pooling layer, was proposed to account for the spatial distribution of pattern features. It was shown that the proposed pooling layer increases the matching accuracy of the pattern patches.
Another version of the re-idenfitication algorithm was proposed and applied to the sister species of Saimaa ringed seals, Ladoga ringed seals (Pusa hispida ladogensis) in (ladoga). Ladoga ringed seals have a similar pattern to Saimaa ringed seals, which means that the same re-identification algorithm is applicable to both species. Two new steps were introduced into the pipeline: individual grouping and Fisher Vector computation. The individual grouping step focuses on finding multiple instances of the same individual from an image sequence. This rather simple image retrieval-based method was shown to attain high accuracy in matching individuals within an image sequence producing sets of images of each seal to be re-identified. This was shown to be beneficial for the re-identification as it helps to compensate for poor image quality, which often results in an inability to extract patterns from some images, and it allows a larger portion of the pattern to be captured as the seal changes its pose between images. The Fisher Vector (perronnin2007fisher; perronnin2010large; perronnin2010improving) is used to aggregate patch descriptors from an individual seal into a single image descriptor. The vector further allows the patch descriptors from multiple images to be aggregated, providing a straightforward tool for utilize utilization of the image sets produced by the grouping step. Aggregated image descriptors are used to find a match from the database of known individuals by calculating distances. Promising results were obtained on Ladoga ringed seal re-identification.
2.3 Content based image retrieval
The task of visual animal re-identification can be formulated as a task of finding the most similar image from the database to the given query image. This formulation matches the definition of content-based image retrieval (CBIR) (smeulders2000content) and motivates study of the suitability of CBIR methods for animal re-identification.
CBIR methods usually consist of two main steps: feature extraction and feature aggregation. The feature extraction problem can be solved using standard hand-crafted features, such as Scale Invariant Feature Transform (SIFT) (lowe2004distinctive; arandjelovic2012three), or extraction by convolutional neural networks (see e.g. (HardNet2017)). Then, feature aggregation creates a descriptor for each image that can be used to find the most similar image from the database. Traditional methods such as Bag of Words (BOW) (sivic2003video), Vector of Locally Aggregated Descriptors (VLAD) (jegou2010aggregating) and the Fisher Vector (perronnin2007fisher; perronnin2010large; perronnin2010improving)
do the aggregation using a specially constructed vocabulary. The vocabulary is usually created by an unsupervised clustering algorithm. For example, k-means(macqueen1967some)
is used for VLAD and a Gaussian Mixture Model (GMM)(mclachlan1988mixture) is used for the Fisher Vector. Finally, fixed-size descriptors are created for each image based on the vocabulary and extracted features. The distance between these descriptors is inversely proportional to the visual similarity.
Due to the availability of data and the convenience of end-to-end approaches, deep learning-based methods for CBIR are becoming increasingly more popular. The advantage of CNN-based methods is that the two main steps, feature extraction and feature aggregation, are naturally implemented as a part of the network architecture, with the first part of the network being a feature extractor and a final specialized layer doing the feature aggregation. For example, there have been several attempts to create deep analogues of traditional methods such as NetVLAD (arandjelovic2016netvlad) where a generalized VLAD layer is used to aggregate CNN-extracted features.
In (gong2014multi; babenko2014neural), fully-connected layers are used to generate the final descriptor, which is a standard approach for CNNs. In (razavian2016visual)
, a global max pooling approach is introduced which produces the final descriptor from the activation maps by taking a maximum value from each filter activation, resulting in a descriptor of the same size as the number of filters. Different variants of global pooling operations have also been studied. These include integral max-pooling(tolias2015particular), sum pooling (babenko2015aggregating) and generalized mean pooling (radenovic2018fine). Integral max-pooling (tolias2015particular) is particularly interesting since it creates the final descriptor by applying max-pooling to the overlapping image regions, which also allows spatial information to be encoded.
The proposed NORPPA method consists of six steps: 1) image prepossessing, 2) seal instance segmentation, 3) pelage pattern extraction, 4) feature extraction, 5) feature aggregation and 6) individual re-identification (see Fig. 3).
3.1 Image preprocessing
Depending on illumination conditions variation in the contrast of the images can be rather high. This could lead to a loss of detail in the region of interest, i.e. the seal and its pelage pattern. In order to rectify this issue, we employ the tone-mapping approach to equalize the contrast in dark and bright image regions. The algorithm proposed by (mantiuk_perceptual_2006) is used due to its ability to produce realistic tone-mapped images without introducing visual artifacts. This method considers contrast on multiple spatial frequencies while using gradient methods with some additional extensions to ensure that the global brightness levels are not reversed and low-frequency details are properly reconstructed. Examples of images before and after prepossessing are presented in Fig. 4.
3.2 Seal instance segmentation
Seal instance segmentation step is important since most of the images are obtained using static camera traps. This together with the fact that seal individuals tend to use same sites or areas inter-annually cause one seal individual to be very often captured with the same camera (same background). This increases the risk that the supervised identification algorithm learns to identify the background instead of the actual seal if the full image or the bounding box around the seal is used. Consequently, algorithm behavior may result in a system that is unable to identify the seal in a new environment.
Instance segmentation is performed using Mask R-CNN (he2017mask). A segmentation model trained for Ladoga ringed seals from (ladoga) is utilised. This is possible due to the two species being visually almost indistinguishable. Ladoga ringed seals are more numerous than Saimaa ringed seals and they are often captures in large groups which makes it easier to collect and annotate large training data for the segmentation. For more details about the instance segmentation model and training procedure see (ladoga).
After the segmentation masks are obtained, additional morphological operations are applied to close the holes and smooth the borders by using morphological closing and opening. The examples of segmentation results are presented in Fig. 5.
3.3 Pelage pattern extraction
The main distinguishing feature of a seal is its pelage pattern, which is both permanent and unique to each seal allowing the identification of individuals over their whole lifetime. The pelage pattern forms the basis for the proposed re-identification method. In order to focus the attention on the pattern and discard irrelevant information causing database bias such as illumination and other visual factors (e.g., wet fur looks different from the dry fur), the pattern is segmented. This is done using CNN based method utilizing U-net encoder-decoder architecture (ronneberger2015u)
. The output of the method is a binarized image of the pelage pattern (see Fig.6). The pattern image is further post-processed to remove small noise by using unsharp masking and morphological opening. All images are then resized in such way that the mean width of the pattern lines is the same for all images, bringing them into the same scale. This is necessary because the images are obtained from various sources and the image resolution has a large variation. For more detailed explanation of the pattern extraction step, as well as the comparison to other methods, see (zavialkin2020cnn).
3.4 Feature extraction
Seals can be found in a variety of poses. The deformable nature of seals body results in distorted and warped patterns on images. While the pattern as a whole is transformed in a non-linear way, it can be argued that small local regions experience close to affine transformations, making an affine invariant feature extractor suitable for the task. For this purpose a combintaion HesAffNet (AffNet2018) detector and HardNet (HardNet2017) descriptor is used.
The combination of a Hessian-Affine detector (mikolajczyk2004scale) with RootSIFT (arandjelovic2012three) used to be considered a gold standard for local feature extraction and description. However, with the increasing size of available datasets and rapidly developing field of deep learning, CNN-based methods are now able to outperform previous handcrafted features. The combination of HesAffNet (AffNet2018) and HardNet (HardNet2017) is able to provide state-of-the-art results in image retrieval tasks, which makes those methods particularly useful for animal re-identification as well.
HesAffNet is a modification of the classical Hessian Affine Region detector (mikolajczyk2002affine; mikolajczyk2004scale), where the shape estimation step is done by the AffNet CNN. The detector is based on the Harris cornerness measure (Harris1988ACC)
, which uses a second moments matrix to find regions of interest by estimating the most prominent gradient directions. This method is combined with the multiscale approach from(lindeberg1998feature)
which uses Laplacian of Gaussian to find extrema in the scale space. The same concept can be further extended to all affine transformations, not just the scale. However, the degree of freedom is much higher for affine transformations, which complicates the process and requires a special shape adaptation algorithm. The original Hessian Affine detector used Baumberg iteration(Baumberg2000), which is replaced by an AffNet CNN in HesAffNet.
AffNet and HardNet are closely related, sharing the architecture and similar training procedure. During the training of HardNet, batches of matching patch pairs are chosen, each containing an anchor and positive match . Each pair correspond to a different location, i.e. there are no other matches except for the ones in each pair. Each patch is encoded by the network, and a matrix of pair-wise distances between all anchors and positive matches are computed. For each pair, a closest non-matching descriptor from the batch is chosen, and a final hard negative margin loss is computed as
where is the closest non-matching positive to , and is the closest non-matching anchor to .
AffNet utilizes a slightly different training procedure, the main difference being that the derivative for the negative term in the loss is set to 0. This loss is called hard negative-constant and helps avoid the situations where positive samples cannot be moved closer together because of a negative sample lying between them in the metric space. The training procedure for AffNet is also more complicated, since it is learning affine shapes and not just a distance metric. Therefore, spatial transformers are used to transform input patches according to the predicted shape, which are then fed into a descriptor network, e.g. HardNet, and only then is the loss calculated and backpropagated through both networks. The example of HesaffNet application to a preprocessed image is visualised in Figure7.
3.5 Feature aggregation
Features are aggregated using Fisher Vector (perronnin2007fisher; perronnin2010large; perronnin2010improving)
. First, Principal Component Analysis (PCA) is applied to the resulting the feature embeddings to decorrelate the features and reduce the dimensionality. This is an important for Fisher Vectors, which are known to produce large descriptors. The images in the database of known individuals are used to learn principal components. Next, a visual vocabulary is constructed by applying Gaussian Mixture Model (GMM) to the features from the database. Then, Fisher Vectors are created for each image by computing the partial derivatives of the log-likelihood function with respect to the GMM parameters and concatenating them. Kernel PCA(scholkopf_nonlinear_1998) is applied to further reduce the dimensionality of the resulting image descriptors which helps to reduce the storage requirements for the database, as well as speed up the database search for the re-identification.
3.5.1 Fisher Vector
Let be a sample of observations and
be a probability density function modelling the distribution of the data, whereis a vector of its parameters. The score is defined as the gradient of the log-likelihood of the data on the model:
This score function can be used to define the Fisher Information Matrix (FIM) (amari2000methods):
which acts as a local metric for a parametric family of distributions. This metric can also be used to measure the similarity between 2 samples using the Fisher Kernel (FK) (jaakkola1998exploiting):
where is the Cholesky decomposition of , and are the Fisher Vectors of samples and respectively. By using Fisher Vectors, it is possible to calculate the kernel as simple dot product, which can efficiently be utilized by linear classifiers. When constructing a Fisher Vector for an image, a set of local features is assumed to be independent, meaning that the final descriptor can be constructed as a sum of Fisher Vectors for each local feature, i.e.
Usually, Gaussian Mixture Model (GMM) is used as , since it can be used to approximate any continuous distribution with arbitrary precision (titterington1985statistical). Then, the vector of parameters contains mixture weights , mean vectors and covariance matrices for each Gaussian . Using the assumption that the assignment of each feature to mixture components is almost hard, i.e. each feature is assigned to only one cluster, it could be inferred (sanchez2013image) that the FIM is diagonal, which means that is just a coordinate-wise normalization of the gradient vectors. The final normalized gradients are then defined as follows
where is the soft assignment function
It should be noted that the gradients for the weight parameters are usually omitted, since they do not provide much additional information (perronnin2010improving). Those gradients are concatenated into a vector of size , where is the dimensionality of samples and is the number of components in GMM. It has been shown (perronnin2010improving) that and Power normalization generally improve the performance of the method. Therefore, it is common to apply Power and normalization to the Fisher Vector to get the final descriptor.
3.6 Individual re-identification
Re-identification is done by calculating the cosine distance from the query image descriptor to each image descriptor in the database of known individuals and selecting the individual ID with the lowest distance. To visualize the re-identification and to provide semi-automatic tool for experts, heatmaps highlighting the similar areas in patterns of the query image and database images are computed. This is done using the following method. First, features from a query are paired with the closest database features. Then, pairs with distance larger than 10th percentile of distances are discarded. The remaining pairs are used to find the homography using Direct Linear Transform (DLT)(DLT) and Random Sample Consensus (RANSAC) (RANSAC). The inliers of the final homography are highlighted with ellipses aligned and transformed according to the extracted affine regions. The intensity of each ellipse is inversely proportional to the distance between the local features in the corresponding pair, i.e. directly proportional to their similarity.
4 Experiments and results
The dataset consists of 57 individual seals with a total of 2080 images. The dataset is divided into two subsets: database and query. The database subset contains a minimal number of high-quality unique images that are enough to cover the full body pattern of each seal. The query subset contains the remaining images and contains the same individuals as in the database. It should be noted that the high-quality images were prioritized when constructing the database and, therefore, images in the query subset often have lower quality. Examples of images from both subsets are presented in Fig. 8. The dataset has been made publicly available. For further description of the dataset, see (dataset).
To train and evaluate the patch embedding (feature extraction) and matching (finding the corresponding patch in other images) a separate dataset of pattern image patches (see Fig.9) was constructed (chelak2021eden). The dataset contains, in total, 4599 images (patches of the size pixels). The data is divided into training and testing subsets. The training subset contains 3016 images and 16 classes. The testing subset contains 1583 images and 26 classes that are different from the training classes in the training set. Each class corresponds to one manually selected location in the pelage pattern of one individual seal. Each sample from one class was extracted from different images of the same seal. For estimation of the accuracy of the method, the testing set was divided into the database and query subset with a ratio of 1 to 2. The images that were used to construct the dataset of pattern image patches are not included in the database and query subsets of the re-identification dataset.
4.2 Feature extraction
The feature extraction step contains two differences compared to the previous version of the Saimaa ringed seal re-identification algorithm (nepovinnykh2020siamese)
. The first difference is that the region of interest detection approach uses the affine invariant regions (HesAffNet) instead of dense patches. The second difference is a switch to HardNet network to compute patch embedding. To assess the necessity of each of these changes both modifications were evaluated separately. Hyperparameters for all versions of the algorithm were chosen using the Tree Parzen Estimator(bergstra2011algorithms) algorithm. The results of the experiments are presented in Table 1.
As can be seen, both HesAffNet for region of interest detection and HardNet for patch embedding computation improve the accuracy noticeably. This finding leads to the conclusion that the dense patches approach cannot handle more general cases, whereas fine invariant features provide much needed robustness to various imaging conditions.
In order to evaluate the effect of the pelage pattern extraction on the algorithm’s accuracy, an ablation study has been performed. The results with and without the pattern extraction step are presented in Table 2. It is clear that the pelage feature extraction significantly increases the accuracy of the algorithm.
4.3 Patch embedding network
The following experiments were conducted in order to further improve the method:
Training and fine-tuning of HardNet on different datasets,
Various architecture modifications to the HardNet model.
4.3.1 Training and fine-tuning
The original HardNet was trained on the union of HPatches (balntas2019hpatches) and Brown (brown_automatic_2007) datasets. Typically, fine-tuning a machine learning model on domain-specific training data improves the method performance in a new domain. To test this on Saimaa ringed seal re-identification, we fine-tuned the HardNet model on patches of pelage pattern images. Fine-tuned models were compared to the pretrained model, a model trained from scratch on the pattern patches, and a model trained on the union of all datasets.
The results are presented in Table 3. For the training, all hyperparameters and random seeds were taken from the original implementation of HardNet (HardNet2017).
Comparison of results for HardNet trained and fine-tuned on various datasets. We report mean with standard deviation.
While fine-tuning on the patches dataset improved the accuracy of the patch matching, the overall accuracy of the full-image matching dropped significantly. One possible reason is that the patches dataset was created using patches of the same scale, while the patches extracted by HesAffNet during the full re-identification algorithm vary in scale, leading to a different level of detail.
Training on the union of all datasets showed no considerable improvements. This result can be explained by the size of the pelage pattern patches dataset in comparison to the combined sizes of the Brown and HPatches datasets. In other words, since HardNet utilizes triplet sampling during the training stage, the probability of an image from the pelage pattern dataset appearing in the triplet is extremely small.
4.3.2 Architecture modifications
Several further modifications to the HardNet architecture were also considered. First, a Self-Organized Operational Neural Network (Self-ONN) (malik2021self) was incorporated into the HardNet model. Self-ONNs are networks consisting of layers that are the generalizations of convolutional layers. Simply put, each value in a convolutional kernel can be seen as a linear function, and this function can be generalized through Taylor series approximation with coefficients learned by the network. Such an approach leads to great nonlinearity even with shallow networks. Other modifications include the use of an EDEN pooling layer (chelak2021eden), as well as changes to the number of channels and the output vector size.
The following models were evaluated:
HardNetONN. This model has the same architecture as HardNet in terms of layers and number of channels in each layer. The only difference is that each convolution layer is replaced by a self ONN layer with a Taylor series degree equal to 3 for all layers, which leads to three times as many parameters.
HardNetONNDrop. This model has the same architecture as HardNetONN but the last layer has a dropout with a probability of 0.3 similarly to the original HardNet.
HardNetONN + EDEN. This model has the same architecture as HardNetONN, albeit that the last convolutional layer kernel is downsampled from to so that it would be possible to apply pooling. After the pooling a vector of size 128 is fed into a fully connected layer with the output size of 128, resulting in a compact embedding.
HardNetONNSmall. This model has the same number of parameters as HardNet. All the layers were shrunk by half and the Taylor series degree was set to 4 for all layers. Consequently, the final vector has a size of 64 instead of 128.
HardNet3_384. This model is an original HardNet with 3 times as many channels in all of the layers. Therefore, it has an output vector of size 384 instead of 128.
HardNet3_128. This model is the same as HardNet3_384 but with an output vector size of 128.
A comparison of the models is presented in Table 4.
|HardNetONN + EDEN||1.3|
The HardNetONN and HardNet3_384 models show higher accuracy on both patch matching and full re-identification tasks than other versions of HardNet. Moreover, although HardNet3_384 has 12 million parameters and a vector size of 384, the difference of scores with HardNetONN is small, with the TOP-5 full re-identification score difference being negligible. A comparison of the processing speed of the models is presented in Fig. 10. Overall, the improvements over the baseline HardNet are rather small while result in a noticeable increase in computer time, limiting their usability in practice.
The accuracy of HardNetONNSmall is worse compared to HardNet, although it has the same number of parameters. This can be explained by the fact that the embedding vector is cut in half for HardNetONNSmall and may not contain enough information to learn a good metric. Additionally, HardNetONN + EDEN also scored lower than the original HardNet, although higher than HardNetONNSmall. The reason could lie in the redundancy of inductive bias provided by the pooling, as well as the worse convergence of the model.
4.4 Qualitative evaluation
Visual examples of the re-identification results for the proposed NORPPA method are presented in Fig. 11. For the final version we use HardNet trained on Brown and HPatches datasets. Upon inspecting the results with highlighted areas, it is evident that the proposed method learns to perform the matching between query and database images based on the characteristics of the pelage pattern. Furthermore, it can be seen that the method is able to find the corresponding regions in the patterns in very challenging cases (Fig. 12).
4.5 Quantitative evaluation
SaimaaReID (nepovinnykh2020siamese), LadogaReID (ladoga) without grouping step and NORPPA seal re-identification methods have been compared to HotSpotter(hotspotter), which is another method developed for patterned animal re-identification. HotSpotter is species-agnostic, and as such can be applied to Saimaa ringed seals as well. The results of NORPPA and HotSpotter for the Saimaa ringed seal dataset are presented in Table 5. It can be seen that the proposed method clearly outperforms HotSpotter based on TOP-1 accuracy. The difference is even more clear on TOP-5 accuracy, implying that even when NORPPA fails to correctly re-identify the seal, it is often able to provide a high rank for the correct match in the database. This is especially useful when the method is applied in a semi-supervised manner where the algorithm provides a set of possible matches for the expert to verify.
By considering a larger number of top matches, it is possible to further increase the chances of finding a correct individual. The plot of the top- accuracy relative to the value is presented in Fig. 13. The relationship for the NORPPA, SaimaaReID and LadogaReID methods is logarithmic in nature with fast growth for small values, which slows down significantly with higher values. HotSpotter, on the other hand, exhibits almost no improvement after TOP-2 accuracy, with the difference between TOP-1 and TOP-5 accuracy being only about 2%, while the difference for NORPPA is almost 10%. The improvement in accuracy is a desirable property for a semi-automatic approach, offering a considerable accuracy improvement in exchange for a relatively small increase in the manual work required (as compared to a fully manual approach). Depending on the final application and available data, the relationship between the top- accuracy and can be used to determine the optimal number of matches to be returned by the algorithm.
A novel method for Saimaa ringed seal re-identification called NOvel Ringed seal re-identification by Pelage Pattern Aggregation (NORPPA) was proposed in this paper. The method utilizes pelage pattern extraction and feature aggregation inspired by content-based image retrieval techniques. The re-identification pipeline consists of image enhancement, seal instance segmentation by Mask R-CNN, U-net based pelage pattern extraction, pattern feature extraction, feature aggregation, and individual re-identification by database search. Improved pattern feature embeddings were proposed by employing affine-invariant region of interest detection, CNN based feature descriptors, and Fisher Vector feature aggregation to obtain fixed size embedding vectors with high discriminative power. The proposed method was applied to a novel and challenging Saimaa ringed seal dataset and showed superior performance compared to HotSpotter and earlier versions of the Saimaa ringed seal re-identification method by the authors. One additional benefit of the proposed method is that it allows features to be aggregated over multiple images. This opens interesting possibilities for further research as sequences of game camera images can be utilized to create a single descriptor for a larger portion of a pelage pattern by filling in the gaps created by obstructions and viewpoints. While the method was developed for Saimaa ringed seals, it is also possible to apply it on other patterned animal species.
The authors would like to thank Raija ja Ossi Tuuliaisen Säätiö Foundation, the project CoExist (Project ID: KS1549) for funding the research. In addition, authors would like to thank Vincent Biard, Piia Mutka, Marja Niemi, and Mervi Kunnasranta from the Department of Environmental and Biological Sciences at the University of Eastern Finland (UEF) for providing the data of Saimaa ringed seals and their expert knowledge of identifying each individual.
The research is a part of project CoExist (Project ID: KS1549) funded by the European Union, the Russian Federation and the Republic of Finland via The Southeast Finland–Russia CBC 2014-2020 programme for funding the research.
Conflict of interest
We declare no competing interests.
Data collection was done under permits by the Finnish environmental authorities ELY centre (ESAELY/1290/2015, POKELY/1232/2015, KASELY/2014/2015 and POSELY/313/07.01/2012) and Metsähallitus (MH 5813/2013 and MH 6377/2018/05.04.01).
Consent for publication
All authors consent that the publisher has the author’s permission to publish research findings. All authors guarantee that the research findings have not been previously published.
Availability of data and materials
All data and materials are publicly available at https://doi.org/10.23729/0f4a3296-3b10-40c8-9ad3-0cf00a5a4a53
The codes for the described experiments are available at https://github.com/kwadraterry/Norppa
T. Eerola and H. Kälviäinen were responsible for the supervision of the research, designing methodology, and project administration; E.Nepovinnykh and I.Chelak implemented the algorithm. E.Nepovinnykh, I.Chelak, T.Eerola, and H. Kälviäinen prepared the original draft of the manuscript. All the authors gave the final approval for publication.