Structured Context Enhancement Network for Mouse Pose Estimation

12/01/2020 ∙ by Feixiang Zhou, et al. ∙ University of Leicester 5

Automated analysis of mouse behaviours is crucial for many applications in neuroscience. However, quantifying mouse behaviours from videos or images remains a challenging problem, where pose estimation plays an important role in describing mouse behaviours. Although deep learning based methods have made promising advances in mouse or other animal pose estimation, they cannot properly handle complicated scenarios (e.g., occlusions, invisible keypoints, and abnormal poses). Particularly, since mouse body is highly deformable, it is a big challenge to accurately locate different keypoints on the mouse body. In this paper, we propose a novel hourglass network based model, namely Graphical Model based Structured Context Enhancement Network (GM-SCENet) where two effective modules, i.e., Structured Context Mixer (SCM) and Cascaded Multi-Level Supervision module (CMLS) are designed. The SCM can adaptively learn and enhance the proposed structured context information of each mouse part by a novel graphical model with close consideration on the difference between body parts. Then, the CMLS module is designed to jointly train the proposed SCM and the hourglass network by generating multi-level information, which increases the robustness of the whole network. Based on the multi-level predictions from the SCM and the CMLS module, we also propose an inference method to enhance the localization results. Finally, we evaluate our proposed approach against several baselines...

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 3

page 5

page 11

page 18

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

MOUSE models have become an important tool to study neurodegenerative disorders such as Alzheimer [1] and Parkinson’s disease [2] in neuroscience. modelling and measuring mouse behaviour is key to understand brain function and disease mechanisms [3]

, due to the potential similarity and homology between humans and animals. Historically, studying mouse behaviour can be a time-consuming and difficult task because collected data requires experts to analyse manually. On the contrary, recent advances in computer vision and machine learning have facilitated intelligent analysis for complex behaviours

[4, 5, 6]. This makes a wide range of behavioural experiments possible, which has yielded new insights into the pathologies and treatment of specific diseases carried by mice [7, 8, 9].

In this paper, we are particularly interested in 2D mouse pose estimation, which is one of the fundamental problems in mouse behaviour analysis. Mouse pose estimation is defined as the problem of measuring mouse posture which denotes the geometrical configuration of body parts in images or videos [10]. This task is significant because pose information may be used to enrich mouse behaviour representations by providing configuring features [11]. Mouse pose estimation is a challenging task due to small subject, occlusions, invisible keypoints, abnormal poses and deformation of mouse body. Early approaches focused on labelling the locations of interest with unique and recognizable markers to simplify pose analysis [12, 13, 14]. However, these systems are invasive to the targeted subjects. To overcome this problem, several advanced markerless methods have been proposed for mouse modelling. Overall, a common technique used in these methods is to fit specific skeleton or active models, such as bounding box [15], ellipse [16] and contour [17]. These methods have produced promising results, but the flexibility of such methods is limited as they require sophisticated skeleton models.

With the development of Deep Convolutional Neural Networks (DCNNs), significant progresses have been achieved in human and animal pose estimation

[18, 19, 20, 21]. Most existing systems (e.g., DeepLabCut [22], LEAP [23], DeepPoseKit [21]) are built on deep networks for human pose estimation. For example, DeepLabCut combines the feature detectors of DeeperCut [24]

with readout layers, namely deconvolutional layers, for markerless pose estimation, and exploits transfer learning to train the deep model with fewer examples. Similarly, LEAP has been also developed based on an existing network, i.e., SegNet

[25], which provides a graphical interface for labelling body parts. However, its preprocessing is computationally expensive, thus limiting the application of their system on other environments. DeepPoseKit uses two hourglass modules with different scales to learn multi-scale geometry from keypoints, based on the standard hourglass network [26].

Although the system performance on different animal datasets reported in these works is positive, the datasets used in these previous studies are less challenging because of small variations in postural changes with clean background. These systems mainly concentrate on designing working systems for animal pose estimation by transplanting human pose estimation networks. In addition, the common pipeline of most existing deep models is to generate heatmaps (also known as score maps) of all the yielded keypoints simultaneously at the last stage of the network [21, 22, 23, 26, 27, 28]. However, these systems have not fully considered the difference between different body parts, which is significantly important to mouse pose estimation. We understand that this ignorance is due to the fact that spatial correlations between mouse body parts are much weaker (due to highly deformable shapes) than those of human body parts.

According to the above analysis, we here propose a Graphical Model based Structured Context Enhancement Network (GM-SCENet) for mouse pose estimation, as shown in Fig. 1. We use the standard hourglass network [26] as the basic structure due to its strong multi-scale feature representations as shown in Fig. 1(a). In our approach, the proposed Structured Context Mixer (SCM) shown in Fig. 1(b) first captures global spatial configurations of the full mouse body to represent the global contextual information by concatenating multi-level feature maps from the hourglass network. Using this global contextual information, our proposed SCM explores the contextual features from four directions of each keypoint to represent the keypoint-specific contextual information, focusing on the detailed feature descriptions of the specific mouse part. Afterwards, we model the relationship between the keypoint-specific and the global contextual information by a novel graphical model for information fusion. We define these two types of contextual information as structured context information hereafter.

Fig. 1: Overview of the proposed framework for mouse pose estimation. (a) The structure of the hourglass network [26]. We stack 3 hourglass modules to capture multi-level pose features. (b) The structure of the Structured Context Mixer (SCM). It first generates global contextual information by fusing multi-level features from (a). To describe the difference between different body parts, we then design 4 independent subnetworks to predict the heatmap of the corresponding keypoint, where each subnetwork explores global and keypoint-specific contextual information to represent structured context information of each keypoint (see Section III-A1). The blue blocks refer to the 4 subnetworks, in each of which we propose a novel graphical model to adaptively fuse the global and keypoint-specific contextual information for structured context enhancement (see Section III-A2). (c) The structure of Cascaded Multi-level Supervision (CMLS). We exploit this module as intermediate supervision with multi-scale outputs for joint training of SCM, CMLS module and hourglass network (see Section III-B). We attach this module to the end of each hourglass module and the proposed SCM. In particular, the last CMLS module preserves the spatial correlation between different parts by combining the heatmaps generated in the SCM and the multi-level features from the hourglass network. This helps us to refine the results generated in the SCM focusing on the description of the difference between the body parts. In this way, our proposed GM-SCENet is able to model the difference and spatial correlation between different parts simultaneously. Moreover, during the inference phase, the prediction results (locations of keypoints) , and generated in all the CMLS modules and generated in the SCM are used as input of our proposed Multi-level Keypoint Aggregation (MLKA) inference method to generate the final localization results (see Section III-C). All these components are trained in an end-to-end fashion. in (c) represents the loss of all the keypoints. , , and in (b) represent the losses of kepoints 1-4, respectively.

In addition, we introduce a novel Cascaded Multi-level Supervision module (CMLS) [27, 26] to jointly train the proposed SCM and the hourglass network, as shown in Fig. 1(c). This allows our network to generate multi-level information, which is helpful for addressing the problems of scale variations and small subjects by strengthening contextual feature learning. In fact, the prediction results of the current scale/level, which can be seen as prior knowledge, are used for semantic description and supervision of high-scale feature maps. In the meantime, the extracted features and the predicted heatmaps within each CMLS module can be further refined using a cascaded structure. Finally, to integrate multi-level localization results, we also design an inference algorithm called Multi-level Keypoint Aggregation (MLKA) to produce the final prediction results, where the locations of the selected multi-level keypoints from all the supervision layers are aggregated.

Fig. 1 shows our proposed framework for mouse pose position. The main contributions of our work can be summarized as follows:

  • We propose a novel Graphical Model based Structured Context Enhancement Network (GM-SCENet) which consists of Structured Context Mixer (SCM), Cascaded Multi-level Supervision module (CMLS) and the backbone hourglass network. The GM-SCENet has the ability of modelling the difference and spatial correlation between different mouse parts simultaneously.

  • The proposed SCM can adaptively learn and enhance the structured context information of each keypoint. This is achieved by exploring the global and keypoint-specific contextual information to describe the difference between body parts, and designing a novel graphical model to explicitly model the relationship between them for information fusion.

  • We also develop a novel CMLS module to jointly train our SCM and the hourglass network, which adopts multi-scale features for supervision learning and increases the robustness of the whole network. In addition, we design a Multi-Level Keypoint Aggregation (MLKA) algorithm to integrate multi-level localization results.

  • We introduce a new challenging mouse dataset for pose estimation and behaviour recognition, namely Parkinson’s Disease Mouse Behaviour (PDMB). Several common behaviours and the locations of four body parts (e.g. snout, left ear, right ear and tail base) are included in this dataset. To the best of our knowledge, this is the first large public dataset for mouse pose estimation, collected from standard CCTV cameras.

Ii Related Work

In this section, we review the established approaches related to our proposed framework. Since currently most advanced methods for mouse or other animals pose estimation are based on existing human pose estimation networks, we first review the existing methods for human pose estimation. Then, we discuss the approaches used for mouse pose estimation.

Ii-a Human Pose Estimation

Many traditional methods for single-person pose estimation adopt graph structures, e.g., pictorial structures [29] or graphical models [30] to explicitly model the spatial relationships between body parts. However, the prediction results of keypoints rely heavily on hand-crafted features, which are susceptible to difficult problems such as occlusion. Recently, significant improvements have been achieved by designing DCNNs for learning high-level representations [31, 32, 33, 28, 34, 35]. DeepPose [36]

is the first attempt to use deep models for pose estimation, which directly regress the keypoints coordinates using an end-to-end network. However, making predictions in this way misses the spatial relationship within images. Therefore, recent methods mainly adopt fully convolutional neural networks (FCNNs) to regress 2D Gaussian heatmaps

[27, 26], followed by further inference using Gaussian estimation. For instance, Newell et al. [26] first propose the hourglass network to gradually capture representations across scales by repeated bottom-up and top-down processing. They also adopt intermediate supervision to train the entire network, and learn the spatial relationship with a larger receptive field. In [28], Sun et al. design the High-Resolution Net (HRNet) that pursues high-resolution representations, fusing multi-scale feature maps of parallel multi-resolution subnetworks. In spite of their performance, it remains an open problem to locate keypoints accurately due to the abnormal postures, small objects and occluded body parts.

Ii-B Mouse Pose Estimation

In previous works, fitting active shape models was a popular means to approach the understanding of mouse postures. For example, De Chaumont et al. [16] employ a set of geometrical primitives for mouse modelling. Mouse posture is represented by three parts such as head, trunk and neck with corresponding constrained spatial distance between them. Similarly, in [17], 12 B-spline deformable 2D templates are adopted to represent a mouse contour which is mathematically described with 8 elliptical parameters. However, these methods require sophisticated skeleton models, thus limiting the ability to fit to different datasets. Recently, most deep learning based methods for 2D mouse or other animal pose estimation are based on human pose estimation models such as DeeperCut [24], SegNet [25], and Hourglass network [26]. In [21], Graving et al. present a multi-scale deep learning model called Stacked DenseNet to generate confidence maps from keypoints, based on the hourglass architecture. This network is able to capture multi-scale information by intermediate supervision. Nevertheless, the ability of these approaches to deal with the problems such as occlusion and small targets with the large background is still weak.

Iii Proposed Methods

Iii-a Structured Context Mixer

Iii-A1 Structured Context Information Representations

Extracting keypoints of mouse body parts is non-trivial at all due to occlusion, small size and highly deformable body. In order to accurately locate different keypoints of the targets, most deep models take multi-scale or high-resolution information into account [21], [26], [28], whilst looking at contextual information. In general, contextual information is referred to as regions surrounding the targets, and it has been proved effective in pose estimation [26], [34], object detection[37, 38], co-saliency detection [39]. However, these models ignore the difference between different keypoints to some extent, which are significant to mouse pose estimation due to relatively weak spatial correlation caused by highly deformable mouse body. Unlike previous methods, we explore the structured context information for each body part to describe the difference between the parts, including global and keypoint-specific contextual information.

Before generating structured context information of each keypoint, we aggregate the low- and high-level feature maps from multiple stacks of the hourglass network for representing the global contextual information of the whole subject, as shown in Fig. 1. In fact, the aggregated features also encode the local contextual information derived from low-level layers. is defined as:

(1)

where represents channel-wise concatenation, , , and are the outputs of the first, second and -th stack of the hourglass network, respectively.

Afterwards, we extract keypoint-specific contextual information, where multiple discriminative regions are selected as the contextual information of each keypoint. This is beneficial to the accurate localization of keypoints. Inspired by the corner pooling scheme used in [40], we explore four context features to represent the keypoint-specific contextual information by the Directionmax operation, as shown in Fig. 2, which enables us to localize each keypoint using prior knowledge.

Fig. 2: Directionmax operation. (a) Topmax. (b) Leftmax. (c) Bottommax. (d) Rightmax. For each channel of feature maps, we take the maximum values in four directions.

In [40], to locate the top-left corner of a bounding box, we look horizontally towards the right for the topmost boundary of a target and vertically towards the bottom for the leftmost boundary. Similarly, we look horizontally towards the left for the bottommost boundary of a target and vertically towards the top for the rightmost boundary to pursue the bottom-right corner. Different from their work, we focus on locating different keypoints in our paper. The problem of locating the top-left and bottom-right corners can be considered as that of locating a point when these two corners spatially overlap. Therefore, following the corner pooling used in [40], we adopt Topmax, Leftmax, Bottommax and Rightmax operations illustrated in Fig. 2, respectively, to generate four regions as keypoint-specific contextual information. For a keypoint, Topmax means that we look vertically towards the bottom to explicitly investigate the area below the keypoint, while Leftmax suggests that we look horizontally towards the right to examine the area to the right of the keypoint. Similarly, Bottommax and Rightmax need to look vertically towards the top and horizontally towards the left, respectively. With feature maps, the computation of Topmax shown in Fig. 2 can be expressed by the following equations:

(2)

where is the feature maps that are the inputs to the Topmax operations. is the feature maps of the output, i.e., keypoint-specific contextual information.

is the vectors at

.

Iii-A2 Structured Context information modelling with Graphical Model

Although the global contextual information defined in Eq. (1) can be used to directly infer the locations of keypoints, it pays less attention onto the local characteristics of each keypoint, i.e., keypoint-specific contextual information for describing the difference between body parts. Thus, in this subsection, we aim at modelling the dependency between global and keypoint-specific contextual information for the fusion of the two types of contextual information. Fig. 3 shows the structure of a graphical model based Structured Context Mixer. In Fig. 3(a), is the fusion of the output from multiple hourglass modules, which preserve the multi-level feature representations. Based on this previous work, we use five 3

3 Conv-BN-Relu layers to further refine the generated feature maps. One of them is considered as global contextual features, whilst the other four maps are fed into Topmax, Leftmax, Bottommax and Rightmax shown in Fig.

2 respectively to generate keypoint-specific contextual features. Fig. 3(b) shows the proposed graphical model which effectively fuses two types of contextual features for each keypoint to enhance important features. In particular, we design two attention gates ( and ), which interact with each other to control the flow of the information between the related feature maps, i.e., global and keypoint-specific contextual feature maps.

      
(a) (b)
Fig. 3: Structured Context Mixer (SCM). (a) The whole structure of SCM. (b) The detailed architecture of our proposed graphical model. The arrows indicate the dependency between the variables used in our message passing algorithm, as shown in Algorithm 1. The solid arrows denote the updates from the keypoint-specific and global contextual feature maps, while the dashed ones represent the updates from the two attention gates.

We formulate the task of mouse pose estimation in images as the problem of learning a non-linear mapping , where and refer to image and keypoint spaces. More formally, let be the training set of mouse images, where denotes an input image and represents the corresponding ground-truth labels. can be represented as , where is the number of the keypoints defined in the image space .

To learn , we propose a framework composed of three main parts: Structured Context Mixer, Cascaded Multi-Level Supervision module and the backbone network. The proposed SCM aims to model the dependency between keypoint-specific and global contextual information, which focuses on both spatial correlation and difference between keypoints. In this paper, we denote as the structured context feature maps of each keypoint, where represents the global contextual information defined in Eq. (1), and are keypoint-specific contextual information. Here, can be represented as a set of feature vectors , , where is the number of the pixels in a contextual feature map and is the number of the channels of the contextual feature maps. Unlike previous systems that model contextual information using direct concatenation [41], channel attention [42] and spatial attention [34], we here propose to adaptively fuse the structured context information by learning a set of latent feature maps with a novel graphical model. In particular, we design two interactive attention gates and to control the flow of the information between keypoint-specific and global contextual information. With the two-gate mechanism, our network can adaptively learn and enhance the relatively important keypoint-specific contextual features by preventing the potential loss of useful information. In the SCM, the proposed graphical model jointly infers the hidden contextual features and the attention gates . Given the observed structured context feature maps , we aim to infer the hidden structured context feature representation and the attention variables and . we formalize the problem within a conditional random field framework and define a graphical model by constructing the following conditional distribution:

(3)

where is the partition function for normalization, is the set of parameters and . The energy function is defined as:

(4)

The first term shown in Eq. (4) denotes the unary potentials related to the observed variables and the corresponding latent variables . It can be defined as:

(5)

According to the previous work [43]

, we intend to drive the latent features being close to the observed features. Thus, we adopt a multidimensional Gaussian distribution to represent

as follows:

(6)

where is the Manhattan Distance between and . is covariance matrix. Here, we assume

is Identity Matrix

. The second and third terms shown in Eq. (4) are two branches to model the relationship between the latent keypoint-specific and the latent global contextual feature maps. For each branch, we design a gate to regulate the flow of information between the two types of the contextual features. Inspired by [43, 34], we define two bilinear potentials, i.e. and to represent the dependency between the latent keypoint-specific and the latent global contextual feature maps. and can be defined as:

(7)
(8)

where , , , and and denote the number of the channels of the keypoint-specific and global contextual feature maps, respectively.

Using two branches with different gates, shown in Eqs. (7) and (8), we also consider the relationship between the two gates. Different from previous works modelling the spatial relationship between pixels within a feature map [34, 44], we focus on the spatial dependency between feature maps. In other words, the two gates are spatially interactive, which can guide our proposed GM-SCENet to preserve more useful keypoint-specific contextual information. We define:

(9)

where , , are the number of the channels of the gate and , respectively.

To deduce shown in Eq. (3), we adopt classical mean-field approximation [45]

which is effective and efficient to handle high-dimensional data.

can be approximated by a new distribution , which is represented as a product of independent distributions as follows:

(10)

where , , and are independent marginal distributions. Afterwards, we minimise the difference between those two distributions using Kullback–Leibler (KL) divergence, which is formulated as follows:

(11)

Our target is to minimize the two sides of the above equation. We therefore convert Eq. (11) to the following form:

(12)
(13)

where represents the expectation of distribution , and is the constant. Then, we minimize , which can be regarded as a constrained optimization problem. Given the condition , we rewrite Eq. (12) as:

(14)

To seek the minimum, we take the derivative of with respect to and set the derivative to zero, as shown in Eq. (15). Starting from this minimization, we can derive the final mean-field update shown in Eq. (16) by combining Eqs. (3)-(10).

(15)
(16)

Eqs. (12)-(16) show the process of mean-field approximation where latent variables are updated, and we keep the remaining latent variables fixed. Following this way, , , can be derived as:

(17)
(18)
(19)

Eqs. (16) and (17) show the posterior distribution for and which follow Gaussian distributions. We denote the expectation of , , and by , , and respectively. By combining Eqs. (6), (7) and (8), the following mean-field updates can be derived for the latent structured context representations:

(20)
(21)

In Section III-A2, we define and

as binary variables. Therefore, the expectations of their distributions are the same as

and respectively. The expectations and can be derived considering the distributions shown in Eqs. (18) and (19) and the potential functions defined in Eqs. (7), (8) and (9):

(22)
(23)

where

is the sigmoid function. Using Eq. (

23), the updates for gate depend on the expected values of the hidden features and another gate . In particular, the spatially interactive gates also enable us to refine the distribution of the hidden features, i.e., keypoint-specific and the global contextual features shown in Eqs. (20) and (21), respectively.

0:  Global contextual feature maps initialized with corresponding feature observation , keypoint-specific contextual feature maps initialized with , and the number of iteration .
0:  Enhanced structured context feature maps, i.e., global contextual feature maps updated in the last iteration shown in Fig. 3(b) and Eq. (21).
1:  for  to  do
2:     , ;
3:     , ;
4:     ;
5:     ;
6:     ;
7:     ;
8:     ;
9:     ;
10:     ;
11:     ;
12:  end for
13:  return  Enhanced structured context feature maps .
Algorithm 1 Algorithm for mean-field updates in our proposed Structured Context Mixer with CNN.

Iii-A3 End-to-End Optimization

Following [43], [46], we convert the mean-field inference equations shown in Eqs. (20)-(23) to convolutional operations by deep neural network for jointly training the our Structured Context Mixer and the remaining networks. Our goal is to achieve end-to-end optimization of the proposed GM-SCENet.

The mean-field updates of the two gates shown in Eqs. (22) and (23) can be implemented with deep neural networks in several steps as follows: (a) Message passing between the global contextual feature maps and the keypoint-specific contextual feature maps can be performed by and , where and represents convolutional kernels and the symbols and denote the element-wise product and convolutional operation, respectively; (b) Message passing between two gates is performed by and , where is a convolution kernel; (c) The normalization of the two gates is performed by and , where represents the element-wise addition operation.

Afterwards, we conduct mean-field updating of the global contextual feature maps shown in Eq. (21) after having updated the two gates. The steps are as follows: (a) Message passing from the keypoint-specific contextual to the global contextual feature maps under the control of gate is achieved by ; (b) Message passing from the keypoint-specific contextual to the global contextual feature maps under the control of gate is performed by . (c) The final update involves unary term, i.e., the observed global contextual feature maps can be performed by .

Similarly, following Eq. (20), the updating of the latent feature maps corresponding to keypoint-specific contextual information can be carried out. The mean-field updates in our proposed SCM are summarized in Algorithm 1.

Iii-B Cascaded Multi-Level Supervision

Since mice are small and highly deformable, single-scale supervision may not work well [26], [22]. In this section, we design a novel Cascaded Multi-level Supervision module (CMLS) as intermediate supervision to jointly train the proposed Structured Context Mixer and the hourglass network, where multi-level features, i.e., multi-stage features of multiple scales are used for supervision, as shown in Figs. 1 and 4. Different from previous research [35], our supervision module is used for module-wise supervision rather than layer-wise supervision. In other words, we adopt this module to connect different hourglass modules and further refine the prediction results from the SCM, as shown in Fig. 1. Particularly, we combine the supervision information with historical information by (concatenation) operation in the first stage, before adopting a layer (Deconvolution) to extract high-resolution information in the second stage. We use here to preserve feature maps of higher dimensions. This allows the operation to capture more high-resolution information, which can provide more detailed features, thus benefiting the large-scale supervision in the second stage. Similar to [26], we also use identity mapping to add the input information to the output of the second stage. Furthermore, in the CMLS module, we concatenate two identical Multi-level Supervision modules (MLS) where we take the output of the first MLS module as the input of the other MLS module to build a cascaded structure. In this way, small-scale supervision information as prior knowledge is used for large-scale supervision, and then the large-scale supervision information is also considered as the prior knowledge of the following MLS module. In addition, the generated multi-level supervision information can drive our SCM to strengthen the contextual feature learning. Therefore, such coarse-to-fine process can help our GM-SCENet to generate multi-level feature maps for precisely localizing the keypoints of the mouse.

Fig. 4: Architecture of the Cascaded Multi-level Supervision module (CMLS).

The mathematical formulation of the first and second stages in our proposed CMLS are as follows:

(24)
(25)

where denotes the channel-wise concatenation, and are the input and output of the first stage of the -th MLS module. and are the input and output of the second stage of the -th MLS module and . represents the identity mapping. is a stack of the Residual Module and Conv()-Bn-ReLU. represents two convolutions. is a stack of deconvolution and residual block followed by Conv()-Bn-ReLU and downsampling operation. is a combination of convolution, downsampling operation and convolution.

We use the Mean-Squared Error (MSE) based loss function

[21], [34] for the training of the entire network. To represent the ground-truth keypoint labels, we generate a heatmap for each single keypoint , by a 2D Gaussian distribution centered at the mouse part position . We apply MSE loss function to the proposed SCM and CMLS module (i.e., and in Fig. 1), which can be denoted as:

(26)

where denotes the -th location, denotes the predicted heatmap for keypoint . In our paper, we add all the supervision losses from the SCM and the CMLS module. During the inference phase, we obtain the location of the keypoint from the predicted heatmap by choosing the position with the maximum score, i.e., . Then, taking all the predicted heatmaps from the inner layers of the GM-SCENet into account and combining Multi-Level Keypoint Aggregation in Section III-C, we have the final prediction results.

Iii-C Multi-level Keypoints Aggregation for inference

In general, the final prediction results of deep models for pose estimation come from the last stages of the networks. Here, we argue that results generated in the inner layers may be more accurate if we apply intermediate supervision including our proposed Cascaded Multi-level Supervision. Similar to the popular Non-Maximum Suppression (NMS) postprocessing step used in object detection, most top-down multi-person pose estimation also adopts this method to remove unreasonable poses based on multiple bounding boxes detected around a person instance [28], [47]. Inspired by this concept, we propose an inference algorithm for single-mouse pose estimation, i.e., Multi-level Keypoint Aggregation (MLKA), to integrate prediction results generated in the SCM and CMLS module. We adopt the CMLS module as intermediate supervision for training in our proposed framework, and consider all the predicted heatmaps with multiple scales as potential candidates, as shown in Fig. 4.

During the inference phase, we first obtain the positions of the keypoints from the generated heatmaps before adopting the MLKA to calculate the average of the selected multi-level keypoints, i.e., multi-scale keypoints from different CMLS modules. Specifically, we first resize all the keypoints with different scales to the scale of the input image. Then, we choose the keypoint candidates which are close to the keypoints from the last stage where we use the Object Keypoint Similarity (OKS) metric to compare the similarities of the keypoints of the same part. After having selected all the potential keypoints of different parts by the preset threshold, we aggregate all the selected keypoints of each part by directly calculating the average of them to generate a new set of keypoints and the skeleton. The OKS metric has been used for human pose estimation in different datasets [28]. Therefore, we modify the OKS metric and use it to measure the similarity between the results generated in the last stage and those predicted in the inner layers. The modified OKS can be formed as:

(27)

where is the Euclidean distance between the detected -th keypoint and the corresponding ground truth, and is the Kronecker function. if the condition holds, otherwise 0. is the visibility flag of the ground truth, and is a per-keypoint constant that controls falloff.

Iv Experimental Setup

To evaluate the performance of our proposed methods, we implement comprehensive evaluations on two mouse pose datasets, i.e., DeepLabCut Mouse Pose111https://zenodo.org/record/4008504.X1KM7mczYW8

and our PDMB dataset. In this section, we first introduce the two datasets and evaluation metrics. Then, we describe the implementation details.

Iv-a Datasets and Evaluation Metrics

Iv-A1 DeepLabCut Mouse Pose Dataset

DeepLabCut Mouse Pose dataset: The dataset was recorded by two different cameras with the resolution of 640*480 and 1700*1200 pixels respectively. Most images are 640*480, while some images were cropped around mice to generate the images that are approximately 800*800. There are 1066 images from multiple experimental sessions of 7 different mice. We randomly split the DeepLabCut Mouse Pose dataset into a training set of 853 images and a test set of 213 images. In addition, four parts (i.e., snout, left ear, right ear and tail base) were labeled in this dataset.

Iv-A2 PDMB Dataset

In this paper, we introduce a new dataset that was collected in collaboration with biologists of Queen’s University Belfast of United Kingdom, for a study on neurophysiological mechanisms of mice with Parkinson’s disease (PD) [48]. The neurotoxin 1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine (MPTP) is used as a model of PD, which has become an invaluable aid to produce experimental parkinsonism since its discovery in 1983 [49]. We recorded 4 videos for 4 mice using a Sony Action camera (HDR-AS15) (top-view) with a frame rate of 30 fps and 640*480 resolution. All the mice received treatment of MPTP and they were housed under constant climatic conditions with free access to food and water. All experimental procedures were performed in accordance with the Guidance on the Operation of the Animals (Scientific Procedures) Act, 1986 (UK) and approved by the Queen’s University Belfast Animal Welfare and Ethical Review Body. For each video, We divided it into 6 clips, and each clip lasts about 10 minutes. Then, we extracted frames at different frequencies and build our top-view PDMB dataset with 4248, 3000, 2000 images for training, validation and testing, respectively. Particularly, our dataset contains a wide range of behaviours like rear, groom and eat. After proper training, six professionals were invited to label the locations of snout, left ear, right ear and tail base. In addition, we also consider the visibility of each keypoint.

Iv-A3 Evaluation Metrics

Following [22], we use Root Mean Square Error (RMSE) for evaluation. Furthermore, we introduce a new evaluation metric, i.e., Percentage of Correct Keypoints (PCK) to the PDMB and DeepLabCut Mouse Pose datasets. Different from the PCK metric for the datasets of human pose estimation, we modify and apply it to our PDMB dataset, as shown in Eq. (28). A detected keypoint is considered correct if the distance between the predicted and the ground-truth keypoint is within a certain threshold.

(28)

where represents the threshold, represents a constant vector for normalization.

Iv-B Implementation Details

All the experiments are performed on a server with an Intel Xeon CPU @ 2.40GHz and two 16GB Nvidia Tesla P100 GPUs. The parameters are optimized by the Adam algorithm. For the DeepLabCut Mouse Pose dataset, we use the initial learning rate of 1e-4 and all training and test images are resized to the resolution of 640*480. For the PDMB dataset, the initial learning rate is set to le-5. We adopt the validation split of the PDMB dataset to monitor the training process. The proposed GM-SCENet is implemented with Pytorch. The source code and dataset will be available at:

https://github.com/FeixiangZhou/GM-SCENet.

V Results and Discussion

In this section, we implement comprehensive experiments to evaluate the performance of our proposed methods. Firstly, to validate the effectiveness of the proposed Structured Context Mixer (SCM), Cascaded Multi-level Supervision module (CMLS) and Multi-level Keypoint Aggregation (MLKA), we conduct ablation experiments on the proposed PDMB validation and DeepLabCut Mouse pose test datasets. We use the 3-stack hourglass network as our baseline network. Based on the 3-stack hourglass network, we first investigate each proposed component, followed by the comprehensive analysis for the impact of each module (i.e., SCM, CMLS and MLKA) on the whole network. Then, we compare our network with the prior state-of-the-art networks for animal and human pose estimation on both PDMB and DeepLabCut Mouse pose test datasets.

Methods RMSE Snout Left ear Right ear Tail base Mean

Baseline
4.89 93.38 92.67 93.97 86.87 91.72

SCM(Conv())
SCM(Iteration 1) 4.98 94.99 93.87 95.77 83.4 92.0
SCM(Iteration 2) 3.74 96.90 96.50 96.07 88.87 94.58
SCM(Iteration 3) 3.07 97.67 97.83 97.17 91.67 96.08
SCM(Conv())
SCM(Iteration 1) 4.79 92.55 86.83 93.60 85.83 89.70
SCM(Iteration 2) 3.91 95.13 91.97 96.90 88.93 93.23
SCM(Iteration 3) 3.83 95.89 95.20 97.53 87.80 94.11
SCM(Conv()+Conv())
SCM(Iteration 1) 4.30 94.15 92.87 97.17 87.63 92.95
SCM(Iteration 2) 3.76 96.62 91.27 97.13 90.07 93.77
SCM(Iteration 3) 3.55 97.81 91.67 97.60 89.53 94.15
TABLE I: Ablation experiments for the Structured Context Mixer on the PDMB validation dataset (RMSE(pixels) and PCK@0.2).

V-a Ablation studies on the Structured Context Mixer

In this experiment, we investigate the influence of the proposed Structured Context Mixer with different numbers of iteration during the message passing process and kernel sizes of , and shown in Algorithm 1. Tables I and S1 show the performance comparisons of different networks on the PDMB validation and DeepLabcut Mouse Pose test dataset, respectively. We design 3 types of convolutional kernels (i.e., Conv(), Conv(), Conv() + Conv() ), where Conv() + Conv() means that the gate related kernels use

convolution with stride 1 and padding 1 to keep the same spatial resolution, while the gate

related kernels use convolution with stride 1 and padding 1. We do not choose larger convolutional kernels because this will affect the efficiency of the whole network. As shown in Table I, with convolutions, we achieve the lowest RMSE of 3.07 pixels and the highest 96.08 mean PCK score. When the iteration number is set to 2, we observe that, compared with the networks with other two types of SCM, the mean PCK score is improved from 93.23 and 93.77 to 94.58 respectively by adopting SCM with convolutions. The same trend can be observed when the iteration is set to 3. Actually, larger convolutional kernels help us to increase the receptive field of the network, which pushes our SCM to focus on larger context regions.

With respect to the number of iteration in the phase of inferring hidden structured context features, we train the networks with 1, 2 and 3 iterations, respectively. The results on the PDMB validation set show that the performances of the networks are continuously improved with the increasing number of iteration, which demonstrates that the proposed SCM can adaptively learn and enhance the structured context information of each keypoint considering the difference between different mouse parts. However, on the DeepLabCut Mouse Pose test set, the performance becomes worse when the number of iteration is set to 3, as shown in Table S1. The main reason is that the complexity of the network will increase with the increasing number of iteration, which needs more samples for training. Overall, Although the network using SCM(Conv()) with 3 iterations holds the lowest RMSE and the highest PCK score on the PDMB validation set, the increasing number of iteration will cost more computational overhead. Therefore, in our implementation, we set the number of iteration to 2 on both two datasets. In particular, we make all the convolutions in the first and second iterations share the weights to reduce the parameters of the network for faster inference.

For some challenging mouse keypoints on the PDMB dataset, e.g., tail base, which is occluded frequently, we receive 91.67 PCK scores, which is 4.8 improvement compared to the baseline model. This experimentally confirms that adding the SCM to explore the keypoint-specific contextual information is also beneficial to determine the positions of the occluded parts.

V-B Ablation studies on the Cascaded Multi-level Supervision module

Methods RMSE Snout Left ear Right ear Tail base Mean

Baseline
4.89 93.38 92.67 93.97 86.87 91.72

MLS(All Cas1)
4.45 93.49 96.10 98.16 86.93 93.67
CMLS(Start Cas2 ) 5.75 92.13 97.20 97.93 86.37 93.41
CMLS(Middle Cas2) 4.98 94.78 91.07 96.63 85.60 92.02
CMLS(Final Cas2) 5.39 97.08 94.0 97.8 86.73 93.90
CMLS(All Cas2) 13.61 95.19 93.07 96.33 78.83 90.85
TABLE II: Ablation experiments for the Cascaded Multi-level Supervision module on the PDMB validation dataset (RMSE(pixels) and PCK@0.2).

In this subsection, we investigate the effect of the CMLS module on the prediction results. In our experiments, we design 5 structures based on the baseline model. MLS (All Cas1) refers to the case where we use the Multi-level Supervision (MLS) module as intermediate supervision at the end of the first and second hourglass modules and our SCM, while CMLS (All Cas2) refers to a cascaded structure composed of 2 identical MLS modules. For CMLS (Start Cas2), CMLS (Middle Cas2) and CMLS (Final Cas2), we design the cascaded structure at the end of the first hourglass, the second hourglass and the SCM, respectively, while the MLS modules are kept on the other two positions. As shown in Tables II and S2, we observe that all the networks with the CMLS module except the last one can improve the PCK score of the baseline model. However, we do not witness significant decrease on the RMSE of all the mouse parts for almost all the networks. The reason is that proper supervision applied to the inner layers of the network would easily affect the localization results of the challenging mouse parts such as the tail base on the PDMB dataset. As we can see from Table II, CMLS (All Cas2) has 78.83 PCK score for tail base, which is the worst PCK score among all the networks. On the other hand, compared with the baseline model, CMLS (Start Cas2) and MLS (All Cas1) obtain 4.53 and 4.19 improvement respectively for left and right ears, which demonstrate that our Cascaded Multi-level Supervision module can refine the prediction results of relatively easy parts by providing large-scale (high-resolution) supervision information. Furthermore, MLS (All Cas1) achieves 93.67 mean PCK score and the score is further improved to 93.90 by a designed cascaded structure applied to the end of the SCM. This shows the superior performance of our CMLS module over the single-scale supervision scheme.

Methods RMSE Snout Left ear Right ear Tail base Mean

SCM+CMLS(w/o MLKA)
2.89 97.35 98.70 98.80 90.90 96.44

SCM+CMLS(with small-scale MLKA)
2.90 97.84 99.17 98.60 91.87 96.87
SCM+CMLS(with large-scale MLKA) 2.81 97.60 98.90 98.67 91.40 96.64
SCM+CMLS(with multi-scale MLKA) 2.81 97.84 99.20 98.60 91.87 96.88
TABLE III: Ablation experiments for the Multi-level Keypoints Aggregation on the PDMB validation dataset (RMSE(pixels) and PCK@0.2).
Fig. 5: Visualization of the prediction results from the Structured Context Mixer and the last Cascaded Multi-level Supervision module of our GM-SCENet. (a) The input images (we crop the original images for better display). (b) Heatmaps predicted by the Structured Context Mixer for 4 parts. The last column in (b) are the heatmaps of all the parts. (c) Mouse skeletons generated by the Structured Context Mixer (Dots with different colors represent the predicted results of different parts, the symbol ‘+’ represents the ground truth, and the purple lines are the skeletons). (d) Further refined heatmaps by the last Cascaded Multi-level Supervision module. (e) Further refined skeletons by the last Cascaded Multi-level Supervision module.

V-C Ablation studies on Multi-level Keypoints Aggregation

The CMLS module can drive the whole network to generate multi-level features for supervision during the training phase, as described in Section III-B. After having obtained all the multi-level predicted keypoints, we aggregate all the chosen keypoints according to the MLKA strategy. In our experiments, we design 3 schemes to evaluate the performance of the proposed MLKA where we use the network consisting of the 3-stack hourglass network, SCM (Conv(), Iteration 2) and CMLS (Final Cas2) for comparison. The first scheme is that we generate the keypoints by aggregating all the selected small-scale keypoints and the predicted keypoints from the last stage. In the other two schemes, we aggregate all the selected large-scale and multi-scale keypoints respectively. The threshold of the keypoint similarity in all the experiments is set to 0.2. As shown in Table III, we observe that the inference strategy combined with small-scale, large-scale and multi-scale MLKA respectively can improve the mean PCK score. In addition, the inference strategy with multi-scale MLKA has the lowest RMSE and the best mean PCK score among these three schemes on the PDMB validation set. Besides, we have 0.24 and 0.01 improvement by replacing the small-scale and large-scale MLKA strategy with the multi-scale MLKA respectively, and obtain 0.44 improvement by applying multi-scale MLKA to the inference phase of the baseline network.

Dataset SCM CMLS RMSE Snout Left ear Right ear Tail base Mean

PDMB
4.89 93.38 92.67 93.97 86.87 91.72
3.74 96.90 96.50 96.07 88.87 94.58
5.39 97.08 94.0 97.8 86.73 93.90
2.89 97.35 98.70 98.80 90.90 96.44
DeepLabCut Mouse Pose 3.83 89.35 90.28 90.28 93.06 90.74
3.67 95.77 96.24 90.61 98.12 95.19
3.52 94.84 97.65 95.77 97.65 96.48
3.13 96.71 98.60 97.65 98.59 97.89

TABLE IV: Ablation experiments for the Structured Context Mixer and Cascaded Multi-level Supervision module on the PDMB validation and DeepLabCut Mouse Pose test datasets (RMSE(pixels) and PCK@0.2).
Dataset PDMB DeepLabCut Mouse Pose
Methods RMSE Snout Left ear Right ear Tail base Mean RMSE Snout Left ear Right ear Tail base Mean
Newell et al.[26] 4.29 95.69 92.75 95.05 93.38 94.22 5.80 92.96 93.43 91.55 88.73 91.67

Mathis et al.[22]
3.73 96.50 98.25 97.85 93.58 96.54 3.64 93.90 93.90 85.92 98.58 93.08
Pereira et al.[23] 4.0 94.68 94.10 96.25 93.23 94.56 5.70 88.26 94.37 94.37 96.24 93.31
Graving et al.[21] 3.23 96.96 98.89 97.30 94.64 96.95 3.48 96.24 95.31 92.02 98.12 95.42
Toshev et al.[36] 5.78 83.92 83.75 89.90 63.21 80.20 7.30 73.24 64.32 72.30 79.34 72.30
Wei at al.[27] 4.71 93.41 92.05 92.60 94.69 93.19 6.19 91.08 89.67 81.22 96.71 89.67
Chu et al.[34] 4.20 96.65 94.35 97.0 94.64 95.66 5.54 91.08 92.49 91.55 96.71 92.96
Ours(SCM) 3.46 96.91 97.60 98.20 95.19 96.97 3.67 95.77 96.24 90.61 98.12 95.19
Ours(CMLS) 4.79 96.45 95.50 97.30 93.13 95.60 3.52 94.84 97.65 95.77 97.65 96.48
Ours(GM-SCENet) 2.84 97.87 98.65 98.80 95.89 97.80 3.13 96.71 98.60 97.65 98.59 97.89
Ours(GM-SCENet+MLKA) 2.79 97.72 98.90 99.10 96.34 98.01 3.10 97.65 98.12 98.12 98.59 98.12
TABLE V: Comparisons of RMSE and PCK@0.2 score on the PDMB and DeepLabCut Mouse Pose test sets.

V-D Comprehensive analysis on Structured Context Mixer and Cascaded Multi-level Supervision module.

We also investigate the contribution of the proposed Structured Context Mixer and the Cascaded Multi-level Supervision module to the whole network on both PDMB validation and DeepLabCut Mouse Pose test datasets. In addition to adding each proposed module to the baseline model separately, we further employ the two proposed modules to the baseline model simultaneously. The results are reported in Table IV. On the two datasets, our method achieves the highest 96.44 and 97.89 mean PCk scores with the two proposed modules applied simultaneously, which are 4.72 and 7.15 increases compared to the baseline model. Additionally, the network combining the SCM and CMLS module achieves the lowest RMSE of 2.89 pixels on the PDMB validation set and the lowest RMSE of 3.13 pixels on the DeepLabCut Mouse Pose test set, respectively.

We also observe that the overall performance of the network combined with only the CMLS module outperforms that of the SCM based network across almost all the keypoints on the DeepLabCut Mouse Pose dataset. Different from the tail base in the PDMB dataset, the CMLS based network does not result in an obvious decrease on the PCK score of snout which is a relatively challenging part in the DeepLabCut Mouse Pose dataset. On the contrary, the CMLS based network improves the PCK score of snout from 89.35 to 94.84. As we discussed in Section V-D, the localization errors of the challenging mouse parts would be amplified with only the CMLS module used, although those of relatively easy parts can be further reduced. However, the CMLS based network can effectively deal with the problem of scaling variations by providing multi-level supervision information. Actually, the size of the mouse in the input image is different since the images in the DeepLabCut Mouse Pose dataset have different resolutions. Therefore, our network can achieve better performance by combining both the SCM and the CMLS module, where the SCM can adaptively learn and enhance structured context information by exploring the difference between different parts and the CMLS can provide multi-level supervision information to strengthen the contextual feature learning.

Finally, we explore the effect of the CMLS module applied to the end of SCM by a qualitative evaluation on our PDMB validation set, as demonstrated in Fig. 5. We observe that the network can lead to more accurate localization results by adding the CMLS module to the end of SCM in the challenging cases such as abnormal mouse poses, occluded and invisible keypoints. The network without the CMLS module applied to the end of SCM produces the failing predictions as shown in Fig. 5(c). For example, in the 2nd row of Fig. 5(c), the mouse’s left foot was mispredicted as the tail base due to the local similarity caused by the highly deformable body. However, adding the CMLS module to the end of SCM results in the refined predictions as shown in Fig. 5(e). It is also noticeable that the improved network produces higher responses in the ambiguous regions of the heatmaps, as depicted in Fig. 5(d) compared to the 5th column of Fig .5(b). It is because the last CMLS module can preserve the spatial correlations between different parts, where the predicted heatmaps in the SCM can be used as prior knowledge to combine multi-level features from the baseline model.

V-E Comparison with state-of-the-art pose estimation networks

In this section, we report the RMSE and PCK scores of our proposed and the other state-of-the-art methods at the threshold of 0.2 on our PDMB and DeepLabCut Mouse Pose test sets, as shown in Table V. Among them, there are three state-of-the-art approaches of animal pose estimation and four networks of human pose estimation.

On the PDMB test set, our proposed GM-SCENet achieves the state of the art RMSE of 2.84 pixels and 97.80 mean PCK score. Compared with the Stacked DenseNet for animal pose estimation [21], our GM-SCENet improves the mean PCK score from 96.95 to 97.80, and the performance can be further improved to 98.10 by adding the MLKA inference algorithm. In particular, for the most challenging mouse part, i.e., tail base, our GM-SCENet achieves 1.25 improvement. The PCK scores for snout, left and right ears are also improved to 97.87, 98.65 and 98.80, respectively. Among these popular networks for human pose estimation, the model of [36] has the worst performance. This is because it directly predicts the numerical value of each coordinate rather than the heatmap of each part by a CNN. Also, making predictions in this way ignores the spatial relationships between keypoints in an image. Compared with the 8-stack hourglass network of [26], our GM-SCENet (3-stack hourglass network based) outperforms it by a large margin, where the mean PCK score holds a 3.58 improvement. In addition, we observe that our SCM based network can achieve competitive performance in comparison to the other methods, demonstrating the effectiveness of our proposed SCM for structured context enhancement. The performance can be further improved by adding the CMLS module. Figs. 6(a) and S1 show the mean PCK curves of different methods and the PCK curves for each keypoint, respectively. Our method (GM-SCENet+MLKA) achieves the best performance across different thresholds among all the methods. The final results demonstrate the superior performance of our proposed network over the other state-of-the-art methods in terms of PCK@0.2 score.

(a) PDMB (b) DeepLabCut
Fig. 6: Comparisons of mean PCK curves on the PDMB and DeepLabCut Mouse Pose test sets.
(a) PDMB (b) DeepLabCut
Fig. 7: Comparisons of keypoints’ RMSE on the PDMB test and DeepLabCut test sets. The rectangles corresponding to individual methods are exploited to depict the distribution of keypoints’ RMSE between the minimum and maximum values.

On the DeepLabCut Mouse Pose test set, our SCM based network achieves 95.19 mean PCK score, which exceeds the results of the other state-of-the-art networks except for the Stacked DenseNet. However, our CMLS based network obtains 1.06 improvement compared with the Stacked DenseNet. Furthermore, the proposed GM-SCENet improves the PCK scores with large margins of 3.29 and 3.28 on the left and right ears compared with their closest competitors and achieves 2.47 improvement in average compared with the Stacked DenseNet. The use of the MLKA scheme results in further improvement on the mean PCK score and decrease on the keypoint RMSE. Figs. 6(b) and S2 show the mean PCK curves and the PCK curves of different methods for individual parts. The GM-SCENet with or without the MLKA can achieve competitive performance compared to the other state-of-the-art methods on the DeepLabCut Mouse Pose test set.

It is also noteworthy that although our PDMB dataset contains more challenging mouse poses, the performance of different methods on the dataset outperforms that on the DeepLabCut Mouse Pose dataset. The reason is that our PDMB is a large database which has more images for training. For the Small DeepLabCut Mouse Pose dataset, our proposed method can achieve significant improvement on the PCK score compared with other state-of-the-art approaches. This shows that our method works satisfactorily on both small and relatively large datasets. Fig. S3 gives some examples of the estimated mouse poses on the PDMB and DeepLabCut test sets. Our proposed GM-SCENet can deal well with occlusions, invisible keypoints, abnormal poses, deformable mouse body and scale variations.

Finally, we also compare the distribution of keypoint RMSE on the PDMB test and DeepLabCut test sets (we randomly choose a group of samples rather than the whole test set for comparison in our experiments). Here, the box plots are plotted in Fig. 7(a) and (b) to measure the effectiveness of our proposed method and other four comparison methods on the two datasets in terms of keypoints’ RMSE and stability. Fig 7

(a) and (b) show that our GM-SCENet achieves the lowest keypoints’ RMSE with the lowest variance on the two datasets. This demonstrates that our proposed GM-SCENet can reduce the keypoints’ RMSE effectively and produce more stable pose estimation results.

Vi Conclusion

We have presented a novel end-to-end architecture called Graphical Model based Structured context Enhancement Network (GM-SCENet) for mouse pose estimation. By concentrating on the difference between different mouse parts, the proposed Structured Context Mixer (SCM) based on a novel graphical model is able to adaptively learn and enhance the structured context information of individual mouse parts, which is composed of global and keypoint-specific contextual features. Meanwhile, a Cascaded Multi-level Supervision module (CMLS) has been designed to jointly train SCM and backbone network by providing multi-level supervision information, which strengths contextual feature learning and improves the robustness of our network. We also designed an inference strategy, i.e., Multi-level Keypoint Aggregation (MLKA), where the selected multi-level keypoints are aggregated for better prediction results. Experimental results on our Parkinson’s Disease Mouse Behaviour (PDMB) and the standard DeepLabCut Mouse Pose datasets demonstrate that the proposed approach outperforms several baseline methods. Our future work will explore social behaviour analysis of mice with Parkinson’s Disease using the proposed mouse pose estimation model.

References

  • [1] L. Lewejohann, A. M. Hoppmann, P. Kegel, M. Kritzler, A. Krüger, and N. Sachser, “Behavioral phenotyping of a murine model of Alzheimer’s disease in a seminaturalistic environment using RFID tracking,” Behavior Research Methods, vol. 41, no. 3, pp. 850–856, 2009.
  • [2] S. R. Blume, D. K. Cass, and K. Y. Tseng, “Stepping test in mice: A reliable approach in determining forelimb akinesia in MPTP-induced Parkinsonism,” Experimental Neurology, vol. 219, no. 1, pp. 208–211, 2009.
  • [3] A. Abbott, “Novartis reboots brain division. After years in the doldrums, research into neurological disorders is about to undergo a major change of direction;,” Nature, vol. 502, pp. 153–154, 2013.
  • [4]

    Z. Jiang, D. Crookes, B. D. Green, Y. Zhao, H. Ma, L. Li, S. Zhang, D. Tao, and H. Zhou, “Context-Aware Mouse Behavior Recognition Using Hidden Markov Models,”

    IEEE Transactions on Image Processing, vol. 28, no. 3, pp. 1133–1148, 2019.
  • [5] N. G. Nguyen, D. Phan, F. R. Lumbanraja, M. R. Faisal, B. Abapihi, B. Purnama, M. K. Delimayanti, K. R. Mahmudah, M. Kubo, and K. Satou, “Applying Deep Learning Models to Mouse Behavior Recognition,” Journal of Biomedical Science and Engineering, vol. 12, no. 02, pp. 183–196, 2019.
  • [6] A. A. Robie, K. M. Seagraves, S. E. Egnor, and K. Branson, “Machine vision methods for analyzing social interactions,” Journal of Experimental Biology, vol. 220, no. 1, pp. 25–34, 2017.
  • [7] A. T. Schaefer and A. Claridge-Chang, “The surveillance state of behavioral automation,” Current Opinion in Neurobiology, vol. 22, no. 1, pp. 170–176, 2012.
  • [8] T. Kilpeläinen, U. H. Julku, R. Svarcbahs, and T. T. Myöhänen, “Behavioural and dopaminergic changes in double mutated human A30P*A53T alpha-synuclein transgenic mouse model of Parkinson´s disease,” Scientific Reports, vol. 9, no. 1, pp. 1–13, 2019.
  • [9] S. R. Egnor and K. Branson, “Computational Analysis of Behavior,” Annual Review of Neuroscience, vol. 39, no. 1, pp. 217–236, 2016.
  • [10] M. W. Mathis and A. Mathis, “Deep learning tools for the measurement of animal behavior in neuroscience,” Current opinion in neurobiology, vol. 60, pp. 1–11, 2020.
  • [11] A. Arac, P. Zhao, B. H. Dobkin, S. T. Carmichael, and P. Golshani, “Deepbehavior: A deep learning toolbox for automated analysis of animal and human behavior imaging data,” Frontiers in Systems Neuroscience, vol. 13, p. 20, 2019.
  • [12] O. H. Maghsoudi, A. V. Tabrizi, B. Robertson, and A. Spence, “Superpixels based marker tracking vs. hue thresholding in rodent biomechanics application,” Conference Record of 51st Asilomar Conference on Signals, Systems and Computers, ACSSC 2017, vol. 2017-Octob, pp. 209–213, 2018.
  • [13] S. Ohayon, O. Avni, A. L. Taylor, P. Perona, and S. E. Roian Egnor, “Automated multi-day tracking of marked mice for the analysis of social behaviour,” Journal of Neuroscience Methods, vol. 219, no. 1, pp. 10–19, 2013.
  • [14] “Three-dimensional rodent motion analysis and neurodegenerative disorders,” Journal of Neuroscience Methods, vol. 231, pp. 31 – 37, 2014.
  • [15] X. P. Burgos-artizzu, P. Doll, D. Lin, D. J. Anderson, and P. Perona, “Social behavior recognition in continuous video Supplementary Material,”

    2012 IEEE Conference on Computer Vision and Pattern Recognition

    , vol. 5, no. 75, pp. 1322–1329, 2012.
  • [16] F. De Chaumont, R. D. S. Coura, P. Serreau, A. Cressant, J. Chabout, S. Granon, and J. C. Olivo-Marin, “Computerized video analysis of social interactions in mice,” Nature Methods, vol. 9, no. 4, pp. 410–417, 2012.
  • [17] K. Branson and S. Belongie, “Tracking multiple mouse contours (without too many samples),” Proceedings - 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005, vol. I, pp. 1039–1046, 2005.
  • [18] J. Cao, H. Tang, H. S. Fang, X. Shen, C. Lu, and Y. W. Tai, “Cross-domain adaptation for animal pose estimation,” Proceedings of the IEEE International Conference on Computer Vision, vol. 2019-Octob, pp. 9497–9506, 2019.
  • [19] X. Liu, S.-y. Yu, N. Flierman, S. Loyola, M. Kamermans, T. M. Hoogland, and C. I. D. Zeeuw, “OptiFlex: video-based animal pose estimation using deep learning enhanced by optical flow,” bioRxiv, p. 2020.04.04.025494, 2020.
  • [20] T. Zhang, L. Liu, K. Zhao, A. Wiliem, G. Hemson, and B. Lovell, “Omni-supervised joint detection and pose estimation for wild animals,” Pattern Recognition Letters, vol. 132, pp. 84–90, 2020.
  • [21] J. M. Graving, D. Chae, H. Naik, L. Li, B. Koger, B. R. Costelloe, and I. D. Couzin, “Deepposekit, a software toolkit for fast and robust animal pose estimation using deep learning,” eLife, vol. 8, pp. 1–42, 2019.
  • [22] A. Mathis, P. Mamidanna, K. M. Cury, T. Abe, V. N. Murthy, M. W. Mathis, and M. Bethge, “DeepLabCut: markerless pose estimation of user-defined body parts with deep learning,” Nature Neuroscience, vol. 21, no. 9, pp. 1281–1289, 2018.
  • [23] T. D. Pereira, D. E. Aldarondo, L. Willmore, M. Kislin, S. S. Wang, M. Murthy, and J. W. Shaevitz, “Fast animal pose estimation using deep neural networks,” Nature Methods, vol. 16, no. 1, pp. 117–125, jan 2019.
  • [24] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele, “Deepercut: A deeper, stronger, and faster multi-person pose estimation model,” in European Conference on Computer Vision.   Springer, 2016, pp. 34–50.
  • [25] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, no. 12, pp. 2481–2495, 2017.
  • [26] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in European Conference on Computer Vision.   Springer, 2016, pp. 483–499.
  • [27] S.-E. Wei, V. Ramakrishna, T. Kanada, and Y. Sheikh, “Convolutional Pose Machines,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • [28] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2019-June, pp. 5686–5696, 2019.
  • [29] Y. Yang and D. Ramanan, “Articulated pose estimation with flexible mixtures-of-parts,” in CVPR 2011.   IEEE, 2011, pp. 1385–1392.
  • [30] X. Chen and A. L. Yuille, “Articulated pose estimation by a graphical model with image dependent pairwise relations,” in Advances in neural information processing systems, 2014, pp. 1736–1744.
  • [31] D. Zhang, G. Guo, D. Huang, and J. Han, “Poseflow: A deep motion representation for understanding human behaviors in videos,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6762–6770.
  • [32] L. Dong, X. Chen, R. Wang, Q. Zhang, and E. Izquierdo, “Adore: An adaptive holons representation framework for human pose estimation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 10, pp. 2803–2813, 2017.
  • [33] S. Liu, Y. Li, and G. Hua, “Human pose estimation in video via structured space learning and halfway temporal evaluation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 7, pp. 2029–2038, 2018.
  • [34] X. Chu, W. Yang, W. Ouyang, C. Ma, A. L. Yuille, and X. Wang, “Multi-context attention for human pose estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1831–1840.
  • [35] L. Ke, M.-C. Chang, H. Qi, and S. Lyu, “Multi-scale structure-aware network for human pose estimation,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 713–728.
  • [36] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1653–1660.
  • [37] D. Zhang, J. Han, G. Guo, and L. Zhao, “Learning object detectors with semi-annotated weak labels,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 12, pp. 3622–3635, 2018.
  • [38] J. Nie, Y. Pang, S. Zhao, J. Han, and X. Li, “Efficient selective context network for accurate object detection,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2020.
  • [39] D. Zhang, H. Fu, J. Han, A. Borji, and X. Li, “A review of co-saliency detection algorithms: Fundamentals, applications, and challenges,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 9, no. 4, pp. 1–31, 2018.
  • [40] H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 734–750.
  • [41] S. Gidaris and N. Komodakis, “Object detection via a multi-region and semantic segmentation-aware cnn model,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1134–1142.
  • [42] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132–7141.
  • [43] D. Xu, W. Ouyang, X. Alameda-Pineda, E. Ricci, X. Wang, and N. Sebe, “Learning deep structured multi-scale features using attention-gated crfs for contour prediction,” in Advances in Neural Information Processing Systems, 2017, pp. 3961–3970.
  • [44] D. Xu, W. Wang, H. Tang, H. Liu, N. Sebe, and E. Ricci, “Structured attention guided convolutional neural fields for monocular depth estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3917–3925.
  • [45] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected crfs with gaussian edge potentials,” in Advances in neural information processing systems, 2011, pp. 109–117.
  • [46] X. Chu, W. Ouyang, X. Wang et al., “Crf-cnn: Modeling structured information in human pose estimation,” in Advances in Neural Information Processing Systems, 2016, pp. 316–324.
  • [47] B. Xiao, H. Wu, and Y. Wei, “Simple baselines for human pose estimation and tracking,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 466–481.
  • [48] Z. Jiang, F. Zhou, A. Zhao, X. Li, L. Li, D. Tao, X. Li, and H. Zhou, “Muti-view mouse social behaviour recognition with deep graphical model,” arXiv preprint arXiv:2011.02451, 2020.
  • [49] V. Jackson-Lewis and S. Przedborski, “Protocol for the mptp mouse model of parkinson’s disease,” Nature Protocols, vol. 2, no. 1, p. 141, 2007.

Supplementary A

Methods RMSE Snout Left ear Right ear Tail base Mean

Baseline
3.83 89.35 90.28 90.28 93.06 90.74

SCM(Conv())
SCM(Iteration 1) 4.70 91.20 95.83 95.37 96.30 94.68
SCM(Iteration 2) 4.11 91.08 97.18 96.24 98.12 95.66
SCM(Iteration 3) 14.87 91.08 76.53 69.95 88.26 81.46
SCM(Conv())
SCM(Iteration 1) 4.09 94.37 93.43 92.96 94.84 93.90
SCM(Iteration 2) 3.67 95.77 96.24 90.61 98.12 95.19
SCM(Iteration 3) 4.52 92.49 90.61 93.90 94.84 92.96
SCM(Conv()+Conv())
SCM(Iteration 1) 3.97 95.31 97.65 88.26 95.30 94.13
SCM(Iteration 2) 3.93 95.77 96.71 89.67 95.31 94.37
SCM(Iteration 3) 8.55 91.08 89.67 84.04 81.69 86.62
TABLE S1: Ablation experiments of the Structured Context Mixer on DeepLabCut Mouse Pose test dataset (RMSE(pixels) and PCK@0.2).
Methods RMSE Snout Left ear Right ear Tail base Mean

Baseline
3.83 89.35 90.28 90.28 93.06 90.74

MLS(All Cas1)
7.85 90.28 91.67 95.37 98.61 93.98
CMLS(Start Cas2 ) 3.52 94.84 97.65 95.77 97.65 96.48
CMLS(Middle Cas2) 5.35 90.14 90.61 88.26 98.59 91.90
CMLS(Final Cas2) 4.51 89.20 93.42 95.77 98.12 94.13
CMLS(All Cas2) 8.01 81.22 93.43 92.02 96.24 90.73
TABLE S2: Ablation experiments of the Cascaded Multi-level Supervision module on the DeepLabCut Mouse Pose test dataset (RMSE(pixels) and PCK@0.2).

Supplementary B

(a) Snout (b) Left ear
(c) Right ear (d) Tail base
Fig. S1: Comparisons of the PCK curves for each part on the PDMB test set.

Supplementary C

(a) Snout (b) Left ear
(c) Right ear (d) Tail base
Fig. S2: Comparisons of the PCK curves for each part on the DeepLabCut Mouse Pose test set.

Supplementary D

  
(a) PDMB (b) DeepLabCut Mouse Pose
Fig. S3: Examples of the estimated mouse poses on the PDMB and DeepLabCut test sets (best viewed in electronic form with 4 zoom in). Our proposed method deals well with occlusions, invisible keypoints, abnormal poses and deformable mouse body on the PDMB dataset. On the DeeplabCut Mouse Pose dataset, our method is robust to scaling variations.