Mouse models have been extensively developed to study across cognitive and neurological fields for Down syndrome, autism, Alzheimer’s disease and Parkinson’s disease. Comprehensive behavioural phenotypes of transgenic mice can be used to reveal the underlying functional role of genes, and provide new insights into the pathophysiology and treatment of the diseases carried by the mice[18, 34, 25, 32]. Historically, such behaviour is primarily labelled by an expert, which is a time-consuming, labor-intensive and error-prone task. To reduce the inherent high labour cost and inter-investigator variability associated with the manual annotation of data, reliable and high-throughput methods for automated quantitative analysis of mouse behaviours have become extremely important.
Previous automated systems have mainly relied on the use of various sensors to monitor animal behaviours. These established technologies include the use of infrared sensors, radio-frequency identification (RFID) transponders and photobeams. Such approaches have been successfully applied to the analysis of simple pre-programmed behaviours such as running and resting. However, the capacity of these sensor-based approaches restricts the complexity of the objects’ behaviours that can be measured. They cannot be used to handle more complex mouse behaviours such as eating, attacking, or sniffing. Vision-based techniques is thus used to recognise subtle mouse behaviours.
|approach||Moving toward another mouse in a straight line without obvious exploration|
|attack||Biting/pulling fur of another mouse|
|copulation||Copulation of male and female mice|
|chase||A following mouse attempts to maintain a close distance to another mouse while the latter is moving.|
|circle||Circling around own axis or chasing tail|
|drink||Licking at the spout of the water bottle|
|eat||Gnawing/eating food pellets held by the fore-paws|
|clean||Washing the muzzle with fore-paws (including licking fore-paws) or grooming the fur or hind-paws by means of licking or chewing|
|human||Human intervenes with mice|
|sniff||Sniff any body part of another mouse|
|up||Exploring while standing in an upright posture|
|walk away||Moving away from another mouse in a straight line without obvious exploration|
|other||Behaviour other than defined in this ethogram, or when it is not visible what behaviour the mouse displays|
firstly estimated the positions of the mouse body parts (e.g. head and trunk) by deploying a geometrical primitive model and a temporal watershed segmentation algorithm respectively, and then recognised mouse behaviours based on these positions. Since they only use top-view video recordings, it is difficult to recognise some behaviours that involve vertical movements e.g. ‘rearing’. In contrast, the side-view video recordings may supply a better perspective for some bouts of behaviour. For example, Jhuang et al.22]
developed and implemented a novel Hidden Markov Model algorithm for behaviour recognition using visual and contextual features. These systems were successful for measuring single mouse behaviour. If multiple mice are in the scene, these systems lack the ability to recognise the interactions between mice due to occlusion. In such case, the ambiguity caused by occlusion can be mitigated by adopting multiple-view observations. Burgos-Artizzu et al.
designed a system for recognising social behaviours of a mouse from both top and side views. They firstly extracted spatio-temporal and trajectory features and then applied AdaBoost to classifying those extracted features. However, their approach can only learn view-specific feature representations. The relationship between different cameras and the temporal transition of mouse behaviours are not addressed in their approach. Hong et al.
utilized a top-view camera and a top-view depth sensor to track and extract the body-pose features of mice by fitting an ellipse to each of them. These body pose features are then integrated with pixel changes from the side-view to train a classifier. Similar to the method of Burgos-Artizzu et al., this method also ignored the relationship between different cameras and the temporal transition of mouse behaviours. Another popular method employing multi-view cameras is to reconstruct the 3D pose of a mouse[33, 48, 45], but it requires additional equipment, calibration of cameras, higher computational resources, and 3D tracking software.
In this paper, we are particularly interested in recognising mouse behaviours (see Table I for the description of mouse behaviours) by fusing view-specific features, which is a challenging task due to large data variations over different views. Recently, several approaches have been proposed to address the problem of multi-view action recognition. Liu et al.  presented a bipartite-graph-based method to bridge the semantic gap across view-dependent vocabularies. Zheng et al.  proposed to learn a set of view-specific dictionaries for individual views and a common dictionary can be shared by different views. Junejo et al.  summarised actions at various views by using a so-called self-similarity matrix (SSM) descriptor. In order to enhance the representation power of SSM, Yan et al. proposed a multi-task learning approach to share discriminative SSM features between different views. However, these methods can only deal with segmented sequences, each of which contains only one subject’s behaviours.
Probabilistic graphical models are a useful tool to address the dynamic behaviour recognition problem due to their ability in fully exploiting spatial and temporal structures of data. Normally, graphical models can be classified into two main categories: generative and discriminative models
. Some of the popular approaches use generative models such as Hidden Markov Model (HMM) and Dynamic Bayesian Networks. In particular, Brand et al. introduced a coupled HMM to model interacting processes, and Murphy et al. introduced Dynamic Bayesian Networks to model complex dependencies in the hidden (or observed) state variables. Comparatively, discriminative models such as conditional random fields (CRFs) are more commonly used due to their better predictive power than the generative ones. CRFs have been extended to model the latent states, e.g. using Hidden Conditional Random Field (HCRF). Latent Discriminative HCRFs (LDCRF)  is a variation of HCRF tailored to deal with the dynamic behaviour recognition problem. Song et al. further extended LDCRF to the multi-view (MV) domain and proposed a MV-LDCRF model by defining view-specific and view-shared edges.
In this paper, we describe a multi-view mouse behaviour recognition system based on trajectory-based motion and spatio-temporal features as shown in Fig. 1. Specifically, we here propose a novel deep probabilistic graphical model with the aims to model: (1) the temporal relationship of image frames in each view, (2) the relationship between camera views, and (3) the correlations between the neighbouring labels.
Ii Proposed Methods
In this section, we give details in terms of our feature extraction approach that extracts meaningful features from videos, and our proposed MV-LADDM model that fuses and dynamically classifies these collected features. The overview of the proposed system is shown in Fig. 1.
Ii-a Feature Extraction
From the video data, two types of feature were extracted: spatio-temporal features and trajectory-based motion features. Each of these features was carefully chosen to capture different aspects of the mouse posture and movement. The spatio-temporal features used in this study include local visual features and contextual features. Both of them are computed based on the spatio-temporal interest points, which are obtained by applying a Laplacian of Gaussian (LoG) kernel filter along the spatial dimension and a quadrature pair of 1-D Gabor kernel filters along the temporal dimension. For the computation of local visual features, we extract the brightness gradients of three channels from the cuboid of each interest point. The contextual features can be computed in the form: where and represent the coordinates of the centre and the th interest point respectively. These features can characterise both the spatial location and temporal changes of mice.
The computation of trajectory-based motion features
is based on the combination of dense trajectories and deeply learned features as deep learning has produced remarkable results in human action recognition[54, 49, 2, 39]. The first step of computing dense trajectories is to densely sample a set of points on a grid with the step size of 5 pixels on 8 spatial scales, which has been justified to give satisfactory results in . Points in homogeneous areas are eliminated if the eigenvalues of their autocorrelation matrices are below a pre-defined threshold. Afterwards, these sampled points are tracked using a median filter in a dense flow field. To compute deeply learned features, we adopt the temporal stream nets proposed in . The temporal stream nets are trained on the stacking optical flow field of the action dataset, describing the dynamic motion information. Similar to , we also choose the trajectory-constrained sampling and pooling descriptors from conv3 and con4 layers of the temporal stream nets. Finally, we de-correlate TDD with PCA and reduce its dimensionality.
We apply Fisher Vectors (FVs)  to encoding all the features into high dimensional representations that have been proved to be effective for action recognition in previous works[53, 54, 22]. We firstly train a Gaussian Mixture Model (GMM) with parameters for each type of features. Here, and
respectively denote the mixture weight, mean vector, standard deviation vector (diagonal covariance) and the number of Gaussians. Then, FV can be computed in the following form:
where is the number of the interest points or trajectories within a sliding window. Parameter is the weight of to the th Gaussian: . We concatenate and after having applied power normalisation, followed by normalisation to each of them. Finally, we create a view-specific feature for each sliding window concatenating the FVs computed from all the features as shown in Fig. 1.
Ii-B Multi-view Latent-Attention Dynamic Discriminative Model
In our model, we denote the input as a set of multi-view sequences , where each consists of an observation sequence of length from the -th view. Each is associated with a label at the timestamp . Similar to MV-LDCRF which extends LDCRF (as shown in Fig.2) to model the sub-structure of the multi-view sequences, we also use latent variables. However, different from their methods, where the hidden variables are contemporaneously connected between views as shown in Fig. 2, we instead introduce a set of higher level latent variables for deep view-shared representations. In addition, since there are strong dependency across the output labels, for example, social behaviours often switches back and forth between ‘approach’ and ‘walk away’ in our test videos, we add edges between the neighbouring labels for encoding the temporal transition of social behaviours as shown in Fig. 2. Let , where each is a hidden state sequence of length , model the view-specific sub-structure and
model the deep view-shared sub-structure. We are interested in modelling the conditional probabilityparameterized by , where is a sequence of labels. The conditional distribution with latent variables and can be modeled as follows:
To describe the relationship between random variables, we represent our model as a Markov random field or undirected graph, where and . and denote a set of edges connecting lables, connecting view-sharing latent variables with view-specific latent variables , connecting view-specific latent variables and connecting view-specific latent variables with observation sequences . Based on the global Markov property, variables and are conditionally independent given variables , shown in Fig. 2. We also observe that variables and are conditionally independent given variables . Hence, we can express our model as:
, and are energy functions to be defined later. , and are partition functions for normalisation.
Ii-B1 Energy functions
where and are two feature functions defined on edges and , which encode the relationship between the neighbouring labels and between variables Y and Z, respectively. We represent as a transition score matrix , in which element denotes the transition score from labels to in the next timestamp, i.e. and , . is represented as , where is the weight vector and the inner product of can be interpreted as a measure of the plausibility of label given .
The energy function for our model is:
where encodes the relationship between variables and . In , we assume the hidden states from views are conditionally independent given the latent variable . The latent variable
is used to represent multi-view data. A common probabilistic graphical model to represent multi-view data is deep Boltzmann machines (DBM)
that stack the restricted Boltzmann machines (RBM) as building blocks. However, as described in , the latent variable is preferred to be binary when we use RBMs. If both variables and are Gaussian, the instability with training RBM becomes worse
. Moreover, it is computationally expensive to train RBM using high-dimensional data because of the Monte Carlo practice. Recently, variational autoencoders (VAEs) have been proposed to overcome the above challenging problems. However, how to extend VAE for handling multi-view data is still an open challenge. Here, we introduce a multi-view latent-attention variational autoencoder (MLVAE) (see Fig. 3) that uses a multi-Gaussian inference model in combination with latent attention networks to solve the multi-view inference problem.
Since Eq. (5) needs to marginalise latent variables and and derive , its computational complexity is exponentially proportional to the cardinality of and . To infer in an efficient way, following the approximation used in greedy layer-wise learning for deep belief nets reported in , we formulate:
, a popularly used method in Bayesian inference, which is efficient to handle high-dimensional data than MCMC used in RBMs. Following VI, we haveusing . Then, we minimise the difference between those two distributions using the Kullback–Leibler (KL) divergence metric, which is formulated as follows:
However, to compute requires exponential time as it needs to be evaluated over all the configurations of latent variables . In order to avoid computing , we reformulated Eq. (12) as an objective function:
where with parameters under our conditional independence assumption. is a generative network with parameters for view .
is specified as a standard normal distribution. With the derivation in Supplementary A, we obtain as:
That is, has the form in which a product of individual posteriors are divided by the prior. If we approximate with , where is the inference network with parameters of the view . For simplicity, each is presumably Gaussian with parameters of mean
and variance. Then, can be computed as follows:
where is the dimensionality of latent variables , denotes the transpose operation and is a -by-identity matrix. can be represented as,
We observe that is still a Gaussian model with mean and variance . Hence, the KL divergence between and in Eq. (13) can be computed as follows:
|(a) CRIM13||(b) our PDMB dataset||(c) camera|
where, is a trace function to sum the diagonal elements of matrix . is the determinant of matrix , which can be computed as the product of its diagonals. The whole model can be trained by maximising our ELBO. Although our current model can learn joint representations of the multi-view data, there is still some information that cannot be acquired in each view. As discussed in , the top view is suitable to detect behaviors like chase and walk away while other behaviours e.g. drink and eat are best recognised from the side view. To utilise such private view information, we adopt a latent attention network to learn the attention weights for both view-shared and view-specific latent variables. For instance, with regards to views, we can compute view-specific latent variables and view-shared latent variables . Hence, the expectation of in Eq. (10) can be calculated as:
where is a score assigned to each latent variable based on its relevance to the behavioural label . We calculate as follows:
where is the attention score measuring the relationship between the latent variable and the behavioural label .
is an word embedding function which is widely used on natural language processing. The weight matrixis the parameter to be learned.
To calculate in Eq. (11), we adopt the classical LSTM. Then, we can have:
are defined in traditional Recurrent Neural Network (RNN), while LSTM has an extra state called cell which is protected and controlled by the three gates. Hence, thecan be calculated below:
where with parameter has the same definition as LSTM, and , , and are parameters to be learned.
Iii-a Video database
Iii-A1 CRIM13 dataset
In this section, we firstly give an overview of a publicly available multi-view mouse social behaviour datasets: the Caltech Resident-Intruder Mouse (CRIM13) dataset. This dataset was used to study neurophysiological mechanisms in the mouse brain. It consists of 237*2 videos that was recorded using synchronised top- and side-view cameras with the resolution of 640*480 pixels and the frame rate of 25Hz. Each video lasts around 10 min and was annotated frame by frame. There are 12+1 different mutually exclusive behaviour categories, i.e. 12 behaviors and one otherwise unspecified behaviour for the description of mouse behaviours. Fig.4.(a) shows video frames for the approaching behavior in both top and side views. The occurrence probabilities of behaviours are expressed as percentages in Fig. S1.(a). The behaviours in CRIM13 are highly imbalanced. Except for ‘other’ (56.0%) behaviours, the most occurring behaviour is ‘sniff’ (13.9%), and the least occurring behaviours are ‘circle’ and ‘drink’ (only 0.4%).
Iii-A2 PDMB dataset
In this paper, we introduce a new dataset, which was collected in collaboration with biologists of Queen’s University Belfast of United Kingdom, for a study on motion recordings of mice with Parkinson’s disease (PD). The neurotoxin 1-methyl-4-phenyl-1,2,3,6-tetrahydropyridine (MPTP) is used as a model of PD, which has become an invaluable aid to produce experimental parkinsonism since its discovery in 1983 [47, 20, 19, 13]. Six C57bl/6 female mice received treatment of MPTP while another six wild-type female mice are used as controls. All mice used throughout this study were housed (3 mice of the same type per cage) in a controlled environment with the constant temperature of () and light condition (long fluorescent lamp of 40W), and under constant climatic conditions with free access to food and water(placed on the corner of the cage). All experimental procedures were performed in accordance with the Guidance on the Operation of the Animals (Scientific Procedures) Act, 1986 (UK) and approved by the Queen’s University Belfast Animal Welfare and Ethical Review Body.
The proposed dataset consists of 12*3 annotated videos (6 videos for MPTP treated mice and 6 videos for control mice) recorded by using three synchronised Sony Action cameras (HDR-AS15) (one top-view and two side-view) with the frame rate of 30 fps and video resolution of 640 by 480 pixels. Fig. 4(b) and (c) show video frames for the approaching behavior in three views and the locations of our cameras. All videos (216,000*3 frames in total) contain 8+1 behaviours of two freely behaving mice and each video lasts around 10 minutes. Activities occurrences of a healthy mouse are shown in Fig. S1(b).
Iii-B View-specific feature representation
To extract view-specific features, sliding windows are centered at each frame, wherein all types of view-specific features are sought. The method of extracting view-specific features is adapted from the previous works for single-view mouse behaviour recognition[22, 23]. We adopt spatio-temporal and trajectory-based motion features as both of them result in satisfactory performance . More technical details can be found in the Methods section.
|(a) LR||(b) BNB||
|(d) AdaB||(e) RF||(f) SVM|
Receiver Operating Characteristic (ROC) curves of the classification outcome for the CRIM13 dataset. These classifiers include (a) Logistic Regression, (b) Bernoulli naive Bayes, (c) 5-nearest neighbours, (d) AdaBoost with the base estimator of Random forest (e) Random forest,(f) Support Vector Machine with a linear kernel.
|Feature extraction method||LR||BNB||KNN||AdaB||RF||SVM||Average|
|Trajectory-based motion features||IDT||29.7%||28.0%||22.7%||24.2%||34.3%||22.4%||26.9%|
To evaluate the contribution of these features towards the recognition of mouse behaviours, as an example, we wish to examine different classifiers on the sliding windows collected from the top-view videos. These approaches neither rely on the multi-view feature fusion nor the temporal context of mouse behaviours from the view-specific features. To this end, we collect a subset of the CRIM13 dataset which was also used in  for analysing their feature extraction method. This small validation dataset includes 20 top-view videos randomly chosen from the whole dataset and is evenly divided to training and testing datasets. We assess some of the most widely used trajectory-based motion features, spatio-temporal features and their combinations. In the approaches based on trajectory-based motion features, we use the established Improved Dense Trajectory (IDT) technique, which densely samples image points and tracks them using optical flows. In the evaluation, we deploy the default trajectory length of 15 frames. For each trajectory, we compute Histograms of Oriented Gradients (HOG), Histograms of Optical Flow (HOF) and Motion Boundary Histograms (MBH) descriptors proposed in . The final dimensions of the descriptors are, 96 for HOG, 108 for HOF and 192 for MBH. Another Trajectory-based motion feature extraction approach in our assessment is the trajectory-pooled deep convolutional descriptor (TDD) 
. The goal of TDD is to combine the benefits of both trajectory-based and deep-learned features. This local trajectory-aligned descriptor is computed from the spatial and temporal nets. Following their default settings, we use the descriptors from conv4 and conv5 layers for the spatial nets, and conv3 and con4 layers for the temporal nets. These networks are pre-trained on ImageNet and fine-tuned on the UCF-101 
dataset. Finally, we concatenate these descriptors and reduce the dimensionality of the vector using Principal Component Analysis (PCA) (256 components are kept as default).
Majority of the papers published so far have shown the promising performance of the above approaches on human action datasets, but very few papers are related to the exploration of mouse behaviours. Popularly used spatio-temporal feature extraction approaches include VF&CF, Harris3D, Cuboids, and LTP. In our experiments, all the parameters used in these approaches have been set to their original configurations which give the best results in behaviour recognition of mice [22, 7]. We incorporate these features with individual classifiers and illustrate the classification results in Table II, where the classifiers include Logistic Regression (LR), Bernoulli naive Bayes (BNB), 5-nearest neighbours (KNN), Random forest (RF), AdaBoost (AdaB) with the base estimator of RF and Support Vector Machine (SVM) with a linear kernel. We also report their average accuracy in the table. The highlighted figures in the table demonstrate that the use of TDD, VF&CF and their combination usually result in the best classification accuracy. In particular, for BNB, KNN and SVM, the combined features are able to achieve better accuracy than the individual use of them. The effectiveness of the other features is significantly lower than that of TDD and VF&CF. Note that VF&CF can achieve 15.7% better than Cuboids that has been reported to achieve the best performance . It is observed that the features combined with IDT deteriorate the system performance. In fact, complementary features performs much better than casual feature combination for system performance improvement. Receiver Operating Characteristic (ROC) curves of individual classifiers with different feature combinations and their area under the curve (AUC) are shown in Fig. 5. We also witness that the combination of TDD and VF&CF have the highest AUC, the best performance in each classifier.
Iii-C Social Behaviour Recognition
In our system, for the efficiency purpose, all the view-specific features are computed from a small sliding window in the video (40 frames), which are centered at each frame. Our system aims at assigning every sliding window to one of the pre-defined behaviour categories. For this challenging task, the temporal and view contexts of each specific behaviour are fully utilised in our system. To do so, we propose a novel Multi-view Latent-Attention Dynamic Discriminative Model that includes (1) the modelling of the temporal relationship of image frames for each segment, (2) the modelling of the relationship between views, and (3) the modelling of the correlations between the labels in neighbouring regions. Details about the system implementation are provided in the Methods section. For efficiency and simplification, we divide the experiments in this section into two parts: View-Shared and View-Attention Behaviour Recognition.
Traditionally, classification accuracy is defined as the percentage of the samples that are correctly labelled against the number of the overall samples. While being a valid measure, this metric cannot disclose the characteristics of the datasets that have a severe imbalanced classification problem. To better measure the system performance, we here use the averaging recognition rate per behaviour.
Iii-C1 View-Shared Behaviour Recognition
In this experiment, we leave the view-specific features and only use the learned view-shared features. For the fair comparison, we adopt the same classifier (i.e. linear SVM) and compare its recognition results with those of canonical correlation analysis (CCA) , kernel CCA (KCCA)  and deep CCAs , resulting in Table III. It is worth pointing out that CCA is a way of measuring the linear relationship between two views in the projected space. KCCA is the extension of the standard CCA, where explicit mapping to the feature space can be avoided and the correlation can be performed in the feature space by replacing the scalar products by the kernel function in the input space. We adopt Gaussian and polynomial kernels for the comparison in this study. DCCA  is introduced to address the scalability issue using deep learning and we vary the node number of its output layer from 50 to 150 in our experiment for deeper exploration. As shown in Table III, our approach achieves the best recognition rate for 11 out of 12 behaviours, significantly better than the other state of the art approaches. It also demonstrates the effectiveness of our learned features. Moreover, using Variational Inference (more detail can be found in the Methods section), our model can effectively handle the overfitting problem with the strength of dealing with the imbalanced datasets.
|Behaviour||PBMV||KCCA (Gaussian)||DCCA||BILSTM||DCLSTM||Burgos-Artizzu et al.||Ours (View-shared)||Ours (View-attention)|
|Ours (without label correlation)||Ours (with label correlation)|
Iii-C2 View-Attention Behaviour Recognition
This experiment is prepared with both the view-specific and view-shared features, where the former captures unique dynamics of each view whilst the latter encodes the interaction between the views. In our proposed model, attention scores are automatically learned to measure the contributions of each view-specific and view-shared feature in the recognition of mouse behaviours. Our view-attention behaviour recognition approach is compared against the existing approaches such as [41, 17, 27, 1, 12, 56]. The PBMVboost  is a two-level multi-view learning approach, which learns the distribution over view-specific classifiers or views in one single step by a boosting approach. The number of the iterations used in PBMVboost is set to 100 with a tree depth 13 (class number), experimentally. CCA, KCCA and DCCA can report the correlation over the representations from different views, but how to utilize the view-specific information is not addressed in these approaches. BcLSTM and DCLSTM 
are two Long Short-Term Memory (LSTM) based approaches with specific hyperparameters set to the optimum values (epochs: 100, batch size: 50, and learning rate: 0.001). The importance of modeling the correlations between the neighbouring labels in our approach is also evaluated.
depicts that our approach with modelling label correlation achieves the highest averaging accuracy of 71.7%. Our view-attention approach outperforms the view-shared approach, suggesting the effectiveness of adding the attention model into the framework. Without using view-specific features, only using the shared features cannot make the data discriminative enough for satisfactory classification, especially in the cases where features are not shared across different views. The methods, e.g.[12, 41, 56], also exploit view-specific features. They treat the features across views equally and thus cannot properly value the importance of feature collected from different views. In Table IV, we observe that BILSTM and DCLSTM have poor performance (accuracy is lower than 20%) in the recognition of ‘copulation’ and ‘walk away’.
The importance of the modelling label correlation is clearly demonstrated in Figs. S2 and 6. Our two approaches achieve superior performance over all the other approaches, demonstrating the benefit of modelling label correlation and attention modelling in this experiment. Fig. S3
shows the average agreement rates of our approaches over 2-, 4- and 6-minute intervals. For statistical analysis, two-sample t-test and paired t-test are performed under the assumption of Gaussian errors. Wilcoxon signed-rank tests are also used to examine this assumption. All the testing results suggest that our method with label correlation significantly improve the average agreement rate (). Furthermore, We do not see any significant difference in the mean average agreement rates over various intervals, shown in Fig. S3(a), (b) and (c), suggesting the performance of our approaches does not go down over time. In addition, our approach is robust against viewpoint variations and can achieve satisfactory performance in multi-view recognition.
To demonstrate the versatility of our proposed approach with different laboratory settings, as an example, we here use the proposed system to discriminate the behaviours of control mice and MPTP treated mice of Parkinson’s disease. Similar to CRIM13, the whole dataset is also evenly divided to training and testing datasets. Fig. S4 shows the agreement of the labeling results by our MV-LDCRF model and the expert annotators on testing datasets. The agreement is satisfactory for most behaviours, whereas 18% of the ‘approach’ behaviour are incorrectly classified as ‘walk away’, 18% of the ‘sniff’ behaviour are incorrectly classified as ‘up’, and 16% of the ‘up’ behaviour are incorrectly classified as ‘walk away’. However, compared with other methods, our approach still can achieve the highest averaging accuracy 71.9% and the best performance for 7 out of 8 behaviours as shown in Table V. Experiments on both datasets have presented a high agreement rate by the proposed model. To demonstrate the applicability of the proposed system to behaviour phenotyping of the MPTP mouse model of Parkinson’s disease, we analyse the behaviour frequencies measured over the 60-min period for the MPTP treated mice and their control strains in Fig. 7. We observe that the MPTP treated mice, compared to the control group, have less exercises in ‘up’, ‘circle’, ‘clean’ and ‘approach’ and more exercises in ‘sniff’.
|Behaviour||PBMV||KCCA (Gaussian)||DCCA||BILSTM||DCLSTM||Ours (all)|
Iv Discussion and Conclusion
Automated social behaviour recognition for mice is an important problem due to its clear benefits: repeatability, objectiveness, consistency, efficiency and cost-effectiveness. Traditional automated systems use sensors such as infrared sensors, radio-frequency identification (RFID) transponders and photobeams, and single 2D cameras. However, those sensor-based or single-view approaches restrict their abilities to recognise complex mouse behaviours. In contrast, multi-view behaviour recognition systems have demonstrated their potential to recognise mouse behaviours in occlusion.
Here, we have proposed a deep probabilistic model to perform multi-view social behaviour quantification in mice. Our approach jointly models the temporal relationship of frames in each view, the relationship between views and the correlation between labels in the neighbouring areas. Moreover, our system utilises both view-shared and view-specific features to accurately characterise mouse social behaviours with certain distance between viewpoints and appearance variations.
We benchmarked every component of our approach separately. The performance of various feature extractors for mouse behaviour recognition is firstly evaluated on the CRIM13 dataset. The experimental results show that the combination of TDD and VF&CF have the highest AUC value and accuracy, outperforming the other combined features and the individual use of them. This suggests that the multiplicity and complementarity of heterogeneous features provides significantly positive supports in study of mouse behaviour. To verify the effectiveness of the view-shared substructure in our model, our system is tested independently and also compared to other methods with view-shared feature representations. We show that our approach achieved the best performance for 11 out of 12 behaviours. Thanks to variational inference, our model can effectively handle imbalanced datasets. The modelling label correlation is also demonstrated to retain 6% higher averaging accuracy than that without modelling label correlation. The statistical significance of our results is proved in our statistical analysis using two-sample t-test, paired t-test and Wilcoxon signed-rank tests. We also demonstrate that the performance of our approaches is not deteriorated over time. Compared to the other state-of-the-art methods that have the averaging accuracy of 62.6%, our best model (with label correlation) achieves significant better averaging accuracy of 71.7%. A major advantage of our proposed method is that our model can automatically learn the contributions of each view-specific and view-shared feature, while these comparative approaches treat the features across views equally. On the other hand, we provide a new multi-view video dataset for motion monitoring of mice with Parkinson’s disease. We also validated our system on the PDMB dataset with two important aspects: the generalisation ability of the proposed deep graphical model on the new datasets and the applicability of the proposed system to behaviour phenotyping of the MPTP mouse model of Parkinson’s disease.
In addition, our experiments show that our spatio-temporal and trajectory-based motion features are still insufficient to distinguish between some similar behaviours like ‘drink’ and ‘eat’, ‘approach’ and ‘walk away’. We believe that it is possible to achieve better performance with (a) a better coordinated multi-camera system to share the visual information over views, and (b) development of characteristic features that capture mouse posture for mouse motion identification.
In summary, we describe the first deep graphical model, to our knowledge, of integrating features extracted from video recordings of multiple views, to perform automated quantification of social behaviours for freely interacting mice in a home-cage environment. The proposed approach has the potential to be a valuable tool for quantitative phenotyping of complex behaviours including those for the study of mice with neurodegenerative diseases.
-  (2013) Deep canonical correlation analysis. In International conference on machine learning, pp. 1247–1255. Cited by: §III-C1, §III-C2, TABLE III.
-  (2019) Human action recognition with a large-scale brain-inspired photonic computer. Nature Machine Intelligence 1 (11), pp. 530–537. Cited by: §II-A.
-  (2018) Multimodal machine learning: a survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence 41 (2), pp. 423–443. Cited by: §I.
-  (2006) Pattern recognition and machine learning. springer. Cited by: §II-B1.
-  (2009) Stepping test in mice: a reliable approach in determining forelimb akinesia in mptp-induced parkinsonism. Experimental neurology 219 (1), pp. 208–211. Cited by: §I.
-  (1997) Coupled hidden markov models for complex action recognition. In cvpr, Vol. 97, pp. 994. Cited by: §I.
-  (2012) Social behavior recognition in continuous video. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1322–1329. Cited by: TABLE I, §I, §II-B1, §III-A1, §III-B, §III-B, TABLE II.
-  (2001) Automated measurement of age-related changes in the locomotor response to environmental novelty and home-cage activity. Mechanisms of ageing and development 122 (15), pp. 1887–1897. Cited by: §I.
-  (2012) Computerized video analysis of social interactions in mice. Nature methods 9 (4), pp. 410. Cited by: §I.
-  (2013) Automatic visual tracking and social behaviour analysis with multiple mice. PloS one 8 (9), pp. e74557. Cited by: §I.
-  (2008) A robust automated system elucidates mouse home cage behavioral structure. Proceedings of the National Academy of Sciences 105 (52), pp. 20575–20582. Cited by: §I.
-  (2019) Multiview boosting by controlling the diversity and the accuracy of view-specific voters. Neurocomputing 358, pp. 81–92. Cited by: §III-C2, §III-C2.
-  (2019) Bone loss caused by dopaminergic degeneration and levodopa treatment in parkinson’s disease model mice. Scientific reports 9 (1), pp. 1–16. Cited by: §III-A2.
-  (2006) A fast learning algorithm for deep belief nets. Neural computation 18 (7), pp. 1527–1554. Cited by: §II-B1, §II-B1.
-  (2012) A practical guide to training restricted boltzmann machines. In Neural networks: Tricks of the trade, pp. 599–619. Cited by: §II-B1.
-  (2015) Automated measurement of mouse social behaviors using depth sensing, video tracking, and machine learning. Proceedings of the National Academy of Sciences 112 (38), pp. E5351–E5360. Cited by: §I.
-  (1992) Relations between two sets of variates. In Breakthroughs in statistics, pp. 162–190. Cited by: §III-C1, §III-C2, TABLE III.
-  (2005) Behavioral characterization of a unilateral 6-ohda-lesion model of parkinson’s disease in mice. Behavioural brain research 162 (1), pp. 1–10. Cited by: §I.
-  (2007) Protocol for the mptp mouse model of parkinson’s disease. Nature protocols 2 (1), pp. 141. Cited by: §III-A2.
-  (2002) Blockade of microglial activation is neuroprotective in the 1-methyl-4-phenyl-1, 2, 3, 6-tetrahydropyridine mouse model of parkinson disease. Journal of Neuroscience 22 (5), pp. 1763–1771. Cited by: §III-A2.
-  (2010) Automated home-cage behavioural phenotyping of mice. Nature communications 1, pp. 68. Cited by: §I.
-  (2018) Context-aware mouse behavior recognition using hidden markov models. IEEE Transactions on Image Processing 28 (3), pp. 1133–1148. Cited by: §I, §II-A, §II-A, §III-B, §III-B, TABLE II.
-  (2017) Behavior recognition in mouse videos using contextual features encoded by spatial-temporal stacked fisher vectors.. In ICPRAM, pp. 259–269. Cited by: §III-B.
-  (2008) Cross-view action recognition from temporal self-similarities. In European Conference on Computer Vision, pp. 293–306. Cited by: §I.
-  (2019) Behavioural and dopaminergic changes in double mutated human a30p* a53t alpha-synuclein transgenic mouse model of parkinson´ s disease. Scientific reports 9 (1), pp. 1–13. Cited by: §I.
-  (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §II-B1.
-  (2000) Kernel and nonlinear canonical correlation analysis. International Journal of Neural Systems 10 (05), pp. 365–377. Cited by: §III-C1, §III-C2, TABLE III.
-  (2009) Behavioral phenotyping of a murine model of alzheimer’s disease in a seminaturalistic environment using rfid tracking. Behavior research methods 41 (3), pp. 850–856. Cited by: §I, §I.
-  (1994) Markov random field models in computer vision. In European conference on computer vision, pp. 361–370. Cited by: §II-B.
-  (2018) A survey of multi-view representation learning. IEEE transactions on knowledge and data engineering 31 (10), pp. 1863–1883. Cited by: §I.
-  (2011) Cross-view action recognition via view knowledge transfer. In CVPR 2011, pp. 3209–3216. Cited by: §I.
-  (2016) Autism-like behaviours and germline transmission in transgenic monkeys overexpressing mecp2. Nature 530 (7588), pp. 98–102. Cited by: §I.
-  (2013) A 3d-video-based computerized analysis of social and sexual interactions in rats. PloS one 8 (10), pp. e78460. Cited by: §I.
-  (1995) Long-term antidepressant treatment reduces behavioural deficits in transgenic mice with impaired glucocorticoid receptor function. Journal of neuroendocrinology 7 (11), pp. 841–845. Cited by: §I.
-  (2007) Latent-dynamic discriminative models for continuous gesture recognition. In 2007 IEEE conference on computer vision and pattern recognition, pp. 1–8. Cited by: §I, Fig. 2, §II-B.
-  (2002) Dynamic bayesian networks: representation, inference and learning. Cited by: §I.
-  (2013) Automated multi-day tracking of marked mice for the analysis of social behaviour. Journal of neuroscience methods 219 (1), pp. 10–19. Cited by: §I.
-  (2004) Down syndrome mouse models ts65dn, ts1cje, and ms1cje/ts65dn exhibit variable severity of cerebellar phenotypes. Developmental dynamics: an official publication of the American Association of Anatomists 230 (3), pp. 581–589. Cited by: §I.
-  (2020) Complex sequential understanding through the awareness of spatial and temporal concepts. Nature Machine Intelligence 2 (5), pp. 245–253. Cited by: §II-A.
Absence of cntnap2 leads to epilepsy, neuronal migration abnormalities, and core autism-related deficits. Cell 147 (1), pp. 235–246. Cited by: §I.
-  (2018) MELD: a multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508. Cited by: §III-C2, §III-C2.
-  (2007) Hidden conditional random fields. IEEE Transactions on Pattern Analysis & Machine Intelligence (10), pp. 1848–1852. Cited by: §I, §II-B1.
-  (2017) Machine vision methods for analyzing social interactions. Journal of Experimental Biology 220 (1), pp. 25–34. Cited by: §I.
-  (2009) Deep boltzmann machines. In Artificial intelligence and statistics, pp. 448–455. Cited by: §II-B1.
-  (2019) Three dimensional pose estimation for laboratory mouse from monocular images. IEEE Transactions on Image Processing. Cited by: §I.
-  (2013) Image classification with the fisher vector: theory and practice. International journal of computer vision 105 (3), pp. 222–245. Cited by: Fig. 1, §II-A.
-  (2001) Behavioral phenotyping of the mptp mouse model of parkinson’s disease. Behavioural brain research 125 (1-2), pp. 109–125. Cited by: §III-A2.
-  (2013) Quantitative evaluation of 3d mouse behaviors and motor function in the open-field after spinal cord injury using markerless motion tracking. PloS one 8 (9), pp. e74536. Cited by: §I.
-  (2014) Two-stream convolutional networks for action recognition in videos. In Advances in neural information processing systems, pp. 568–576. Cited by: §II-A, §III-B.
-  (2012) Multi-view latent variable discriminative models for action recognition. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2120–2127. Cited by: §I, Fig. 2, §II-B1, §II-B.
-  (2013) An automated system for the recognition of various specific rat behaviours. Journal of neuroscience methods 218 (2), pp. 214–224. Cited by: §I.
-  (2011) Action recognition by dense trajectories. Cited by: §II-A.
-  (2013) Action recognition with improved trajectories. In Proceedings of the IEEE international conference on computer vision, pp. 3551–3558. Cited by: §II-A, §III-B, TABLE II.
-  (2015) Action recognition with trajectory-pooled deep-convolutional descriptors. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4305–4314. Cited by: Fig. 1, §II-A, §II-A, §III-B, TABLE II.
-  (2014) Multitask linear discriminant analysis for view invariant action recognition. IEEE Transactions on Image Processing 23 (12), pp. 5599–5611. Cited by: §I.
-  (2018) Dual channel lstm based multi-feature extraction in gait for diagnosis of neurodegenerative diseases. Knowledge-Based Systems 145, pp. 91–97. Cited by: §III-C2, §III-C2.
-  (2016) Cross-view action recognition via transferable dictionary learning. IEEE Transactions on Image Processing 25 (6), pp. 2542–2556. Cited by: §I.
|(a) CRIM13||(b) Our PDMB dataset|
determines the statistical significance of the results and gives the probability under the null hypothesis that our method with label correlation does not improve the average agreement rate. If thevalue is less than the significance level (typically set to 0.05), our null hypothesis will be rejected. In our statistical analysis, two-sample t-test, paired t-test and Wilcoxon signed-rank test are performed to measure . Since , our null hypothesis is therefore rejected.