Log In Sign Up

Detection and Tracking of Multiple Mice Using Part Proposal Networks

by   Zheheng Jiang, et al.

The study of mouse social behaviours has been increasingly undertaken in neuroscience research. However, automated quantification of mouse behaviours from the videos of interacting mice is still a challenging problem, where object tracking plays a key role in locating mice in their living spaces. Artificial markers are often applied for multiple mice tracking, which are intrusive and consequently interfere with the movements of mice in a dynamic environment. In this paper, we propose a novel method to continuously track several mice and individual parts without requiring any specific tagging. Firstly, we propose an efficient and robust deep learning based mouse part detection scheme to generate part candidates. Subsequently, we propose a novel Bayesian Integer Linear Programming Model that jointly assigns the part candidates to individual targets with necessary geometric constraints whilst establishing pair-wise association between the detected parts. There is no publicly available dataset in the research community that provides a quantitative test-bed for the part detection and tracking of multiple mice, and we here introduce a new challenging Multi-Mice PartsTrack dataset that is made of complex behaviours and actions. Finally, we evaluate our proposed approach against several baselines on our new datasets, where the results show that our method outperforms the other state-of-the-art approaches in terms of accuracy.


page 2

page 4

page 11

page 14

page 16

page 17

page 18

page 19


Probabilistic 3D Multi-Modal, Multi-Object Tracking for Autonomous Driving

Multi-object tracking is an important ability for an autonomous vehicle ...


In this paper we present a robust tracker to solve the multiple object t...

Know Your Surroundings: Panoramic Multi-Object Tracking by Multimodality Collaboration

In this paper, we focus on the multi-object tracking (MOT) problem of au...

Tracking Passengers and Baggage Items using Multi-camera Systems at Security Checkpoints

We introduce a novel tracking-by-detection framework to track multiple o...

Automatic tracking of protein vesicles

With the advance of fluorescence imaging technologies, recently cell bio...

AirTrack: Onboard Deep Learning Framework for Long-Range Aircraft Detection and Tracking

Detect-and-Avoid (DAA) capabilities are critical for safe operations of ...

Tracking and Long-Term Identification Using Non-Visual Markers

Our objective is to track and identify mice in a cluttered home-cage env...

1 Introduction

In neuroscience research, mouse models are valuable tools to understand the pathology and development of neurological conditions such as Alzheimer’s and Parkinson’s diseases [1, 2, 3, 4, 5, 6, 7]. However, manually annotating long video recordings is a time-consuming task. Furthermore, manual documentation suffers from a number of limitations such as highly subjective and scarce replicability. Hence, there is an increasing interest in the development of systems for automated analysis of rodents’ behaviour from videos.

Definitions of social behaviours can vary to some extent, from all the behaviours that occur when two or more animals are present in the scene [8], to only behaviours in which one influences another [9]. Despite its definition, to automatically analyse social behaviours, discriminative features are often required: to record behaviours, to track the positions (or parts) of the participants, to identify individuals across time and space, and to quantify animal interactions. For automated behaviour analysis, a reliable and smart tracking method is required to associate the detected behaviours with the correct individuals. Moreover, accurate locations of mouse parts obtained from a stable tracking method enable the representation of interactions and allow for behavioural classification. Many applications, such as [10, 11, 12]

, analyse mouse behaviours using the tracking results of mice. Simultaneous tracking of two or more individuals poses a challenge in the computer vision community. The fact is that mice are mostly identical and highly deformable objects. In addition, social interaction between individuals makes the tracking mission even more complicated due to frequent occlusions. A popular method to track individuals during interactions is to label each subject with a unique marker, e.g., by bleaching 

[13], color [14]. Also, Galsworthy [12] monitor multiple mice in a single cage by using radio transmitters buried under the skin and then record their activities by the detection coils. However, these systems are invasive for the tested subjects, and the labelling method in these systems very likely influence an individual’s social behaviours as it frequently provides an olfactory and/or visual stimulus [15]. Therefore, people generally prefer markerless identification.

Fig. 1: Architecture of the proposed tracking system. Firstly, we propose an efficient and robust deep learning based mouse part detection scheme to generate part candidates. Subsequently, we propose a novel Bayesian Integer Linear Programming Model that jointly assigns the part candidates to individual targets with necessary geometric constraints whilst establishing pair-wise association between the detected parts.

Several markerless approaches have been proposed to tackle this challenging problem. A common approach is to count on suitable foreground segmentation, which separates the extracted foreground pixels into several spatially connected groups using clustering algorithms such as Expectation-Maximization for Gaussian Mixture Models (

[16, 13, 17]) or watershed segmentation ([11]

). However, if two mice are spatially close, in contact or occluding each other, it can be difficult to separate them only based on foreground detection. Moreover, the location estimation of mouse parts is unreliable when occlusion occurs. De Chaumont et al.

[18] make use of prior knowledge when tracking the mouse parts, and developed a model that connects a set of geometrical primitives under physics-based principles and adapts the model’s parts to the moving mouse based on the defined physical engines. However, this type of method requires sophisticated skeleton models which are hard-coded in the system, and thus limit the flexibility of the methods.

The above methods belong to the family of detection-free tracking methods, which requires manual initialization of the mouse location in the first frame, and then tracks mice and their parts in subsequent frames. Although these methods have no need of pre-trained object detectors, they are prone to drifts and identity switches. Recently, tracking-by-detection methods have become popular as they are efficient to handle deformation and occlusion. The essential idea is to first detect objects and then handle the data association problem over frames. This approach has been widely used for human tracking and pose estimation [19, 20, 21].

To address the part tracking problem of multiple mice in the context of tracking-by-detection, we break the original problem into three interdependent but also associated sub-problems. First, the positions of all the parts in each frame must be identified. Second, the detected parts must be assembled to form a physical mouse. Finally, the positions of the parts must be connected across image frames in order to generate trajectories.

Detecting mouse parts in the first sub-problem is challenging due to the small size and subtle local inter-class differences between image frames. To address this problem, we propose a mouse parts and body detection framework based on the features from deep neural networks. In fact, the use of deep neural networks has already obtained promising outcomes for part detection and pose estimation of humans in some challenging benchmarks 

[19, 20, 22]. This suggests that deep learning architectures can be applied to part detection of lab animals. For the second and third problems, we propose a Baysian Inter Linear Program (BILP) model that resolves these problems by minimizing a joint objective function using Baysian Inference. In this way, we can handle targets-candidates assignment and pair-wise part association within a single cost function. In order to solve the challenging problems in mouse part detection and tracking, we here design and evaluate a novel framework based on part association. Figure 1 shows the flowchart of the proposed tracking system. In summary, we have the following contributions:

1. We propose a reliable detection framework that is efficient to identify mouse body and parts and obtain competitive performance on detection benchmarks.

2. With the detection candidates generated by the proposed detector, a Bayesian Integer Linear Program Model is proposed to estimate the locations of all the mouse parts present in an image. The formulation is based on the targets’ assignment and pair-wise part association, subject to mutual consistency and exclusion constraints.

3. We formulate the parts’ localization as a Bayesian inference process that combines the output of the proposed detector with prior geometric models of mice. Unlike the previous work such as

[18], our prior on the configuration of the mouse parts is not hard-coded but derived from our large collection of the labeled training samples. In addition to the geometric models, we also introduce motion cues to compensate for the missing appearance information.

4. Since there is no publicly available dataset that provides a foundation to quantitatively evaluate the Multi-Mice part tracking, we also introduce a new competitive Multi-Mice PartsTrack dataset that comprises a set of video recordings of two or three mice in a home cage with the top view. Several common behaviours are included in this dataset such as ‘approaching’, ‘following’,‘moving away’,‘nose contact’,‘solitary’ and ‘Pinning’. The ground truth of the part and body location of mice was manually labeled for each image frame.

2 Related Work

In this section, we review the established approaches related to our proposed system. Section 2.1 reviews the existing methods for tracking multiple mice. Section 2.2 discusses the works used for multi-people tracking.

2.1 Mouse based Tracking

Typically, the separation of foreground and background can be used as the first step of a multi-subject tracking algorithm, e.g. [16, 13, 17, 11, 18, 23, 24, 25]. In mouse tracking, the knowledge of the foreground can be used to improve the accuracy of the tracking scheme.

Some existing approaches for tracking multiple mice focus on modelling the appearance of a mouse. For example, Hong et al.[25] first apply background subtraction and image segmentation from the top-view camera. They then fit an ellipse to each mouse in the foreground. Thus, the position and body orientation of each mouse is described by the fit ellipse. Twining et al.[23] find mice in a new image by fitting active shape models to different locations in the image and selecting the instance that best fits the models. Similarly, de Chaumont et al. [18] manually define shape models based on geometrical primitives and fit these models to images using a physical engine. Although these methods are fast and work well, the flexibility of such methods is limited since they require sophisticated skeleton models. Moreover, if two mice are touching, or one mouse is occluding the other, it can be difficult to separate them based only on the shapes of the blobs of foreground pixels. Alternatively, Pérez-Escudero et al. [17] use fingerprints extracted from image frames in which the mice are not interacting. The fingerprints were then used to deal with occlusions and identity shifts.

In general, motion is a useful cue for dealing with occlusion. Model-based tracking approaches can benefit the identification by incorporating motion cues into the model. A frequently adopted tracking method that combines both appearance and motion is based on particle filtering. For example, Pistori et al. [24] extend a particle filtering approach with certain variations on the observation model to track multiple mice from the top view. Branson et al.[16] present a particle filtering algorithm by tracking the contours of multiple mice and acquire the images of the targets from a side view. The ability of a particle filter to correctly track mice depends on how well the observation model works with mice. Due to the highly deformable shapes of mice, it is very difficult to explicitly model the entire mouse.

Another solution towards the problem of occlusions is to marke mice. Shemesh et al.[26] for example, apply fluorescent paints that light up in different colors under UVA light, and Ohayon et al.[13] dye the fur with different patterns of strokes and dots. An alternative to visible markers are radio-frequency identifiers (RFID), which are used by Weissbrod et al.[27] to identify individuals in combination with video data. However, these markers must be placed before recording and may be a potential distraction to mice. To deal with these issues, Giancardo et al.[11] use a high-resolution thermal camera to detect minor differences in body temperature. 3D tracking can also disambiguate occlusions by using depth cameras (Hong et al.[25]) or multiple video cameras at different viewpoints (Sheets et al. [28]), which is a challenging solution requiring additional equipment and calibration work. The summary of mouse based tracking is shown in Supplementary A.

2.2 People based Tracking

Human tracking is also a well studied topic in computer vision. Some of the techniques developed for human tracking may be applied for mouse tracking. Most filter-based tracking methods such as Kalman filter, Particle filter and correlation filter are well suited for online application due to their recursive nature. Some approaches of this category focus on tracking a single target by model evolution

[29, 30]

. Some of the others aim at training better object classifiers

[31] or learning better target representations [32, 33]. However, these methods cannot guarantee a global optimum as they conduct tracking on each target individually.

To alleviate the problems of filter-based tracking methods, tracking-by-detection methods have been used in many applications. The fundamental idea is to first detect objects in each frame and then address the data association problem. Recent approaches of this category have been focused on improving the performance of the developed object detectors or designing better data association techniques to improve tracking performance. For example, Shu et al. in[34] propose an extension to deformable part-based human detector [35] and utilize the visible part to infer the state of the whole object. For data-association optimisation, network flows ([36, 37]) and its variations ([38, 39]) have been recently established. This type of algorithm uses a directed acyclic graph (DAG) to form the detection hypotheses in each image frame and then find the solution through the minimum-cost maximum-flow algorithm. Zhang et al. [36] showed that a promising solution towards the network flow problem can be found in polynomial time and is highly efficient in practice. Pirsiavash et al. in [37] also adopt network flow and use dynamic programming to search for a high quality sub-optimal solution. Different variations of network flows are also used. Authors in [38] propose a multi-commodity network flow to better incorporate the appearance consistency between groups of people. Dehghan et al. [39] solve detection and data association problems simultaneously by using a multi-commodity flow graph in an inner loop of a structured learning tracker. Compared against these methods, we consider targets assignment and pair-wise part association in our problem formulation. Also, we utilize Bayesian Inference to construct correct correspondence between the proposals and the measurements. Moreover, we introduce a geometric constraint in Bayesian Inference to help reduce the ambiguity caused by nearby targets with similar appearance.

3 Proposed Mouse parts and body Detection

As stated before, a powerful part and body detector supplies the solid foundation for body tracking. Driven by the recent development in deep neural networks with ‘attention’ mechanisms, we here propose a novel mouse part and body detection framework which consists of two components: a multi-stage Part Proposal Network that generates suitable proposals of mouse parts and body, followed by a fully connected network that classifies these proposals using higher-resolution convolutional and Local Binary Pattern (LBP) features extracted from the original images.

3.1 Part Proposal Network (PPN) for Mouse parts and body Detection

Detecting mouse parts is non-trivial at all due to their small size and subtle local inter-class differences across images. To develop a strong detector, we firstly create a proposal generation network which directs the downstream classifier where to look. The region proposal network (RPN) in the Faster R-CNN [40] and selective search [41, 42] are two state-of-the-art methods for object proposal generation. These methods were developed as a class-agnostic proposal generator aiming to create the bounding-boxes that may contain objects. Without specific knowledge, these region proposals may not be accurate. Deformable Part Model (DPM) [35] is another frequently used algorithm to generate small object proposals. DPM and its variants are a class-specific method, which is based on the Histogram of Oriented Gradient (HOG) features. In addition, most negative examples generated by these methods belong to negative examples, because they are randomly selected from the background. Unlike these methods, we adopt a multi-stage architecture to perform effective hard negative mining. Suppose we can have finer region proposals, the accuracy of the proposed classifier can be further improved. In this section, we describe the proposed multi-stage PPN (as shown in Fig. 2 which can generate high-quality object proposals from the convolutional feature maps.

Fig. 2: Architecture of the proposed Part Proposal Network.

Stage 1    Stage 2   Stage 3

Fig. 3: Confidence maps of the root boxes of the mouse head (first row) and tail base (second row). In stage 1, there is confusion between the mouse head and the tail. Moreover, many proposals are located on the background. However, the estimates are increasingly refined in the later stages, as shown in the highlighted areas.

We firstly adopt the VGG-19 net [43]

as the backbone network, which is pre-trained on the ImageNet dataset 

[44]. The image is first analysed by a convolutional network (initialized by the first 15 layers of the VGG-19  net[43] and fine-tuned for the region proposal task to generate a set of feature maps that is input to our proposed PPN. We use a partial VGG-19 network and add other two trainable convolutional networks after the VGG-19 network, which aim to extract useful features and generate feature maps for the subsequent proposal generation tasks. To generate the region proposals, we establish a multi-stage class-specific proposal generation network. In each stage, we create a new network including an intermediate 3*3 convolutional layer sliding on feature maps , followed by two sibling 1*1 convolutional layers for classification and the bounding box regression respectively. For each sliding window in the 3*3 convolutional layer, we simultaneously predict multiple region proposals (called root boxes) in the original image. In our method, the multiple scales and aspect ratios of the root boxes are different for individual objects and are dependent on the object size in the image. The classification layer provides confidence scores of the root boxes, which can be used to mine and add negative root boxes into a mini-batch of the next stage. Each mini-batch contains many positive and negative example root boxes from a single image. Initially, all the positive root boxes in an image are selected and the negative root boxes are randomly sampled. In each stage, hard negative root boxes are used to replace the previous negative root boxes in the mini-batch. Here, we define and consists of all the positive and negative root boxes respectively in a mini-batch at stage . is a universal set containing all the positive, negative and unlabeled root boxes in an image.

With these definitions, we infer a multi-task loss function for an image at stage

as follows:



denote the predicted probability of root box

being a positive object of mouse ‘body’ or a mouse part at stage . The negative predicted probabilities are represented as . For the regression loss , we use the robust loss function (smooth ) defined in [40].

is a vector which represents the 4 parameterized coordinates of the predicted bounding box associated with root box

of at stage , and is that of the ground-truthed box. All the terms are finally normalized by the mini-batch size (i.e., ).

In order to utilize our region proposal networks trained at each stage, we assign the weights to the output of each classification-regression network (represented by ) based on their pseudo-loss:


where is a function to count the number of the root boxes. After stages of training, the confidence scores and the bounding boxes of the root boxes with respect to object have the following form: and .

Fig. 3 shows the refinement of the confidence maps across different stages. To reduce the number of the proposals for efficiency, we remove all the bounding boxes that have an intersection-over-union (IoU) [40] ratio over 0.7 with another bounding box that has a higher detection confidence.

3.2 Part Candidates generation

With the proposals generated by PPN, inspired by the Faster R-CNN layer [40], we also crop the proposals and extract fixed-length features from the feature maps. However, the feature maps of the faster R-CNN are of low resolution for detecting small objects (e.g. mouse heads and tails). Generally, a small proposal box is mapped to only a small region (sometimes 1*1*n) at the last pooling layer. Such a small feature map lacks discriminative information, and thus degrades the following classifier. We address this problem using the pooling features from the shallower layers, and by additionally extracting texture features (e.g. LBP features) from the original image. For classifier training, we construct the training set by selecting the top-ranked 200 proposals (and ground truths) of each image for each class. At the testing stage, we use a trained classifier to classify all the proposals in an image, and then perform non-maximum suppression independently for each class to obtain part candidates.

4 Parts Tracking of Multi-mouse

In order to achieve successful body and part tracking for all the mice freely interacting in a home cage, we here propose a Bayesian Integer linear program (BILP) formulation. Solving this BILP will yield an optimal solution which provides the locations of mouse parts in continuous videos, whilst fulfilling mouse part association over time.

4.1 Problem Formulation

Given a video sequence containing parts, we generate a set of detection candidates using the proposed detector at time . For the task of multi-mouse part tracking, our goal is to jointly address two problems: (1) Successful assembling of the detected parts to represent each individual mouse. (2) Correct alignment of the mouse parts to form motion trajectories.

For the benefits of representation, we encode these two problems through two binary vectors of and :


where suggests that, if the detected parts and can be used to form an individual mouse, i.e. with .

Our approach to the second problem is to assign correct identities to the detection candidates. In an easy case, the relationship between the targets and the detection candidates is bijective, which means that all the tracked targets are also observed and each measurement was generated by the tracked targets. This is an unrealistic scenario as it does not consider the false detection and targets occlusions. In order to solve this assignment problem, we introduce a placeholder for a ‘fake’ (or missed) detection and also define to include the ‘fake’ detection and all the detection candidates. Similar to

, a binary variable

represents that the detection candidate index which is generated by target at time . Here, we have paired with if candidate is assigned to target and otherwise. By definition, and consist of all possible solutions at time .

To ensure that every solution is physically available, we add several constrains that (1) each candidate (except for the fake hypothesis ) is assigned to at most one target, i.e. (a) for (2) each target is uniquely assigned to a candidate, i.e. (b) for and (3) only the candidates and targets of the same type (e.g. head to head or tail to tail) can be corresponded, except for fake hypothesis, i.e. (c) for if . In order to ensure that feasible solutions and result in valid target assignment and pair-wise part association, we need to apply an additional constraint that two detected parts are connected if and only if both detections are assigned to different targets: (d) for .

4.2 Bayesian Integer Linear Program

We jointly resolve the targets assignment and part association problems by minimizing the following cost function:


where the value of and which attain the minimum value of (6) are the maximum likelihood of target assignment and part association at time .

4.2.1 Bayesian Inference

With regards to the target tracking problem, the assumption of Markov property in the targets state sequence is frequently employed in Extended Kalman filter [45], particle filtering [46, 47], MCFPHD [48] or MDP[49]. This type of approaches are appropriate for the task of online tracking due to their concern on the region neighbourhood. We also utilize this assumption to resolve the target assignment problem. Let denote the states of all targets. State vector contains dynamic information of the position and velocity of the target at time . represent the positions of all part detection candidates at time

. The assumption of Markov property includes two aspects. First,the kinematic state dynamics follow a first-order Markov chain:

, meaning that the state only depends on state . Second, each observation is only related to its state corresponding to this observation: . Then we have the following formulations:


where, and are our state transition (or motion) and observation models respectively. and are the mean value and the error covariance matrix at state .

is the normal distribution.

and represent the covariances of the process and observation noise with the mean value of zero. is calculated by marginalizing and :


However, in a realistic case, Markovian assumption based methods carry the associated danger of drifting away from the correct target, as they treat the targets as conditionally independent of one another. This risk can be mitigated by optimizing data assignment and considering prior knowledge, as shown in [50, 51, 52]. In our approach, when solving the data association problem, we introduce a Geometric Model as the prior knowledge. Let (where denote the locations of the mouse parts in of training samples centered at a single mouse. More specifically, let denote the locations of parts, where is the location of the part in the training sample. Let refer to a geometric template and suppose that the locations of the mouse parts in a feasible solution is generated by one of our geometric templates . We then expand as follows:


where our collection of training samples have been introduced into the calculation of , before it can be marginalized out.

By conditioning on the geometric template , the locations of the mouse parts can be treated as conditionally independent of one another. As the association probability of the fake detection is not constrained by our geometric template, we rewrite the first term of (9) as follows:




where is the detection probability of the part detection candidate at time and

is a probability distribution obtained from our motion model,

is a assignment probability representing that the detection candidate index is generated by target at time . If , which means the target is missing, the can be estimated using an empirical parameter of the false detection density.

Combining (7),(8), (9) and (10) yields:


where, represents that, if candidate is assigned to target , i.e. the location of target is known, how well the part location in the geometric template fits the location of this target. Computing the sum shown in (12) is challenging due to a large amount of training samples. However, we notice that, if is very small, it will be unlikely to contribute much to the final outcome. Thus, we mainly consider geometric templates with large . To cope with this problem, we firstly find the nearest neighbors to satisfy in the appearance space, where is a set of body candidates with element which are generated by the proposed detector at time and is the top-scoring body candidates which contain the candidate . The selection of the body targets can be formulated as follows:




where, is the detection probability and is an empty value. In our experiments, we set .

If , we then fit a Gaussian model to the location of each part in the body neighbours and use this presentation to estimate the first term of (12) given the type (e.g. head or mouth) of the candidate . Otherwise, we treat the first term of (12) as a small variable value of . We reformulate (12) as follows:


where defines a scoring function, which is estimated by our proposed geometric model given the location of the candidate in the body target.

4.2.2 Parts association model

To improve the multi-target tracking accuracy, several methods ([53, 54, 55]) have also presented group models, in which each object is considered of having the relationship with other objects and surroundings. These models can alleviate performance deterioration in the crowded scenes. Similarly, we here propose a part association model to establish the pair relationship between the detection candidates. In our approach, the part association probability in (6) is obtained as follows:


where, probability depends on the types and of the part detections and . If , we define . This means that two close detections denoting the same part should belong to the same mouse. If the connection between the two detections of the same type exists, the detections are merged with the weighted mean of the detections, where the weights are equal to . If

, we firstly construct a 1-D Gaussian distribution using the Euclidean distance between different parts of single mouse in our dataset. Then we use the distribution to estimate the probability


4.2.3 Optimization

By introducing (9) and (16), the cost function of (6) can be rewritten as:

s.t.  (a),(b),(c),(d) in section 4.1 (17)

Finally, the problems of targets assignment and part association are jointly reformulated as:


where is a binary vector of length , and is the cost vector with if , if and if . and stand for a constraint matrix and linear equality constraints respectively, resulting in constraints (a), (b), (c) and (d).

To solve Eq. (18), we first relax the constraint to . Then, we utilize a branch-and-bound [56] based global optimization method which is summarized and shown in Supplementary B and C.

4.3 Targets state update

After having obtained an optimal solution of Eq. (18), we can associate each target with its corresponding detection candidate. If a target is assigned to a fake detection candidate, then the target state is simply estimated by the prediction using the motion model

. If a target is assigned to a true detection candidate, our goal is to maximize the posterior probability of each target state

given the detection . The updated formula can be represented as:


To compute the posterior probability , we firstly define


Following the Bayes’ rule, we can directly derive the posterior probability as follows:


where parameters have the same definitions as those of Eq. (7).

Afterwards, the estimated state is the mean of the Gaussian distribution:


The part tracking of multi-mice are illustrated in Supplementary D and are summarized in Alg. 1.

0:  a video sequence , the proposed mouse parts and body detector, C training samples of single mouse.
0:  bounding boxes of all targets.
1:  for  to  do
2:     Generate a set of parts and body detection candidates using the proposed mouse parts and body detector on the frame
3:     Fit a geometric model in appearance feature to O neighbours of the top-scoring body candidate which contain candidate m;
4:     Compute , and in (15);
5:     Construct constrain matrix A in (18) using constrain formula (a), (b), (c) and (d);
6:     Construct cost vector in (18);
7:     Obtain best solution of and by solving (18) in Alg. S1;
8:     Update the state of each target using (22);
9:     Obtain bounding boxes of all targets dependant on the best solution of and ;
10:  end for
11:  return  bounding boxes of all the targets.
Algorithm 1 Algorithm for mouse part tracking.

5 Experimental Set-up

5.1 Our Multi-Mice Parts Track Dataset

In this paper, we introduce our new dataset for multi-mice part tracking in videos. The dataset was collected in collaboration with biologists of Queen’s University Belfast, for a study of neurophysiological mechanisms involved in Parkinson’s disease. In our dataset, two or three mice are interacting freely in a 50*110*30cm home cage and are recorded from the top view using a Sony Action camera (HDR-AS15) with a frame rate of 30 fps and 640 by 480 pixels VGA video resolution. All experiments are conducted in an environment-controlled room with constant temperature (27

C) and light condition (long fluorescent lamp 40W). The dataset provides the detailed annotations for multiple mice in each video, as shown in Supplementary E. The mice used throughout this study were housed under constant climatic conditions with free access to food and water. All the experimental procedures were performed in accordance with the Guidance on the Operation of the Animals (Scientific Procedures) Act, 1986 (UK) and approved by the Queen’s University Belfast Animal Welfare and Ethical Review Body. Our database covers a wide range of activities like contacting, following and crossing. Moreover, our database contains a large amount of mouse appearance and mouse part occlusion. After proper training, six professionals were invited to annotate mouse heads, tail bases and localise each mouse body in the videos. We assign a unique identity to every mouse part appearing in the images. If a mouse part was in the field-of-view but became invisible due to occlusion, it is marked ‘occluded’. Those mouse parts outside the image border limits are not annotated. In total, our dataset yields 5 annotated videos of two mice and 5 annotated videos of three mice, and each video lasts 3 minutes. In order to evaluate the part tracking accuracy, we introduce new evaluation metrics to the proposed dataset, and also report results for several baseline methods. We split the dataset into a training and testing set with an equal duration of time and train our network based on transfer learning with pre-trained models.

5.2 Evaluation metrics

In order to evaluate the proposed mouse part detector, we use the widely adopted precision, recall, average precision (AP) and mean average precision (mAP)[57]. For a specific class, the precision value corresponds to the ratio of the positive object detections against the total number of the objects that the classifier predicted, while the recall value is defined as the percentage of the positive object detections against the total number of the objects labelled as ground-truth. The precision-recall curves are obtained by varying the model score threshold in the range of 0 and 1, which determines what is counted as a positive detection of the class. Note that, in our metrics, only the object detections that have an IoU ratio over 0.5 with the ground-truth are counted as positive detections while the rest are negatives. The AP score is defined as the average of precision at the set of 11 equally spaced recall values. mAP is just the average over all the classes. To provide a fair comparison for both types of occlusion handling, we consider an occluded mouse part correctly estimated either if (a) it is predicted at the correct location despite being occluded, or (b) it is not predicted at all. Otherwise, the prediction is considered as a false positive.

To evaluate the part tracking performance, we consider each mouse part trajectory as one individual target, and compute the multiple object tracking precision (MOTP) and the multiple object tracking accuracy (MOTA). The former is derived from three types of error ratios: false positives (FP), missed targets (MT), and identity switches (IDs). These errors are normalized by the number of the objects appearing in the image frames and can be summed up to produce the resulting tracking accuracy, where 100% corresponds to zero errors. MOTP measures how precise each mouse part has been localized. We also report the trajectory-based measures of the number of mostly tracked (MT) and mostly lost (ML) targets. If a track hypothesis has covered at least 80% of its life span based on the ground truth trajectories, it is considered as MT. If less than 20% are not tracked, the track hypothesis is considered as ML. IDF1 measures the ratio of the correctly identified detections.

6 Experimental Results

In this section, we evaluate the proposed method for part tracking on the newly introduced Multi-Mice PartsTrack dataset.

6.1 Evaluation of Part and Body Detector

6.1.1 Implementation

The multiple scales and aspect ratios of the root boxes in our method are different for mouse parts and body. Root boxes of inappropriate scales and aspect ratios are ineffective for mouse detection. For mouse ‘head’ and ‘tail base’, we choose a single aspect ratio of 1.13 (width to height) and four scales with the root box widths of 24, 29, 35 and 42 pixels based on the statistics of the object shape in the training set. For mouse ‘body’ cases, we use multiple aspect ratios of 0.5 (landscape),1 (square) and 2 (portrait), and three scales with the root box widths of 50, 80 and 128 pixels. We label a root box as a positive example if its IoU ratio is greater than 0.7 with one ground-truthed box, and a negative example if its IoU ratio is lower than 0.3 with all the ground-truthed boxes. Root boxes that are neither positive nor negative do not contribute to the training RPNs. This experiment is conducted on the 2-mice dataset.

method Pre-trained Model Layers head tail base body mAP
faster rcnn [40] ResNet50 / 93.5% 76.0% 97.1% 88.9%
ResNet101 / 96.3% 80.2% 95.6% 90.7%
ssd300 [58] VGGNet16[43] / 86.9% 63.7% 98.1% 82.9%
ssd512 [58] VGGNet16[43] / 90.6% 75.4% 98.2% 88.1%
YOLO[59] Darknet[59] / 93.6% 77.3% 98.6% 89.8%
ours AlexNet [60] 5 layers 88.8% 44.8% 86.9% 73.5%
9 layers 49.5% 25.7% 89.5% 54.9%
15 layers 35.2% 12.4% 86.8% 44.8%
6 layers 89.9% 67.5% 77.3% 78.2%
11 layers 91.5% 74.2% 90.3% 85.3%
18 layers 85.8% 51.0% 89.8% 75.5%
6 layers 91.5% 64.0% 87.3% 80.9%
11 layers 95.6% 78.4% 95.4% 89.8%
15 layers 98.4% 89.4% 98.3% 95.4%
20 layers 76.8% 42.1% 91.5% 70.1%
TABLE I: Performance (precision) of the proposed part detector using different methods.

6.1.2 Results

We firstly investigate the performance of the proposed Part and Body detector using different pre-trained models. All the experiments adopt a 3-stage Part Proposal Network based on the trade-off between speeds and system performance, as shown in Supplementary F Fig.S3. From Table I, we observe that deeper network (VGG19) has achieved superior performance over more shallow networks (AlexNet and VGG16). It is because that the deeper network can learn richer image representations. But surprisingly, the accuracy of the mouse parts is degraded after using more than 9 layers of AlexNet, 11 layers of VGGNet16 or 15 layers of VGGNet19. This limitation is partially because of the low-resolution features in the feature map of the higher layer. These features are not discriminative on the small regions, and thus degrade the performance of the downstream classifier. In comparison, these features are cooperative enough to distinguish the mouse body from the background as the region area of the mouse body is 3 times larger than that of the mouse head and the tail base. This result also suggests that, if reliable features can be extracted, the downstream classifier is able to improve the detection accuracy. In addition, using the network with fewer layers than 6 layers of VGGNet 16 or 6 layers of VGGNet 19 starts to demonstrate accuracy degradation, which can be explained by the weaker representation ability of the shallower layers.

ROI Feature Classifier head tail base background
LBP linear SVM 93.1% 92.2% 94.7%
fc layers 95.8% 94.3% 95.1%
HOG linear SVM 93.2% 85.3% 83.4%
fc layers 94.1% 87.2% 83.7%
SIFT + FV linear SVM 77.6% 72.4% 60.8%
fc layers 78.1% 73.8% 61.7%
Conv linear SVM 92.9% 91.1% 94.7%
fc layers 93.2% 92.3% 96.3%
Conv + LBP linear SVM 96.5% 95.4% 95.3%
fc layers 98.8% 97.3% 96.0%
Conv + HOG linear SVM 93.3% 89.5% 90.3%
fc layers 94.5% 90.6% 92.3%
TABLE II: Comparisons of different classifier and features.
(a) head (b) tail
Fig. 4: ROC curve shows the true positive rate (TPR) against the false positive rate (FPR) for different mouse parts.
(a) head (b) tail (c) body
Fig. 5: Precision/Recall curves of different methods on the 2-mice dataset.
Impact of the constraints
All 81.1 65.5 25 0 566 546 12 91.2
All(b,c,d) 80.2 65.6 24 0 581 571 26 89.7
All(a,c,d) 79.5 65.6 25 0 615 595 12 90.75
All(a,b,d) 75.4 66.0 22 0 738 718 12 90.2
All(a,b,c) 74.8 65.8 24 0 764 735 12 89.2
Impact of the geometric model
All without geometric model 70.3 66.6 21 0 892 872 12 88.8
Impact of the motion model
All without motion model 61.2 65.5 22 2 1101 1061 152 76.14
Impact of the parts association model
All without parts association model 76.3 65.4 24 1 688 693 30 88.5
Comparison with the state-of-the-art
MOTDT [61] 2.5 31.4 21 0 1592 1152 3103 20.2
MDP [49] 64.6 70.2 21 0 1114 994 18 /
SORT [62] 59.9 70.5 19 1 971 1331 106 15.8
JPDAm [63] 67.0 65.1 26 0 1264 548 155 44.6
TABLE III: Quantitative evaluation of multi-mice parts tracking on the 3-Mice dataset.

In Table II, we report the capability of different features to distinguish mouse parts and the background. Since CNN features perform better than the other hand-crafted features for body detection, we do not exploit any hand-crafted feature to maintain the system accuracy. For fair comparisons, all the proposals are generated from the same PPN which is initialized by the first 15 layers of VGG-19. We produce part proposals from images and extract low-level features such as LBP, HOG and SIFT. Interestingly, training our fc layers with LBP features on the same set of the PPN proposals actually has led to better average accuracy 95.1%(vs. Conv’s 93.9% and HoG’s 88.3%, shown in Table II. Combining LBP and Conv features, we can achieve the best result 97.37%. Similar results are also obtained using a linear SVM classifier. But the fc layers are more flexible and can be trained with other layers in an end-to-end style. Since too few interest points are detected from the mouse head and the tail base, SIFT features encoded by Fisher Vector [64] present a much worse result on this task. Figure 4 clearly shows that our combined LBP and conv features are able to achieve the best performance for classifying mouse ’head’ and ’tail base’ proposals.

We also compare our detection network against Faster R-CNN [40], SDD [58] and YOLO [59] on our dataset. These methods are also trained based on transfer learning with pre-trained models. We train Faster R-CNN based on the two pre-trained models of Residual Network 50 and 101 [65]. SSD [58] has two input sizes (300 300 vs. 512 512). The experiments show that the model with a larger input size leads to better results. YOLO uses its own pre-trained model named Darknet. As shown in Figure 5, most methods have high AP on the detection of head and body. However, the performance of Faster RCNN, SSD and YOLO falls considerably when they are applied to detecting the tail base. Our detection method has better performance to detect the tail base and its AP degrades less than the other methods.

6.2 Evaluation of Multi-mice parts tracking

6.2.1 Implementation

In our tracking algorithm, each target’s state contains position and velocity . We model the motion of each target based on the linear dynamical system discretized in the time domain and predict the state of each target in the next image frame as , where is a constant transition model, and . is a system noise and subject to a Gausssian distribution with covariance , . Here and are the sampling period and the process noise parameter respectively. The uniform clutter density is estimated as , and are the width and height of the image, is the average number of the false detections per image frame. This experiment is conducted on both our 2-Mice and 3-Mice datasets.

6.2.2 Results

The results of multi-mice part tracking are reported in Table III and Supplementary G Table S2. We quantitatively evaluate the proposed system using common multi-object tracking metrics. The Up and Down arrows in the table indicate whether higher or lower values are obtained. To evaluate the proposed optimization objective function (6) for multi-mice parts tracking, we have quantified the impact of different constraints (a)-(d) during the optimization. To this end, we optimize the problem by removing one constraint at a time. As shown in Table III, all the types of the constraints make evident contributions to the system performance, as they ensure that every solution is physically feasible. We also examine the impact of the geometric, motion and part association models. Removing one of the three components significantly decreases the performance. In particular, the motion model plays the most crucial role, which is obvious on the 3-mice Dataset. The part association and geometric models can effectively reduce ID switch and FP respectively. This is expected since these two models help to obtain the best target assignment and parts association. We also implement several standard multi-target tracking methods on our multi-mice PartsTrack dataset. For fair comparison, we use the proposed PPN detector to generate part bounding boxes, and perform part tracking using several state-of-the-art part trackers [62, 63, 49]. As we can see, at the bottom of Table III and Supplementary G Table S2, our tracker achieves the highest MOTA scores. Moreover, our method can result in the lowest number of ID switches and highest IDF1 scores. This is primarily due to our powerful geometric and pair based part association models, which can handle part identities more robustly. Figs. S4, S6 and S7 in Supplementary H provide the qualitative comparison of the proposed tracking method against MDP [49] and JPDAm [63] using the same detection results. As shown in Fig. S4, MDP swap identities between targets 2 and 5 after the occlusion is caused by ‘Pinning’ in the image of the middle column, while JPDAm assigns a new identity to target 6 after it is occluded by target 3. Similar problems can also be witnessed in Figs. S6 and S7. Compared to these two tracking methods, the proposed approach correctly integrates the detection results with the tacking practice and predicts the occlusion. Fig. 6 shows exemplar tracking results of the test sequences in the proposed Multi-Mice PartsTrack dataset.

(a) (b)
Fig. 6: Tracking ground-truth (top) and results (bottom) of the test sequences in the proposed Multi-Mice PartsTrack dataset. The trajectory and rectangle of each mouse part are shown in different colour, (a) the leftmost mouse: head (labelled as 4) is in green and tail (labelled as 3) is in pink, the rightmost mouse: head (labelled as 1) is in grey and tail (labelled as 2) is in cyan; (b) the leftmost mouse: head (labelled as 3) is in pink and tail (labelled as 4) is in green, the upper right mouse: head (labelled as 5) is in GreenYellow and tail(labelled as 6) is in blue, the lower right mouse: head (labelled as 1) is in grey and tail (labelled as 2) is in cyan.

7 Conclusion

In this paper, we have presented a novel method for markerless multi-mice part tracking. We have demonstrated that the proposed multi-stage Parts and Body detector performed effective hard negative mining and achieved excellent detection results. We also proposed a new formulation based on target assignment with the learned geometric constraints and a pair-wise association scheme with motion consistency and restriction. Moreover, we presented a challenging annotated dataset to evaluate the algorithms for multi-mice part tracking. Experimental results on the proposed datasets demonstrate that the proposed algorithm outperformed other baseline methods. Our future work will explore social behaviour analysis using the propose tracking method.


  • [1] L. H. Tecott and E. J. Nestler, “Neurobehavioral assessment in the information age,” Nature Neuroscience, vol. 7, no. 5, pp. 462–466, 2004.
  • [2] D. Brunner, E. Nestler, and E. Leahy, “In need of high-throughput behavioral systems,” Drug discovery today, vol. 7, no. 18, pp. S107–S112, 2002.
  • [3] D. Houle, D. R. Govindaraju, and S. Omholt, “Phenomics: the next challenge,” Nature reviews genetics, vol. 11, no. 12, pp. 855–866, 2010.
  • [4] J. Askenasy, “Approaching disturbed sleep in late parkinson’s disease: first step toward a proposal for a revised updrs,” Parkinsonism & related disorders, vol. 8, no. 2, pp. 123–131, 2001.
  • [5] A. Vogel-Ciernia, D. P. Matheos, R. M. Barrett, E. A. Kramár, S. Azzawi, Y. Chen, C. N. Magnan, M. Zeller, A. Sylvain, J. Haettig et al.

    , “The neuron-specific chromatin regulatory subunit baf53b is necessary for synaptic plasticity and memory,”

    Nature neuroscience, vol. 16, no. 5, pp. 552–561, 2013.
  • [6] L. Lewejohann, A. M. Hoppmann, P. Kegel, M. Kritzler, A. Krüger, and N. Sachser, “Behavioral phenotyping of a murine model of alzheimer’s disease in a seminaturalistic environment using rfid tracking,” Behavior research methods, vol. 41, no. 3, pp. 850–856, 2009.
  • [7] A. V. Kalueff, A. M. Stewart, C. Song, K. C. Berridge, A. M. Graybiel, and J. C. Fentress, “Neurobiology of rodent self-grooming and its value for translational neuroscience,” Nature Reviews Neuroscience, vol. 17, no. 1, pp. 45–59, 2016.
  • [8] J. Altmann, “Observational study of behavior: sampling methods,” Behaviour, vol. 49, no. 3, pp. 227–266, 1974.
  • [9] M. B. Sokolowski, “Social interactions in “simple” model systems,” Neuron, vol. 65, no. 6, pp. 780–794, 2010.
  • [10] A. Spink, R. Tegelenbosch, M. Buma, and L. Noldus, “The ethovision video tracking system—a tool for behavioral phenotyping of transgenic mice,” Physiology & behavior, vol. 73, no. 5, pp. 731–744, 2001.
  • [11] L. Giancardo, D. Sona, H. Huang, S. Sannino, F. Managò, D. Scheggia, F. Papaleo, and V. Murino, “Automatic visual tracking and social behaviour analysis with multiple mice,” PloS one, vol. 8, no. 9, p. e74557, 2013.
  • [12] M. J. Galsworthy, I. Amrein, P. A. Kuptsov, I. I. Poletaeva, P. Zinn, A. Rau, A. Vyssotski, and H.-P. Lipp, “A comparison of wild-caught wood mice and bank voles in the intellicage: assessing exploration, daily activity patterns and place learning paradigms,” Behavioural brain research, vol. 157, no. 2, pp. 211–217, 2005.
  • [13] S. Ohayon, O. Avni, A. L. Taylor, P. Perona, and S. R. Egnor, “Automated multi-day tracking of marked mice for the analysis of social behaviour,” Journal of neuroscience methods, vol. 219, no. 1, pp. 10–19, 2013.
  • [14] S. Ballesta, G. Reymond, M. Pozzobon, and J.-R. Duhamel, “A real-time 3d video tracking system for monitoring primate groups,” Journal of neuroscience methods, vol. 234, pp. 147–152, 2014.
  • [15] R. Dennis, R. Newberry, H.-W. Cheng, and I. Estevez, “Appearance matters: artificial marking alters aggression and stress,” Poultry science, vol. 87, no. 10, pp. 1939–1946, 2008.
  • [16] K. Branson and S. Belongie, “Tracking multiple mouse contours (without too many samples),” in

    Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on

    , vol. 1.   IEEE, 2005, pp. 1039–1046.
  • [17] A. Pérez-Escudero, J. Vicente-Page, R. C. Hinz, S. Arganda, and G. G. De Polavieja, “idtracker: tracking individuals in a group by automatic identification of unmarked animals,” Nature methods, vol. 11, no. 7, p. 743, 2014.
  • [18] F. De Chaumont, R. D.-S. Coura, P. Serreau, A. Cressant, J. Chabout, S. Granon, and J.-C. Olivo-Marin, “Computerized video analysis of social interactions in mice,” Nature methods, vol. 9, no. 4, p. 410, 2012.
  • [19] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele, “Deepcut: Joint subset partition and labeling for multi person pose estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4929–4937.
  • [20] E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele, “Deepercut: A deeper, stronger, and faster multi-person pose estimation model,” in European Conference on Computer Vision.   Springer, 2016, pp. 34–50.
  • [21] S. Tang, B. Andres, M. Andriluka, and B. Schiele, “Subgraph decomposition for multi-target tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 5033–5041.
  • [22] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov, B. Andres, and B. Schiele, “Arttrack: Articulated multi-person tracking in the wild.”
  • [23] C. Twining, C. Taylor, and P. Courtney, “Robust tracking and posture description for laboratory rodents using active shape models,” Behavior Research Methods, Instruments, & Computers, vol. 33, no. 3, pp. 381–391, 2001.
  • [24] H. Pistori, V. V. V. A. Odakura, J. B. O. Monteiro, W. N. Gonçalves, A. R. Roel, J. de Andrade Silva, and B. B. Machado, “Mice and larvae tracking using a particle filter with an auto-adjustable observation model,” Pattern Recognition Letters, vol. 31, no. 4, pp. 337–346, 2010.
  • [25]

    W. Hong, A. Kennedy, X. P. Burgos-Artizzu, M. Zelikowsky, S. G. Navonne, P. Perona, and D. J. Anderson, “Automated measurement of mouse social behaviors using depth sensing, video tracking, and machine learning,”

    Proceedings of the National Academy of Sciences, vol. 112, no. 38, pp. E5351–E5360, 2015.
  • [26] Y. Shemesh, Y. Sztainberg, O. Forkosh, T. Shlapobersky, A. Chen, and E. Schneidman, “High-order social interactions in groups of mice,” Elife, vol. 2, p. e00759, 2013.
  • [27] A. Weissbrod, A. Shapiro, G. Vasserman, L. Edry, M. Dayan, A. Yitzhaky, L. Hertzberg, O. Feinerman, and T. Kimchi, “Automated long-term tracking and social behavioural phenotyping of animal colonies within a semi-natural environment,” Nature communications, vol. 4, p. 2018, 2013.
  • [28] A. L. Sheets, P.-L. Lai, L. C. Fisher, and D. M. Basso, “Quantitative evaluation of 3d mouse behaviors and motor function in the open-field after spinal cord injury using markerless motion tracking,” PloS one, vol. 8, no. 9, p. e74536, 2013.
  • [29] T. Zhang, C. Xu, and M.-H. Yang, “Multi-task correlation particle filter for robust object tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, no. 2, 2017, p. 3.
  • [30] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 583–596, 2015.
  • [31] Y. Wu, M. Pei, M. Yang, J. Yuan, and Y. Jia, “Robust discriminative tracking via landmark-based label propagation,” IEEE Transactions on Image Processing, vol. 24, no. 5, pp. 1510–1523, 2015.
  • [32] H. Li, Y. Li, and F. Porikli, “Deeptrack: Learning discriminative feature representations online for robust visual tracking,” IEEE Transactions on Image Processing, vol. 25, no. 4, pp. 1834–1848, 2016.
  • [33] X. Jia, H. Lu, and M.-H. Yang, “Visual tracking via coarse and fine structural local sparse appearance models,” IEEE Transactions on Image processing, vol. 25, no. 10, pp. 4555–4564, 2016.
  • [34] G. Shu, A. Dehghan, O. Oreifej, E. Hand, and M. Shah, “Part-based multiple-person tracking with partial occlusion handling,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.   IEEE, 2012, pp. 1815–1821.
  • [35] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained part-based models,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.
  • [36] L. Zhang, Y. Li, and R. Nevatia, “Global data association for multi-object tracking using network flows,” in Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on.   IEEE, 2008, pp. 1–8.
  • [37] H. Pirsiavash, D. Ramanan, and C. C. Fowlkes, “Globally-optimal greedy algorithms for tracking a variable number of objects,” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on.   IEEE, 2011, pp. 1201–1208.
  • [38] H. B. Shitrit, J. Berclaz, F. Fleuret, and P. Fua, “Multi-commodity network flow for tracking multiple people,” IEEE transactions on pattern analysis and machine intelligence, vol. 36, no. 8, pp. 1614–1627, 2014.
  • [39] A. Dehghan, Y. Tian, P. H. Torr, and M. Shah, “Target identity-aware network flow for online multiple target tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1146–1154.
  • [40] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in neural information processing systems, 2015, pp. 91–99.
  • [41] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W. Smeulders, “Selective search for object recognition,” International journal of computer vision, vol. 104, no. 2, pp. 154–171, 2013.
  • [42] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
  • [43] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  • [44] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.
  • [45] D. Mitzel and B. Leibe, “Real-time multi-person tracking with detector assisted structure propagation,” in 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops).   IEEE, 2011, pp. 974–981.
  • [46] Z. Khan, T. Balch, and F. Dellaert, “An mcmc-based particle filter for tracking multiple interacting targets,” in European Conference on Computer Vision.   Springer, 2004, pp. 279–290.
  • [47] M. D. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool, “Robust tracking-by-detection using a detector confidence particle filter,” in Computer Vision, 2009 IEEE 12th International Conference on.   IEEE, 2009, pp. 1515–1522.
  • [48] N. Wojke and D. Paulus, “Global data association for the probability hypothesis density filter using network flows,” in 2016 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2016, pp. 567–572.
  • [49] Y. Xiang, A. Alahi, and S. Savarese, “Learning to track: Online multi-object tracking by decision making,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4705–4713.
  • [50] N. Chenouard, I. Bloch, and J.-C. Olivo-Marin, “Multiple hypothesis tracking for cluttered biological image sequences,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 11, pp. 2736–3750, 2013.
  • [51] A. A. Perera, C. Srinivas, A. Hoogs, G. Brooksby, and W. Hu, “Multi-object tracking through simultaneous long occlusions and split-merge conditions,” in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 1.   IEEE, 2006, pp. 666–673.
  • [52] J. Giebel, D. M. Gavrila, and C. Schnörr, “A bayesian framework for multi-cue 3d object tracking,” in European Conference on Computer Vision.   Springer, 2004, pp. 241–252.
  • [53] Z. Qin and C. R. Shelton, “Improving multi-target tracking via social grouping,” in Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on.   IEEE, 2012, pp. 1972–1978.
  • [54] K. Yamaguchi, A. C. Berg, L. E. Ortiz, and T. L. Berg, “Who are you with and where are you going?” in Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on.   IEEE, 2011, pp. 1345–1352.
  • [55] S. Pellegrini, A. Ess, K. Schindler, and L. Van Gool, “You’ll never walk alone: Modeling social behavior for multi-target tracking,” in Computer Vision, 2009 IEEE 12th International Conference on.   IEEE, 2009, pp. 261–268.
  • [56] E. L. Lawler and D. E. Wood, “Branch-and-bound methods: A survey,” Operations research, vol. 14, no. 4, pp. 699–719, 1966.
  • [57] K. Bernardin and R. Stiefelhagen, “Evaluating multiple object tracking performance: the clear mot metrics,” Journal on Image and Video Processing, vol. 2008, p. 1, 2008.
  • [58] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in European conference on computer vision.   Springer, 2016, pp. 21–37.
  • [59] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–788.
  • [60]

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in

    Advances in neural information processing systems, 2012, pp. 1097–1105.
  • [61] L. Chen, H. Ai, Z. Zhuang, and C. Shang, “Real-time multiple people tracking with deeply learned candidate selection and person re-identification,” 07 2018.
  • [62] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in Image Processing (ICIP), 2016 IEEE International Conference on.   IEEE, 2016, pp. 3464–3468.
  • [63] S. Hamid Rezatofighi, A. Milan, Z. Zhang, Q. Shi, A. Dick, and I. Reid, “Joint probabilistic data association revisited,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 3047–3055.
  • [64] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, “Image classification with the fisher vector: Theory and practice,” International journal of computer vision, vol. 105, no. 3, pp. 222–245, 2013.
  • [65] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.

Supplementary A

Table S1 summarize the mouse based tracking methods described in section 2.1.

Authors Requirements Brief Introduction Limitation Parts Localization
Twining et al. [23] Standard camera (top view) Use active shape models to detect targets Its flexibility is limited because of the required sophisticated skeleton models Yes (influenced by active shape models)
de Chaumont et al. [18] Standard camera (top view) Fit the defined geometrical models to images Its flexibility is limited since it requires sophisticated skeleton models Yes (influenced by geometrical models)
Pérez-Escudero et al. [17] Standard camera(top view) Use fingerprints to resolve the occlusions and identity combined with motion models Easily influenced by illumination variation No
Pistori et al. [24] Standard camera (top view) Extension of the standard particle filtering approach Difficult to explicitly model the entire mouse due to highly deformable shapes No
Branson et al. [16] Standard camera (side view) A particle filtering algorithm for tracking the contours of multiple mice Difficult to explicitly model the entire mouse due to highly deformable shapes of mice No
Weissbrod et al. [27] RFID and standard camera (top view) Use RFID to identify individuals in combination with video data RFID does not provide sufficient spatial accuracy and temporal resolution No
Giancardo et al. [11] Thermal camera (top view) Use thermal camera to detect minor changes in body temperature Thermal images do not provide appearance information such as illumination, contrast and texture Yes (influenced by frame difference and the estimated mouse shape)
Shemesh et al. [26] Fluorescent colors, UVA light and sensitive color camera (top view) Lighting change Marking is invasive and can change mice behaviours No
Ohayon et al. [13] Dye and camera mounted above the enclosure Dye the fur with different patterns of strokes and dots Marking is invasive and can modify mice behaviour Yes (influenced by dye patterns)
Hong et al. [25] Standard camera (top view) and depth camera (top view) Background subtraction and image segmentation using all cameras It requires additional equipment and calibration of cameras Yes (influenced by fitted ellipses)
Sheets et al. [28] Ten synchronised video cameras Shape-from-silhouette to reconstruct 3D shape of the mouse It requires additional equipment and calibration of cameras Yes(influenced by 3D shape)
Ours Standard camera(top view) Novel BILP Model to associate part candidates with individual targets Computing time is increasing as the number of mice arises Yes (by the proposed Bayesian Integer Linear Programming Model)
Adavantage: our approach is powerful to track multi-mice parts using a standard camera without color marking or additional equipment.
TABLE S1: Summary of mouse based tracking methods.

Supplementary B

0:  the cost vector , the constraint matrix A and Linear equality constraints h.
0:  the optimal solution
1:  Find the optimal solution to (17) with the 0-1 restrictions relaxed.
2:  At the root node, let the relaxed solution be the lower bound and randomly select a 0-1 solution be the upper bound , set
3:  while  do
4:     Create two new constraints of ‘’ constraint and ‘’ constraint for the minimum fractional variable of the optimal solution.
5:     Create two new nodes, one for the ‘’ constraint and one for the ‘’ constraint
6:     Solve the relaxed linear programming model with the new constraint at each of these nodes;
7:     let the relaxed solution be the lower bound and the existing maximum 0-1 solution be the upper bound , set ;
8:     Select the node with the minimum lower bound for branching.
9:  end while
10:  return  the optimal solution
Algorithm S1 Algorithm for optimization of cost function (17).

Supplementary C

Alg. S1 finds the global minimum of the cost function (18) over a dimensional solution space . For a subspace , we define . Thus, , and are functions to compute the lower and upper bounds respectively. We first show that after a large number of iterations , the list of partition must contain a subspace of the original volume. The volume of the subspace is defined as , is the interval of along the th dimension and therefore


we can also have


We define the condition number of subspace as


Since we define as the minimum lower bound as described in Alg.S1, we have


Combing equations (1), (2), (3) and (4):


where . Thus, with the increasing , the maximum dimension of , which is the smallest subspace in , is decreasing. As goes to zero, the difference between the upper and lower bounds(i.e. ) uniformly converges to zero.

Supplementary D

Fig. S1 shows the part tracking of multi-mice described in section 4.3.

Fig. S1: Top: Part detection candidates shown over four frames. Middle: Estimated locations of the parts for all the mice. Each colourful line corresponds to a unique mouse identity and each colourful bounding box corresponds to a unique part identity. Bottom: Estimated trajectories of all the parts.

Supplementary E

Fig. S2 shows some example frames and annotations from the proposed Multi-Mice PartsTrack dataset described in section 5.1.

Fig. S2: Example frames and annotations from the proposed Multi-Mice PartsTrack dataset.
(a) head (b) tail (c) body
Fig. S3: Precision/Recall curves of the proposed Part and Body proposal Network across various stages on the 2-mice dataset.
Impact of the constraints
All 85.0 65.5 19 0 272 288 8 86.6
All(b,c,d) 84.6 65.4 19 0 280 288 18 83.0
All(a,c,d) 81.0 66.1 17 2 374 331 18 80.9
All(a,b,d) 82.7 66.0 18 0 342 299 14 82.4
All(a,b,c) 80.1 69.6 17 0 399 356 18 79.8
Impact of the geometric model
All without geometric model 71.4 66.1 15 0 557 514 15 65.5
Impact of the motion model
All without motion model 70.1 65.4 17 2 576 533 27 67.4
Impact of the parts association model
All without parts association model 74.4 66.3 16 0 487 462 41 71.7
Comparison with the state-of-the-art
MOTDT [61] 22.2 30.4 7 0 658 504 830 40.4

MDP [49]
67.9 69.9 14 0 630 583 10 /

SORT [62]
56.7 69.5 7 0 571 509 10 32.2

JPDAm [63]
55.0 67.6 11 0 858 815 34 60.0
TABLE S2: Quantitative evaluation of multi-mice parts tracking on the 2-Mice dataset.

Supplementary F

Fig. S3 shows the Precision/Recall curves of the proposed Part and Body proposal Network across various stages on the 2-mice dataset described in section 6.1.2.

Supplementary G

Tab. S2 shows quantitative evaluation of multi-mice parts tracking on the 2-Mice dataset described in section 6.2.2.

Supplementary H

Figs. S4, S6 and S7 provide the qualitative comparison of the proposed tracking method against MDP [49] and JPDAm [63] described in section 6.2.2.

Fig. S4: Qualitative comparison of the proposed tracking method (row 4) against MDP [49](row 2) and JPDAm [63] (row 3) using the same detection results (row 1). MDP swap identities between targets 2 and 5 after the occlusion is caused by ‘Pinning’ at the frame 106, i.e. the target identity is swapped between the tail of the upper left mouse and the head of the lower left mouse. JPDAm assign a new identity to target 6 after it is occluded by target 3, i.e. the occluded tail of the upper left mouse is assigned to a new identity number.
Fig. S5: Confidence maps of the mouse head and tail bases in Fig. S4.
Fig. S6: Qualitative comparison of the proposed tracking method (row 4) against MDP [49] (row 2) and JPDAm [63] (row 3) using the same detection results (row 1). As shown at image frame 876 of the MDP algorithm, target 3 (the tail of the leftmost mouse) occurs drift and replaces the target 6 (the tail of the middle mouse). After the drift of target 3, the original object is assigned to a new identity number as shown at image frame 911. In the JPDAm algorithm, target 3 (the tail of the leftmost mouse) drifts towards target 5 (the head of the leftmost mouse) and finally replaces the target 5 at image frame 911.
Fig. S7: Qualitative comparison of the proposed tracking method (row 4) against MDP [49](row 2) and JPDAm [63] (row 3) using the same detection results (row 1). The problems of identity swap and target drift are more serious in this situation. In the MDP algorithm, target 3 (the tail of the leftmost mouse at image frame 1819) occurs drift and replaces target 1 (the tail of the upper right mouse in the first column) at image frame 1861. Moreover, in the same frame, target 7 (the head of the leftmost mouse at image frame 1819) is replaced by target 1, and its tail is assigned to a new identity number 9. Similar problems occur in JPDAm, target 6 (the tail of the upper right mouse at image frame 1819) replaces the target 1 (the head of the upper right mouse at image frame 1819) and target 6 is replaced by a new identity number 7 at image frame 1861. Although target 6 (the tail of the upper right mouse at image frame 1819) in our algorithm also causes a drift due to occlusion at image frame 1836, but target 6 finally finds the correct object when the occluded part appear again as shown at image frame 1861.