Online multi-object tracking (MOT) has become a critical scientific issue and perception technique required in many real-time applications such as intelligent driving (geiger2012are) and action recognition (RN583)
. Considering the complexity of real-world scenarios, the MOT task is inherently complicated, and the key difficulties in MOT are appearance variations of the same target, similar appearances of different targets or frequent occlusion by a cluster of targets. With the development of deep learning, MOT has made extraordinary progress(RN550; wen2020ua-detrac)
. MOT algorithms generate consistent trajectories by localizing and identifying multiple targets in consecutive frames, and they can be divided into two categories: offline and online MOT approaches. Offline approaches exploit the entire video information to extract tracking results. However, these algorithms are unfit for online applications such as intelligent driving. On the contrary, online MOT approaches only use the accessible data at the current moment.
Due to the improvement of detection accuracy (ren2017faster), the object association method which associates current detection results with historical sequence has been widely used. But the object association algorithm depends on perfect detection results. Once the detection results of tracking objects are lost, ignored, or inaccurate, the process of tracking is likely to fail. These issues can be relieved by using recent high-precision single object trackers (RN1215; RN1212; RN1214) following a tracking-by-prediction paradigm. The single object trackers only employ the detections of the first frame to predict the status of a tracked object in the latter sequence frames (li2016robust). Nevertheless, when the tracked object is occluded, these tracking methods will drift (RN717; 2021Person). To make up for the shortcomings of the tracking-by-prediction paradigm, we propose a method which combines the benefits of object association methods with single object trackers to improve the performance of MOT. A series of single object trackers are employed to track every object in the majority of video frames. When the score of tracking is lower than the threshold, the target association method will be used.
In addition, the object association in MOT refers to associating the current candidate detection with a series of historical tracklets (RN994; RN724). The historical tracklet usually contains a pedestrian sequence with temporal features, while the candidate detection only contains two-dimensional image representations. So the object association must be conducted between two different modes, which are detection results and image sequences, respectively. A number of studies (RN969; RN970) have demonstrated that extracting temporal features from image sequences can get more robust pedestrian sequences features in various complex environments. Nevertheless, such approaches neglect that the current detection image lacks temporal features during object association, so the current detection representation can not utilize the temporal features (Figure 1). Meanwhile, the feature difference between current candidate detection and historical tracklets makes it more difficult to evaluate the similarity in object association. Therefore, it is particularly urgent and important to design an approach to reinforce current detection feature by using historical temporal knowledge.
To deal with the problems of the neglected temporal feature of current detection result and the feature difference in object association, we introduce a fancy Spatial-Temporal Mutual Representation Learning (STURE). The idea is enlightened by mutual learning strategy (RN983), which learns feature collaboratively throughout the training process. In our STURE method, the spatial information is extracted by a detection learning network, and the temporal information learned by a sequence learning network is transfered to the detection learning network. At the train stage, given a pedestrian sequence, the current detection features learned by the detection learning network are forced to fit the feature of the sequence learning network. By using STURE, these spatial-temporal features are learned mutually by the sequence learning network and the detection learning network. At the test stage, the reinforced detection learning network is used to extract the features of the current detection. Because of the learned temporal information, the enforced current detection features are robust to various complex environments just as the historical sequence features in Figure 1. At the same time, the feature difference problem between detection and sequence is solved, so the current candidate detection can be associated with the historical tracklets better in object association.
To sum up, our main contributions are listed as follows:
A superior STURE architecture is proposed to solve the problem of feature difference between the spatial features of current detection and the spatial-temporal features of historical sequence for object association.
In order to enhance the mutual learning and identification ability of the proposed method, we have designed three loss functions: cross loss, modality loss and similarity loss, which will help the detection learning network obtain the temporal features.
A tracking-by-prediction MOT tracking paradigm is designed to alleviate the drift problem of single object tracking by using the reinforced features with STURE, which will improve the accuracy and robustness of the proposed method.
Abundant experiments performed on MOT benchmark with ablation studies are conducted to demonstrate that the proposed algorithm can obtain competitive tracking accuracy against the state-of-the-art online MOT methods.
2 Related work
2.1 Multi-Object Tracking
The current MOT solutions typically involve object association, which usually follow the tracking-by-detection paradigm. For example, associate the detections across frames by calculating the pairwise affinity. Depending on whether the whole video information is employed, the MOT approaches are split into online and offline approaches. Offline methods (tang2017multiple) leverage both historical and future data so they can take advantage of information about the entire video sequence. They normally consider MOT as a graph optimization issue with different paradigms (li2020graph), such as k-partite graph (dehghan2015gmmcp) and multicut (tang2017multiple).
While online MOT approaches (chu2017online; RN1004; junbo2020a) don’t utilize future information and are likely to fail when the detection of the tracked target is inaccurate. Most of the previous approaches (bae2014robust; xu2019spatial-temporal) adopt a tracking-by-detection pipeline, whose performance is largely dependent on the detection results. Recently, some (feng2019multi; RN455; RN542) combine the merits of single object tracking and data association to carry out online MOT and generally gain better tracking results. FAMNet (RN542) integrates single object tracker and a object manager into a tracking model to recover false negatives. LSST (feng2019multi)
utilizes a single object tracker to capture short term cues, a reidentification submodule to extract long term cues and a switcher-aware classifier to make matching decisions. This strategy improves the tracking accuracy, but also brings a huge running cost. MTP(kim2021discriminative) also considers discriminative appearance association and all previous tracks simultaneously to improve online tracking performance. In this study, a similarity association method with single object tracker and robust appearance is introduced to deal with the issue of imperfect detection. It indicates that the new MOT approach can get commendable tracking results compared with existing algorithms.
2.2 Spatial-Temporal Modeling
Extracting spatial-temporal feature is important for sequence frame information modeling. Typically, recurrent structure is utilized to model temporal features (chung2017a; mclaughlin2016recurrent). The temporal average pooling (mclaughlin2016recurrent)
is used at each time step to extract the feature of video sequences. Nevertheless, these methods can not deal with multiple adjacent regions. Lately, non-local neural network is utilized to model long-term temporal relations(RN580).
For online MOT, a robust motion representation is important for object association. Recently, many online MOT methods (chu2017online; RN455; RN601) which use deep neural network have been formed. For example, siamese network (leal-taixe2016learning)
has been applied to estimate the affinity of provided detections by aggregating targets appearance and optical flow information. In particular, Long Short-Term Memory (LSTM) is used to extract spatial representations(RN455), and it processes detection results in the sequence one by one and outputs the similarity between them. In this study, we use the self attention mechanism to extract the temporal representations in a pedestrian sequence.
2.3 Mutual Learning
Generally, mutual learning (hinton2015distilling; romero2015fitnets; RN982) is a widely used cooperative learning method throughout training process. As for the learning methods, the process is performed by optimizing the KL divergence between the distribution of the final outputs from two different networks (RN983). Besides, the middle layer’s output of the two networks is optimized (romero2015fitnets). And the feature transferring is performed by calculating the similarities of cross-sample in metric learning problems (RN982; RN984). In the study, temporal feature is learned mutually by optimizing the loss between the current detection and the historical sequence features in a mutual representation space. Comparing to some works (RN982; RN927), we design a new model architecture and loss functions. In addition, the cross loss is designed on the basis of differences in the mutual representation space. The sequence learning network and the detection learning network are trained synchronously, rather than train the former first. To the best of our knowledge, in the online MOT methods, the proposed STURE is the first attempt to train a sequence learning network and a detection learning network in a mutual representation space.
Firstly, we present a complete model architecture overview of the STURE. Secondly, we give the detailed information of the detection learning network and the sequence learning network. Thirdly, we introduce the design of loss function and the training method. In the end, the trained models are utilized to associate the current detection and the historical tracklets in online MOT.
3.1 Architecture Overview
The architecture of STURE in the detection-to-sequence object association is presented (see Figure 2). The sequence learning network learns spatial features and deals with temporal relations between sequence frames synchronously; and the detection learning network learns spatial features of the current detection. The spatial-temporal features are extracted by sequence learning network and detection learning network in a mutual representation space with the designed losses. By optimizing the total objective function, the detection representations and the sequence representations will be learned simultaneously. In the end, the detection-to-sequence object association is conducted according the similarity between the historical tracklets and the current detection. The detailed procedures are presented in the following parts.
3.2 Spatial-Temporal Modeling Network
3.2.1 Sequence Learning Network
In order to extract the spatial-temporal feature of historical pedestrian tracklets in a single model, the Convolutional Neural Network (CNN) with non-local self attention mechanism(RN580) is used in the sequence learning network. By exploiting temporal relations within the sequence, the non-local neural network can summarize image-wise features into a single feature, and then output the activation of each location by the weighted average of each position using the input representation (RN580).
Figure 3 indicates the architecture of the sequence learning model based on ResNet (he2016deep). Five non-local layers are particularly embedded into CNN, and the final down-sampling operation of the ResNet in the last convolution is deleted to obtain a high-resolution feature map (RN985). Given pedestrian sequences , which can be described as:
where is the index of a pedestrian sequence, , and is the number of frames for each pedestrian sequence (here is 8). So the sequence feature can be extracted by :
where is the feature of frame in sequence , and . The sequence features of several pedestrians are compressed into a single sequence representation by using 3D pooling ():
3.2.2 Detection Learning Network
In order to learn spatial feature, the detection learning network uses the ResNet which has removed the last fully-connected layer. The last downsampling step of ResNet-50 is deleted like the sequence learning network to get higher spatial resolution and more-refined detection representations. Once the temporal features among pedestrian sequences are ignored, the input sequence can be regarded as many independent detections . So the detection learning network is utilized to learn the representations of the independent detections:
where ) is the corresponding detection feature of the sequence frame .
Both the architecture of sequence and detection learning models utilize ResNet as backbone, and the only difference between them is that the former adds additional non-local neural network to extract temporal features.
3.3 Spatial-Temporal Mutual Representation Learning
Generally speaking, in online MOT, the result of object association has a lot to do with the robust feature learning. Extracting temporal knowledge among pedestrian sequence improves the robustness of the spatial features in various environmental challenges (RN986). Nevertheless, the detection learning network (its inputs are flat images) is unable to deal with temporal correlations, which prevent it from learning temporal features. To deal with the issue, the proposed STURE makes the detection learning network’s outputs to match the representations of the sequence learning network in a mutual representation space.
For the specific pedestrian sequence , Equation 2, Equation 3 and Equation 4 are utilized to extract sequence features and detection features , respectively. Because learns spatial features and temporal correlations among pedestrian sequences , contains both the spatial and temporal features. To extract the sequence’s temporal feature representation in the detection learning network and sequence learning network mutually, the STURE is designed to optimize the following losses.
3.3.1 Cross Loss
The STURE enforces the detection learning network to match more refined temporal features in the mutual representation space. In the circumstances, the STURE is formed to optimize the error between pedestrian sequence representations and the corresponding detection representations. It learns temporal feature in the sequence learning network and the detection learning network mutually to utilize deep feature. The representation of the object feature is expressed by cross-sample. The feature of the whole sequences can be described as. We use cross-sample to measure the difference between detection-detection, detection-sequence and sequence-sequence. The Euclidean difference matrix of the cross-sample can be indicated as .
where each submatrix is a matrix with the same value, . In this, is the length of pedestrian sequence and is the number of sequences.
where is the difference submatrix between the sequence and ().
The cross detection differences are forced to fit the cross sequence frame difference matrix to learn the detection feature in the mutual representation space.
where , and every element is the Euclidean difference between the detection and detection ().
In this way, the temporal feature can be transfered to detection learning network. The cross loss is formulated as:
where denotes the error between the cross difference matrix of the detection and sequence. We can use Equation 8 to reconstruct the detection representation function by learning the sequence representations , and it can be viewed as a continuous representation reconstruction from a group of data (RN987; RN988). The sequence and detection learning networks are similar to FitNets (romero2015fitnets), except the output of the model. They are converted to the identical size with an additional convolution. By comparison, we don’t require extra convolution because the outputs of the sequence and detection learning networks have identical size. After training, the detection learning network will be able to learn from the pedestrian sequences, which can make it gain desired temporal features.
Except for cross-sample loss, other identification losses are added to extract discriminative representations in detection-to-sequence object association. Any loss that can improve the discriminability is feasible in the same way. Particularly, in our work, the modality loss and the similarity loss are used.
3.3.2 Modality Loss
The triplet loss (cheng2016person) is utilized to keep interindividual differences in the mutual representation space. In this study, two types of modality differences are designed, including cross-modality and within-modality loss.
The cross-modality loss keeps the difference between detection and sequence representation, and it is able to enhance the representation discriminability of different modalities. It is formulated as:
where the previous term is the sequence-to-detection loss, and the latter is the detection-to-sequence loss. indicates the pre-set margin, means the difference in Euclidean space. and are the positive and negative sample datasets of the pedestrian ( and ), respectively.
Similarly, the within-modality loss keeps relative differences within a same modality, which makes the method discriminate the finely grained features to various objects in the same modality. It is formulated as:
where the previous term is the sequence-to-sequence loss, and the latter is the detection-to-detection loss.
The losses of the within-modality and the cross-modality are able to extract detection-to-sequence feature more efficiently. So we integrate the two kinds of modality losses: within-modality and cross-modality losses. The final modality loss can defined as:
3.3.3 Similarity Loss
Because pedestrian identities are category-level information provided in MOT datasets, two same weight classifiers are constructed to convert the detection representations and sequence representations to a mutual representation space. Several fully-connected layers followed by a softmax function constitute the object classifier; and the number of output channels is equal to the number of identities in MOT datasets. So the similarity loss is formed as the cross entropy loss between the inferred object label and the ground truth label.
where is the ground truth, and is the predicted label.
3.3.4 Total Loss
The spatial-temporal features are learned simultaneously in detection learning network and sequence learning network. So the total loss is composed of cross loss, modality loss and similarity loss as follows:
3.4 Training Strategy
The labeled bounding boxes and identity data of pedestrians supplied in the MOT training dataset are used to produce the pedestrian sequence and the current candidate detections. We utilize it to train our proposed model.
3.4.1 Data Augmentation
The training set in MOT datasets does not contains enough pedestrian tracklets, and each pedestrian sequence contains limited detection results. Therefore, the association model will be liable to underfit for the training data. In this study, a number of image augmentation methods are employed to relieve these difficulties.
Firstly, the training set is augmented by cropping and rescaling the input detection images randomly; and horizontal flip is also utilized. Furthermore, in order to simulate the noise environment for tracking and improve the robustness of the proposed model, some noise data is mixed to the pedestrian sequences by substituting detections of other pedestrians randomly. While some sequences may contain only a small number of pedestrian images in the training set, each tracklet is sampled with the same probability to relieve the sample disequilibrium problem, and the appropriate channel number is prepared for the sequence learning network.
3.4.2 Data Sampling
In order to optimize the proposed network by various target functions in Equation 13, an especial data sampling method is utilized in MOT datasets. pedestrians are selected stochastically in every training iteration. sequences are generated randomly for every pedestrian, and each sequence contains detections. If a sequence is shorter than , then it will be sampled by equal probability from the preserved frames to meet the requirement of model input. We input the whole pedestrian sequences to the sequence learning network. Besides, no more than recent tracking results of the object is preserved. At the same time, current candidate detections which constitute a detection batch are fed to the detection learning network. To reduce the computational cost, all data for each detection batch is reused to evaluate three different target functions in Equation 13.
3.4.3 Selective Back-propagation
For each input data, the goal of STURE is to force the sequence learning network and detection learning network to output similar features. It’s easy to find that the two networks will have the same representations if an optimal solution is used to minimize the STURE loss. Hence, updating the sequence learning network by cross loss will restraint temporal feature learning. In this case, the detection learning network won’t learn the desired temporal feature. To solve this problem, in this study, the cross loss is not used to update the sequence learning network at the training stage. So the selective back-propagation strategy makes the detection learning network learn more robust representation, and it won’t diminish the ability of learning temporal features from the sequence learning network.
3.4.4 Similarity Computing
To evaluate the affinity between a candidate detection and a historical tracklet, we merge the integrated sequence feature (12048) and the output (1
2048) from detection learning network into one representation, and then put it into the linear classifier, which evaluates the affinity between candidate detections and historical tracklets. The linear classifier has three fully connected layers and its input dimensions are 4096, 256, and 32, respectively. Meanwhile, each fully-connected layer incorporates a batch normalization layer and an activation function. The last layer of the linear classifier will output the affinity between candidate detections and historical tracklets. Afterwards, a softmax operation is followed, and the similarity lossis formed by the cross entropy loss between the inferred label and the ground truth.
Finally, the whole network is trained by optimizing the total loss according to Equation 13.
3.5 Object Association
A single object tracker is utilized to track the object in each frame of videos. Once the single object tracking procedure turns into drifting, it will be suspended and the status of the tracked object will change to be drifting. So the tracked object status can be described as:
where , , and are tracking score, track score threshold, and overlap rate between the tracked target and detection, respectively. The average overlap of the pedestrian sequences is denoted as
When determining the status of tracked object, the average value of in the historical frames is taken into consideration. For Equation 15, the overlap rate between current detection result and the tracked object is indicated as
where is assigned to 0 when the maximal intersection over union (IoU) of the previous tracked target (all tracking object within frames) and the current detection in video frames is lower than . Otherwise, is assigned to 1.
The motion information is utilized to choose current candidate detection results before evaluating the affinity for similarity association. Once the tracked object drifts, the size of the pedestrian bounding box at the frame will remain unchanged, and a linear prediction method is utilized to infer the object’s position at the latest moment . Let be the center position of the tracked object at frame , so the velocity of the tracked object at frame is evaluated as
where indicates the length of historical sequence. Hence, the position of the tracked object in frame is inferred as
If the detection result which around the predicted position of the tracked object is not overlapped with any other tracked object (the difference between inferred position and detection result is smaller than a threshold value ), it will be viewed as a candidate detection in the current frame . The affinity is measured between the current detections and the historical sequences.
3.5.1 Similarity Association
The similarity association is utilized, as shown in Figure 4, to decide whether the status of tracked object should be converted to tracked or kept drifting. In this stage, each association operation is performed between flat images and a mass of pedestrian sequences. In the process of detection-to-sequence association, the current detection and the historical tracklet representations are extracted by the detection learning network and the sequence learning network, respectively. After extracting representations, the similarity between current detection and historical tracklets is evaluated. Then the detection-to-sequence object association is conducted according to the similarity. The most similar detection will be selected and a similarity threshold value is set to determine whether the drifting object is link to the sequence.
In this tracking method, it is a natural solution that we use the highest tracking score of the target in the confidence map to evaluate the reliability of a single object tracker. Nevertheless, if we only use the tracking score, the false alarm detections with high confidence value are liable to be tracked persistently. In the ordinary way, the tracked object which does not overlap with other detection objects continuously is much more likely to be false positive. To solve this problem, the single object tracker and the overlap rate of bounding boxes and are used to remove the false positives.
Finally, the drifting objects and current detection results are assigned, which based on the paired affinity values between the historical sequences and the candidate detection results.
3.5.2 Object Appearing and Disappearing
In the process of MOT, using the detection results supplied by MOT benchmark (RN583), we can initialize a newly emerged object and start the tracking process. When the overlap rates of a current detection result with all tracked objects are lower than a threshold, it will be viewed as a new potential object. To prevent false alarm detection results, when a pedestrian sequence in the new candidate detection is greater than a threshold value during frames continuously, it will be viewed as an initial tracked object.
With respect to object disappearing, a tracked target which does not overlap with any other detections will be viewed as drifting and then will be deleted from the tracked list. The process of single object tracking will be terminated when it maintains drift status over than frames or directly moves out of view.
4.1 Datasets and Evaluation Protocol
4.1.1 Benchmark Datasets
The proposed method is evaluated on MOT16 (RN583), MOT17 (RN583) and MOT20 (2019CVPR19) datasets. In total, there are 7 fully labeled training videos and 7 testing videos recorded by static or moving cameras in MOT16. The MOT17 has as many video sequences as the MOT16, while it supplys three extra image-based object detectors: DPM (RN584), Faster-RCNN (ren2017faster), and SDP (yang2016exploit), which have different detection accuracy and noise levels, and they will support the test of different MOT methods. The MOT20 consists of 8 new sequences depicting very crowded challenging scenes.
4.1.2 Evaluation Protocols
Various evaluation protocols of the MOT benchmarks (RN583) are utilized for a fair comparison. Except for the classic multi-object tracking accuracy (MOTA) (RN475) and multi-object tracking precision (MOTP) (RN550)
, the evaluation metrics also contain the ratio of correctly identified detection (IDF1), ID recall(ristani2016performance) (IDR, the fraction of ground-truth detections that are correctly identified), the total number of false positives (FP), mostly tracked targets (MT), mostly lost targets (ML), the total number of identity switches (IDS), the total number of times a trajectory is fragmented (Frag), and the processing speed (Hz). Particularly, ID recall is added by (ristani2016performance) and has been introduced to the MOT benchmarks. It can be used to evaluate the consistency of the predicted identities with the actual identities.
4.2 Implementation Details
First of all, we use the ECO (danelljan2017eco)
as the single object tracker in our proposed method. The ResNet pre-trained on the ImageNet(RN991) is exploited as the backbone module, and the approach in (RN580) is adopted to initialize the parameters of the non-local layers. As for ResNet-50, the length of its output is . The maximum preserved results in the trajectory is assigned to 100 and the length of tracklet is assigned to 8. Every detection result of the tracked pedestrian is resized to 256 128. The batch size is assigned to 32. The Adaptive Moment Estimation (Adam) (RN586) optimizer with the learning rate of is used to optimize the proposed model.
The values of tracking parameters are assigned on the basis of the MOTA results. Given as the raw frame frequency, the initialization threshold value of trajectory is assigned to . The termination threshold value of trajectory is assigned to . The distance for evaluating whether the object is tracked is assigned to . The thresholds of the appearance similarity and tracking score are assigned to and , respectively. The threshold values of the difference and overlap are and
. Besides, the threshold values of the tracking score and appearance are selected by grid search. The experiment device we used is a workstation with an Intel Core i9-9820X CPU. And the MOT algorithm is implemented with Python by the Pytorch 1.3.0 library(paszke2019pytorch)
and it is run in the Linux environment of Ubuntu 18.04. The whole training procedure takes 3 hours for 80 epochs on a NVIDIA GeForce RTX 2080Ti.
During the test stage, the representations of a detection result are extracted by the detection learning network. Firstly, for every whole pedestrian sequence, it is split into many 32-frame sequences. The sequence learning model is used to extract sequence representation feature in every pedestrian sequence. The last compressed sequence representation is the average of the whole sequence features.
4.3 Evaluations on MOT benchmark
The designed method STURE is compared with the various types of MOT approaches. The evaluations are shown in Table 1, Table 2 and Table 3 respectively. N1T (baisa2019development) is the approaches of hand-crafted representation, and PHD_DAL (2019Online), HISP (baisa2021robust), GMPHD_ReId (baisa2021occlusion) and GNN (li2020graph) are the methods based on deep representation. It is obvious that those approaches which use deep representations can perform better than the traditional approaches. Thus, to a certain extent, the proposed approach gains excellent performance compared with the published algorithms based on deep learning. Moreover, compared with discriminative appearance association in DDAL (RN601), MTP (kim2021discriminative) and LSST (feng2019multi), our method also have superior online tracking performance. SORT (bewley2016simple), IOU_KMM (urbann2021online), Tracktor++ (bergmann2019tracking) and FlowTracker (nishimura2021sdof) try to combine detection into tracking to improve running speed without losing too much precision.
The proposed online MOT algorithm STURE is tested on the test sets of MOT16, MOT17 and MOT20 benchmarks, and it has been compared with other online methods. Table 1, Table 2 and Table 3 indicate the tracking results on MOT16, MOT17 and MOT20 benchmark datasets respectively. The proposed STURE gains better MOTA score and is compared with the other approaches with respect to MOTA, MOTP, IDF1, IDR, FP, FN, MT, ML, IDS, and Frag. Compared with the second best existing online methods, for MOT16, STURE has gain best performance in MOTA and Frag. For MOT17, STURE has best performance in MOTA and FP. And particularly for MOT20, STURE has best performance in MOTA and IDF1. We can see that STURE has gain a good performance in both precision and speed against other online tracking methods in various metrics, which demonstrates the advantages in MOT.
4.4 Ablation Study
Besides, we also conduct some ablation experiments. We remove a foundational module each time to prove the effectiveness of each component in the proposed approach as depicted in Figure 6. Each foundational module is indicated below.
4.4.1 STURE method
To verify the validity of the designed STURE training approach for object association in the MOT task, the designed STURE training method is removed and the original detection and sequence features are utilized to associate the lost target. In addition, the convolutional operation on current detections is applied and the maximal tracking score on the confidence map is utilized to compute the affinity for object association. The results of detection-to-sequence object association on the two MOT benchmarks are shown in Figure 6.
With results from these ablation experiments, it’s easy to find that STURE enhances the performance significantly and consistently. Specifically, STURE increases the MOTA by 8.1% on MOT16 and 9.6% on MOT17 respectively, which indicates that temporal representation is crucial for detection-to-sequence feature learning and object association. The performance results prove that STURE is able to extract spatial-temporal representation effectively from various perspectives and they are mutual complementation.
We visualize the distribution of the learned mutual representations and corresponding tracking results without/with STURE using t-SNE (van2008visualizing) as shown in Figure 5. It is easy to see that the original representations with the same identity are incompact as depicted in Figure 5 (a). After STURE, the learned mutual representationss become more consistent as depicted in Figure 5 (b); and it can improve rubustness to MOT challenges as depicted in Figure 5 (c) and (d). Thus, the reinforced representations can improve tracking performance significantly.
4.4.2 Non-local block
The non-local layers are utilized to extract temporal representations among pedestrian sequence. To demonstrate the effectiveness of the added non-local layers, we remove them and use the classic CNN structure to learn sequence representations. And the sequence frame feature is substituted with the sequence representation from 3D average pooling. In Figure 6, if we delete the non-local layers, the tracking performance in MOT16 and MOT17 still surpasses the ablation status sharply. However, the tracking performance removing STURE is lower than deleting non-local layers. Compared with the classical 3D pooling operation, it is believed that non-local layers are able to extract temporal features better, and they also help the detection learning network learn temporal features more effectively. In addition, the proposed STURE is much more important than non-local blocks in improving the tracking performance.
The proposed STURE and robust object association method deal with the trajectory association in online MOT conjunctively. The object association is performed between the current candidate detections and the historical tracklets.
4.5.1 Sequence Size
The number of pedestrian sequence images is a crucial parameter for the tracking performance in the designed architecture. The performance experiment with different values of is showed in Figure 7 . It is easy to find that, when is set to , the optimal MOTA and MOTP are obtained simultaneously.
4.5.2 Tracking hyper-parameters
In addition, several experiments are conducted to indicate the effect of various thresholds, such as the trajectory launch, trajectory termination, appearance similarity, tracking score and IoU. The tracking parameters are tested on various parametric settings. The MOTA varies drastically with the settings of and . By this means, and are selected with grid search based on the trained association network. When and , the maximized MOTA is gained. Therefore, the tracking parameters are selected as shown before.
The tracking demonstration of MOT is shown in Figure 8. In general, the single object tracker is easy to drift when the object moves fast in camera or is affected by others. In this work, a robust association approach is utilized to deal with the drifting. Besides, the single object tracker is able to solve the problems of occlusion well.
In this study, a novel STURE method is proposed for robust object association in online MOT task. The STURE schema can learn temporal representation in the sequence learning network and the detection learning network mutually. Using the mutual learning, the detection learning network can learn temporal features and become more robust, and the feature difference between current detections and historical sequences will be relieved as well. And it is useful to associate current detection result and historical sequences to deal with the imperfect detection results. Compared with the existing online MOT methods, the proposed method can get better tracking results on the MOT benchmarks, and extensive experiments prove the effectiveness of the designed approach.
This work was partially supported by National Key Research and Development Program of China (No.2018YFB1308604), National Natural Science Foundation of China (No.61976086,62106071), Hunan Innovation Technology Investment Project (No.2019GK5061), and Special Project of Foshan Science and Technology Innovation Team (No. FS0AA-KJ919-4402-0069).