Person re-identification (Re-ID) aims to match pedestrian’s identity across disjoint cameras views distributed at different locations [chen2018video, wu2020decentralised, wang2014person, wu2021generalising]. Early Re-ID studies concentrated on exploring appearance patterns unique per identity from still images [fu2019horizontal, li2018harmonious, yao2019deep], which has shown remarkable discrimination capacity. However, such methods assume well-curated data and the identity information are preserved in images. This assumption dramatically restricts their scalability and usability to many practical application scenarios when uncontrollable environments are the norm not the exception where video data are captured [li2019unsupervised, liu2017quality]. Video person Re-ID beyond still images requires analysing and assembling information from a sequence of video frames in each tracklet so to build a more discriminative and robust representation of pedestrians in motion, minimising information corruption from poor frames and ID-switch [liu2019spatial, xu2017jointly, chen2018video, fu2019sta, zhang2018multi, hou2020temporal].
In the literature, one of the most commonly adopted techniques for assembling identity information from different video frames is averaging by pooling [suh2018part, si2018dual]. By assuming all the frames are in equal importance, the pooling method neglects their diverse qualities caused by the constantly changing environments and/or unreliable pedestrian detections. Therefore, the aggregated tracket’s representations are likely impacted by various types of noise as shown in Fig. 1. In order to selectively assemble video frames rather than averaging, attention mechanisms [hu2018squeeze, vaswani2017attention, wang2018non, woo2018cbam, huang2021cross] have been studied to explore the correlations between the global visual features of frames (Fig. 1 (b)) so that the common appearance patterns shared among frames in the same tracklet are maintained while removing/ignoring unusual and low-quality frames [li2020relation, matiyali2020video, liu2021watching, wang2021robust]. In contrast to the global appearance correlations, an alternative approach [zhang2020multi, hou2019vrstc, hou2020temporal] compares video frames by local parts (Fig. 1
(a)) so to identify outliers that are significantly misaligned with other frames in a tracklet. Although sharing the same objective to adaptively assemble only the relevant video frames, these two approaches differ in exploiting information in different granularities. In isolation, both are sub-optimal in different real-world video scenes. The local-parts approach is fragile if the detected pedestrians are not well-aligned while the global-appearance approach is spatially insensitive, tending to miscorrelate patterns of interest in the background. Beyond attentive assembling, Recurrent Neural Network (RNN)[xu2017jointly, mclaughlin2016recurrent] has also been exploited for modelling temporal information to represent frame sequences in video tracklets. However, this approach is also vulnerable to noisy frames without careful frame selections [wu2019spatio].
In this work, we propose a tracklet frame assembling approach to video person Re-ID termed Local-Global Associative Assembling (LOGA). As shown in Fig. 1 (d), the LOGA method adaptively assembles video frames in the same tracklets by a Local Aligned Quality (LAQ) and a Global Correlated Quality (GCQ) modules to assess importance/relevance of the frames by both their alignments in local part and global appearance correlations as well as their mutual reinforcements. Whilst the focus of most existing spatial-temporal attentive methods is on collaborating the temporal information with intra-frame spatial attention, we aim to exploit the inter-frame
complements more effectively, which is different and ready to benefit from the advancing per-frame learning. Specifically, the LAQ module divides all video frames in a tracklet into a same set of spatial parts and assesses each frame’s quality by their part-wise alignment to the other frames so to measure both inter-frame visual similarity and spatial alignment. On the other hands, the GCQ module is applied on the holistic feature representation of each frame to consider inter-frame global appearance correlations, which is more robust to local part misalignment but spatially insensitive so less reliable from miscorrelation of information,e.girrelevant patterns in the background. Furthermore, to associate the local and global information and exploit their mutual benefits, we take the tracklet’s representation assembled by the LAQ as its prototype and compare the global visual feature of frames with it in the GCQ module so that the two modules are encouraged to find a trade-off between the local and global information to cope with different types of noise more reliably.
Contributions of this work are three-fold: (1) To our best knowledge, we make the first attempt to explore the association and mutual promotion of frame’s local part alignments and global appearance correlations in assembling a sequence descriptor so to improve the model’s robustness to noisy frames and inter-frame ID-switch in video Re-ID. (2) We propose a new video person Re-ID model termed Local-Global Associative Assembling (LOGA) that learns a discriminative and reliable representation for video tracklets by adaptively assembling frames of diverse qualities. (3) We introduce a local-assembled global appearance prototype to associate the local and global visual information by exploiting their mutual agreements to facilitate the learning of a discriminative tracklet representation.
Extensive experiments show the performance advantages and superior robustness of the proposed LOGA model over the state-of-the-art video Re-ID models on four video Re-ID benchmarks MARS [zheng2016mars], Duke-Video [ristani2016performance, wu2018exploit], Duke-SI [li2019unsupervised], and iLIDS-VID [wang2014person].
2 Related Works
Video person Re-ID aims to learn an expressive appearance feature and/or distance metric from a sequence of frames, i.e, a video tracklet. To take the advantages of the additional temporal information and complementary spatial information intrinsically available in video tracklets, existing approaches explore either local part alignments [song2018region, bao2019preserving, zhang2020multi, hou2019vrstc, hou2020temporal] or global appearance correlations [liu2017quality, li2019global, zhang2019scan, li2020temporal, li2020relation, matiyali2020video, liu2021watching, wang2021robust] to assemble the per-frame representations with high robustness to their diverse qualities.
Local part alignments.
Considering the consistent body structure shared among humans and the arbitrary combinations of body part’s appearance that unique to each identity, it is intuitive to differentiate images/frames of pedestrians regarding their visual similarity in different parts. In this spirit, local-parts assembling approaches [song2018region, bao2019preserving, zhang2020multi, hou2019vrstc, hou2020temporal] apply per-part comparisons of video frames in the same tracklets to identify outliers which are misaligned with others in most local parts, so as to restore the corrupted parts of frames with the complements of others [hou2019vrstc, hou2020temporal] or degrade their importance in frame assembling [song2018region, bao2019preserving, zhang2020multi]. However, this hypothesis that a pedestrian detected in different video frames being mostly well-aligned is often untrue due to unreliable auto-generated person bounding boxes, e.gthe importance of a noise-free video frame might be underestimated due to the spatial shift of its detected bounding box from those in other frames. In this work, we further consider the holistic visual similarity of video frames when assessing their quality, which helps refrain from inaccurate assessments caused by part misalignments.
Global appearance correlations.
In contrast to the local-parts approaches, methods based on global-appearance [liu2017quality, li2019global, zhang2019scan, li2020temporal, li2020relation, matiyali2020video, liu2021watching, wang2021robust]
take the advantages of the strong representational power of convolutional neural network (CNN)[goodfellow2016deep, lecun2015deep] to learn correlations between video frames holistically so that the irrelevant frames, which are likely in low-quality, are suppressed in frame assembling. However, the CNN features can be insensitive to spatial shift resulting in potential miscorrelations of visually similar but irrelevant parts, e.gthe ID-switch issue shown in Fig. 1 (b) is hard to be detected due to the subtle differences in the two pedestrians’ outfits. This will result in misassemblling of frames to represent a tracklet. To address this problem, we propose to enhance the global-appearance methods by jointly explore frames’ holistic visual correlations and their local part alignments by considering inter-frame spatial relations.
Beyond the temporal assembling approaches discussed above, spatial attention [woo2018cbam] is also popular in both image and video person Re-ID [li2018harmonious, zhong2020robust, xiang2020part, fu2019sta, wu2019spatio]. By exploring the correlations of local parts within a still image or across different video frames, the spatial attention mechanism is able to adaptively focus on the more discriminative regions regardless of their spatial location. However, this is prone to miscorrelation of information in video frames as in the global-correlated assembling approaches. Differently, our LAQ module investigates the alignments of the same part across different video frames, focusing on exploiting complementary inter-frame information in a tracklet.
There are a few recent attempts on exploring jointly the local and global information for frames assembling in video Re-ID [chen2020frame, yang2020spatial]. However, they learn from these two types of information with few interactions either by a dual-branch network [chen2020frame] or feature concatenations [yang2020spatial], and overlook the local-global mutual impacts (Fig. 1 (c)). We validated the effectiveness of the proposed LOGA over those assembling strategies in both performance evaluation (Section 4.1) and ablation analysis (Section 4.2).
3 Video Person Re-ID
Given video tracklets with each containing frames depicting pedestrians in motion, the objective of video person Re-ID is to derive a representation model from the tracklets data which is capable of extracting discriminative feature representations : for Re-ID matching across disjoint camera views. Considering the diverse and unknown sources of noise commonly exist in surveillance videos, which leads to distractions in different frames, it is essential for the model to effectively recognise visual patterns that specific to each pedestrian to selectively assemble frames into a tracklet’s representation. This is inherently challenging due to the uncertain nature of noise in tracklets of people in motion against backgrounds of visually similar distractors.
3.1 Local-Global Associative Assembling
In this work, we propose a Local-Global Associative Assembling (LOGA) model to address this problem by selecting information from video frames in the same tracklets according to both their local part alignments and global appearance correlations as well as the synergy and mutual promotion of these two types of information. For notation clarity, in the following, we focus on the formulation of assembling frames in a single video tracklet and ignore its tracklet index. As shown in Fig. 2, the video tracklet is first fed into a Local Aligned Quality (LAQ) module to assess the quality of frames regarding their part-wise alignment:
The in Eq. (1) is the learnable parameters of the LAQ and denotes the importance of frames determined by its alignments with other frames in local parts. Then, a global correlated quality (GCQ) module is devised which is applied to the -dim holistic visual representation of frames to determine their global appearance correlations. Instead of focusing on only the global visual features that are prone to spatial-insensitive miscorrelation, we explore the mutual synergy between local and global information by associating LAQ and GCQ through a prototypical descriptor . This assembles a frame’s global features by their local-parts quality in GCQ for correlation exploration:
where denotes frame’s quality regarding their global-appearance feature and is the learnable parameters of the GCQ module. In this way, the final representaion of a tracklet is obtained by associating LAQ and GCQ through :
With the tracklet-level representations, the LOGA model can be trained with arbitarily conventional Re-ID objectives in an end-to-end manner. In inference, a generic distance metric (e.gcosine distance) is used to measure pairwise visual similarity of tracklets for video Re-ID matching. The overall learning process of the LOGA model is depicted in Algorithm 1.
Local aligned quality.
To explore the visual similarity of frames in terms of their local alignments, we separate them uniformly into non-overlapping patches (parts) and apply patch-wise cross-frame convolution to recognise the aligned local patterns. This is accomplished by first flatten the 2D frames then stacking them in the channel dimension as the raw representation of the tracklet maintaining the inter-frames spatial correspondence. An 1D convolution is then applied on to explore the per-part visual patterns,
where denotes the 1D convolution function and is a trainable kernel. The size of kernel is determined by the granularity of the spatial separation, i.e, where and are the height and width of frames, respectively. The computed results
encode the part-wise importance of every frame, which is then aggregated by pooling followed by a multi-layer perceptron (MLP) to obtain the per-frame scores:
The in Eq. (6) is a frame-wise mean pooling function and the
stands for a single layer MLP activated by a ReLU function. The resulted scores are then normalised by softmax function as the indicationof per-frame importance to the tracklet . In this way, the LAQ learns to assess the frame’s quality by its local part alignments to other frames, so to identify the misaligned outlier frames and suppress them from representing a tracklet.
Global correlated quality.
The GCQ module is formulated to explore the inter-frame correlations according to their global appearances. However, the spatial invariant characteristic of the CNN features tends to miscorrelate patterns of interests with potential noise in the background, i.ecompletely ignoring the spatial part’s alignment. In this case, we propose to establish the GCQ on the results yielded by LAQ so to associate them by their synergy. Specifically, given the frame’s importance computed by Eq. (6) regarding their local part alignments, we first assemble their visual features accordingly in Eq. (2), which serves as the appearance prototype of a tracklet. Then, the global-appearance quality of a frame is estimated according to the correlation between their global features and the prototype:
The and functions in Eq. (7
) are to linearly transform respectively the prototype and frame’s features. Both are followed by batch normalisation. In this way, the video frames inwith higher appearance correlations to the pedestrian’s prototype will be highlighted with larger and those mis-correlated ones will be suppressed.
Given the global-appearance quality of frames, their visual features can be selectively aggregated by:
where is identical to and in Eq. (7) with independent parameters . Rather than taking as the final representation of the tracklet , in light of the residual learning [he2016deep], we distill the complementary information from global appearance correlations of frames to enhance the prototype computed by local-parts quality so to minimise representational error from identity-irrelevant part misalignments. To that end, we further learn the residual of from and obtain the visual feature representation of by:
This design not only explores the global features of frames but also considers their local part alignments for optimising a discriminative tracklet representation.
3.2 Model Training
Given the formulations of LAQ and GCQ, the proposed LOGA model can benefit from conventional learning supervisions. Specifically, the LOGA model is jointly trained with a softmax cross-entropy loss and a triplet ranking loss [hermans2017defense]. The softmax cross-entropy loss is employed to optimise identity classification:
The in Eq. (10) is an one-hot indicator of the ground-truth identity of tracklet and the
serves as a linear classifier which maps the tracklet’s representationinto an identity prediction distribution while is the total number of identities. Moreover, the triplet ranking loss explicitly draws the features of a positive tracklet pair sharing the same identity closer in the learned latent space while pushes the negative pairs apart:
where and are the representations of two randomly sampled tracklets with the same and different ground-truth labels as in respective, measures the distance of two features and is a predefined margin. The overall optimisation objective of a batch of tracklets is then formulated by combining the two losses as:
where is the size of a mini-batch. Since the objective function Eq. (12
) is differentiable, the LOGA model can be trained end-to-end by the conventional stochastic gradient descent algorithm in the batch-wise manner.
The proposed Local-Global Associative Assembling (LOGA) is evaluated on four video-based Re-ID datasets: MARS [zheng2016mars], Duke-Video [ristani2016performance, wu2018exploit], Duke-SI [li2019unsupervised], iLIDS-VID [wang2014person], and PRID2011 [hirzer2011person]. Example tracklets are shown in Fig. 3. The MARS has 20,478 tracklets of 1,261 persons captured from a camera network with 6 near-synchronised cameras. Duke-Video is a newly released large-scale benchmark of 1,812 person identities with 4,832 tracklets. Duke-SI is a fully auto-generated version of Duke-Video without manual frames selection, thus, more practical and challenging. The iLIDS-VID dataset is relatively small scale including 600 video tracklets of 300 persons captured by two disjoint cameras in an airport arrival hall. The PRID2011 is another small scale dataset containing 1,134 tracklets from 934 identities captured by two cameras.
To evaluate the effectiveness of the proposed LOGA model, we adopted two commonly used performance metrics in person re-id including Cumulative Matching Characteristics (CMC) and Mean Average Precision (mAP) [zheng2015scalable].
For fair comparisons, we took a ResNet50 [he2016deep]
as the backbone network for global visual feature extraction[gu2020appearance]. Given that the video tracklets are composed of arbitrary number of frames, we split each tracklet into several clips with a fixed length of 10. We randomly sampled identity instances each with 8 clips to construct a mini-batch in model training. All the frames were resized to and augmented by random horizontal flip. We used Adam [kingma2014adam] with weight decay of for model optimisation. The margin in Eq. (11) is set to 0.3, and the dimension of representations is set to following [gu2020appearance, luo2019bag]. The kernel size for the 1D convolution in Eq. (5
) is set to 10. The model was trained on two P100 GPUs for 240 epochs, and the learning rate is initialised towhich linearly decayed with a factor of per training epochs. During the testing stage, the tracklet-level representation was obtained by averaging pooling the learned representations of their clips. Cosine distance was then used to measure the distances between a query and every probed tracklet in gallery for Re-ID.
4.1 Comparisons to the State-of-the-Art
In Table 1, we compared the proposed LOGA model with a wide range of state-of-the-art video person Re-ID methods. The LOGA model yielded the best results across the board, which suggests the efficacy of associatively exploring local part alignments and global appearance correlation in assembling a discriminative representation of a tracklet. Whilst maintaining its competitiveness on the large-scale MARS and the well-curated Duke-Video datasets, the LOGA model achieved compelling improvements over the other methods on iLIDS-VID and its performance advantage is more significant on the automatically detected and segmented Duke-SI, in which case LOGA outperformed the others by 1.9%55%, 1.7%54.9% and 1.1%50% on mAP, rank-1 and rank-5, respectively.
4.2 Ablation Study
We conducted further studies to experimentally investigate the effectiveness of exploring the complementary local and global information by solely considering one while ablating another, and also demonstrated the superiority of our associative assembling over the dual-branch strategy [chen2020frame] which used both local and global information separately. We also provided comprehensive visualisation for intuitively understandings.
We started with examining the role of local part alignments by introducing LAQ for frame assembling. Fig. 4 (pink v.sorange) shows that both metrics on most datasets are decreased. This is caused by the unrealistic assumption that local regions of all the frames are well-aligned. Such an assumption is shown to be unreliable due to uncontrollable environment and fragile detection/segmentation. We further examined the importance of global appearance by solely employing GCQ for frame assembling. The unsatisfying performance as reported in Fig. 4 (pink v.sgray) suggests assessing the quality of frames in accordance with solely the unobstructed global appearance is unreliable owing to the fine-grained details being ignored. In contrast, when both LAQ and GCQ are adopted, LOGA exhibits remarkable advantage over all other counterparts (green v.sothers). This demonstrates the indispensable of both LAQ and GCQ.
Effects of assembling strategy.
We further studied the effects of different strategies to join the local and global information in frames assembling: (1) separately assembling by two individual branches learned in parallel according to the two kinds of information [chen2020frame]. (2) directly connecting local and global information by rescaling the per-frame visual features according to their normalised local alignment scores (Eq. (6)) then explore their global correlations by the conventional self-attention on the rescaled features. (3) associatively assembling by combining the local-assembled prototype and global-assembled residual (Eq. (9)) to exploit their synergy. The comparison given in Fig. 6 (green v.sothers) shows a noticeable advantage of LOGA over the dual-branch or direct-connecting counterpart, which demonstrates the effectiveness of the proposed associative assembling strategy.
Effects of local part size.
We study the effects of local part size by varying the kernel size of the 1D convolution in Eq. (5) and experimented on iLIDS-VID. The experimental results shown in Fig. 6 indicates our model’s robustness to this hyper-parameter within a wide range of values thanks to the subsequent GCQ module which help refine the local alignment scores according to global correlations. Given that improving doesn’t benefit the performance but increase the model’s complexity, we set in practice.
Fig. 7 shows several video clips stacked with their activation maps generated according to their local parts quality. Each frame’s local-aligned score (upper, Eq. (6)) and global-correlated score (lower, Eq. (7)) are attached at their bottom-right corner. As exhibited, LOGA is robust to various kinds of noise by providing a faithful importance score for assembling a discriminative representation. The activation maps accurately reveal the critical regions for Re-ID. The global-correlated scores are obtained with the complementary appearance information so can reliably adjust the biased local-aligned scores. For instance, as shown in Fig. 7, LAQ enables network to focus on the target instead of the switched ID or the irreverent multi-detected ID as shown in the activation maps. For the low quality frames caused by partial-detection, scale-variation and occlusion, etc. LAQ can faithfully assess the local quality. The suitable importance score revealed by the association of LAQ and GAQ efficiently guide LOGA to learn the representation from the most discriminative region in the most discriminative frames.
In this work, we present a novel Local-Global Associative Assembling (LOGA) method for video person Re-ID through selectively assembling video frames of diverse qualities to derive a more reliable and discriminative representation of a video tracklet. This is accomplished by assessing the frame’s quality according to both their local part alignments and global appearance correlation so to refrain from integrating undesired visual information into tracklet’s representation causing identity mismatch. Different from existing approaches which explore either local or global information separately, our LOGA method constructs a local-assembled global appearance prototype of a tracklet so to alleviate biased quality assessment caused by either identity-irrelevant misalignment or spatial-insensitive appearance miscorrelation. Extensive experiments on five benchmark datasets show the performance advantages of LOGA over a wide range of the state-of-the-art video Re-ID methods. Detailed ablation studies are also conducted to provide in-depth discussions about the rationale and essence of different components in our model design.