Recent advances on Metric Learning [schroff2015metric1-facenet, sohn2016metric2-npair, wang2017metric3-angular, ustinova2016metric4-hist] give to researchers the foundation for computing suitable distance metrics between data points. In this context, Re-Identification (Re-ID) has greatly benefited in diverse domains [zheng2016survey-person, khan2019survey-vehicle, schneider2019survey-animal], as the common paradigm requires distance measures exhibiting robustness to variations in background clutters, as well as different viewpoints. To meet these criteria, various deep learning based approaches leverage videos to provide detailed descriptions for both query and gallery items. However, such a setting – known as Video-To-Video (V2V) Re-ID – does not represent a viable option in many scenarios (e.g. surveillance) [zhang2017lstm, xie2019i2vreason1, nguyen2018i2vreason2, gu2019TKP], where the query comprises a single image (Image-To-Video, I2V).
As observed in [gu2019TKP], a large gap in Re-ID performance still subsists between V2V and I2V, highlighting the number of query images as a critical factor in achieving good results. Contrarily, we advise the learnt representation should not be heavily affected when few images are shown to the network (e.g. only one). To bridge such a gap, [gu2019TKP, bhardwaj2019fewerframes] propose a teacher-student paradigm, in which the student – in contrast with the teacher – has access to a small fraction of the frames in the video. Since the student is educated to mimic the output space of its teacher, it will show higher generalisation properties than its teacher when a single frame is available. It is noted that these approaches rely on transferring temporal information: as datasets often come with tracking annotation, they can guide the transfer from a tracklet into one of its frames. In this respect, we argue the limits of transferring temporal information: in fact, it is reasonable to assume an high correlation between frames from the same tracklet (Fig. 1a), which may potentially underexploit the transfer. Moreover, limiting the analysis to the temporal domain does not guarantee robustness to variation in background appearances.
Here, we make a step forward and consider which information to transfer, shifting the paradigm from time to views: we argue that more valuable information arises when ensembling diverse views of the same target (Fig. 1c). This information often comes for free, as various datasets [zheng2016mars, wu2018duke1, liu2016veri, bergamini2018multi] provide images capturing the same target from different camera viewpoints. To support our claim, Fig. 1 (right) reports pairwise distances computed on top of ResNet-50, when trained on Person and Vehicle Re-ID. In more details: matrices from Fig. 1b visualise the distances when tracklets are provided as input, whereas Fig. 1d shows the same for sets of views. As one can see, leveraging different views leads to a more distinctive blockwise pattern: namely, activations from the same identity are more consistent if compared to the ones computed in the tracklet scenario. As shown in [tung2019similarity], this reflects a higher capacity to capture the semantics of the dataset, and therefore a graceful knowledge a teacher can transfer to a student.
Based on the above, we propose Views Knowledge Distillation (VKD), which transfers the knowledge lying in several views in a teacher-student fashion. VKD devises a two-stage procedure, which pins the visual variety as a teaching signal for a student who has to recover it using fewer views. We remark the following contributions: i) the student outperforms its teacher by a large margin, especially in the Image-To-Video setting; ii) a thorough investigation shows that the student focuses more on the target compared to its teacher and discards uninformative details; iii) importantly, we do not limit our analysis to a single domain, but instead achieve strong results on Person, Vehicle and Animal Re-ID.
2 Related Works
The I2V Re-ID task has been successfully applied to multiple domains. In person Re-ID, [wang2017p2snet] frames it as a point-to-set task, where image and video domains are aligned using a single deep network. The authors of [zhang2017lstm]
exploit time information by aggregating frames features via a Long-Short Term Memory. Eventually, a dedicated sub-network aggregates video features and match them against single image query ones. Authors of MGAT[bao2019maskedGAT]
employ a Graph Neural Network to model relationships between samples from different identities, thus enforcing similarity in the feature space. Dealing with vehicle Re-ID, authors from[liu2017provid] introduce a large-scale dataset (VeRi-776) and propose PROVID and PROVID-BOT, which combine appearance and plate information in a progressive fashion. Differently, RAM [liu2018ram] exploits multiple branches to extract global and local features, imposing a separate supervision on each branch and devising an additional one to predict vehicle attributes. VAMI [zhou2018vami]
employs a viewpoint aware attention model to select core regions for different viewpoints. At inference time, they obtain a multiview descriptor through a conditional generative network, inferring information regarding the unobserved viewpoints. Differently, our approach asks the student to do it implicitly and in a lightweight fashion, thus avoiding the need for additional modules. Similarly to VAMI,[chu2019vanet] predicts the vehicle viewpoint along with appearance features; at inference, the framework provides distances according to the predicted viewpoint.
has been first investigated in [romero2014fitnets, hinton2015distilling, zagoruyko2016payingattention] for model compression: the idea is to instruct a lightweight model (student) to mimic the capabilities of a deeper one (teacher): as a gift, one could achieve both an acceleration in inference time as well as a reduction in memory consumption, without experiencing a large drop in performance. In this work, we benefit from the techniques proposed in [hinton2015distilling, tung2019similarity] for a different purpose: we are not primarily engaged in educating a lightweight module, but on improving the original model itself. In this framework – often called self-distillation [furlanello2018born, yang2018tolerantteacher] – the transfer occurs from the teacher to a student with the same architecture, with the aim of improving the overall performance at the end of the training. Here, we get a step ahead and introduce an asymmetry between the teacher and student, which has access to fewer frames. In this respect, our work closely relates to what [bhardwaj2019fewerframes] devises for Video Classification. Besides facing another task, a key difference subsists: while [bhardwaj2019fewerframes] limits the transfer along the temporal axis, our proposal advocates for distilling many views into fewer ones. On this latter point, we shall show that the teaching signal can be further enhanced when opening to diverse camera viewpoints. In the Re-Identification field, Temporal Knowledge Propagation (TKP) [gu2019TKP] similarly exploits intra-tracklet information to encourage the image-level representations to approach the video-level ones. In contrast with TKP: i) we do not rely on matching internal representations but instead their distances solely, thus making our proposal viable for cross-architecture transfer too; ii) at inference time, we make use of a single shared network to deal with both image and video domains, thus halving the number of parameters; iii) during transfer, we benefit from a larger visual variety, emerging from several viewpoints.
We purse the aim of learning a function mapping a set of images into a representative embedding space. Specifically, is a sequence of bounding boxes crops depicting a target (e.g.
a person or a car), for which we are interested in inferring its corresponding identity. We take advantage of Convolutional Neural Networks (CNNs) for modelling. Here, we look for two distinctive properties, aspiring to representations that are i) invariant to differences in background and viewpoint and ii) robust to a reduction in the number of query images. To achieve this, our proposal frames the training algorithm as a two-stage procedure, as follows:
First step (Sec. 3.1): the backbone network is trained for the standard Video-To-Video setting.
Second step (Sec. 3.2): we appoint it as the teacher and freeze its parameters. Then, a new network with the role of the student is instantiated. As depicted in Fig. 2, we feed frames representing different views as input to the teacher and ask the student to mimic the same outputs from fewer frames.
3.1 Teacher Network
Without loss of generality, we will refer to ResNet-50 [he2016deep] as the backbone network, namely a module mapping each image from to a fixed-size representation (in this case ). Following previous works [luo2019bag, gu2019TKP]
, we initialise the network weights on ImageNet and additionally include few amendments[luo2019bag]luo2019bag] (i.e.
batch normalisation followed by a linear layer). Second: to benefit from fine-grained spatial details, the stride of the last residual block is decreased from 2 to 1.
Given a set of images , several solutions [liu2017quality, zhang2017lstm, liu2019NVAN] may be assessed for designing the aggregation module, which fuses a variable-length set of representations into a single one. Here, we naively compute the set-level embedding through a temporal average pooling. While we acknowledge better aggregation modules exist, we do not place our focus on devising a new one, but instead on improving the earlier features extractor.
We train the base network - which will be the teacher during the following stage - combining a classification term (cross-entropy) with the triplet loss 111For the sake of clarity, all the loss terms are referred to one single example. In the implementation, we extend the penalties to a batch by averaging.. The first can be formulated as:
where and represent the one-hot labels (identities) and the output of the softmax respectively. The second term encourages distance constraints in feature space, moving closer representations from the same target and pulling away ones from different targets. Formally:
where and are the hardest positive and negative for an anchor within the batch. In doing so, we rely on the batch hard strategy [hermans2017defensetriplet] and include P identities coupled with K samples in each batch. Importantly, each set comprises images drawn from the same tracklet [liu2019NVAN, fu2019STA].
3.2 Views Knowledge Distillation (Vkd)
After training the teacher, we propose to enrich its representation capabilities, especially when only few images are made available to the model. To achieve this, our proposal bets on the knowledge we can gather from different views, depicting the same object under different conditions. When facing re-identification tasks, one can often exploit camera viewpoints [zheng2016mars, ristani2016duke, liu2016veri] to provide a larger variety of appearances for the target identity. Ideally, we would like to teach a new network to recover such a variety even from a single image. Since this information may not be inferred from a single frame, this can lead to an ill-posed task. Still, one can underpin this knowledge as a supervision signal, encouraging the student to focus on important details and favourably discover new ones. On this latter point, we refer the reader to Section 4.4 for a comprehensive discussion.
Views Knowledge Distillation (VKD) stresses this idea by forcing a student network to match the outputs of the teacher . In doing so, we: i) allow the teacher to access frames from different viewpoints; ii) force the student to mimic the teacher output starting from a subset with cardinality (in our experiments, and ). The frames in are uniformly sampled from without replacement. This asymmetry between the teacher and the student leads to a self-distillation objective, where the latter can achieve better solutions despite inheriting the same architecture of the former.
To accomplish this, VKD exploits the Knowledge Distillation loss [hinton2015distilling]:
where and are the distributions – smoothed by a temperature – we attempt to match222Since the teacher parameters are fixed, its entropy is constant and the objective of Eq. 3 reduces to the cross-entropy between and .. Since the student experiences a different task from the teacher one, Eq. 3 resembles the regularisation term imposed by [li2016learning] to relieve catastrophic forgetting. In a similar vein, we intend to strengthen the model in the presence of few images, whilst not deteriorating the capabilities it achieved with longer sequences.
In addition to fitting the output distribution of the teacher (Eq. 3), our proposal devises additional constraints on the embedding space learnt by the student. In details, VKD encourages the student to mirror the pairwise distances spanned by the teacher. Indicating with the distance induced by the teacher between the -th and -th sets (the same notation also holds for the student), VKD seeks to minimise:
where equals the batch size. Since the teacher has access to several viewpoints, we argue that distances spanned in its space yield a powerful description of corresponding identities. From the student perspective, distances preservation provides additional semantic knowledge. Therefore, this holds an effective supervision signal, whose optimisation is made more challenging since fewer images are available to the student.
Even thought VKD focuses on self-distillation, we highlight that both and allow to match models with different embedding size, which would not be viable under the minimisation performed by [gu2019TKP]. As an example, it is still possible to distill ResNet-101 () into MobileNet-V2 [sandler2018mobilenetv2] ().
The VKD overall objective combines the distillation terms ( and ) with the ones optimised by the teacher - and - that promote higher conditional likelihood w.r.t. ground truth labels. To sum up, VKD aims at strengthening the features of a CNN in Re-ID settings through the following optimisation problem:
are two hyperparameters balancing the contributions to the total loss. We conclude with a final note on the student initialisation: we empirically found beneficial to start from the teacher weights except for the last convolutional block, which is reinitialised according to the ImageNet pretraining. We argue this represents a good compromise between exploring new configurations and exploiting the abilities already achieved by the teacher.
We indicate the query-gallery matching as x2x, where both x terms are features that can be generated by either a single (I) or multiple frames (V). In the Image-to-Image (I2I)
setting features extracted from a query set image are matched against features from individual images in the gallery. This protocol – which has been amply employed for person Re-ID and face recognition – has a light impact in terms of resources footprint. However, a single image captures only a single view of the identity, which may not be enough for identities exhibiting multi-modal distributions. Contrarily, theVideo-to-Video (V2V) setting enables to capture and combine different modes in the input, but with a significant increase in the number of operations and memory. Finally, the Image-to-Video (I2V) setting [zhou2018SCCN, zhou2018vami, liu2018ram, wang2017oife, liu2017provid] represents a good compromise: building the gallery may be slow, but it is often performed offline. Moreover, matchings perform extremely fast, as a query comprise only a single image. We remark that i) We adopt the standard “Cross Camera Validation” protocol, not considering examples of the gallery from the same camera of the query at evaluation and ii) even if VKD relies on frames from different camera during train, we strictly adhere to the common schema and switch to tracklet-based inputs at evaluation time.
While settings vary between different dataset, evaluation metrics for Re-Identification are shared by the vast majority of works in the field. In the followings, we report performance in terms of top-k accuracy and Mean Average Precision (mAP). By combining them, we evaluate VKD both in terms of accuracy and ranking performance.
Person Re-ID: MARS [zheng2016mars] comprises 19680 tracklets from 6 different cameras, capturing 1260 different identities (split between 625 for the training set, 626 for the gallery and 622 for the query) with 59 frames per tracklet on average. MARS has shown to be a challenging dataset because it has been automatically annotated, leading to errors and false detections [zheng2016survey-person]. The Duke [ristani2016duke] dataset was first introduced for multi-target and multi-camera surveillance purposes, and then expanded to include person attributes and identities (414 ones). Consistently with [gu2019TKP, si2018DuATM, liu2019NVAN, matiyali2019clip-simil], we use the Duke-Video-ReID [wu2018duke1] variant, where identities have been manually annotated from tracking information333In the following, we refer to Duke-Video-ReID simply as Duke. Another variant of Duke named Duke-ReID exists [ristani2018duke2], but it does not come with query tracklets.. It comprises 5534 video tracklets from 8 different cameras, with 167 frames per tracklet on average. Following [gu2019TKP], we extract the first frame of every tracklet when testing in the I2V setting, for both MARS and Duke.
Vehicle Re-ID: VeRi-776 [liu2016veri] has been collected from 20 fixed cameras, capturing vehicles moving on a circular road in a area. It contains 18397 tracklets with an average number of 6 frames per tracklet, capturing 775 identities split between train (575) and gallery (200). The query set shares identities consistently with the gallery, but differently from the other two sets it includes only a single image for each couple (id, camera). Consequently, all recent methods perform the evaluation following the I2V setting.
Animal Re-ID: The Amur Tiger [li2019atrw] Re-Identification in the Wild (ATRW) is a recently introduced dataset collected from a diverse set of wild zoos. The training set includes 107 subjects and 17.6 images on average per identity; no information is provided to aggregate images into tracklets. It is possible to evaluate only the I2I setting through a remote http server. As done in [liu2019tiger-top1], we horizontally flip the training images to duplicate the number of identities available, thus resulting in 214 training identities.
Following [hermans2017defensetriplet, liu2019NVAN] we adopt the following hyperparameters for MARS and Duke: i) each batch contains identities with samples each; ii) each sample comprises 8 images equally spaced in a tracklet. Differently, for image-based datasets (ATRW and VeRi-776) we increase to
and use a single image at a time. All the teacher networks are trained for 300 epoch using Adam[kingma2014adam], setting the learning rate to and multiplying it by every 100 epochs. During the distillation stage, we feed images to the teacher and ones (picked at random) to the student. We found beneficial to train the student longer: so, we set the number of epochs to 500 and the learning rate decay steps at 300 and 450. We keep fixed (Eq. 3), and (Eq. 5) in all experiments. To improve generalisation, we apply data augmentation as described in [luo2019bag]. Finally, we put the teacher in training mode during distillation (consequently, batch normalisation [ioffe2015batch] statistics are computed on a batch basis): as observed in [bagherinezhad2018label], this provides more accurate teacher labels.
In this section we show the benefits of self-distillation for person and vehicle re-id. We indicate the teacher with the name of the backbone (e.g. ResNet-50) and append “VKD” for its student (e.g. ResVKD-50). To validate our ideas, we do not limit the analysis on ResNet-*; contrarily, we test self-distillation on DenseNet-121 [huang2017densely] and MobileNet-V2 1.0X [sandler2018mobilenetv2]. Since learning what and where to look represents an appealing property when dealing with Re-ID tasks [fu2019STA], we additionally conduct experiments on ResNet-50 coupled with Bottleneck Attention Modules [park2018bam] (ResNet-50bam).
Table 1 reports the comparisons for different backbones: in the vast majority of the settings, the student outperforms its teacher. Such a finding is particularly evident when looking at the I2V setting, where the mAP metric gains on average. The same holds for the I2I setting on VeRi-776, and in part also on V2V. We draw the following remarks: i) in accordance with the objective the student seeks to optimise, our proposal leads to greater improvements when few images are available; ii) bridging the gap between I2V and V2V does not imply a significant information loss when more frames are available; on the contrary it sometimes results in superior performance; iii) the previous considerations hold true across different architectures. As an additional proof, plots from Figure 3 draw a comparison between models before and after distillation. VKD improves metrics considerably on all three dataset, as highlighted by the bias between the teachers and their corresponding students. Surprisingly, this often applies when comparing lighter students with deeper teachers: as an example, ResVKD-34 scores better than even ResNet-101 on VeRi-776, regardless of the number of images sampled for a gallery tracklet.
4.3 Comparison with State-Of-The-Art
Tables 4, 4 and 4 report a thorough comparison with current state-of-the-art (SOTA) methods, on MARS, Duke and VeRi-776 respectively. As common practice [gu2019TKP, bao2019maskedGAT, qian2019san], we focus our analysis on ResNet-50, and in particular on its distilled variants ResVKD-50 and ResVKD-50bam. Our method clearly outperforms other competitors, with an increase in mAP w.r.t. top-scorers of 6.3% on MARS, 8.6% on Duke and 5% on VeRi-776. This results is totally in line with our goal of conferring robustness when just a single image is provided as query. In doing so, we do not make any task-specific assumption, thus rendering our proposal easily applicable to both person and vehicle Re-ID.
Analogously, we conduct experiments on the V2V setting and report results in Table 7 (MARS) and Table 7 (Duke)444Since VeRi-776 does not include any tracklet information in the query set, following all other competitors we limit experiments to the I2V setting only.. Here, VKD yields the following results: on the one hand, on MARS it pushes a baseline architecture as ResVKD-50 close to NVAN and STE-NVAN [liu2019NVAN], the latter being tailored for the V2V setting. Moreover – when exploiting spatial attention modules (ResVKD-50bam) – it establishes new SOTA results, suggesting that a positive transfer occurs when matching tracklets also. On the other hand, the same does not hold true for Duke, where exploiting video features as in STA [fu2019STA] and NVAN appears rewarding. We leave the investigation of further improvements on V2V to future works. As of today, our proposals is the only one guaranteeing consistent and stable results under both I2V and V2V settings.
4.4 Analysis on VKD
In the absence of camera information.
Here, we address the setting where we do not have access to camera information. As an example, when dealing with animal re-id this information often lacks and datasets come with images and labels solely: can VKD still provide any improvement? We think so, as one can still exploit the visual diversity lying in a bag of randomly sampled images. To demonstrate our claim, we test our proposal on Amur Tigers re-identification (ATRW), which was conceived as an Image-To-Image dataset. During comparisons: i) since other works do not conform to a unique backbone, here we opt for ResNet-101; ii) as common practice in this benchmark [liu2019tiger-top1, liu2019tiger-top2, yu2019tiger-top-3], we leverage re-ranking [zhong2017re]. Table 7
compares VKD against the top scorers in the “Computer Vision for Wildlife Conservation 2019” competition. Importantly, the student ResVKD-101 improves over its teacher (1.5% on mAP and 2.9% on top) and places second behind [liu2019tiger-top1], confirming its effectiveness in a challenging scenario. Moreover, we remark that the top-scorer requires additional annotations - such as body parts and pose information - which we do not exploit.
Distilling viewpoints vs time.
Figure 4.4 shows results of distilling knowledge from multiple views against time (i.e. multiple frames from a tracklet). On one side, as multiple views hold more “visual variety”, the student builds a more invariant representation for the identity. On the opposite, a student trained with tracklets still considerably outperforms the teacher. This shows that, albeit the visual variety is reduced, our distillation approach still successfully exploits it.
VKD reduces the camera bias.
As pointed out in [tian2018eliminating], the appearance encoded by a CNN is heavily affected by external factors surrounding the target object (e.g. different backgrounds, viewpoints, illumination …). In this respect, is our proposal effective for reducing such a bias? To investigate this aspect, we perform a camera classification test on both the teacher (e.g. ResNet-34) and the student network (e.g.
ResVKD-34) by fitting a linear classifier on top of their features, with the aim of predicting the camera the picture is taken from. We freeze all backbone layers and train for 300 epochs (and halved every 50 epochs). Table 4.4 reports performance on the gallery set for different teachers and students. To provide a better understanding, we include a baseline that computes predictions by sampling from the cameras prior distribution. As expected: i) the teacher outperforms the baseline, suggesting it is in fact biased towards background conditions; ii) the student consistently reduces the bias, confirming VKD encourages the student to focus on identities features and drops viewpoint-specific information. Finally, it is noted that time-based distillation does not yield the bias reduction we observe for VKD (see supplementary materials).
Can performance of the student be obtained without distillation?
To highlight the advantages of the two-stage procedure above discussed, we here consider a teacher (ResNet-50) trained straightly using few frames () only. First two rows of Table 8 show the performance achieved by this baseline (using tracklets and views respectively). Results show that major improvements come from the teacher-student paradigm we devise (third row), instead of simply reducing the number of input images available to the teacher.
To further assess the differences between teachers and students, we leverage GradCam [selvaraju2017gradcam] to highlight the input regions that have been considered paramount for predicting the identity. Figure 4 depicts the impact of VKD for various examples from MARS, VeRi-776 and ATRW. In general, the student network pays more attention to the subject of interest compared to its teacher. For person and animal Re-ID, background features are suppressed (third and last columns) while attention tends to spread to the whole subject (first and fourth columns). When dealing with vehicle Re-ID, one can appreciate how the attention becomes equally distributed on symmetric parts, such as front and rear lights (second, seventh and last columns). Please see supplementary materials for more examples, as well as a qualitative analysis of some of our model errors.
Differently from other approaches [bhardwaj2019fewerframes, gu2019TKP], VKD is not confined to self-distillation, but instead allows the knowledge transfer from a complex architecture (e.g. ResNet-101) into a simpler one, such as MobileNet-V2 or ResNet-34 (cross-distillation). Here, drawing inspirations from the model compression area, we attempt to reduce the network complexity but, at the same time, increase the profit we already achieve through self-distillation. In this respect, Table 10 shows results of cross-distillation, for various combinations of a teacher and a student. It appears that better the teacher, better the student: as an example, ResVKD-34 gains an additional 3% mAP on Duke when educated by ResNet-101 rather than “itself”.
|Student (#params)||Teacher (#params)||MARS||Duke||VeRi-776|
|ResNet-34 (21.2M)||ResNet-34 (21.2M)||82.17||73.68||83.33||80.60||94.76||79.02|
|ResNet-50 (23.5M)||ResNet-50 (23.5M)||83.89||77.27||85.61||83.81||95.17||82.16|
|MobileNet-V2 (2.2M)||MobileNet-V2 (2.2M)||83.33||73.95||83.76||80.83||92.61||75.27|
On the impact of loss terms.
We perform a thorough ablation study (Table 9) on the student loss (Eq. 5). It is noted that leveraging ground truth solely (second row) hurts performance. Differently, best performance for both metrics are obtained exploiting teacher signal (from the third row onward), with particular emphasis to , which proves to be a fundamental component.
An effective Re-ID method requires visual descriptors robust to changes in both background appearances and viewpoints. Moreover, its effectiveness should be ensured even for queries composed of a single image. To accomplish these, we proposed Views Knowledge Distillation (VKD), a teacher-student approach where the student observes only a small subset of input views. This strategy encourages the student to discover better representations: as a result, it outperforms its teacher at the end of the training. Importantly, VKD shows robustness on diverse domains (person, vehicle and animal), surpassing by a wide margin the state of the art in I2V. Thanks to extensive analysis, we highlight that the student presents stronger focus on the target and reduces the camera bias.
The authors would like to acknowledge Farm4Trade for its financial and technical support.