I Introduction
An essential task in visual surveillance system is to automatically associate individuals across disjoint camera views, which is known as person reidentification (reID). It has gained considerable popularity in video surveillance, multimedia, and security system by its prospect of searching persons of interest from a large amount of video sequences. Most existing approaches focus on matching still images represented by spatial visual appearance (shape, texture, and colour). Specifically, one matches a probe (or a query) person observed in one camera view against a set of gallery candidates captured by another disjoint view for generating a ranked list according to their matching distance or similarity. To this end, a body of methods have been developed to extract invariant features [1, 2, 3, 4, 5] or learn discriminative matching models [6, 7, 8, 9, 10, 11, 12]. However, people appearance is intrinsically limited due to the inevitable visual ambiguities and unreliability caused by appearance similarities among different people and appearance variations of the same person from significant crossview changes in terms of poses, illumination, and cluttered background. This motivates the demand of seeking additionally helpful visual information sources for person reID.
On the other hand, videos or image sequences are often largely available from surveillance cameras, which inherently contain more information than independent images. One question raised is whether we can obtain more useful information from videos as opposed to still images? Videos are abundant and rich source of human motion and appearance information. For example, given sequences of images, temporal information related to a person’s motion can be captured, which may disambiguate difficult cases that arise in the case of recognising the person in a different camera. However, working on videos creates new challenges such as dealing with video sequences of arbitrary length, and the difficulty of learning effective representations while disentangling nuisance factors caused by visual variations.
Most approaches to videobased person reID are based on supervised learning to optimise a discriminant distance metric under which minimised intravideo and maximised intervideo distances can be achieved
[14, 15, 13]. They typically extract spatialtemporal features (e.g., HOG3D [16]) on each fragment from which videos are represented as a set of extracted features. To learn datadependent highlevel video features, [17, 18]use LongShort Term Memory (LSTM) networks to aggregate framewise CNN features into videolevel representations while mapping into hidden states with temporal dependency. However, videos are much higher dimensional entities, and it becomes increasingly difficult to do credit assignment on each frame selection to learn longrange relationship among them, unless we collect much more labeled data or do a lot of feature engineering (
e.g., computing the right kinds of flow features) to keep the dimensionality low. As a matter of fact, it is not realisable for person reID to collect massive labeled pairs of video sequences.Challenges in Video Matching based Person ReID
Matching video footages of pedestrians in realistic surveillance raises several challenges. First, it demands a faithful model with long distance dependency to map sequences into latent variables where both the dynamics of input patterns and effective representations can be learned simultaneously. Second, the model should be designed without demand on large amount of labeled training samples [19]. To this end, it desires a representation learning with access to few labeled data. Last but not the least, learning latent variables as video representations across camera views inherently exhibits distribution divergence, which should be eliminated to make the matching comparable [20]. An illustration on feature distribution divergence is shown in Fig.2
. It is seen that the initial probability distributions of latent variables across views are very arbitrary (depicted in Fig.
2 (a)). Thus, it requires a unifiedview approach for constructing rich latent variables, and allows comparable matching in crossview setting.Fig. 2 (b) shows that after crossview adversarial learning, the features are transformed to be viewinvariant and comparable for matching (see Fig.2 (c), as measured by the KLdivergence).Our Method
In this work, we propose fewshot deep adversarial neural networks to learn latent variables as representations for crossview videobased person reID in the context of few labeled training video pairs. Video observations are modelled using variational recurrent neural networks (VRNNs) [23] that use variational methods to produce latent representations while capturing the temporal dependency across timesteps. To achieve viewinvariance, we perform adversarial training to produce the latent representations invariant across camera views. Specifically, the VRNNs contain variational autoencoders that provide a class of latent variables to capture the input dynamics, all of which conditioned on the previous encoders via hidden states of an RNN. To promote the learned features viewinvariant, the network is augmented with crossview verification to update adversarially into the view changes, and it encourages viewinvariant features to emerge in the course of the optimisation. The proposed approach is generic as it can be created atop any existing feedforward architecture that is trainable by backpropagation. Meanwhile, the network is easy to be optimised by adding a gradient reversal layer [24] that leaves the input unchanged during forward passing and reverse the gradient by multiplying it by a negative scalar during backpropagation.
The inputs to the model are highlevel percepts for a pair of videos, extracted by applying a convolutional net, namely VGGNet [25]
trained on ImageNet
[26]. These percepts are the states of last layers of rectified linear hidden states form a CNN, which are put through the VRNNs to capture the latent dependencies at different timesteps. However, the derived latent variables from VRNNs are less regulated and exhibit varied distributions caused by view changes [27]. To this end, we propose adversarial learning which applies a crossview verification to explicitly promote the viewinvariant feature learning. In order to evaluate the learned representations, we qualitatively and quantitatively analyse the predictions and matching rates in people recognition made by the model.Contributions
The major contributions of this paper are threefold:

We propose deep fewshot adversarial learning to produce effective video representations for videobased person reID with few labeled training paired videos. The proposed model atop VRNNs [23] is able to map video sequences into latent variables which serve as video representations by capturing the temporal dynamics.

Our approach addresses the distribution disparity of learned deep features by performing adversarial training
[24] to ensure the viewinvariance. Also, the algorithm is based on fewshot learning, and thus it is advantageous in generalisation capability to widen its application to largescale networked cameras.
Ii Related Work
Iia Person Reidentification
The majority of approaches to person reidentification are imagebased, which can be generally categorized into two categories. The first category [1, 3, 4, 5, 29, 30] employs invariant features that aims to extract features that are both discriminative and invariant against dramatic visual changes across views. The second stream is metric learning based method which are working by extracting features for each image, and then learning a metric where the training data have strong interclass differences and intraclass similarities [31, 6, 32, 33, 34, 35]. These approaches only consider oneshot image per person per view, which is inherently weak when multishot are available, due to the intrinsically ambiguous and noisy people appearance, and large crossview appearance variations.
The use of video in many realistic scenarios indicates that multiple images can be exploited to improve the matching performance. Multishot approaches in person reidentification [1, 2, 13, 36, 11, 37] use multiple images of a person to extract the appearance descriptors to model person appearance. For these methods, multiple images from a sequence are used to either enhance local image region/patch spatial feature descriptions [1, 2, 36, 11] or to extract additional appearance information such as temporal change statistics [37, 38]. These methods, however, deal with multiple images independently whereas in the videobased person reidentification problem, a video contains more information than independent images, e.g., underlying dynamics of a moving person and temporal evolution.
Recently, a number of frameworks are developed to address the problem of person reID in the video setting [15, 39, 13, 14]. The Dynamic Time Warping (DTW) model [40]
, which is a common sequence matching algorithm widely used in pattern recognition, has been applied into videobased person reID
[41]. Wang et al. [13, 42] partially solve this problem by formulating a discriminative video ranking (DVR) model using the spacetime HOG3D feature [16]. In [39], both spatial and temporal alignment is considered to generate a bodyaction model from which Fisher vectors are learned and extracted from individual bodyaction units, known as STFV3D. However, these approaches presumably suppose all image sequences are synchronised, whereas they are inapplicable in practice due to different actions taken by different people. An unsupervised approach is introduced by [43] where a spatialtemporal person representation is introduced by encoding multiscale spatialtemporal dynamics (including histograms of oriented 3D spatiotemporal gradient and spatiotemporal pyramids) from the unaligned image sequences.A supervised toppush distance learning model (TDL) [14] is proposed to enforce toppush constraint [44] into the optimisation on toprank matching in person reID. It uses videobased representations (HOG3D, colour histograms and LBP) to maximise the interclass margin between different persons, and likewise in [15]
. Nonetheless, these pipelines perform feature extraction and metric learning separately in which lowlevel features are very generic to determine which of them are useful in matching process. Meanwhile, the process of distance metric learning is principled on individual examples (
e.g., triplets), which carries little information about neighbourhood structure, and thus not generalised across data sets.With remarkable success achieved by deep neural networks (DNNs) in visual recognition, person reID has witnessed great progress by applying DNNs to learn ranking functions based on pairs [45, 46, 47, 48] or triplets of images [49]. These methods, which use network architectures such as the “Siamese network”, learn a direct mapping from the raw image pixels to a feature space where diverse images from the same person are close, while images from different person are widely separated. Another DNNbased approach to reID, uses an autoencoder to learn an invariant colour feature, whilst ignoring spatial features [50], which turn out to be very crucial in matching pedestrians [51, 52, 53, 54]. However, existing architectures do not exploit any form of temporal information, and thus not applicable into videobased person reID. In order to introduce temporal signals into a DNN, McLaughlin et al. present a recurrent neural network for videobased person reidentification [17]. It combines recurrence and temporalpooling of appearance data with representation learning to yield an invariant representation for each person’s video sequence. Wu et al. [22] deliver an endtoend approach to learn spatialtemporal features and corresponding similarity metric given a pair of time series.
IiB Deep Generative Models for Videos
Ranzato et al. [55]
proposed a generative model for videos by using a recurrent neural network to predict the next frame or interpolate between frames. In this work, the author quantise image patches into a large dictionary and train the model to predict the identity of the target patch. However, it introduces an arbitrary dictionary size and altogether removes the idea of patches being similar to dissimilar to one other. A recent model of using EncoderDecoder LSTM to learn representations of video sequences is proposed by Srivastava
[56] where Encoder LSTM maps an input sequence into a fixed length representation, and Decoder LSTM produces the future sequence. Nonetheless, existing studies are dependent on Recurrent Neural Networks (RNNs) where the functions are deterministic and cannot capture variability in the input space. In this work, we study the visual variability in video sequences and capture the variability by developing a recurrent gaussian process model. The family of Recurrent Gaussian Process (RGP) models are defined by Lincoln et al. [57], which, similarly to RNNs, are able to learn temporal representations from data. Also, the authors propose a novel RGP model with a latent autoregressive structure where the intractability brought by the recurrent GP priors are tackled with a variational approximation approach. While our method is similar to RGP [57] by employing autoregressive structure as latent states, this work addresses the distribution problem of latent variables effectively such that it provides the best approximation to the true posterior which is not resolved in [57]. In contrast to the model of [57] that needs expensive iterative inference scheme to continuous latent variables with intractable posterior distributions, we perform an efficient inference and a learning algorithm that even works in the intractable case. Our inference model based on the variational Bayes [58, 59]is able to reparameterise the variational lower bound, which can be straightforwardly optimised using standard stochastic gradient descent techniques.
IiC FewShot Learning
Fewshot learning is to learn novel concepts from very few examples [60]. It typically demands an effective representation learning pipeline that have good generalisation ability. Generative models can be used to generate additional examples so as to improve the learner’s performance on novel classes [61, 60]. Intuitively, the challenge is that these training examples capture very little of the category’s intraclass variations. For instance, if the category is a particular pedestrian, then we may only have examples of the person’s frontal view in one camera, and one of he/she in back view. Amongst representation learning approaches, metric learning such as triplet loss [62, 63, 49, 64] or Siamese networks [65, 66, 45, 48] have been used to automatically learn feature representations where objects of the same class are put closer together. However, these approaches cannot be directly applied into videobased person reID because they do not explicitly address the crossview variations which may lead to the feature distribution divergence.
Iii FewShot Deep Adversarial Neural Networks for CrossView Videobased Person ReIdentification
In this section, we present a deep adversarial learning to learn video representations for person reID in a fewshot learning. In the case of videobased person reID, we need to learn from a few sequence examples regarding each person to produce discriminative representations. This motivates the setting we are interested in: “fewshot” learning, which consists of learning a persons/class from little labelled sequence examples. While deep learning requires large training datesets to update the set of parameters, we develop our model by using the variational recurrent neural networks (VRNNs) with continuous latent random variables to model the video sequences, which allow us to perform efficient inference and learning. Then the learned latent temporal representations are put through adversarial training to make them viewinvariant. In the following, we first formally describe the problem setting and definitions in Section
IIIA. Then, we present the VRNNs and the principle of adversarial learning in Section IIIB and Section IIIC, respectively. Section IIID contains the optimisation and Section IIIE details the inference for testing.Iiia Problem Formulation
Given a set of variablelength video sequences captured in a network of cameras with video observations where is a person video containing frames . In the case of person reID, a video sequence regarding a person appearing in one camera is known as the probe video, and the reidentification is to find the correct match for the probe from a set of gallery videos. In this setting, we further assume each video sequence in the probe camera and its correspondence in the gallery set come with labels (each person corresponds to a class). Without loss of generality, does not have to equal to . Hence, each training pair is consisted of a randomly selected probe sequence and its correspondence regarding the same person with the label where denotes the total number of persons. In overall, the training objective of our framework is to jointly achieve two goals: predict the label for each coupled probe and gallery video pair; and impose adversarial learning to regularise the learned representations to be viewinvariant.
IiiB Variational Recurrent Neural Networks (VRNNs)
To model the temporal dependencies between the latent random variable across time steps, we employ the variational recurrent neural networks (VRNNs) [23], which contains a variational autoencoders [58] at each time step. These encoders are conditioned on previous ones via the hidden state of an RNN such as an LSTM [67]. Thus, for each timestep of frame , a latent random variable can be inferred as
(1) 
with the prior where . All denote parameters of generating a distribution, and can be any highly flexible function such as deep neural networks. Then for each , the data can be generated via
(2) 
and learned by optimising the VRNN objective function:
(3) 
where
is KullbackLeibler divergence between two distributions
and . is the inference model, is the prior, is the generative model. and are parameters of the VRNN’s encoder and decoder, respectively. Thus, for each frame sequence , we use as the overall feature representations for the following classification task since it captures temporal latent dependencies across the time steps.IiiC Deep Adversarial Learning for ViewInvariance
Recall that in the context of fewshot person reID there are limited labeled pairs of video sequences in training, we perform the training objective by jointly optimising two tasks: the classification on each sequence and the verification on the correspondence. Intuitively, imposing the classification loss on inputs is able to optimise discriminative classifiers regarding identities with relative similarity in the context of all training classes because the optimal similarity is probabilistically determined by respecting all training categories
[27, 22]. The verification loss is principled to regularise the classifier to be viewinvariance.Formally, let and represent the probe classifier (predict class labels for ) and gallery classifier (predict class labels for ) respectively with parameters and for a given paired input . Here, and can be modelled using deep neural networks. For the ease of notations, we set , which suggests each pair input is coupled with one person and the training progressively considers one person out of identities. Therefore, the classification loss towards and () can be defined respectively as
(4) 
where
represents the categorical crossentropy loss function, and
is the VRNN encoder that maps input into its latent representation . Hence, for notation convenience, we define the combined classification loss as(5) 
Combing the VRNN training and classification task can yield the following optimisation:
(6) 
As we are aimed in achieving the representations with viewinvariance, we can adversarially train the above objective function by incorporating the verification loss. In other words, the verification loss is introduced as the regulariser of Eq. (6):
(7) 
Jointly combining the optimisation problems in Eq. (6) and Eq. (7) leads to our objective function, which can be mathematically express as:
(8) 
where is a tuning hyperparameter to balance the tradeoff between optimising on making viewinvariant representations and optimising the classification accuracy.
IiiD Optimisation
The optimisation of Eq. (8) involves minimisation on some parameters, and maximisation on others, i.e., we iteratively solve the following problems by finding the saddle points , , , such that:
(9) 
This problem can be tackled with a simple stochastic gradient procedure, which is to make updates in the opposite direction of the gradient of Eq. (8
) for the minimisation of parameters, and in the direction of the gradient for the maximisation of parameters. In practice, stochastic estimates of the gradient can be computed by using a subset of the training examples to calculate their averages. Thus, the gradient updates can be calculated as:
(10) 
We use stochastic gradient decent (SGD) to update . For the other parameters, we use SGD and gradient reversal layer [24] to update a feedforward deep networks that comprises feature extractor (VRNN’s encoder) fed into the classification task ( and ) and the crossview verification loss. The role of gradient reversal is to make the gradients from the classification and view difference subtracted instead of being summed. This ensures the classification loss is maximised while the feature representations are viewinvariant. Thus, the resulting feature representations can capture temporal dependencies (due to the VRNN objective function ) and also crossview invariance (due to the regressor of ). The optimisation is depicted in Fig.3.
Why should this framework learn good video features for videobased person reID? The learned latent variables are not comparable across videos in crossview setting which exhibit view variability and distributions. The formalism of crossview verification offers a systematic mechanism of reducing view variations through generalising the classifiers where the loss function is to penalise the differences between the classifiers undr each view. Thus, this can draw connections between this loss and regularisation of feature activations.
IiiE Inference and Complexity Analysis
Once the training is accomplished, the inference model can be used to produce the feature representations for a video sequence following the equation: , where and denote the parameters of a Gaussian whose mean and variance are the output of a nonlinear function of , that is, . Given two unknown sequences in the test and , we are able to obtain the latent variables and as their respective video representations to compute their similarity value via the elementwise inner product between the latent variables and [22], that is, . The illustration on the inference is shown in Fig. LABEL:fig:inference.
The main computational cost in the training comes from the learning parameters of Gaussian posterior, that is, and . In our model, we use the pretrained neural networks [25] to extract features from and . Thus, and
are parameterised to be the last fully connected layers of the deep convolutional neural networks
[25]. Fortunately, the feature extraction () can be performed offline and hence, we consider the complexity regarding and , which are typically a stacked threelayer RNN, and each layer has 1024 LSTM units. As the standard vanilla LSTM is high computational cost because of its complex structure [68], e.g., it has three gates to control the memory cell in addition to the input activation, and thus the computational cost of LSTM is four times larger than a simple RNN with the same dimension of the memory cell. To reduce the computational cost, we compress the output vector of the LSTM by a linear projection referred to as a recurrent projection [69]. Assume the input vector of is dimensional, and the dimension of the hidden state is , the sizes of weight matrices of multiplicating the input and the hidden state vectors and are and , respectively. Therefore, the computational cost of a single LSTM to calculate the from at a time step is approximately . Since the feature dimension from the deep ConvNets [25] is 1024, and , the number of multiplications becomes . In our model, we compress the output vector into the dimensional vector via a weight matrix of size . Then the compressed vector is fed as the input to the memory block of the next layer and the hidden state to the gate activations on the same layer at next time step. Thus, the computational cost of this light LSTM becomes . If and , the number of multiplications becomes , resulting in 72% reduction from the vanilla LSTM.Iv Experiments
Iva Datasets
To evaluate the performance of the proposed approach, we conduct extensive experiments on three image sequence based person reID datasets: iLIDSVID [13], PRID2011 [28], and MARS [21]. Example images are shown in Fig. 4.

The iLIDSVID dataset consists of 600 image sequences for 300 randomly sampled people, which was created based on two nonoverlapping camera views from the iLIDS multiple camera tracking scenario. The sequences are of varying length, ranging from 23 to 192 images, with an average of 73. This dataset is very challenging due to variations in lighting and viewpoint caused by crosscamera views, similar appearances among people, and cluttered backgrounds.

The PRID 2011 dataset includes 400 image sequences for 200 persons from two adjacent camera views. Each sequence is between 5 and 675 frames, with an average of 100. Compared with iLIDSVID, this dataset was captured in uncrowded outdoor scenes with rare occlusions and clean background. However, the dataset has obvious colour changes and shadows in one of the views.

The MARS dataset contains 1,261 pedestrians, and 20,000 video sequences, making it the largest video reID dataset. It provides rich motion information by using DPM detector [70] and GMMCP tracker [71] for pedestrian detection and tracking. As suggested by [21], the dataset is evenly divided into train and test sets, containing 631 and 630 identities, respectively.
(a) iLIDSVID  (b) PRID2011  (c) MARS 
IvB Experimental Settings
We extracted percepts using the convolutional neural net model from [25]. To fit the input frame into the model, we resize each frame to be , and run it through the convnet to produce the RGB percepts. This is implemented by simply rescaling the largest side of each image to a fixed length, i.e., (We use openCV to resize each frame to have 224 width and 224 height). In the case of person reID, cropping image has the drawback of potentially excluding critical parts of pedestrians whilst simply resizing an image can still get plausible performance because frames from person reID datasets do not have very different aspect ratios. Additionally, we compute flow percepts by extracting flows using the Brox method and train the temporal stream convolutional network as described by [72]. The optical flow is computed using the offtheshelf GPU implementation of [73] from the OpenCV toolbox. In spite of the fast computation time (0.06s for a pair of frames), it would still impose a bottleneck if done onthefly, so we precomputed the flow before training. To avoid storing the displacement fields as floats, the horizontal and vertical components of the flow were linearly rescaled to a range. We use the 4096dimensional fc6 layer as the input representation with RGB and flow percepts.
For the iLIDSVID and PRID2011 datasets, we randomly split each dataset into two subsets with the same size. One is used for training and the other one for testing. For iLIDSVID and PRID2011, each of which dataset has only two disjoint cameras and each person has one sequence under each camera. Thus, the sequences from the first camera are used as the probe while the sequences from the second camera are used as the gallery. To constitute the training pairs, each person is regarded as a class, and we only select one probe frame sequence and its gallery correspondence as a training paired input. Please note that for the MARS dataset, we maintain the IDs with the probe and its correspondence and thus we have 625 IDs for training. In the testing on the MARS dataset with more than two cameras, we randomly select one camera as the probe view while a different camera is randomly selected as the gallery. Training was performed by following the Adam optimiser and ran the model for 50 epochs with a learning rate of
. We set an early stop criteria that the model does not show a decrease in a validation loss for 10 epochs. To set the hyperparameter such as , we select the parameter by using crossvalidation where the labeled paired videos in the training are further split into the training set () and the validation set () containing 90% of the original examples and the rest of 10%, respectively. We use the labeled set to learn classifiers and . The learned classifiers are evaluated on the validation setand parameters are selected corresponding to the classifiers with the lowest validation risk. All our models are implemented on the public Torch
[74] platform, and all experiments are conducted on a single NVIDIA GeForce GTX TITAN X GPU with 12 GB memory.Evaluation Measure
To evaluate the performance of our method and also compare the performance against other methods, we employ the standard Cumulated Matching Characteristics (CMC) curve as our evaluation metric, which indicates the probability of finding the correct match in the top
matches within the ranked gallery. In our experimental results, we report the Rank average matching rates, which are obtained by randomly splitting the dataset into training and testing 10 trails and the average result is computed.iLIDSVID  PRID2011  MARS  
Methods  =1  =20  =1  =20  =1  =20 
VLSTM [56]  39.7  89.5  57.7  95.2  30.5  62.4 
VAE [58]  48.1  96.9  63.7  96.8  33.4  72.1 
RDANN  54.0  96.9  73.8  97.7  46.4  90.2 
VRNN  51.0  95.8  69.7  97.1  41.1  88.6 
Ours 
IvC Hyperparameter Selection
In this section, we set up the hyperparameter selection on by using a variant of reverse crossvalidation approach [75] to optimise the in the context of viewinvariance adaptation. The evaluation on reverse validation risk associated to varied parameter of is proceeded as follows. Given the video pairs from the probe () and the gallery view (), we split each set into training sets ( and respectively, containing 90% of the original examples) and the validation sets ( and respectively). We use the and to learn a classifier to classify each identity across views. Then, we learn a reverse classifier using the set and the unclassified part of . Finally, the learned reverse classifier is evaluated on the validation set , and the classifier has a reverse validation risk in the process of addressing crossview invariance. This process is repeated with multiple values of hyperparameter and the selected parameter is the one corresponding to the classifier with the lowest reverse validation risk. As shown in Fig. 5, the adaptation parameter is chosen among 9 values between and 1 on a logarithmic scale, and the optimal value of is chosen to be 0.6 when the method has the lowest validation risk.
IvD Ablation Studies
In this section, we conduct extensive experimental analysis to answer the following questions: (a) How does our method perform compared to other feature learning models for videos? (b) How do we show that the learned temporal latent features are effective in matching persons against view changes? (c) How is the performance of our method affected with respect to varied input lengths of videos? (d) How large the computational complexity is as opposed to existing deep recurrent models?
Comparison with Other Feature Learning Models for Videos
To evaluate the effectiveness of latent video representations learned by our model, we compare with stateoftheart deep generative models:

RDANN: The domain adversarial neural networks [24] is a deep domain adaptation model which uses two components to create domaininvariant representations: a feature extractor that produces the data’s latent representations, and an adversarial domain labeller that helps the feature extractor produce features domaininvariant. In our experiment, the feature exactor is a RNN (LSTM) with fc6 as the input representation of data where each LSTM has 2048 units.

VLSTM [56]: This method uses LSTM networks to learn representations of video sequences. The state of the encoder of LSTM after the last input has been read is the representation of the input video, and similar to RDANN, each LSTM has 2048 units.

VAE [58]: It offers highlevel latent random variables to model the variability observed in the data with a combination of highly flexible nonlinear mapping between the latent random state and the observed output and effective approximate inference.

VRNN [23]: It extends the VAE into a recurrent framework for modelling the highdimensional sequences.
The comparison results are given in Table I. The proposed method is superior in all datasets with large improvements over VLSTM [56]. The lower recognition accuracy obtained by LSTM model is possibly due to the difficulties of training with large visual variations and crossview divergence. VAE [58] performs better than VLSTM [56] by introducing latent random variables which are effective in learning the mapping between videos and latent representations. RDANN [24] outperforms VRNN [23] by performing domaininvariance training, however, RDANN is not focusing on the underlying temporal dependencies. In contrast, our approach explicitly addresses the crossview feature distribution changes by adversarial training on the latent variables produced by VRNNs.
Viewinvariance Study
The second experiment is to demonstrate the property of our features in viewinvariance. To this end, we evaluate the matching performance of our method against the difference between feature distributions. Fig.6 with dual Yaxis shows two evaluations: The red line indicates the performance of our method for video matching by reducing the parameter distributions across camera views as the training epochs proceeds. The blue line represents the difference between feature distributions under two camera views measured by the KLdivergence. In this experiment, we use the testing data from the iLIDSVID dataset, and keep the representations by increasing the number of training epochs. We can see an improvement in the matching accuracy as more training epochs are proceeded, which is aligned with the decrease in feature distribution divergence as measured by the variational bound of KLdivergence.
(a) iLIDSVID  (b) PRID2011  (c) MARS 
Variable Length
The third study is to investigate how the performance varies against the length of the probe and gallery sequences in testing. Evaluations are conducted on three datasets, and the lengths of the probe and gallery sequences are varied between 1 and 128 frames in step of power of two. Results are shown in Fig.7 where a bar matrix shows the rank1 reidentification matching rate as a function of the probe and gallery sequence lengths. We can see that increasing the length in either probe or gallery can increase the matching accuracy. Also longer gallery sequences can bring about more performance gain than longer probe sequences.
Computational Cost Analysis
The final study is to analyse the computational cost of our model as opposed to existing deep recurrent models. We consider the vanilla LSTM [68] with three layers as comparison. The vanilla LSTM and the light LSTM in our model are trained with 1024dimensional memory cells and the light LSTM has an additional 256dimensional recurrent projection layer. The comparison results are shown in Table II. It can be seen that the LSTM with projection outperforms the vanilla LSTM with the number of weight parameters one third of the vanilla LSTM.
Rank1  

Models  Number of Weights  iLIDSVD  PRID2011 
vanilla LSTM [68]  22,595,584  59.6  79.0 
Ours  7,548,160  60.1  79.3 
IvE Impact of Adversarial Learning
It is an interesting question to know what is the right time to perform the adversarial operation? It should be at every time step or the last time step of a latent representation learning. If the adversarial training is performed at every time step, namely early fusion, the network learns to produce the viewinvariant representations conditioned on the subsets of the input sequence . On the other hand, a late adversarial learning at the last timestep (i.e., late fusion) might be incapable of eliminating feature variations during the representation learning. To study the impact of adversarial learning on the optimal viewinvariant representations, we empirically test the two different strategies by assessing the CMC values at computed at each timestep. The comparison results on MARS dataset is shown in Fig.8. The results indicate that progressively performing adversarial learning at each timestep on the learned features lead to more optimal representations.
IvF Comparison with Sateoftheart Approaches
In this section, we compare our method with stateoftheart videobased person reID approaches. We consider the following competitors:

XQDA [11]: CrossView Quadratic Discriminant Analysis is a static appearance feature based supervised approach that learns simultaneously a discriminant low dimensional subspace and a QDA metric.

eSDC [5]: An effective unsupervised spatial appearance based method, which is able to learn localised appearance saliency statistics for measuring local patch importance.

SDALF [1]: A classic handcrafted visual appearance feature for reID purpose.

ISR [76]: A weighted dictionary learning based algorithm that iteratively extends sparse discriminative classifiers.

STFV3D [39]: A lowlevel featurebased Fisher vector learning and extraction method which is applied to spatially and temporally aligned video fragments.

RCN [17]: A contemporary deep learning approach that incorporates CNN underlying LSTMs to learn videolevel features for person reID.

TSDTW [43]: A unsupervised method to extract spacetime person representation by encoding multiple granularities of spatiotemporal dynamics in form of time series.

TDL [14]: A toppush distance learning (TDL) model incorporating a toppush constraint to quantify ambiguous video representation for videobased person reID.

RFAnet [18]: It uses the LongShort Term Memory (LSTM) network to aggregate framewise person features in a recurrent manner.

[15]: A setbased supervised distance learning model which aims to learn a pair of intravideo and intervideo distance metrics.

Whatandwhere [22]: An endtoend Siameselike network to match videos of person reID by attending to distinct local regions.

CUG [77]: It exploits the unlabelled tracklets to update the CNNs by stepwise learning.
The comparison results on three datasets are given in Table III and Fig.9. It can be seen that our method has large improvement gains compared against multishot methods including XQDA [11], ISR [76], and eSDC [5], in which temporal dependency across frames are not considered. Compared with 3D feature based methods such as HOG3D [16]+DVR [13], SDALF[1]+DVR [13], and STFV3D [39]
, our method has a high matching rate by learning hidden representations which are able to faithfully describe person videos with variations. Competing methods of RCN
[17] and RFAnet [18] are based on RNN structure and use deterministic functions to compute transitions from inputs into hidden states, which cannot capture data variability to be addressed in crossview video matching, and thus leads to inferior performance to our approach. In contrast, our approach explicitly address crossview difference by adversarial learning, and the learned video features are more effective in person reID. For example, on iLIDSVID dataset, RCN [17] attains 58.0% at rank1 matching rate while the proposed method achieves 60.1% at rank1 value. On PRID2011 dataset, our method outperforms [15] by 2.5% at rank1 while using few labeled training pairs. One thing to be noticed is that our method performs secondary to the method Whatandwhere [22] in rank1 and rank5 on the MARS dataset. The possible reason is Whatandwhere [22] is based on supervised learning to match persons with annotations on body parts. In contrast, our method does not require any region annotations. In the combination with KissME [8], Ours+KissME [8] has further improvement on recognition values on all datasets. For instance, Ours+KissME [8] achieves the best results on rank1 (64.6% and 84.2%) on the iLIDSVIDS and PRID011 datasets, and the best mAP (52.1%) on the MARS dataset. This allows us to potentially improve the performance in combining with metric learning algorithms.iLIDSVID  PRID2011  MARS  
Methods  R=5  R=10  R=20  R=1  R=5  R=10  R=20  R=1  R=5  R=10  R=20  mAP  
XQDA [11]  16.7  39.1  52.3  66.8  46.3  78.2  89.1  96.3  30.7  46.6  53.5  60.9  16.4 
ISR [76]  7.9  22.8  30.0  41.8  17.3  38.2  53.4  64.5           
eSDC [5]  10.2  24.8  35.5  52.9  25.8  43.6  52.6  62.0           
HOG3D [16]+DVR [13]  23.3  42.4  55.3  68.4  28.9  55.3  65.5  82.8  12.4  33.2  54.7  71.8   
SDALF [1] +DVR [13]  26.7  49.3  60.6  71.6  31.6  58.0  67.3  85.3  4.1  12.3  20.2  25.1  1.8 
STFV3D [39]  37.0  64.3  77.0  86.9  42.1  71.9  84.4  91.6           
RCN [17]  58.0  84.0  91.0  96.0  70.0  90.0  95.0  97.0           
TSDTW [43]  31.5  62.1  72.8  82.4  41.7  67.1  79.4  90.1           
TDL[14]  56.3  87.6  95.6  56.7  80.0  87.6  93.6            
RFAnet [18]  49.3  76.8  85.3  90.0  58.2  85.8  93.4  97.9           
[15]  48.7  81.1  89.2  97.3  76.7  95.6  96.7  98.9           
CNN+KissME+MQ [21]  48.8  75.6  84.5  92.6  69.9  90.6  96.5  98.2  68.3  82.6  86.0  89.4  49.3 
CUG [77]                  62.6  74.9  79.7  82.6  42.4 
Whatandwhere [22]  61.2  80.7  90.3  97.3  74.8  92.6  98.6  88.3  50.5  
Ours  60.1  97.9  97.9  54.6  76.5  96.4  
Ours+KissME [8]  97.9  61.2  79.5 
(a) iLIDSVID 
(b) PRID2011 
IvG Comparison with Other FewShot Learning Methods
We also compared to two recently proposed fewshot learning methods: matching networks [78] and model regression [79]. The matching networks propose a nearest neighbor approach that trains an embedding endtoend for the task of fewshot learning. Model regression trains a small MLP to regress from the classifier trained on a small dataset to the classifier trained on the full dataset. Both of the two techniques are highcapacity in learning from few examples and facilitates the recognition in the small sample size regime on a broad range of tasks, including domain adaptation and finegrained recognition. Comparison results are shown in Fig. 10. In terms of the overall performance, our method outperforms the two competitors constantly over the MARS dataset. Matching networks exhibit similar performance to our method, however, matching networks are based on nearest neighbors and use the entire training set in memory, and thus they are more expensive in testing time compared with our method and model regressors. Please note that we do not perform the comparison on iLIDSVID and PRID2011 datasets because these two datasets have only two crossview sequences regarding each person.
V Conclusion and Future Work
We propose a novel fewshot deep adversarial model with a latent variational structure to learn deep temporal representations for videobased person reidentification with few labeled training examples. The proposed method is based on variational recurrent neural network which provides a family of latent variables to describe input videos with temporal relationships. To eliminate feature distribution divergence caused by view changes, the learned latent features are adversarially trained through a crossview verification loss to make them viewinvariant. The proposed method requires few labeled training examples while generic enough and scalable to larger networked cameras. In future work, we would explore oneshot or zero shot learning with memories to augment the generalisation of the model.
Acknowledgement
Lin Wu was supported by ARC LP160101797 Improved Pathology by Fusion of Digital Microscopy and Plain Text Reports. Yang Wang was supported by National Natural Science Foundation of China, under Grant No 61806035. Meng Wang was supported by National Natural Science Foundation of China, under Grant No 61432019, 61732008, and 61725203.
References
 [1] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani, “Person reidentification by symmetrydriven accumulation of local features,” in CVPR, 2010.
 [2] N. Gheissari, T. B. Sebastian, and R. Hartley, “Person reidentification using spatiotemporal appearance,” in CVPR, 2006.
 [3] R. Zhao, W. Ouyang, and X. Wang, “Learning midlevel filters for person reidentfiation,” in CVPR, 2014.
 [4] D. Gray and H. Tao, “Viewpoint invariant pedestrian recognition with an ensemble of localized features,” in ECCV, 2008.
 [5] R. Zhao, W. Ouyang, and X. Wang, “Unsupervised salience learning for person reidentification,” in CVPR, 2013.
 [6] Z. Li, S. Chang, F. Liang, T. S. Huang, L. Cao, and J. Smith, “Learning locallyadaptive decision functions for person verification,” in CVPR, 2013.
 [7] L. Ma, X. Yang, and D. Tao, “Person reidentification over camera networks using multitask distance metric learning,” IEEE Transactions on Image Processing, vol. 23, no. 8, pp. 3656–3670, 2014.
 [8] M. Kostinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof, “Large scale metric learning from equivalence constraints,” in CVPR, 2012.
 [9] W. Zheng, S. Gong, and T. Xiang, “Reidentification by relative distance comparision,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 3, pp. 653–668, 2013.
 [10] F. Xiong, M. Gou, O. Camps, and M. Sznaier, “Person reidentification using kernelbased metric learning methods,” in ECCV, 2014.
 [11] S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person reidentification by local maximal occurrence representation and metric learning,” in CVPR, 2015.
 [12] B. Prosser, W. Zheng, S. Gong, and T. Xiang, “Person reidentification by support vector ranking,” in BMVC, 2010.
 [13] T. Wang, S. Gong, X. Zhu, and S. Wang, “Person reidentification by video ranking,” in ECCV, 2014.
 [14] J. You, A. Wu, X. Li, and W.S. Zheng, “Toppush videobased person reidentification,” in CVPR, 2016.
 [15] X. Zhu, X.Y. Jing, F. Wu, and H. Feng, “Videobased person reidentification by simultaneously learning intravideo and intervideo distance metrics,” in IJCAI, 2016.
 [16] A. Klaser, M. Marszaek, and C. Shmid, “A spatiotemporal descriptor based on 3dgradients,” in BMVC, 2008.
 [17] N. McLaughlin, J. M. del Rincon, and P. Miller, “Recurrent convolutional network for videobased person reidentification,” in CVPR, 2016.
 [18] Y. Yan, B. Ni, Z. Song, C. Ma, Y. Yan, and X. Yang, “Person reidentification via recurrent feature aggregation,” in ECCV, 2016.
 [19] H. Fan, L. Zheng, and Y. Yang, “Unsupervised person reidentification: Clustering and finetuning,” in arXiv:1705.10444, 2017.
 [20] C. Zhang, L. Wu, and Y. Wang, “Crossing generative adversarial networks for crossview person reidentification,” INeurocomputing, vol. 340, no. 7, pp. 259–269, 2019.
 [21] L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian, “Mars: A video benchmark for largescale person reidentification,” in ECCV, 2016.
 [22] L. Wu, Y. Wang, J. Gao, and X. Li, “Whatandwhere to look: Deep siamese attention networks for videobased person reidentification,” IEEE Transactions on Multimedia, vol. 10.1109/TMM.2018.2877886, pp. –, 2018.
 [23] J. Chung, K. Kastner, L. Dinh, K. Goel, A. Courville, and Y. Bengio, “A recurrent latent variable model for sequential data,” in NIPS, 2015.

[24]
Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette,
M. Marchand, and V. Lempitsky, “Domainadversarial training of neural
networks,”
The Journal of Machine Learning Research
, vol. 17, no. 1, pp. 2096–2030, 2016.  [25] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” in ICLR, 2015.
 [26] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
 [27] L. Wu, C. Shen, and A. van den Hengel, “Deep linear discriminant analysis on fisher networks: A hybrid architecture for person reidentification,” Pattern Recognition, vol. 65, p. 238–250, 2017.
 [28] M. Hirzer, C. Beleznai, P. M. Roth, and H. Bischof, “Person reidentification by descriptive and discriminative classification,” in SCIA, 2011.
 [29] L. Wu, Y. Wang, X. Li, and J. Gao, “Whatandwhere to match: Deep spatially multiplicative integration networks for person reidentification,” Pattern Recognition, vol. 76, pp. 727–738, 2018.
 [30] L. An, S. Yang, and B. Bhanu, “Person reidentification by robust canonical correlation analysis,” IEEE Signal Processing Letters, vol. 22, no. 8, pp. 1103–1107, 2015.
 [31] S. Pedagadi, J. Orwell, S. Velastin, and B. Boghossian, “Local fisher discriminant analysis for pedestrian reidentification,” in CVPR, 2013.
 [32] M. Kostinger, M. Hirzer, P. Wohlhart, P. M. Roth, and H. Bischof, “Large scale metric learning from equivalence constraints,” in CVPR, 2012.
 [33] L. Zhang, T. Xiang, and S. Gong, “Learning a discriminative null space for person reidentification,” in CVPR, 2016.
 [34] L. Wu, Y. Wang, J. Gao, and X. Li, “Deep adaptive feature embedding with local sample distributions for person reidentification,” Pattern Recognition, vol. 73, pp. 275–288, 2018.
 [35] Y. Wang, X. Lin, L. Wu, and et al, “Robust subspace clustering for multiview data by exploiting correlation consensus,” IEEE Trans. Image Processing, vol. 24, no. 11, pp. 3939–3949, 2015.
 [36] Y. Xu, L. Lin, W. Zheng, and X. Liu, “Human reidentification by matching compositional template with cluster sampling,” in ICCV, 2013.
 [37] A. BedagkarGala and S. Shah, “Partbased spatiotemporal model for multiperson reidentification,” Pattern Recognition Letters, vol. 33, pp. 1908–1915, 2012.
 [38] Y. Shen, T. Xiao, H. Li, S. Yi, and X. Wang, “Learning deep neural networks for vehicle reid with visualspatiotemporal path proposals,” in ICCV, 2017.
 [39] K. Liu, B. Ma, W. Zhang, and R. Huang, “A spatiotemporal appearance representation for videobased person reidentification,” in ICCV, 2015.
 [40] D. J. Berndt and J. Clifford, “Using dynamic time warping to find patterns in time series,” in Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, 1994, pp. 359–370.
 [41] D. Simonnet, M. Lewandowski, S. A. Velastin, and J. Orwell, “Reidentification of pedestrians in crowds using dynamic time warping,” in ECCV, 2012.
 [42] T. Wang, S. Gong, X. Zhu, and S. Wang, “Person reidentification by video ranking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 12, pp. 2501–2514, 2016.
 [43] X. Ma, X. Zhu, S. Gong, X. Xie, J. Hu, K.M. Lam, and Y. Zhong, “Person reidentification by unsupervised video matching,” Pattern Recognition, vol. 65, pp. 197–210, 2017.
 [44] N. Li, R. Jin, and Z.H. Zhou, “Toprank optimization in linear time,” in NIPS, 2014.
 [45] L. Wu, C. Shen, and A. van den Hengel, “Personnet: Person reidentification with deep convolutional neural networks,” in CoRR abs/1601.07255, 2016.
 [46] E. Ahmed, M. Jones, and T. K. Marks, “An improved deep learning architecture for person reidentification,” in CVPR, 2015.
 [47] W. Li, R. Zhao, X. Tang, and X. Wang, “Deepreid: Deep filter pairing neural network for person reidentification,” in CVPR, 2014.
 [48] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Deep metric learning for person reidentification,” in ICPR, 2014.
 [49] S.Z. Chen, C.C. Guo, and J.H. Lai, “Deep ranking for reidentification via joint representation learning,” IEEE Transactions on Image Processing, vol. 25, no. 5, pp. 2353–2367, 2016.
 [50] G. W. R. R. Varior and J. Lu, “Learning invariant color features for person reidentification,” in arXiv:1410.1035, 2014.
 [51] Y. Shen, W. Lin, J. Yan, M. Xu, J. Wu, and J. Wang, “Person reidentification with correspondence structure learning,” in ICCV, 2015.
 [52] L. Wu, Y. Wang, L. Shao, and M. Wang, “3d personvlad: Learning deep global representations for videobased person reidentification,” IEEE Transactions on Neural Networks and Learning Systems, DOI: 10.1109/TNNLS.2019.2891244, 2019.
 [53] E. Ustinova, Y. Ganin, and V. Lempitsky, “Multiregion bilinear convolutional neural networks for person reidentification,” in arXiv:1512.05300, 2015.
 [54] L. Wu, Y. Wang, X. Li, and J. Gao, “Deep attentionbased spatially recursive networks for finegrained visual recognition,” IEEE Transactions on Cybernetics, vol. 49, no. 5, pp. 1791–1802, 2019.
 [55] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra, “Video (language) modeling: a baseline for generative models of natural videos,” in arXiv:1412.6604, 2014.

[56]
N. Srivastava, E. Mansimov, and R. Salakhutdinov, “Unsupervised learning of video representation using lstms,” in
ICML, 2015.  [57] C. L. C. Mattos, Z. Dai, A. Damianou, J. Forth, G. A. Barreto, and N. D. Lawrence, “Recurrent gaussian process,” in ICLR, 2016.
 [58] D. P. Kingma and M. Welling, “Autoencoding variational bayes,” in ICLR, 2014.
 [59] L. Wu, Y. Wang, and L. Shao, “Cycleconsistent deep generative hashing for crossmodal retrieval,” IEEE Transactions on Image Processing, vol. 28, no. 4, pp. 1602–1612, 2019.
 [60] B. Hariharan and R. Girshick, “Lowshot visual recognition by shrinking and hallucinating features,” in arXiv:1606.02819, 2016.
 [61] A. Wong and A. L. Yuille, “One shot learning via compositions of meaningful patches,” in ICCV, 2015.

[62]
F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in
CVPR, 2015.  [63] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Webscale training for face identification,” in CVPR, 2015.
 [64] S. Ding, L. Lin, G. Wang, and H. Chao, “Deep feature learning with relative distance comparison for person reidentification,” Pattern Recognition, vol. 48, no. 10, pp. 2993–3003, 2015.
 [65] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction by learning an invariant mapping,” in CVPR, 2006.
 [66] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for oneshot image recognition,” in ICML Deep Learning Workshop, 2015.
 [67] S. Hochreiter and J. Schmidhuber, “Long shortterm memory,” in Neural computation, 1997.
 [68] A. Graves and J. Schmidhuber, “Frame wise phoneme classification with bidirectional lstm and other neural network architecture,” Neural Networks, vol. 18, no. 56, pp. 602–610, 200.
 [69] H. Sak, A. Senior, and F. Beaufays, “Long shortterm memory based recurrent neural network architectures for large vocabulary speech recognition,” in arXiv: 1402.1128, 2014.
 [70] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained partbased models,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 9, pp. 1627–1645, 2010.
 [71] A. Dehghan, S. Modiri, and A. M. Sha, “Gmmcp tracker: globally optimal generalized maximum multi clique problem for multiple object tracking,” in CVPR, 2015.
 [72] K. Simonyan and A. Zisserman, “Twostream convolutional networks for action recognition,” in NIPS, 2015.
 [73] T. Brox, A. Bruhn, N. Papenberg, and J. Weickert, “High accuracy optical flow estimation based on a theory for warping,” in ECCV, 2004.
 [74] R. Collobert, K. Kavukcuoglu, and C. Farabet, “Torch7: A matlablike environment for machine learning,” in BigLearn, NIPS Workshop, 2011.

[75]
E. Zhong, W. Fan, Q. Yang, O. Verscheure, and J. Ren, “Cross validation framework to choose amongst models and datasets for transfer learning,” in
Machine Learning and Knowledge Discovery in Databases, 2010, pp. 547–562.  [76] G. Lisanti, I. Masi, and A. D. B. Andrew D. Bagdanov, “Person reidentification by iterative reweighted sparse ranking,” IEEE Transactions on pattern analysis and machine intelligence, vol. 37, no. 8, pp. 1629–1642, 2015.
 [77] Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Ouyang, and Y. Yang, “Exploit the unknown gradually: Oneshot videobased person reidentification by stepwise learning,” in CVPR, 2018.
 [78] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra, “Matching networks for one shot learning,” in arXiv:1606.04080, 2016.
 [79] Y. Wang and M. Hebert, “Learning to learn: model regression networks for easy small sample learning,” in ECCV, 2016.
Comments
There are no comments yet.