1 Introduction
Measuring the distance between data points is a central ingredient of many data analysis and machine learning applications. Several kernel methods (KernelPCA (Schölkopf et al., 1997), KernelNMF (Li & Ding, 2006), etc.), and other nonparametric approaches such as knearest neighbours (Altman, 1992)
rely on the availability of a suitable distance function. Computer vision pipelines, e.g. tracking over time, perform matching based on similarity scores.
But designing a distance function can be challenging: it is not always obvious to write down mathematical formulae that accurately express a notion of similarity. Learning such functions has hence been proven as a viable alternative to manual engineering in this respect (NCA (Goldberger et al., 2005), metric learning (Xing et al., 2003), etc.). Often, these methods rely on the availability of pairs labelled as similar or dissimilar. A different route is that of exploiting the structure that latentvariable models learn. The assumption that a set of highdimensional observations is explained by points in a much simpler latent space underpins these approaches. In their respective probabilistic versions, a latent prior distribution is transformed nonlinearly to give rise to a distribution of observations. The hope is that simple distances, such as the Euclidean distance measured in latent space, implement a function of similarity. Yet, these approaches do not incorporate the variation of the observations with respect to the latent points. For example, the observations will vary much more when a path in latent space will cross a class boundary.
In fact, recent approaches to nonlinear latent variable models, such as the generative adversarial network (Goodfellow et al., 2014) or the variational autoencoder (VAE) (Kingma & Welling, 2014; Rezende et al., 2014), regularise the latent space to be compact, i.e. to remove lowdensity regions. This is in contrast to the aforementioned hope that Euclidean distances appropriately reflect similarity.
The above insight leads us to the development of flat manifold variational autoencoders. This class of VAEs defines the latent space as Riemannian manifold and regularises the Riemannian metric tensor to be a scaled identity matrix. In this context, a flat manifold is a Riemannian manifold, which is isometric to the Euclidean space. To not compromise the expressiveness, we relax the compactness assumption and make use of a recently introduced hierarchical prior (Klushyn et al., 2019). As a consequence, the model is capable of learning a latent representation, where the Euclidean metric is a proxy for the similarity between data points. This results in a computational efficient distance metric which is practical for applications in realtime scenarios.
2 Variational AutoEncoders with Flat Latent Manifolds
2.1 Background on Learning Hierarchical Priors in VAEs
Latentvariable models are defined as
(1) 
where represents latent variables and the observable data. The integral in Eq. (1) is usually intractable but it can be approximated by maximising the evidence lower bound (ELBO) (Kingma & Welling, 2014; Rezende et al., 2014):
(2) 
where is the empirical distribution of the data . The distribution parameters of the approximate posterior and the likelihood
are represented by neural networks. The prior
is usually defined as a standard normal distribution. This model is commonly referred to as the variational autoencoder (VAE).
However, a standard normal prior often leads to an overregularisation of the approximate posterior, which results in a less informative learned latent representation of the data (Tomczak & Welling, 2018; Klushyn et al., 2019). To enable the model to learn an informative latent representation, Klushyn et al. (2019) propose to use a flexible hierarchical prior , where is the standard normal distribution. Since the optimal prior is the aggregated posterior (Tomczak & Welling, 2018), the above integral is approximated by an importanceweighted (IW) bound (Burda et al., 2015) based on samples from . This leads to a model with two stochastic layers and the following upper bound on the KL term:
(3) 
where is the number of importance samples. Since it has been shown that high ELBO values do not necessarily correlate with informative latent representations (Alemi et al., 2018; Higgins et al., 2017)—which is also the case for hierarchical models (Sønderby et al., 2016)—different optimisation approaches have been introduced (Bowman et al., 2016; Sønderby et al., 2016). Klushyn et al. (2019) follow the line of argument in (Rezende & Viola, 2018) and reformulate the resulting ELBO as the Lagrangian of a constrained optimisation problem:
(4) 
with the optimisation objective , the inequality constraint , and the Lagrange multiplier . is defined as the reconstructionerrorrelated term in . Thus, we obtain the following optimisation problem:
(5) 
Building on that, the authors propose an optimisation algorithm—including a update scheme—to achieve a tight lower bound on the loglikelihood. This approach is referred to as variational hierarchical prior (VHP) VAE.
2.2 Learning Flat Latent Manifolds with VAEs
The VHPVAE is able to learn a latent representation that corresponds to the topology of the data manifold (Klushyn et al., 2019). However, it is not guaranteed that the (Euclidean) distance between encoded data in the latent space is a sufficient distance metric in relation to the observation space. In this work, we aim to measure the distance/difference of observed data directly in the latent space by means of the Euclidean distance of the encodings.
Chen et al. (2018a); Arvanitidis et al. (2018) define the latent space of a VAE as a Riemannian manifold. This approach allows for computing the observationspace length of a trajectory in the latent space:
(6) 
where is the Riemannian metric tensor, and the time derivative. We define the observationspace distance as the shortest possible path
(7) 
between two data points. The trajectory that minimises is referred to as the geodesic. In the context of VAEs, is transformed by a continuous function —the decoder—to the observation space. The metric tensor is defined as , where is the Jacobian of the decoder.
To measure the observationspace distance directly in the latent space, distances in the observation space should be proportional to distances in the latent space:
(8) 
where we define the Euclidean distance as the distance metric. This requires that the Riemannian metric tensor is . As a consequence, the Euclidean distance in the latent space corresponds to the geodesic distance. We refer to a manifold with this property as flat manifold (Lee, 2006). To obtain a flat latent manifold, the model typically needs to learn complex latent representations of the data (see experiments in Sec. 4). Therefore, we propose the following approach: (i) to enable our model to learn complex latent representations, we apply a flexible prior (VHP), which is learned by the model (empirical Bayes); and (ii) we regularise the curvature of the decoder such that .
For this purpose, the VHPVAE, introduced in Sec. 2.1, is extended by a Jacobianregularisation term. We define the regularisation term as part of the optimisation objective, which is in line with the constrained optimisation setting. The resulting objective function is
(9) 
where is a hyperparameter determining the influence of the regularisation and the scaling factor. Additionally, we use a stochastic approximation (first order Taylor expansion) of the Jacobian to improve the computational efficiency (Rifai et al., 2011b):
(10) 
where is the Jacobian of the th latent dimension and
a standard basis vector. This approximation method allows for a faster computation of the gradient and avoids the secondderivative problem of piecewise linear layers
(Chen et al., 2018a).However, the influence of the regularisation term in Eq. (2.2) on the decoder function is limited to regions where data is available. To overcome this issue, we propose to use mixup, a dataaugmentation method (Zhang et al., 2018)
, which was introduced in the context of supervised learning. We extend this method to the VAE framework (unsupervised learning) by applying it to encoded data in the latent space. This approach allows augmenting data by interpolating between two encoded data points
and :(11) 
with , and . In contrast to (Zhang et al., 2018), where limits the data augmentation to only convex combinations, we define to take into account the outer edge of the data manifold. By combining mixup in Eq. (11) with Eq. (2.2), we obtain the objective function of our flat manifold VAE (FMVAE):
(12) 
Inspired by batch normalisation, we define the squared scaling factor to be the mean over the batch samples and diagonal elements of (see App. A.2 for empirical support):
(13) 
The optimisation algorithm Alg. 1, and further details about the optimisation process can be found in App. A.4.
By using augmented data, we regularise to be a scaled identity matrix for the entire latent space enclosed by the data manifold. Hence, the VHPFMVAE learns an Euclidean latent space. As a consequence, the function (decoder) is—up to the scaling factor —distancepreserving since , where and refer to the distance in the observation and latent space, respectively.
The decoder of the proposed approach satisfies the Lipschitz continuity condition . We consider the decoder function, and hence the latent space as smooth if , where is the Lipschitz constant.
3 Related Work
Interpretation of the VAE’s latent space. In general, the latent space of VAEs is considered to be Euclidean (e.g. Kingma et al., 2016; Higgins et al., 2017), but it is not constrained to be Euclidean. This can be problematic if we require a precise metric that is based on the latent space. Some recent works (Mathieu et al., 2019; Grattarola et al., 2018) adapted the latent space to be nonEuclidean to match the data structure. We solve the problem from another perspective: we enforce the latent space to be Euclidean.
Jacobian and Hessian regularisation. In (Rifai et al., 2011a), the authors proposed to regularise the Jacobian and Hessian of the encoder. However, it is more difficult to augment data in the observation space than in the latent space. Encoder regularisation enables the model to perform better in case of, e.g., object recognition by means of the latent space. By contrast, decoder regularisation enables the model to do tasks such as generating motions based on the latent space. In (Hadjeres et al., 2017), the Jacobian of the decoder was regularised to be as small as possible/zero. On the contrary, we regularise the the Riemannian metric tensor to be a scaled identity matrix, and hence the Jacobian to be constant, and hence the Hessian to be zero. (Nie & Patel, 2019) regularised the Jacobian with respect to the weights for GANs. In terms of supervised learning, (Jakubovitz & Giryes, 2018) used Jacobian regularisation to improve the robustness for classification.
Metric learning. Various metric learning approaches for both deep supervised and unsupervised models were proposed. For instance, deep metric learning (Hoffer & Ailon, 2015) used a triplet network for supervised learning. (Karaletsos et al., 2016) introduced an unsupervised metric learning method, where a VAE is combined with triplets. However, a human oracle is still required. By contrast, our approach is completely based on unsupervised learning, using the Euclidean distance in the latent space as a distance metric. Our proposed method is similar to the metric learning methods such as Large Margin Nearest Neighbour (Weinberger & Saul, 2009), which pulls target neighbours together and pushes impostors away. The difference is that our approach is an unsupervised method.
Constraints in latent space. Constraints on time (e.g. Wang et al., 2007; Chen et al., 2016, 2015) allow obtaining similar distance metrics in the latent space. Additionally, due to the lack of data, constraints on time cannot guarantee that the metric is correct between of different sequences. However, our method can be used for general datasets.
Data augmentation. The latent space is formed arbitrarily in regions where data is missing. Zhang et al. (2018) proposed mixup, an approach for augmenting input data and labels for supervised learning. Various followup studies of mixup were developed, such as (Verma et al., 2018; Beckham et al., 2019). We extend mixup to the VAE framework (unsupervised learning) by applying it to encoded data in the latent space. This facilitates the regularisation of regions where no data is available. As a consequence, similarity of data points can be measured in the latent space by applying the Euclidean metric.
Geodesic. Recent studies on geodesics for generative models (e.g. Tosi et al., 2014; Chen et al., 2018a; Arvanitidis et al., 2018) are focusing on methods for computing/finding the geodesic in the latent space. By contrast, we use the geodesic/Riemannian distance for influencing the learned latent manifold. (Frenzel et al., 2019)
projected the latent space to a new latent space, where the geodesic is equivalent to the Euclidean interpolation. However, these two separate processes—VAEs and projection—probably hinder the model to find the latent features autonomously. Another difference is the assumption of previous work is that distances, defined by geodesics, can only be measured by following the data manifold. This is useful in scenarios such as avoiding unseen barriers between two data points, e.g.,
(Chen et al., 2018b), but it does not allow measuring distances between different categories. In this work, we focus on learning a general distance metric.4 Experiments
We test our method on artificial pendulum images, human motion data, MNIST, and MOT16. We measure the performance in terms of equidistances, interpolation smoothness, and distance computation. Additionally, our method is applied to a realworld environment—a videotracking benchmark. Here, the tracking and reidentification capabilities are evaluated.
The Riemannian metric tensor has many intrinsic properties of a manifold and measures local angles, length, surface area, and volumes (Bronstein et al., 2017). Therefore, the models are quantified based on the Riemannian metric tensor by computing condition numbers and magnification factors. The condition number, which shows the ratio of the most elongated to the least elongated direction, is defined as , where
is the largest eigenvalue of
. The magnification factor (Bishop et al., 1997) depicts the sensitivity of the likelihood functions. When projecting from the Riemannian (latent) to the Euclidean (observation) space, the can be considered a scaling coefficient. Since we cannot directly compare the s of different models, the s are normalised/divided by their means. The closer the conditional number and the normalised MF are to one, the more invariant is the model with respect to the Riemannian metric tensor. In other words: the conditional number and the normalised MF are metrics to evaluate whether is approximately constant and proportional to .In order to make the visualisations of the magnification factor in Sec. 4.1 (Fig. 1) and Sec. 4.2 (Fig. 3 & Fig. 7) comparable, we define the respective upper range of the colourbar as . and are computed with training data and by using a grid area, respectively.
To be in line with previous literature (e.g. Higgins et al., 2017; Sønderby et al., 2016), we use the parametrisation of the Lagrange multiplier in our experiments.
4.1 Artificial Pendulum Dataset
The pendulum dataset (Klushyn et al., 2019; Chen et al., 2018a) consists of pixel images generated by a pendulum simulator. We generated images with joint angles in the ranges of degrees. Additionally, we added Gaussian noise to each pixel.
As seen in Fig. 1, without regularisation, the contour lines are denser in the centre of the latent space. The reason is that, in contrast to the VHPVAE, the regularisation term in the VHPFMVAE smoothens the latent space ()—visualised by the and the equidistance plots. In Fig. 2, VHPFMVAE and VAEVHP are compared in terms of condition number and normalised . In both cases the VHPFMVAE outperforms the VHPVAE.
4.2 Human Motion Capture Database
To evaluate our approach on the CMU human motion dataset (http://mocap.cs.cmu.edu), we select five different movements: walking (subject 35), jogging (subject 35), balancing (subject 49), punching (subject 143), and kicking (subject 74). After data preprocessing, the input data is a 50dimensional vector of the joint angles. Note that the dataset is not balanced: walking, for example, has more data points than jogging.
dataset  method  observation  latent 

Human  VHPFMVAE  1.02 0.06  0.93 0.03 
VHPVAE  1.23 0.20  0.82 0.10  
MNIST  VHPFMVAE  1.01 0.08  0.92 0.05 
VHPVAE  1.13 0.22  0.70 0.31 
Equidistance plots. In Fig. 3, we randomly select a data point from each class as centres of the equidistance plots. In case of our proposed method, the equidistance plots are homogeneous, while in case of the VHPVAE, the equidistance contour lines are distorted in regions of high values. Thus, the mapping from latent to observation space learned by the VHPFMVAE is approximately distance preserving. Additionally, we use the condition number and the normalised to evaluate based on 3,000 random samples. In contrast to the VHPVAE, both the condition number and the normalised MF values of the VHPFMVAE are close to one, which indicates that .
Smoothness. We randomly sample 100 pair points and linearly interpolate between each pair. The second derivative of each trajectory is defined as the smoothness factor. Fig. 5 illustrates that the VHPFMVAE significantly outperforms the VAEVHP in terms of smoothness. Fig. 6 shows five examples of the interpolated trajectories.
Verification of the distance metric. To verify that the Euclidean distance in the latent space corresponds to the geodesic distance, we approximates the geodesic by using a graphbased approach (Chen et al., 2019)
. The graph of the baseline has 14,400 nodes, which are sampled in the latent space using a uniform distribution. Each node has 12 neighbours. In Fig.
3, five geodesics each are compared to the corresponding Euclidean interpolations. Tab. 1 shows the ratios of Euclidean distances in latent space to geodesics distances, as well as the related ratios in the observation space. To compute the ratios, we randomly sampled 100 pairs of points and interpolated between each pair. If the ratio of the distances is close to one, the Euclidean interpolation approximates the geodesic. The VHPFMVAE outperforms the VAEVHP.Influence of the data augmentation and the identity term . Fig. 4 and Fig. 6(a) show the influence of the data augmentation (see Sec. 2.2). Without data augmentation, the influence of the regularisation term is limited to regions where data is available, as verified by the high values between the different movements. As an additional experiment, Fig. 4 and Fig. 6(b) illustrates the influence of the identity term . If we remove it, the regularisation term becomes . As a consequence, the model is not able to learn a flat latent manifold.
4.3 Mnist
The binarised MNIST dataset (Larochelle & Murray, 2011) consists of 50,000 training and 10,000 test images of handwritten digits (zero to nine) with pixels in size.
Both of our evaluation metrics the condition number and the normalised
show that the VHPFMVAE outperforms the VAEVHP (see Fig. 8 and Fig. 9). In contrast to the VHPVAE, the VHPFMVAE learns a latent space, where Euclidean distances are close to geodesic distances (see Tab. 1). This indicates that is approximately constant.4.4 MOT16 ObjectTracking Database
We evaluate our approach on the MOT16 objecttracking database (Milan et al., 2016), which is a largescale person reidentification dataset, containing both static and dynamic scenes from diverse cameras.
Method  Type  IDF  IDP  IDR  Recall  Precision  FAR  MT  


unsupervised  63.7  77.0  54.3  65.0  92.3  1.12  158  

unsupervised  64.2  77.6  54.8  65.1  92.3  1.13  162  
VHPVAESORT  unsupervised  60.5  72.3  52.1  65.8  91.4  1.28  170  
SORT  n.a.  57.0  67.4  49.4  66.4  90.6  1.44  158  
DeepSORT  supervised  64.7  76.9  55.8  66.7  91.9  1.22  180 
Method  PT  ML  FP  FN  IDs  FM  MOTA  MOTP  MOTAL  


269  90  5950  38592  616  1143  59.1  81.8  59.7  

265  90  6026  38515  598  1163  59.1  81.8  59.7  
VHPVAESORT  266  81  6820  37739  693  1264  59.0  81.6  59.6  
SORT  275  84  7643  37071  1486  1515  58.2  81.9  59.5  
DeepSORT  250  87  6506  36747  585  1165  60.3  81.6  60.8 




We compare with two baselines: SORT (Bewley et al., 2016) and DeepSORT (Wojke et al., 2017)
. SORT is a simple online and realtime tracking method, which uses bounding box intersectionoverunion (IOU) for associating detections between frames and Kalman filters for the track predictions. It relies on good twodimensional bounding box detections from a separate detector, and suffers from ID switching when tracks overlap in the image. DeepSORT extends the original SORT algorithm to integrate appearance information based on a deep appearance descriptor, which helps with reidentification in the case of such overlaps or missed detections. The deep appearance descriptor is trained using a
supervised cosine metric learning approach (Wojke & Bewley, 2018). The candidate object locations of the pregenerated detections for both SORT, DeepSORT and our method are taken from (Yu et al., 2016). Further details regarding the implementation can be found in App. A.3.We use the following metrics for evaluation. indicates that the higher the score is, the better the performance is. On the contrary, indicates that the lower the score is, the better the performance is.
IDF(): ID F Score
IDP(): ID Precision
IDR(): ID Recall
FAR(): False Alarm Ratio
MT(): Mostly Tracked Trajectory
PT(): Partially Tracked Trajectory
ML(): Mostly Lost Trajectory
FP(): False Positives
FN(): False Negatives
IDs(): Number of times an ID switches to a different previously tracked object
FM(): Fragmentations
MOTA(): Multiobject tracking accuracy
MOTP(): Multiobject tracking precision
MOTAL(): Log tracking accuracy
Tab. 2 shows that the performance of the proposed method is better than that of the model without Jacobian regularisation, and even close to the the performance of supervised learning. All methods depend on the same underlying detector for object candidates, and identical Kalman filter parameters. Compared to baseline SORT which does not utilise any appearance information, DeepSORT has 2.54 times, VHPVAESORT has 2.14 times, VHPFMVAESORT () has 2.41 times and VHPFMVAESORT () has 2.48 times fewer ID switches. Whilst the supervised DeepSORT descriptor has the least, using unsupervised VAEs with flat decoders has only 2.2% more switches, without the need for labels. Furthermore, by ensuring a quasiEuclidean latent space, one can query nearestneighbours efficiently via datastructures such as kDTrees. Fig. 10 shows an example of the results. In other examples of the videos, the VHPFMVAESORT works similar as the DeepSORT. Videos of the results can be downloaded at: http://tiny.cc/0s71cz
5 Conclusion
In this paper, we have proposed a novel approach, which we call flat manifold variational autoencoder. We have shown that this class of VAEs learns a latent representation, where the Euclidean metric is a proxy for the similarity between data points. This is realised by interpreting the latent space as a Riemannian manifold and by combining a powerful empirical Bayes prior with a regularisation method that constrains the Riemannian metric tensor to be a scaled identity matrix. Experiments on several datasets have shown the effectiveness of our proposed algorithm for measuring similarity. In case of the MOT16 objecttracking database, the performance of our unsupervised method nears that of stateoftheart supervised approaches.
Acknowledgements
Thanks to Botond Cseke and Alexandros Paraschos for the useful feedback of this work.
References
 Alemi et al. (2018) Alemi, A. A., Poole, B., Fischer, I., Dillon, J. V., Saurous, R. A., and Murphy, K. Fixing a broken ELBO. ICML, 2018.
 Altman (1992) Altman, N. S. An introduction to kernel and nearestneighbor nonparametric regression. The American Statistician, 46(3):175–185, 1992.
 Arvanitidis et al. (2018) Arvanitidis, G., Hansen, L. K., and Hauberg, S. Latent space oddity: on the curvature of deep generative models. In ICLR, 2018.
 Beckham et al. (2019) Beckham, C., Honari, S., Lamb, A. M., Verma, V., Ghadiri, F., Hjelm, R. D., Bengio, Y., and Pal, C. On adversarial mixup resynthesis. NeurIPS, 2019.
 Bewley et al. (2016) Bewley, A., Ge, Z., Ott, L., Ramos, F., and Upcroft, B. Simple online and realtime tracking. In IEEE ICIP, pp. 3464–3468, 2016.

Bishop et al. (1997)
Bishop, C. M., Svens’ en, M., and Williams, C. K.
Magnification factors for the SOM and GTM algorithms.
In
Proceedings Workshop on SelfOrganizing Maps
, 1997.  Bowman et al. (2016) Bowman, S. R., Vilnis, L., Vinyals, O., Dai, A., Jozefowicz, R., and Bengio, S. Generating sentences from a continuous space. CoNLL, 2016.

Bronstein et al. (2017)
Bronstein, M. M., Bruna, J., LeCun, Y., Szlam, A., and Vandergheynst, P.
Geometric deep learning: going beyond Euclidean data.
IEEE Signal Processing Magazine, 34(4):18–42, 2017.  Burda et al. (2015) Burda, Y., Grosse, R. B., and Salakhutdinov, R. Importance weighted autoencoders. CoRR, abs/1509.00519, 2015.

Chen et al. (2015)
Chen, N., Bayer, J., Urban, S., and Van Der Smagt, P.
Efficient movement representation by embedding dynamic movement primitives in deep autoencoders.
In IEEERAS 15th International Conference on Humanoid Robots (Humanoids), pp. 434–440, 2015.  Chen et al. (2016) Chen, N., Karl, M., and van der Smagt, P. Dynamic movement primitives in latent space of timedependent variational autoencoders. In IEEERAS 16th International Conference on Humanoid Robots (Humanoids), pp. 629–636, 2016.
 Chen et al. (2018a) Chen, N., Klushyn, A., Kurle, R., Jiang, X., Bayer, J., and van der Smagt, P. Metrics for deep generative models. In AISTATS, pp. 1540–1550, 2018a.
 Chen et al. (2018b) Chen, N., Klushyn, A., Paraschos, A., Benbouzid, D., and van der Smagt, P. Active learning based on data uncertainty and model sensitivity. IEEE/RSJ IROS, 2018b.
 Chen et al. (2019) Chen, N., Ferroni, F., Klushyn, A., Paraschos, A., Bayer, J., and van der Smagt, P. Fast approximate geodesics for deep generative models. In ICANN, 2019.
 Frenzel et al. (2019) Frenzel, M. F., Teleaga, B., and Ushio, A. Latent space cartography: Generalised metricinspired measures and measurebased transformations for generative models. arXiv preprint arXiv:1902.02113, 2019.
 Goldberger et al. (2005) Goldberger, J., Hinton, G. E., Roweis, S. T., and Salakhutdinov, R. R. Neighbourhood components analysis. In Advances in neural information processing systems, pp. 513–520, 2005.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In NIPS, pp. 2672–2680, 2014.
 Grattarola et al. (2018) Grattarola, D., Zambon, D., Alippi, C., and Livi, L. Learning graph embeddings on constantcurvature manifolds for change detection in graph streams. arXiv preprint arXiv:1805.06299, 2018.
 Hadjeres et al. (2017) Hadjeres, G., Nielsen, F., and Pachet, F. GLSRVAE: geodesic latent space regularization for variational autoencoder architectures. In IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–7, 2017.
 Higgins et al. (2017) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. BetaVAE: Learning basic visual concepts with a constrained variational framework. ICLR, 2017.

Hoffer & Ailon (2015)
Hoffer, E. and Ailon, N.
Deep metric learning using triplet network.
In
International Workshop on SimilarityBased Pattern Recognition
, pp. 84–92. Springer, 2015.  Jakubovitz & Giryes (2018) Jakubovitz, D. and Giryes, R. Improving DNN robustness to adversarial attacks using Jacobian regularization. In ECCV, pp. 514–529, 2018.
 Karaletsos et al. (2016) Karaletsos, T., Belongie, S., and Rätsch, G. Bayesian representation learning with oracle constraints. ICLR, 2016.
 Kingma & Welling (2014) Kingma, D. P. and Welling, M. Autoencoding variational Bayes. ICLR, 2014.
 Kingma et al. (2016) Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improving Variational Inference with Inverse Autoregressive Flow. NIPS, 2016.
 Klushyn et al. (2019) Klushyn, A., Chen, N., Kurle, R., Cseke, B., and van der Smagt, P. Learning hierarchical priors in VAEs. NeurIPS, 2019.

Larochelle & Murray (2011)
Larochelle, H. and Murray, I.
The neural autoregressive distribution estimator.
In
International Conference on Artificial Intelligence and Statistics
, pp. 29–37, 2011.  Lee (2006) Lee, J. M. Riemannian manifolds: an introduction to curvature, volume 176. Springer Science & Business Media, 2006.
 Li & Ding (2006) Li, T. and Ding, C. The relationships among various nonnegative matrix factorization methods for clustering. In International Conference on Data Mining, pp. 362–371. IEEE, 2006.
 Mathieu et al. (2019) Mathieu, E., Lan, C. L., Maddison, C. J., Tomioka, R., and Teh, Y. W. Hierarchical representations with Poincar’e variational autoencoders. NeurIPS, 2019.
 Milan et al. (2016) Milan, A., LealTaixé, L., Reid, I., Roth, S., and Schindler, K. Mot16: A benchmark for multiobject tracking. arXiv preprint arXiv:1603.00831, 2016.
 Nie & Patel (2019) Nie, W. and Patel, A. Towards a better understanding and regularization of GAN training dynamics. In UAI, 2019.
 Rezende & Viola (2018) Rezende, D. J. and Viola, F. Taming VAEs. arXiv preprint arXiv:1810.00597, 2018.

Rezende et al. (2014)
Rezende, D. J., Mohamed, S., and Wierstra, D.
Stochastic backpropagation and approximate inference in deep generative models.
In ICML, volume 32, pp. 1278–1286, 2014. 
Rifai et al. (2011a)
Rifai, S., Dauphin, Y. N., Vincent, P., Bengio, Y., and Muller, X.
The manifold tangent classifier.
In Advances in Neural Information Processing Systems, pp. 2294–2302, 2011a.  Rifai et al. (2011b) Rifai, S., Mesnil, G., Vincent, P., Muller, X., Bengio, Y., Dauphin, Y., and Glorot, X. Higher order contractive autoencoder. In ECMLPKDD, pp. 645–660. Springer, 2011b.
 Ristani et al. (2016) Ristani, E., Solera, F., Zou, R. S., Cucchiara, R., and Tomasi, C. Performance measures and a data set for multitarget, multicamera tracking. CoRR, abs/1609.01775, 2016.

Schölkopf et al. (1997)
Schölkopf, B., Smola, A., and Müller, K.R.
Kernel principal component analysis.
In International conference on artificial neural networks, pp. 583–588. Springer, 1997.  Simonyan & Zisserman (2015) Simonyan, K. and Zisserman, A. Very deep convolutional networks for largescale image recognition. ICLR, 2015.
 Sønderby et al. (2016) Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., and Winther, O. Ladder variational autoencoders. NIPS, 2016.
 Tomczak & Welling (2018) Tomczak, J. M. and Welling, M. VAE with a vampprior. In International Conference on Artificial Intelligence and Statistics, pp. 1214–1223, 2018.
 Tosi et al. (2014) Tosi, A., Hauberg, S., Vellido, A., and Lawrence, N. D. Metrics for probabilistic geometries. In UAI, pp. 800–808, 2014.
 Verma et al. (2018) Verma, V., Lamb, A., Beckham, C., Najafi, A., Mitliagkas, I., Courville, A., LopezPaz, D., and Bengio, Y. Manifold mixup: Better representations by interpolating hidden states. ICML, 2018.
 Wang et al. (2007) Wang, J. M., Fleet, D. J., and Hertzmann, A. Gaussian process dynamical models for human motion. IEEE transactions on pattern analysis and machine intelligence, 30(2):283–298, 2007.
 Weinberger & Saul (2009) Weinberger, K. Q. and Saul, L. K. Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10:207–244, 2009.
 Wojke & Bewley (2018) Wojke, N. and Bewley, A. Deep cosine metric learning for person reidentification. In IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 748–756, 2018.
 Wojke et al. (2017) Wojke, N., Bewley, A., and Paulus, D. Simple online and realtime tracking with a deep association metric. In IEEE International Conference on Image Processing, pp. 3645–3649, 2017.
 Xing et al. (2003) Xing, E. P., Jordan, M. I., Russell, S. J., and Ng, A. Y. Distance metric learning with application to clustering with sideinformation. In NIPS, pp. 521–528, 2003.
 Yu et al. (2016) Yu, F., Li, W., Li, Q., Liu, Y., Shi, X., and Yan, J. POI: multiple object tracking with high performance detection and appearance feature. CoRR, abs/1610.06136, 2016.
 Zhang et al. (2018) Zhang, H., Cisse, M., Dauphin, Y. N., and LopezPaz, D. mixup: Beyond empirical risk minimization. International Conference on Learning Representations, 2018.
Appendix A Appendix
a.1 Vector Field
a.2 Influence of
a.3 Implementation of VHPFMVAESORT
We evaluate the performance of our model by replacing the appearance descriptor from DeepSORT with the latent space embedding from the various autoencoders used, using the same size of 128. The hyperparameters used were held constant: the minimum detection confidence of
, NMS max overlap of , max cosine distance , max appearance budget . We tested a VHPFMVAE, and our regularised VHPFMVAE with and .a.4 Optimisation Process
Note: to be in line with previous literature (e.g. Higgins et al., 2017; Sønderby et al., 2016), we use the parametrisation of the Lagrange multiplier in our experiments.
As introduced in (Klushyn et al., 2019), we apply the following update scheme:
(14) 
where is defined as
(15) 
is the Heaviside function and a slope parameter.
a.5 Model Architectures
Dataset  Optimiser  Architecture  
Pendulum  Adam  Input  16161 
14  Latents  2  
FC 256, 256. ReLU activation. 

FC 256, 256. ReLU activation. Gaussian.  
FC 256, 256, ReLU activation.  
FC 256, 256, ReLU activation.  
Others  = 0.025, = 1, = 16, .  
CMU Human  Adam  Input  50 
14  Latents  2  
FC 256, 256, 256, 256. ReLU activation.  
FC 256, 256, 256, 256. ReLU activation. Gaussian.  
FC 256, 256, 256, 256, ReLU activation.  
FC 256, 256, 256, 256, ReLU activation.  
Others  = 0.03, = 1, = 32, .  
MNIST  Adam  Input  28281 
14  Latents  2  
FC 256, 256, 256, 256. ReLU activation.  
FC 256, 256, 256, 256. ReLU activation. Bernoulli.  
FC 256, 256, 256, 256. ReLU activation.  
FC 256, 256, 256, 256. ReLU activation.  
others  = 0.245 , = 1, = 16, .  
MOT16  Adam  Input  64643 
35  Latents  128  
VGG16 (Simonyan & Zisserman, 2015)  
Conv2DT+Conv2D 256, 128, 64, 32, 16.  
ReLU activation. Gaussian.  
FC 512, 512. ReLU activation.  
FC 512, 512. ReLU activation.  
others  = 0.8 , = 1, = 8, . 