1 Introduction
An instance embedding is a mapping from an input
, such as an image, to a vector representation,
, such that “similar” inputs are mapped to nearby points in space. Embeddings are a versatile representation that support various downstream tasks, including image retrieval
(Babenko et al., 2014)and face recognition
(Schroff et al., 2015).Instance embeddings are often treated deterministically, i.e., is a point in . We refer to this approach as a point embedding. One drawback of this representation is the difficulty of modeling aleatoric uncertainty (Kendall & Gal, 2017), i.e. uncertainty induced by the input. In the case of images this can be caused by occlusion, blurriness, lowcontrast and other factors.
To illustrate this, consider the example in Figure 0(a). On the left, we show an image composed of two adjacent MNIST digits, the first of which is highly occluded. The right digit is clearly a 7, but the left digit could be a 1, or a 4. One way to express this uncertainty about which choice to make is to map the input to a region of space, representing the inherent uncertainty of “where it belongs”.
We propose a new method, called hedged instance embedding (HIB), which achieves this goal. Each embedding is represented as a random variable,
. The embedding effectively spreads probability mass across locations in space, depending on the level of uncertainty. For example in Figure
0(b), the corrupted image is mapped to a twocomponent mixture of Gaussians covering both the “17” and “47” clusters. We propose a training scheme for the HIB with a learnablemargin contrastive loss and the variational information bottleneck (VIB) principle (Alemi et al., 2016; Achille & Soatto, 2018).To evaluate our method, we propose a novel opensource dataset, Ndigit MNIST ^{1}^{1}1https://github.com/google/ndigitmnist. Using this dataset, we show that HIB exhibits several desirable properties compared to point embeddings: (1) downstream task performance (e.g. recognition and verification) improves for uncertain inputs; (2) the embedding space exhibits enhanced structural regularity; and (3) a perexemplar uncertainty measure that predicts when the output of the system is reliable.
2 Methods
In this section, we describe our method in detail.
2.1 Point embeddings
Standard point embedding methods try to compute embeddings such that and are “close” in the embedding space if and are “similar” in the ambient space. To obtain such a mapping, we must decide on the definition of “closeness” as well as a training objective, as we explain below.
Contrastive loss
Contrastive loss (Hadsell et al., 2006) is designed to encourage a small Euclidean distance between a similar pair, and large distance of margin for a dissimilar pair. The loss is
(1) 
where
. The hyperparameter
is usually set heuristically or based on validationset performance.
Soft contrastive loss
A probabilistic alternative to contrastive loss, which we will use in our experiments is defined here. It represents the probability that a pair of points is matching:
(2) 
with scalar parameters and
, and the sigmoid function
. This formulation calibrates Euclidean distances into a probabilistic expression for similarity. Instead of setting a hard threshold like , and together comprise a soft threshold on the Euclidean distance. We will later let and be trained from data.Having defined the match probability , we formulate the contrastive loss as a binary classification loss based on the softmax crossentropy (negative loglikelihood loss). More precisely, for an embedding pair the loss is defined as
(3) 
where is the indicator function with value for groundtruth match and otherwise.
Although some prior work has explored this soft contrastive loss (e.g. Bertinetto et al. (2016); Orekondy et al. (2018)), it does not seem to be widely used. However, in our experiments, it performs strictly better than the hard margin version, as explained in Appendix B.
2.2 Stochastic embeddings
In HIB, we treat embeddings as stochastic mappings , and write the distribution as . In the sections below, we show how to learn and use this mapping.
Match probability for probabilistic embeddings
The probability of two inputs matching, given in Equation 2, can easily be extended to stochastic embeddings, as follows:
(4) 
We approximate this integral via MonteCarlo sampling from and :
(5) 
In practice, we get good results using samples per input image. Now we discuss the computation of .
Single Gaussian embedding
The simplest setting is to let be a dimensional Gaussian with mean and diagonal covariance , where and
are computed via a deep neural network with a shared “body” and
total outputs. Given a Gaussian representation, we can draw samples , which we can use to approximate the match probability. Furthermore, we can use the reparametrization trick (Kingma & Welling, 2013) to rewrite the samples as , where. This enables easy backpropagation during training.
Mixture of Gaussians (MoG) embedding
We can obtain a more flexible representation of uncertainty by using a mixture of Gaussians to represent our embeddings, i.e. . When approximating Equation 5, we use stratified sampling, i.e. we sample the same number of samples from each Gaussian component.
Computational considerations
The overall pipeline for computing the match probability is shown in Figure 2. If we use a single Gaussian embedding, the cost (time complexity) of computing the stochastic representation is essentially the same as for point embedding methods, due to the use of a shared network for computing and . Also, the space requirement is only 2 more. (This is an important consideration for many embeddingbased methods.)
2.3 VIB training objective
For training our stochastic embedding, we combine two ingredients: soft contrastive loss in Equation 3 and the VIB principle Alemi et al. (2016); Achille & Soatto (2018). We start with a summary of the original VIB formulation, and then describe its extension to our setting.
Variational Information Bottleneck (VIB)
A discriminative model is trained under the information bottleneck principle (Tishby et al., 1999) by maximizing the following objective:
(6) 
where is the mutual information, and is a hyperparameter which controls the tradeoff between the sufficiency of for predicting , and the minimality (size) of the representation. Intuitively, this objective lets the latent encoding capture the salient parts of (salient for predicting ), while disallowing it to “memorise” other parts of the input which are irrelevant.
Computing the mutual information is generally computationally intractable, but it is possible to use a tractable variational approximation as shown in Alemi et al. (2016); Achille & Soatto (2018). In particular, under the Markov assumption that we arrive at a lower bound on Equation 6 for every training data point as follows:
(7) 
where is the latent distribution for ,
is the decoder (classifier), and
is an approximate marginal term that is typically set to the unit Gaussian .In Alemi et al. (2016), this approach was shown (experimentally) to be more robust to adversarial image perturbations than deterministic classifiers. It has also been shown to provide a useful way to detect outofdomain inputs (Alemi et al., 2018). Hence we use it as the foundation for our approach.
VIB for learning stochastic embeddings
We now apply the above method to learn our stochastic embedding. In particular, we train a discriminative model based on matching or mismatching pairs of inputs , by minimizing the following loss:
(8) 
where the first term is given by the negative log likelihood loss with respect to the ground truth match (this is identical to Equation 3, the soft contrastive loss), and the second term is the KL regularization term, . The full derivation is in appendix E. We optimize this loss with respect to the embedding function , as well as with respect to the and terms in the match probability in Equation 2.
Note that most pairs are not matching, so the class is rare. To handle this, we encourage a balance of and pair samples within each SGD minibatch by using two streams of input sample images. One samples images from the training set at random and the other selects images from specific class labels, and then these are randomly shuffled to produce the final batch. As a result, each minibatch has plenty of positive pairs even when there are a large number of classes.
2.4 Uncertainty measure
One useful property of our method is that it provides a novel measure of uncertainty of the embedding, which we define in terms of the selfmismatch as follows:
(9) 
Uncertainty is high when an embedding is spread across the latent space. We compute this quantity using the MonteCarlo estimation in Equation
5. Prior works (Vilnis & McCallum, 2014; Bojchevski & Günnemann, 2018) have computed uncertainty for Gaussian embeddings based on volumetric measures like trace or determinant of covariance matrix. However, our metric can be computed for any distribution from which one can sample, including multimodal distributions like mixture of Gaussians.3 Related Work
In this section, we mention the most closely related work from the fields of deep learning and probabilistic modeling.
Probabilistic DNNs
Several works have considered the problem of estimating the uncertainty of a regression or classification model, , when given ambiguous inputs. One of the simplest and most widely used techniques is known as Monte Carlo dropout (Gal & Ghahramani, 2016). In this approach, different random components of the hidden activations are “masked out” and a distribution over the outputs is computed. By contrast, we compute a parametric representation of the uncertainty and use Monte Carlo to approximate the probability of two points matching.
VAEs and VIB
A variational autoencoder (VAE,
Kingma & Welling (2013)) is a latent variable model of the form , in which the generative decoder and an encoder network, are trained jointly so as to maximize the evidence lower bound. By contrast, we compute a discriminative model on pairs of inputs to maximize a lower bound on the match probability. The variational information bottleneck (VIB) method (Alemi et al., 2016; Achille & Soatto, 2018) uses a variational approximation similar to the VAE to approximate the information bottleneck objective (Tishby et al., 1999). We build on this as explained in 2.3.Point embeddings
Instance embeddings are often trained with metric learning objectives, such as contrastive (Hadsell et al., 2006) and triplet (Schroff et al., 2015) losses. Although these methods work well, they require careful sampling schemes (Wu et al., 2017; MovshovitzAttias et al., 2017). Many other alternatives have attempted to decouple the dependency on sampling, including softmax crossentropy loss coupled with the centre loss (Wan et al., 2018), or a clusteringbased loss (Song et al., 2017), and have improved the embedding quality. In HIB, we use a soft contrastive loss, as explained in section 2.1.
Probabilistic embeddings
The idea of probabilistic embeddings is not new. For example, Vilnis & McCallum (2014) proposed Gaussian embeddings to represent levels of specificity of word embeddings (e.g. “Bach” is more specific than “composer”). The closeness of the two Gaussians is based on their KLdivergence, and uncertainty is computed from the spread of Gaussian (determinant of covariance matrix). See also Karaletsos et al. (2015); Bojchevski & Günnemann (2018) for related work. Neelakantan et al. (2014) proposed to represent each word using multiple prototypes, using a ”best of ” loss when training. HIB, on the other hand, measures closeness based on a quantity related to the expected Euclidean distance, and measures uncertainty using the selfmismatch probability.
4 Experiments
In this section, we report our experimental results, where we compare our stochastic embeddings to point embeddings. We consider two main tasks: the verification task (i.e., determining if two input images correspond to the same class or not), and the identification task (i.e., predicting the label for an input image). For the latter, we use a Knearest neighbors approach with . We compare performance of three methods: a baseline deterministic embedding method, our stochastic embedding method with a Gaussian embedding, and our stochastic embedding method with a mixture of Gaussians embedding. We also conduct a qualitative comparison of the embeddings of each method.
4.1 Experimental details
We conduct all our experiments on a new dataset we created called Ndigit MNIST, which consists of images composed of adjacent MNIST digits, which may be randomly occluded (partially or fully). See appendix A for details. During training, we occlude 20% of the digits independently. A single image can have multiple corrupted digits. During testing, we consider both clean (unoccluded) and corrupted (occluded) images, and report results separately. We use images with and digits. In the Appendix, we provide details on the open source version of this data that is intended for other researchers and to ensure reproducibility.
Since our dataset is fairly simple, we use a shallow CNN model to compute the embedding function. Specifically, it consists of 2 convolutional layers, with
filters, each followed by max pooling, and then a fully connected layer mapping to
dimensions. We focus on the case where or . When we use more dimensions, we find that all methods (both stochastic and deterministic) perform almost perfectly (upper 90%), so there are no interesting differences to report. We leave exploration of more challenging datasets, and higher dimensional embeddings, to future work.Our networks are built with TensorFlow
(Abadi et al., 2015). For each task, the input is an image of size , where N is the number of concatenated digit images. We use a batch size of 128 and 500k training iterations. Each model is trained from scratch with random weight initialization.4.2 Qualitative evaluation of the representation
Figure 4 shows HIB 2D Gaussian embeddings for the clean and corrupt subsets of the test set. We can easily see that the corrupt images generally have larger (i.e., less certain) embeddings. In the Appendix, Figure 8 shows a similar result when using a 2D MoG representation, and Figure 8 shows a similar result for 3D Gaussian embeddings.
Figure 4 illustrates the embeddings for several test set images, overlaid with an indication of each class’ centroid. Hedged embeddings capture the uncertainty that may exist across complex subsets of the class label space, by learning a layout of the embedding space such that classes that may be confused are able to receive density from the underlying hedged embedding distribution.
We observe enhanced spatial regularity when using HIB. Classes with a common least or most significant digit roughly align parallel to the or axis. This is because of the diagonal structure of the embedding covariance matrix. By controlling the parametrization of the covariance matrix, one may apply varying degrees and types of structures over the embedding space (e.g. diagonally aligned embeddings). See appendix D for more analysis of the learned latent space.
4.3 Quantitative evaluation of the benefits of stochastic embedding
We first measure performance on the verification task, where the network is used to compute for 10k pairs of test images, half of which are matches and half are not. The average precision (AP) of this task is reported in the top half of Table 2. HIB shows improved performance, especially for corrupted test images. For example, in the digit case, when using dimensions, point embeddings achieve 88.0% AP on corrupt test images, while hedged instance embeddings improves to 90.7% with Gaussian, and 91.2% with Gaussians.
We next measure performance on the KNN identification task. The bottom half of Table
2 reports the results. Again, proposed stochastic embeddings generally outperform point embeddings, with the greatest advantage for the corrupted input samples. For example, in the digit case, when using dimensions, point embeddings achieve 58.3% AP on corrupt test images, while HIB improves to 76.0% with Gaussian, and 75.7% with Gaussians.,  ,  ,  ,  

point  MoG1  MoG2  point  MoG1  MoG2  point  MoG1  MoG2  point  MoG1  MoG2  
Verification AP  
clean  0.987  0.989  0.990  0.996  0.996  0.996  0.978  0.981  0.976  0.987  0.989  0.991 
corrupt  0.880  0.907  0.912  0.913  0.926  0.932  0.886  0.899  0.904  0.901  0.922  0.925 
KNN Accuracy  
clean  0.871  0.879  0.888  0.942  0.953  0.939  0.554  0.591  0.540  0.795  0.770  0.766 
corrupt  0.583  0.760  0.757  0.874  0.909  0.885  0.274  0.350  0.351  0.522  0.555  0.598 
,  ,  ,  ,  

MoG1  MoG2  MoG1  MoG2  MoG1  MoG2  MoG1  MoG2  
AP Correlation  
clean  0.74  0.43  0.68  0.48  0.63  0.28  0.51  0.39 
corrupt  0.81  0.79  0.86  0.87  0.82  0.76  0.85  0.79 
KNN Correlation  
clean  0.71  0.57  0.72  0.47  0.76  0.29  0.74  0.54 
corrupt  0.47  0.43  0.55  0.52  0.49  0.50  0.67  0.34 
4.4 Known unknowns
In this section, we address the task of estimating when an input can be reliably recognized or not, which has important practical applications. To do this, we use the measure of uncertainty defined in Equation 9.
We measure the utility of for the identification task as follows. For the test set, we sort all test input examples according to , and bin examples into 20 bins ranging from the lowest to highest range of uncertainty. We then measure the KNN classification accuracy for the examples falling in each bin.
To measure the utility of for the verification task, we take random pairs of samples, , and compute the mean of their uncertainties, . We then distribute the test pairs to 20 equalsized bins according to their uncertainty levels, and compute the probability of a match for each pair. To cope with the severe class imbalance (most pairs don’t match), we measure performance for each bin using average precision (AP). Then, again, the Kendall’s tau is applied to measure the uncertaintyperformance correlation.
Figure 5 plots the AP and KNN accuracy vs the uncertainty bin index, for both clean and corrupted inputs. We see that when the performance drops off, the model’s uncertainty measure increases, as desired.
To quantify this, we compute the correlation between the performance metric and the uncertainty metric. Instead of the standard linear correlations (Pearson correlation coefficient), we use Kendall’s tau correlation (Kendall, 1938) that measures the degree of monotonicity between the performance and the uncertainty level (bin index), inverting the sign so that positive correlation aligns with our goal. The results of different models are shown in table 2
. Because the uncertainty measure includes a sampling step, we repeat each evaluation 10 times and report the mean results, with standard deviations ranging from 0.02 to 0.12. In general, the measure
correlates with the task performance. As a baseline for point embeddings in KNN, we explored using the distance to the nearest neighbor as a proxy for uncertainty, but found that it performed poorly. The HIB uncertainty metric correlates with task accuracy even in within the subset of clean (uncorrupted) input images having no corrupted digits, indicating that HIB’s understanding of uncertainty goes beyond simply detecting which images are corrupted.5 Discussion and Future Work
Hedged instance embedding is a stochastic embedding that captures the uncertainty of the mapping of an image to a latent embedding space, by spreading density across plausible locations. This results in improved performance on various tasks, such as verification and identification, especially for ambiguous corrupted input. It also allows for a simple way to estimate the uncertainty of the embedding that is correlated with performance on downstream tasks.
There are many possible directions for future work, including experimenting with higherdimensional embeddings, and harder datasets. It would also be interesting to consider the “open world” (or “unknown unknowns”) scenario, in which the test set may contain examples of novel classes, such as digit combinations that were not in the training set (see e.g., Lakkaraju et al. (2017); Günther et al. (2017)). This is likely to result in uncertainty about where to embed the input which is different from the uncertainty induced by occlusion, since uncertainty due to open world is epistemic (due to lack of knowledge of a class), whereas uncertainty due to occlusion is aleatoric (intrinsic, due to lack of information in the input), as explained in Kendall & Gal (2017). Preliminary experiments suggest that correlates well with detecting occluded inputs, but does not work as well for novel classes. We leave more detailed modeling of epistemic uncertainty as future work.
Acknowledgments
We are grateful to Alex Alemi and Josh Dillon, who were very helpful with discussions and suggestions related to VAEs and Variational Information Bottleneck.
References

Abadi et al. (2015)
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen,
Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin,
Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,
Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh
Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris
Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal
Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas,
Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and
Xiaoqiang Zheng.
TensorFlow: Largescale machine learning on heterogeneous systems, 2015.
URL https://www.tensorflow.org/. Software available from tensorflow.org.  Achille & Soatto (2018) Alessandro Achille and Stefano Soatto. On the emergence of invariance and disentangling in deep representations. JMLR, 18:1–34, 2018.
 Alemi et al. (2016) Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. In ICLR, 2016.
 Alemi et al. (2018) Alexander A Alemi, Ian Fischer, and Joshua V Dillon. Uncertainty in the variational information bottleneck. In UAI Workshop on Uncertainty in Deep Learning, 2018.
 Babenko et al. (2014) Artem Babenko, Anton Slesarev, Alexandr Chigorin, and Victor Lempitsky. Neural codes for image retrieval. In ECCV, 2014.
 Bertinetto et al. (2016) Luca Bertinetto, Jack Valmadre, João F Henriques, Andrea Vedaldi, and Philip HS Torr. Fullyconvolutional siamese networks for object tracking. arXiv preprint arXiv:1606.09549, 2016.
 Bojchevski & Günnemann (2018) Aleksandar Bojchevski and Stephan Günnemann. Deep gaussian embedding of attributed graphs: Unsupervised inductive learning via ranking. In ICLR, 2018.
 Gal & Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In ICML, 2016.
 Günther et al. (2017) Manuel Günther, Steve Cruz, Ethan M Rudd, and Terrance E Boult. Toward OpenSet face recognition. In CVPR Biometrics Workshop, 2017.
 Hadsell et al. (2006) R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In CVPR, 2006.
 Karaletsos et al. (2015) Theofanis Karaletsos, Serge Belongie, and Gunnar Rätsch. Bayesian representation learning with oracle constraints. In ICLR, 2015.

Kendall & Gal (2017)
Alex Kendall and Yarin Gal.
What uncertainties do we need in bayesian deep learning for computer vision?
In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), NIPS. 2017.  Kendall (1938) Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.
 Kingma & Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. In ICLR, 2013.
 Lakkaraju et al. (2017) H Lakkaraju, E Kamar, R Caruana, and E Horvitz. Identifying unknown unknowns in the open world: Representations and policies for guided exploration, 2017.

LeCun (1998)
Yann LeCun.
The mnist database of handwritten digits.
http://yann. lecun. com/exdb/mnist/, 1998.  MovshovitzAttias et al. (2017) Yair MovshovitzAttias, Alexander Toshev, Thomas K Leung, Sergey Ioffe, and Saurabh Singh. No fuss distance metric learning using proxies. In ICCV, 2017.
 Neelakantan et al. (2014) Arvind Neelakantan, Jeevan Shankar, Alexandre Passos, and Andrew McCallum. Efficient nonparametric estimation of multiple embeddings per word in vector space. In EMNLP, 2014.
 Orekondy et al. (2018) Tribhuvanesh Orekondy, Seong Joon Oh, Bernt Schiele, and Mario Fritz. Understanding and controlling user linkability in decentralized learning. arXiv preprint arXiv:1805.05838, 2018.
 Schroff et al. (2015) Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
 Song et al. (2017) Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, and Kevin Murphy. Deep metric learning via facility location. In CVPR, 2017.
 Tishby et al. (1999) N. Tishby, F.C. Pereira, and W. Biale. The information bottleneck method. In The 37th annual Allerton Conf. on Communication, Control, and Computing, 1999.
 Vilnis & McCallum (2014) Luke Vilnis and Andrew McCallum. Word representations via gaussian embedding. In ICLR, 2014.

Wan et al. (2018)
Weitao Wan, Yuanyi Zhong, Tianpeng Li, and Jiansheng Chen.
Rethinking feature distribution for loss functions in image classification.
In CVPR, 2018.  Wu et al. (2017) ChaoYuan Wu, R Manmatha, Alexander J Smola, and Philipp Krähenbühl. Sampling matters in deep embedding learning. In ICCV, 2017.
Appendix A Ndigit MNIST Dataset
We present Ndigit MNIST, https://github.com/google/ndigitmnist, a new dataset based upon MNIST (LeCun, 1998) that has an exponentially large number of classes on the number of digits , for which embeddingstyle classification methods are wellsuited. The dataset is created by horizontally concatenating MNIST digit images. While constructing new classes, we respect the training and test splits. For example, a test image from 2digit MNIST of a “54” does not have its “5” or its “4” shared with any image from the training set (in all positions).
Number  Total  Training  Unseen Test  Seen Test  Training  Test 

Digits  Classes  Classes  Classes  Classes  Images  Images 
2  100  70  30  70  
3  1000  700  300  700 
We employ 2 and 3digit MNIST (Table 3) in our experiments. This dataset is meant to provide a test bed for easier and more efficient evaluation of embedding algorithms than with larger and more realistic datasets. Ndigit MNIST is more suitable for embedding evaluation than other synthetic datasets due to the exponentially increasing number of classes as well as the factorizability aspect: each digit position corresponds to e.g. a face attribute for face datasets.
We inject uncertainty into the underlying tasks by randomly occluding (with black fillin) regions of images at training and test time. Specifically, the corruption operation is done independently on each digit of number samples in the dataset. A randomsized square patch is identified from a random location of each digit image. The patch side length is first sampled , and then the top left patch corner coordinates are sampled , so that the occluded square size is always . During training, we set independent binary random flags for every digit determining whether to perform the occlusion at all; the occlusion chance is set to . For testing, we prepare twin datasets, clean and corrupt, with digit images that are either not corrupted with occlusion at all or always occluded, respectively.
Appendix B Soft Contrastive Loss versus Contrastive Loss
As a building block for the HIB, soft contrastive loss has been proposed in §2.1. Soft contrastive loss has a conceptual advantage over the vanilla contrastive loss that the margin hyperparameter does not have to be handtuned. Here we verify that soft contrastive loss outperforms the vanilla version over a range of values.
Figure 6 shows the verification (average precision) and identification (KNN accuracy) performance of embedding 2digit MNIST samples. In both evaluations, soft contrastive loss performance is upper bounding the vanilla contrastive case. This new formulation removes one hyperparameter from the learning process, while not sacrificing performance.


Appendix C Extra Results
Appendix D Organization of the Latent Embedding Space


As hedged instance embedding training progresses, it is advantageous for any subset of classes that may be confused with one another to be situated in the embedding space such that a given input image’s embedding can strategically place probability mass. We observe this impacts the organization of the underlying space. For example, in the 2D embeddings shown in Figure 10, the class centers of mass for each class are roughly axisaligned so that classes that share a tens’ digit vary by xcoordinate, and classes that share a least significant (ones) digit vary by ycoordinate.
To further explore this idea, we embed 2digit MNIST into a single dimension, to see how the classes get embedded along the number line. For hedged instance embedding, a single Gaussian embedding was chosen as the representation. We conjectured that because hedged instance embedding reduces objective loss by placing groups of confusing categories nearby one another, the resulting embedding space would be organized to encourage classes that share a tens or ones digit to be nearby. Figure 10 shows an example embedding learned by the two methods.
We assess the embedding space as follows. First the centroids for each of the 100 classes are derived from the test set embeddings. After sorting the classes, a count is made of adjacent class pairs that share a ones or tens digit, with the maximum possible being 99. The hedged embeddings outscored the point embeddings on each of the the four trials, with scores ranging from 76 to 80 versus scores of 42 to 74. Similarly, consider a run as a series of consecutive class pairs that share a ones or tens digit. The average run contains 4.6 classes with from hedged embeddings, and only 3.0 for point embeddings, as illustrated in Figure 10.
Appendix E Derivation of VIB Objective for Stochastic Embedding
Our goal is to train a discriminative model for match prediction on a pair of variables, , as opposed to predicting a class label, . Our VIB loss (Equation 2.3) follows straightforwardly from the original VIB, with two additional independence assumptions. In particular, we assume that the samples in the pair are independent, so . We also assume the embeddings do not depend on the other input in the pair, . With these two assumptions, the VIB objective is given by the following:
(10) 
We variationally bound the first term using the approximation of as follows
(11)  
(12)  
(13)  
(14)  
(15)  
(16)  
(17) 
The inequalities follow from the nonnegativity of KLdivergence and entropy . The final equality follows from our assumptions above.
The second term is variationally bounded using approximation of as follows:
(18)  
(19)  
(20)  
(21)  
(22) 
The inequality, again, follows from the nonnegativity of KLdivergence, and the last equality follows from our additional independence assumptions.