Deep metric learning models (Hoffer and Ailon, 2015; Snell et al., 2017) similarly learn representations of data using a neural network, but differ from ordinary deep networks in that they learn a representation of data that can directly be used to classify a novel dataset. Such representations have been interpreted as meta-learned knowledge which a nearest neighbor classifier can use to classify novel classes given only a few examples. The primary focus of previous work (Koch et al., 2015; Vinyals et al., 2016; Snell et al., 2017; Sung et al., 2018; Oreshkin et al., 2018) was on the metric inherited by comparing two datapoints using their learned representation.
Rather than focusing on the metric, we tackle the problem of optimizing the representation itself. In particular, we argue that a good representation of data should be as concise as possible while being able to predict class labels. We propose Discrete InfoMax COdes (DIMCO), a model that learns a discrete representation of data. We propose to maximize the correlation between representation and label by directly maximizing their mutual information, which can be evaluated in closed-form because we consider discrete representations. This approach has the advantage that it doesn’t require a mapping from representation to labels and does not require batches that are split into train- and test- sets.
Our specific contributions are:
Derive generalization bounds for meta-learning that shows the roles of task size and number of tasks.
Propose DIMCO, a model that learns concise discrete codes. DIMCO (1) generalizes better than previous models when trained with small datasets, and (2) is more memory- and time-efficient for image retrievel.
2 Supervised Representation Learning
We outline two tasks which can be seen as instances of the more general problem of supervised representation learning. We define supervised representation learning as the task of using class labels to learn useful representations of data. This problem differs from standard classification as it aims to learn a representation that generalizes to other datasets rather than directly predicting the labels themselves.
The few-shot classification task consists of episodes, each of which are small datasets with train/test splits. In -way -shot classification, each episode has a train set with datapoints each from classes, and a test set of unlabeled instances from the same classes. Within each episode, the model observes the train set to predict the labels of the test set images and is evaluated on its accuracy.
Image retrieval is the problem of taking a query image and retrieving the most similar image from a large database of images. Models for this task are evaluated by measuring the similarity between a query image and a retrieved image. An example of such a measure is Recall@k:
where the definition of "relevant" depends on the specific dataset. For class-labeled images, an image is relevant to a query image if the two belong to the same class.
Learning a continuous representation and comparing data in embedding space has been proposed as a solution to both few-shot classification (Vinyals et al., 2016; Snell et al., 2017) and image retrieval Hoffer and Ailon (2015); Sohn (2016). We show in section 6 that the metrics for these two problems are strongly correlated, which motivates our consideration of the more general problem of supervised representation learning. In the next section, we propose an alternative information-theoretic objective for supervised representation learning.
3 An Information-Theoretic Perspective on Representation Learning
Throughout this section, we denote data, representations, and labels as , , and , respectively. Capital symbols , ,
denote the random variables corresponding to, , .
The Mutual Information between two random variables is defined as
is a symmetric quantity which measures the amount of information shared between and . It has its lowest value when and are independent and increases with the correlation between and . We refer the reader to (Cover and Thomas, 2012) for further exposition.
3.1 Problem Setup
We now describe our meta-learning problem setup. Define a task to be a distribution over . Let tasks be sampled i.i.d. from a distribution of tasks . Associated with task is a dataset which is a set of i.i.d. samples from the data distribution ().
Denote model parameters and the representation , to show its dependence on data and parameters. Our learning objective is the expectation of the negative mutual information between the representation and labels:
This differs from previous formulations of representation learning in the following ways:
Objective is negative mutual information within batch
Does not split each task into a train/test set
This objective is closely related to previous loss functions and to previous evaluation metrics for supervised representation learning. We show inappendix A of the appendix that previous loss functions can be seen as approximations to this quantity, and experiments in section 6 show that the mutual information is strongly correlated with metrics such as few-shot accuracy and .
3.2 Generalization Bound
We bound the true expected loss using the empirical loss:
See appendix B of appendix. ∎
The generalization gap has three terms, two of which decrease as increases, and the other decreases as increases. Typically for few-shot learning, is very large while is small: miniImagenet -way -shot has and . We therefore claim that the terms including are the main difficulties for generalizing to new tasks. We see from theorem 1 that using short representations (i.e. small ) can compensate for having a small train set (i.e. small ).
4 Discrete Infomax Codes (DIMCO)
We now present our model, Discrete InfoMax COdes (DIMCO). Motivated by section 3, DIMCO produces a short discrete code and is trained by maximizing mutual information . Figure 1 graphically shows the overall structure of DIMCO.
4.1 Factorized Discrete Codes
We propose a factorized discrete representation scheme which enables us to represent discrete distributions with exponentially fewer parameters compared to listing the probability of each event. We represent each event as the product of independent events, each of which consists of different possibilities. We thus have events in total, but only require parameters to represent the probability of each event. Binary codes can be viewed as a special case of this scheme where . This factorization trick allows us to consider representations of size (section 6). This representation has the advantage of requiring only bits per datapoint, whereas a
-dimensional continuous vector embedding requiresbits (assuming -bit floats).
Recall that we represent a given image using independent discrete distributions, each of which has
possibilities. First, a (convolutional) neural networktakes image as input and outputs a vector of length , which we reshape into a matrix of size :
Each row of this matrix represents the logits of a discrete distribution. We apply the softmax function to each row to get probabilities.
The th codeword is sampled according to the categorical distribution following these probabilites:
The representation for the image is the concatenation of each :
is a discrete random variable andis its distribution. Instead of sampling , we directly use to compute the objective:
The first term, , can be calculated by taking the average of all probabilities and computing the entropy:
The second term is where is the number of classes. The marginal probability of Y () is the frequency of in . can be obtained by computing (11) using only for which .
Though we have motivated the use of as a loss function throughout this paper, we provide yet another perspective using the decomposition in (10). Minimizing encourages discriminatory behavior. This term encourages the average embedding of each class to be as concentrated as possible. Maximizing incentivizes the model to overall use all possible values of .
We emphasize that such closed-form computation of is only possible because we are using discrete codes.
Fix a train image and a test image, and let be the most likely code for the train image. The similarity between train image and test image is measured by the probability of the test image producing . This amounts to computing the product 111 In practice, we add log probabilities for numerical stability. of the test image’s probabilites using for each :
We use this as a similarity metric for both few-shot classification and image retrieval. We perform few-shot classification by computing the most likely code for each class via eq. (12) and classifying each test image by choosing the class that has highest value of (13). We similarly perform image retrival by mapping each support image to its most likely code (12) and for each query image retrieving the support image that has highest (13).
5 Related Work
The concept of learning short descriptions of data that maximally correlate with the label is closely related to the information bottleneck (Tishby et al., 2000). This principle states that should be maximized while simultaneously minimizing . DIMCO maximizes while setting
to be low via a hyperparameter. DIMCO is also related to the deterministic information bottleneck(Strouse and Schwab, 2017), which extends the information bottleneck by minimizing instead of . Note that these quantities are related by the inequality , which is tight when is an efficient code.
Information Theory and Unsupervised Representation Learning
Many works have applied information-theoretic principles to unsupervised representation learning. Bell and Sejnowski (1995) uses a mutual information objective to derive an algorithm for blind source separation. Slonim et al. (2005) derives a clustering algorithm based on the rate-distortion tradeoff. Chen et al. (2016) optimizes a lower bound of the mutual information to make a subset of its latent dimensions correlate with specific pre-specified features. Alemi et al. (2017) analyses the objective of VAEs from a rate-distortion theory perspective. Our work also uses information-theoretic principles for representation learning, but we apply these principles to a supervised meta-learning setting.
Discrete representations have been studied at least since the beginning of information theory (Shannon, 1948)
. Recent deep learning methods have proposed ways to directly learn discrete representations.Rolfe (2016); van den Oord et al. (2017)
learn variational autoencoders with discrete latent variables.Hu et al. (2017) learns discrete representations in an unsupervised manner by maximizing the mutual information between representation and data. In contrast, DIMCO assumes a supervised setting and performs infomax using labels instead of data.
Jeong and Song (2018) is close in spirit to our model: their method learns a quantizable continuous representation. Within each batch, their algorithm solves a minimum cost flow problem to find the locally optimal binary hash code. The training procedure of DIMCO is much simpler since it directly computes its loss function without requiring such inner-loop optimization. Additionally, the focus of Jeong and Song (2018) is on the speedup gained by using sparse binary hash codes, whereas our work focuses on learning an efficient (dense) discrete representation of data.
The idea of using factorized representations to increase representation power has appeared in other contexts. Jegou et al. (2011) factorizes a continuous input into a Cartesian product of quantized low-dimensional subspaces. Norouzi and Fleet (2013) uses factorized representations to represent cluster centers with memory. Vaswani et al. (2017) uses as one of its core components multi-head attention, which factorizes the output into the Cartesian product of dot-product attention in several independent subspaces.
Our analysis provides a unifying view of embedding-based meta-learning Vinyals et al. (2016); Snell et al. (2017); Oreshkin et al. (2018) and image retrieval Hoffer and Ailon (2015); Oh Song et al. (2016); Sohn (2016); Movshovitz-Attias et al. (2017); Wu et al. (2017); Duan et al. (2018) from the perspective of supervised representation learning. We show in appendix A that the loss functions of these methods can be seen as approximation to the mutual information (). While all of these previous methods require a train/test (also called query/anchor) split within each task, DIMCO simply optimizes an information-theoretic quantity of each batch, removing the need for such structured batch construction.
Meta-Learning with Simple Inner-Loop Learners
Many works on gradient-based meta-learning have reported benefits from using few task-specific parameters. Lee and Choi (2018) learns a subset of the full network to alter during task-specific learning. Rusu et al. (2018) explicitly represents each task with a low-dimensional latent space. Zintgraf et al. (2018) alters only a pre-specified subset of the full network during task-specific learning. Our results further support this consensus that meta-learning models with simple task-specific learners generalize to new tasks more easily. We additionally made connections from this idea to information-theoretic principles and used this connection to derive generalization bounds for few-shot learning.
datasets with standard splits for both in our experiments. The miniImageNet dataset is a subset of the Imagenet(Krizhevsky et al., 2012) dataset that was made for few-shot classification. It consists of classes each containing images of size . The classes are split into training, validation, and test classes. The Caltech-UCSD Birds-200-2011 (CUB200) dataset consists of images of birds from classes. The classes are split into training and test classes.
We use two different CNN backbones for our experiments: the 4-layer convnet commonly used for meta-learning (Finn et al., 2017; Sung et al., 2018; Liu et al., 2018), and the Inception network (Szegedy et al., 2015)
with batch normalization(Ioffe and Szegedy, 2015) which is commonly used for deep image retrieval (Sohn, 2016; Movshovitz-Attias et al., 2017; Wu et al., 2017).
6.1 Correlation of Metrics
This experiment attempts to verify whether mutual information is indeed a reasonable metric for quality of representation. Using the miniImageNet dataset, we trained independent runs of DIMCO with for epochs. We used the test split to compute five metrics: ()-way -shot accuracy, , and .
Due to space constraints, we show the pairwise correlation between these metrics in fig. 5 of the appendix. We see that all metrics are very strongly correlated. We point out that while correlates with previous metrics for fixed and , it is not suitable as a general evaluation metric since its scale depends on hyperparameters: it is roughly proportionate to .
6.2 What does each code learn?
We inspected what features were encoded in a small DIMCO model (, ) after training on miniImagenet. Recall that each image produces a probability matrix (eq. 6,7). For each of these entries, we plotted the top images in the test set that assigned highest probability to that entry. We show images corresponding to four such entries in fig. 2 and more in fig. 7 of the appendix.
The top left code in fig. 2
is representative of the bookshelf class. On the other hand, the bottom right code corresponds to animals with fur and assigns high probability to images of many different classes. We interpret this as DIMCO learning a distributed representation: by aggregating such complementary features in each of itscodewords, DIMCO is able to classify novel classes given only a few datapoints.
6.3 Small Train Set
This experiment shows how each model performs when learning with a small dataset. We trained each model using samples from each training class in the miniImageNet dataset. For example, when using samples, we reduced the full train split of ( classes images per class) into ( classes images per class). We compare against three methods: prototypical networks(Snell et al., 2017), Triplet Networks(Hoffer and Ailon, 2015), and multiclass N-pair loss(Sohn, 2016). After training with a subsampled dataset, we test using the full test split.
-way -shot accuracies and of each method are shown in fig. 3. First note that DIMCO is the only method that can be trained with a dataset of example per class. This is because other methods require at least one train and test example per class within each batch, while DIMCO requires no such train/test split and simply maximizes the mutual information within a batch. DIMCO learns much more effectively when the number of examples per class is low. We attribute this to our model’s low inner-loop generalization gap (section 3.2). Because our model can effectively learn using small batches compared to other methods, it can learn using a small total number of training data.
6.4 Fine-Grained Image Retrieval
We conducted a fine-grained image retrieval experiment using the CUB200 dataset. We compare DIMCO to multiclass N-pair loss (Sohn, 2016), a state-of-the-art deep image retrieval method. For this experiment only, we use the Inception network as specified in the beginning of this section. Using the same Inception encoder backbone, we trained DIMCO with and multiclass N-pair with embedding dimension . We measured the time per query for each method on a single Tesla P40 GPU by averaging the time required for batches of queries of size .
Results in fig. 4 show that the compact code of DIMCO takes roughly an order of magnitude less memory for similar performance to N-pair loss, and has benefits in retrieval query time as well. This experiment also demonstrates that discrete representations can match the performance of state-of-the-art methods on this relatively large-scale task and also is able to train using large neural network backbones without significantly overfitting. For example, experiments reported in Mishra et al. (2017) indicate that MAML (Finn et al., 2017) overfits tremendously when training with a deeper backbone.
We introduced DIMCO, a model that learns a discrete representation of data by directly optimizing the mutual information with the label. To evaluate our initial intuition that shorter representations generalize better between tasks, we provided generalization bounds that get tighter as the representation get shorter. We additionally performed meta-learning experiments to show that the concise representations learned by DIMCO generalize well even when learning from very small datasets.
Previous meta-learning models required batches with the specific structure of an evenly balanced train/test split. Because DIMCO can be trained using any batch of labelled data, we believe it is a step towards bridging the gap between the seemingly disparate problems of few-shot classification and traiditional classification.
- Alemi et al.  Alexander A Alemi, Ian Fischer, Joshua V Dillon, and Kevin Murphy. Deep variational information bottleneck. arXiv preprint arXiv:1612.00410, 2016.
- Alemi et al.  Alexander A Alemi, Ben Poole, Ian Fischer, Joshua V Dillon, Rif A Saurous, and Kevin Murphy. Fixing a broken elbo. arXiv preprint arXiv:1711.00464, 2017.
- Amit and Meir  Ron Amit and Ron Meir. Meta-learning by adjusting priors based on extended pac-bayes theory. arXiv preprint arXiv:1711.01244, 2017.
- Bell and Sejnowski  Anthony J Bell and Terrence J Sejnowski. An information-maximization approach to blind separation and blind deconvolution. Neural computation, 7(6):1129–1159, 1995.
- Chen et al.  Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pages 2172–2180, 2016.
- Cover and Thomas  Thomas M Cover and Joy A Thomas. Elements of information theory. John Wiley & Sons, 2012.
- Duan et al.  Yueqi Duan, Wenzhao Zheng, Xudong Lin, Jiwen Lu, and Jie Zhou. Deep adversarial metric learning. In
- Finn and Levine  Chelsea Finn and Sergey Levine. Meta-learning and universality: Deep representations and gradient descent can approximate any learning algorithm. arXiv preprint arXiv:1710.11622, 2017.
- Finn et al.  Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. arXiv preprint arXiv:1703.03400, 2017.
- Goyal et al.  Anirudh Goyal, Riashat Islam, Daniel Strouse, Zafarali Ahmed, Matthew Botvinick, Hugo Larochelle, Sergey Levine, and Yoshua Bengio. Infobot: Transfer and exploration via the information bottleneck. arXiv preprint arXiv:1901.10902, 2019.
- Hoffer and Ailon  Elad Hoffer and Nir Ailon. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition, pages 84–92. Springer, 2015.
- Hu et al.  Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiyama. Learning discrete representations via information maximizing self-augmented training. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1558–1567. JMLR. org, 2017.
- Ioffe and Szegedy  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
- Jegou et al.  Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2011.
- Jeong and Song  Yeonwoo Jeong and Hyun Oh Song. Efficient end-to-end learning for quantizable representations. arXiv preprint arXiv:1805.05809, 2018.
- Kingma and Ba  Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Koch et al.  Gregory Koch, Richard Zemel, and Ruslan Salakhutdinov. Siamese neural networks for one-shot image recognition. In ICML deep learning workshop, volume 2, 2015.
- Krizhevsky et al.  Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
- Lee and Choi  Yoonho Lee and Seungjin Choi. Gradient-based meta-learning with learned layerwise metric and subspace. 2018.
- Liu et al.  Yanbin Liu, Juho Lee, Minseop Park, Saehoon Kim, and Yi Yang. Transductive propagation network for few-shot learning. arXiv preprint arXiv:1805.10002, 2018.
- Mishra et al.  Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. A simple neural attentive meta-learner. arXiv preprint arXiv:1707.03141, 2017.
- Movshovitz-Attias et al.  Yair Movshovitz-Attias, Alexander Toshev, Thomas K Leung, Sergey Ioffe, and Saurabh Singh. No fuss distance metric learning using proxies. In Proceedings of the IEEE International Conference on Computer Vision, pages 360–368, 2017.
Norouzi and Fleet 
Mohammad Norouzi and David J Fleet.
Cartesian k-means.In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3017–3024, 2013.
- Oh Song et al.  Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structured feature embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4004–4012, 2016.
- Oreshkin et al.  Boris Oreshkin, Pau Rodríguez López, and Alexandre Lacoste. Tadam: Task dependent adaptive metric for improved few-shot learning. In Advances in Neural Information Processing Systems, pages 721–731, 2018.
- Ravi and Larochelle  Sachin Ravi and Hugo Larochelle. Optimization as a model for few-shot learning. 2016.
- Rolfe  Jason Tyler Rolfe. Discrete variational autoencoders. arXiv preprint arXiv:1609.02200, 2016.
- Rusu et al.  Andrei A Rusu, Dushyant Rao, Jakub Sygnowski, Oriol Vinyals, Razvan Pascanu, Simon Osindero, and Raia Hadsell. Meta-learning with latent embedding optimization. arXiv preprint arXiv:1807.05960, 2018.
- Shamir et al.  Ohad Shamir, Sivan Sabato, and Naftali Tishby. Learning and generalization with the information bottleneck. Theoretical Computer Science, 411(29-30):2696–2711, 2010.
- Shannon  Claude Elwood Shannon. A mathematical theory of communication. Bell system technical journal, 27(3):379–423, 1948.
- Shwartz-Ziv and Tishby  Ravid Shwartz-Ziv and Naftali Tishby. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810, 2017.
- Slonim et al.  Noam Slonim, Gurinder Singh Atwal, Gašper Tkačik, and William Bialek. Information-based clustering. Proceedings of the National Academy of Sciences, 102(51):18297–18302, 2005.
- Snell et al.  Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems, pages 4077–4087, 2017.
- Sohn  Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, pages 1857–1865, 2016.
- Strouse and Schwab  DJ Strouse and David J Schwab. The deterministic information bottleneck. Neural computation, 29(6):1611–1630, 2017.
- Sung et al.  Flood Sung, Yongxin Yang, Li Zhang, Tao Xiang, Philip HS Torr, and Timothy M Hospedales. Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1199–1208, 2018.
- Szegedy et al.  Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
- Tishby and Zaslavsky  Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneck principle. In 2015 IEEE Information Theory Workshop (ITW), pages 1–5. IEEE, 2015.
- Tishby et al.  Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method. arXiv preprint physics/0004057, 2000.
- van den Oord et al.  Aaron van den Oord, Oriol Vinyals, et al. Neural discrete representation learning. In Advances in Neural Information Processing Systems, pages 6306–6315, 2017.
- Vaswani et al.  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- Vinyals et al.  Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Daan Wierstra, et al. Matching networks for one shot learning. In Advances in neural information processing systems, pages 3630–3638, 2016.
- Wah et al.  Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech-ucsd birds-200-2011 dataset. 2011.
- Wu et al.  Chao-Yuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. Sampling matters in deep embedding learning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2840–2848, 2017.
- Zintgraf et al.  Luisa M Zintgraf, Kyriacos Shiarlis, Vitaly Kurin, Katja Hofmann, and Shimon Whiteson. Caml: Fast context adaptation via meta-learning. arXiv preprint arXiv:1810.03642, 2018.
Appendix A Previous Loss functions Are Approximations to Mutual Information
Let be a parameterized prediction of given , which tries to approximate the true conditional distribution . Typically in a classification network, is the parameters of a learned projection matrix and is the final linear layer. The expected cross-entropy loss can be written as
Assuming that the approximate distribution is sufficiently close to , minimizing (14) can be seen as
where the last equality uses the fact that is independent of model parameters. Therefore, cross-entropy minimization is approximate maximization of the mutual information between representation and labels .
The approximation is that we parameterized as a linear projection. This structure cannot generalize to new classes because the parameters are specific to the labels seen during training. For a model to generalize to unseen classes, one must amortize the learning of this approximate conditional distribution. [Vinyals et al., 2016, Snell et al., 2017] sidestepped this issue by using the embeddings for each class as .
The Triplet loss [Hoffer and Ailon, 2015] is defined as
where are the embedding vectors of query, positive, and negative images. Let denote the label of the query data. Recall that the pdf function of a unit Gaussian is where are constants. Let and
be unit Gaussian distributions centered atrespectively. We have
Two approximations were made in the process. We first assumed that the embedding distribution of images not in is equal to the distribution of all embeddings. This is reasonable when each class only represents a small fraction of the full data. We also approximated the embedding distributions with unit Gaussian distributions centered at single samples from each.
Multiclass -pair loss [Sohn, 2016] was proposed as an alternative to Triplet loss. This loss function requires one positive embedding and multiple negative embeddings , and takes the form
This can be seen as the cross-entropy loss applied to .
Following the same logic as the cross-entropy loss, this is also an approximation to
. This objective should have less variance than Triplet loss since it approximatesusing more examples.
Adversarial Metric Learning
Deep Adversarial Metric Learning [Duan et al., 2018] tackles the problem of most negative exmples being uninformative by directly generating meaningful negative embeddings. This model employs a generator which takes as input the embeddings of anchor, positive, and negative images. The generator then outputs a "synthetic negative" embedding that is hard to distinguish from a positive embedding while being close to the negative embedding.
This can be seen as optimizing
by estimating using a generative network rather than directly from samples. Rather than modelling the marginal distribution , this method conditionally models so that is hard to distinguish from while sufficiently close to both and .
Appendix B Proof of Theorem 1
The following lemma was proved in Shamir et al. , and we restate it using our notation.
Let be a random mapping of .
Let be a sample of size drawn from the joint probability distribution
drawn from the joint probability distribution. Denote the empirical mutual information observed from between and as . For any , the following holds with probability at least :
We simplify this and plug in our specific quantities of interest (, ):
We similarly bound the error caused by estimating with a finite number of tasks sampled from . Denote the finite sample estimate of as
Let the mapping be parameterized by and let this model have VC dimension . Using , we can state that with high probability,
where is the VC dimension of hypothesis class .
Appendix C Experiments and Implementation Details
Every experiment was conducted on a single Nvidia V100 GPU with CUDA 9.2. We used PyTorch version 1.0.1. Each experiment was performed with different fixed initial seeds; we manually fix seeds withmanual_seed() for python, pytorch, and numpy.
For experiments with the 4-layer convnet, we use the Adam optimizer [Kingma and Ba, 2014] with learning rate 3e-4. For the Inception network, we use SGD with learning rate 3e-5 and momentum .
Correlation of Metrics experiment
We report the average of batches of -shot accuracies and mutual information. was computed using balanced batches of images each from different classes. We additionally show in fig. 6 the correlation between -shot accuracies, , and NMI using three previously proposed losses (triplet, npair, protonet).
Small Train Set Experiment
For this experiment, we used the Adam optimizer and performed a log-uniform hyperparameter sweep for learning rate For DIMCO, we swept and . For other methods, we made the embedding dimension . For each combination of loss and number of training examples per class, we ran the experiment times and reported the mean and standard deviation of the top .