This paper concerns unsupervised representation learning: using unlabeled data to learn a representation function such that replacing data point
by feature vectorin new classification tasks reduces the requirement for labeled data. This is distinct from semi-supervised learning, where learning can leverage unlabeled as well as labeled data. (Section 7 surveys other prior ideas and models).
For images, a proof of existence
for broadly useful representations is the output of the penultimate layer (the one before the softmax) of a powerful deep net trained on ImageNet. In natural language processing (NLP), low-dimensional representations of text – calledtext embeddings – have been computed with unlabeled data Peters et al. (2018); Devlin et al. (2018). Often the embedding function is trained by using the embedding of a piece of text to predict the surrounding text Kiros et al. (2015); Logeswaran & Lee (2018); Pagliardini et al. (2018). Similar methods that leverage similarity in nearby frames in a video clip have had some success for images as well Wang & Gupta (2015).
Many of these algorithms are related: they assume access to pairs or tuples (in the form of co-occurrences) of text/images that are more semantically similar than randomly sampled text/images, and their objective forces representations to respect this similarity on average. For instance, in order to learn a representation function for sentences, a simplified version of what Logeswaran & Lee (2018)
minimize is the following loss function
where are a similar pair and is presumably dissimilar to (often chosen to be a random point) and typically referred to as a negative sample. Though reminiscent of past ideas – e.g. kernel learning, metric learning, co-training Cortes et al. (2010); Bellet et al. (2013); Blum & Mitchell (1998) – these algorithms lack a theoretical framework quantifying when and why they work. While it seems intuitive that minimizing such loss functions should lead to representations that capture ‘similarity,’ formally it is unclear why the learned representations should do well on downstream linear classification tasks – their somewhat mysterious success is often treated as an obvious consequence. To analyze this success, a framework must connect ‘similarity’ in unlabeled data with the semantic information that is implicitly present in downstream tasks.
We propose the term Contrastive Learning for such methods and provide a new conceptual framework with minimal assumptions111The alternative would be to make assumptions about generative models of data. This is difficult for images and text. . Our main contributions are the following:
We formalize the notion of semantic similarity by introducing latent classes. Similar pairs are assumed to be drawn from the same latent class. A downstream task is comprised of a subset of these latent classes.
Under this formalization, we prove that a representation function learned from a function class by contrastive learning has low average linear classification loss if contains a function with low unsupervised loss. Additionally, we show a generalization bound for contrastive learning that depends on the Rademacher complexity of . After highlighting inherent limitations of negative sampling, we show sufficient properties of which allow us to overcome these limitations.
Using insights from the above framework, we provide a novel extension of the algorithm that can leverage larger blocks of similar points than pairs, has better theoretical guarantees, and performs better in practice.
Ideally, one would like to show that contrastive learning always gives representations that compete
with those learned from the same function class with plentiful labeled data. Our formal framework allows a rigorous study of such questions: we show a simple counterexample that prevents such a blanket statement without further assumptions. However, if the representations are well-concentrated and the mean classifier (Definition2.1) has good performance, we can show a weaker version of the ideal result (Corollary 5.1.1). Sections 2 and 3 give an overview of the framework and the results, and subsequent sections deal with the analysis. Related work is discussed in Section 7 and Section 8 describes experimental verification and support for our framework.
2 Framework for Contrastive Learning
We first set up notation and describe the framework for unlabeled data and classification tasks that will be essential for our analysis. Let denote the set of all possible data points. Contrastive learning assumes access to similar data in the form of pairs that come from a distribution as well as i.i.d. negative samples from a distribution that are presumably unrelated to . Learning is done over , a class of representation functions , such that for some .
To formalize the notion of semantically similar pairs , we introduce the concept of latent classes.
Let denote the set of all latent classes. Associated with each class over .
Roughly, captures how relevant is to class . For example, could be natural images and the class “dog” whose associated assigns high probability to images containing dogs and low/zero probabilities to other images. Classes can overlap arbitrarily.222An image of a dog by a tree can appear in both & . Finally, we assume a distribution over the classes that characterizes how these classes naturally occur in the unlabeled data. Note that we make no assumption about the functional form of or .
To formalize similarity, we assume similar data points are i.i.d. draws from the same class distribution for some class picked randomly according to measure . Negative samples are drawn from the marginal of :
Since classes are allowed to overlap and/or be fine-grained, this is a plausible formalization of “similarity.” As the identity of the class in not revealed, we call it unlabeled data. Currently empirical works heuristically identify such similar pairs from co-occurring image or text data.
We now characterize the tasks that a representation function will be tested on. A -way333We use as the number of negative samples later. supervised task consists of distinct classes . The labeled dataset for the task consists of i.i.d. draws from the following process:
A label is picked according to a distribution . Then, a sample is drawn from . Together they form a labeled pair with distribution
A key subtlety in this formulation is that the classes in downstream tasks and their associated data distributions are the same as in the unlabeled data. This provides a path to formalizing how capturing similarity in unlabeled data can lead to quantitative guarantees on downstream tasks. is assumed to be uniform444We state and prove the general case in the Appendix. for theorems in the main paper.
Evaluation Metric for Representations
The quality of the representation function is evaluated by its performance on a multi-class classification task using linear classification. For this subsection, we fix a task . A multi-class classifier for is a function whose output coordinates are indexed by the classes in task .
The loss incurred by on point is defined as , which is a function of a -dimensional vector of differences in the coordinates. The two losses we will consider in this work are the standard hinge loss and the logistic loss for . Then the supervised loss of the classifier is
To use a representation function with a linear classifier, a matrix is trained and is used to evaluate classification loss on tasks. Since the best can be found by fixing and training a linear classifier, we abuse notation and define the supervised loss of on to be the loss when the best is chosen for :
Crucial to our results and experiments will be a specific where the rows are the means of the representations of each class which we define below.
Definition 2.1 (Mean Classifier).
For a function and task , the mean classifier is whose row is the mean of representations of inputs with label : . We use as shorthand for its loss.
Since contrastive learning has access to data with latent class distribution , it is natural to have better guarantees for tasks involving classes that have higher probability in .
Definition 2.2 (Average Supervised Loss).
Average loss for a function on -way tasks is defined as
The average supervised loss of its mean classifier is
Contrastive Learning Algorithm
We describe the training objective for contrastive learning: the choice of loss function is dictated by the used in the supervised evaluation, and denotes number of negative samples used for training. Let , as defined in Equations (1) and (2).
Definition 2.3 (Unsupervised Loss).
The population loss is
and its empirical counterpart with M samples from is
Note that, by the assumptions of the framework described above, we can now express the unsupervised loss as
The algorithm to learn a representation function from is to find a function that minimizes the empirical unsupervised loss. This function can be subsequently used for supervised linear classification tasks. In the following section we proceed to give an overview of our results that stem from this framework.
3 Overview of Analysis and Results
What can one provably say about the performance of ? As a first step we show that is like a “surrogate” for by showing that , suggesting that minimizing makes sense. This lets us show a bound on the supervised performance of the representation learned by the algorithm. For instance, when training with one negative sample, the performance on average binary classification has the following guarantee:
Theorem 4.1 (Informal binary version).
where are constants depending on the distribution and as . When is uniform and , we have that .
At first glance, this bound seems to offer a somewhat complete picture: When the number of classes is large, if the unsupervised loss can be made small by , then the supervised loss of , learned using finite samples, is small.
While encouraging, this result still leaves open the question: Can indeed be made small on reasonable datasets using function classes of interest, even though the similar pair and negative sample can come from the same latent class? We shed light on this by upper-bounding by two components: (a) the loss for the case where the positive and negative samples are from different classes; (b) a notion of deviation , within each class.
Theorem 4.5 (Informal binary version).
for constants that depend on the distribution . Again, when is uniform and we have .
This bound lets us infer the following: if the class is rich enough to contain a function for which is low, then has high supervised performance. Both and can potentially be made small for rich enough .
Ideally, however, one would want to show that can compete on classification tasks with every
Unfortunately, we show in Section 5.1 that the algorithm can pick something far from the optimal . However, we extend Theorem 4.5 to a bound similar to (7) (where the classification is done using the mean classifier) under assumptions about the intraclass concentration of and about its mean classifier having high margin.
Sections 6.1 and 6.2 extend our results to the more complicated setting where the algorithm uses negative samples (5) and note an interesting behavior: increasing the number of negative samples beyond a threshold can hurt the performance. In Section 6.3 we show a novel extension of the algorithm that utilizes larger blocks of similar points. Finally, we perform controlled experiments in Section 8 to validate components of our framework and corroborate our suspicion that the mean classifier of representations learned using labeled data has good classification performance.
4 Guaranteed Average Binary Classification
To provide the main insights, we prove the algorithm’s guarantee when we use only 1 negative sample (). For this section, let and be as in Definition 2.2 for binary tasks. We will refer to the two classes in the supervised task as well as the unsupervised loss as . Let be our training set sampled from the distribution and .
4.1 Upper Bound using Unsupervised Loss
Let be the restriction on for any
. Then, the statistical complexity measure relevant to the estimation of the representations is the following Rademacher average
Let be the probability that two classes sampled independently from are the same.
With probability at least , for all
The complexity measure is tightly related to the labeled sample complexity of the classification tasks. For the function class that one would use to solve a binary task from scratch using labeled data, it can be shown that , where is the usual Rademacher complexity of on (Definition 3.1 from Mohri et al. (2018)).
We state two key lemmas needed to prove the theorem.
With probability at least over the training set , for all
The key idea in the proof is the use of Jensen’s inequality. Unlike the unsupervised loss which uses a random point from a class as a classifier, using the mean of the class as the classifier should only make the loss lower. Let be the mean of the class .
where (a) follows from the definitions in (1) and (2), (b) follows from the convexity of and Jensen’s inequality by taking the expectation over , inside the function, (c) follows by splitting the expectation into the cases and , from symmetry in and
in sampling and since classes in tasks are uniformly distributed (general distributions are handled in AppendixB.1). Rearranging terms completes the proof. ∎
Proof of Theorem 4.1.
One could argue that if is rich enough such that can be made small, then Theorem 4.1 suffices. However, in the next section we explain that unless , this may not always be possible and we show one way to alleviate this.
4.2 Price of Negative Sampling: Class Collision
Note first that the unsupervised loss can be decomposed as
where is the loss suffered when the similar pair and the negative sample come from different classes.
and is when they come from the same class. Let be a distribution over with , then
by Jensen’s inequality again, which implies . In general, without any further assumptions on , can be far from , rendering the bound in Theorem 4.1 useless. However, as we will show, the magnitude of can be controlled by the intraclass deviation of . Let the covariance matrix of when . We define a notion of intraclass deviation as follows:
For all ,
where is a positive constant.
With probability at least ,
where , and is a constant.
The above bound highlights two sufficient properties of the function class for unsupervised learning to work: when the function classis rich enough to contain some with low as well as low then , the empirical minimizer of the unsupervised loss – learned using sufficiently large number of samples – will have good performance on supervised tasks (low .
5 Towards Competitive Guarantees
We provide intuition and counter-examples for why contrastive learning does not always pick the best supervised representation and show how our bound captures these. Under additional assumptions, we show a competitive bound where classification is done using the mean classifier.
5.1 Limitations of contrastive learning
The bound provided in Theorem 4.5 might not appear as the most natural guarantee for the algorithm. Ideally one would like to show a bound like the following: for all ,
for constants and generalization error . This guarantees that is competitive against the best on the average binary classification task. However, the bound we prove has the following form: for all ,
To show that this discrepancy is not an artifact of our analysis but rather stems from limitations of the algorithm, we present two examples in Figure 1. Our bound appropriately captures these two issues individually owing to the large values of or in each case, for the optimal .
In Figure 0(a), we see that there is a direction on which can be projected to perfectly separate the classes. Since the algorithm takes inner products between the representations, it inevitably considers the spurious components along the orthogonal directions. This issue manifests in our bound as the term being high even when . Hence, contrastive learning will not always work when the only guarantee we have is that can make small.
This should not be too surprising, since we show a relatively strong guarantee – a bound on for the mean classifier of . This suggests a natural stronger assumption that can make small (which is observed experimentally in Section 8 for function classes of interest) and raises the question of showing a bound that looks like the following: for all ,
without accounting for any intraclass deviation – recall that captures a notion of this deviation in our bound. However this is not true: high intraclass deviation may not imply high , but can make (and thus ) high, resulting in the failure of the algorithm. Consequently, the term also increases while does not necessarily have to. This issue, apparent in Figure 0(b), shows that a guarantee like (11) cannot be shown without further assumptions.
5.2 Competitive Bound via Intraclass Concentration
We saw that being small does not imply low , if is not concentrated within the classes. In this section we show that when there is an that has intraclass concentration in a strong sense (sub-Gaussianity) and can separate classes with high margin (on average) with the mean classifier, then will be low.
Let be the hinge loss with margin and be with the loss function .
For , if the random variable
, if the random variable, where , is -sub-Gaussian in every direction for every class and has maximum norm , then for all ,
where and is some constant.
For all , with probability at least , for all ,
where is as defined in Lemma 5.1, , and is a constant.
6 Multiple Negative Samples and Block Similarity
In this section we explore two extensions to our analysis. First, in Section 6.1, inspired by empirical works like Logeswaran & Lee (2018) that often use more than one negative sample for every similar pair, we show provable guarantees for this case by careful handling of class collision. Additionally, in Section 6.2 we show simple examples where increasing negative samples beyond a certain threshold can hurt contrastive learning. Second, in Section 6.3, we explore a modified algorithm that leverages access to blocks of similar data, rather than just pairs and show that it has stronger guarantees as well as performs better in practice.
6.1 Guarantees for Negative Samples
(Informal version) For all
where and are extensions of the corresponding terms from Section 4 and remains unchanged. The formal statement of the theorem and its proof appears in Appendix B.1. The key differences from Theorem 4.5 are and the distribution of tasks in that we describe below. The coefficient of increases with , e.g. when is uniform and , .
The average supervised loss that we bound is
where is a distribution over tasks, defined as follows: sample classes , conditioned on the event that does not also appear as a negative sample. Then, set to be the set of distinct classes in . is defined by using .
Bounding directly gives a bound for average -wise classification loss from Definition 2.2, since , where is the probability that the sampled classes are distinct. For and uniform, these metrics are almost equal.
6.2 Effect of Excessive Negative Sampling
The standard belief is that increasing the number of negative samples always helps, at the cost of increased computational costs. In fact for Noise Contrastive Estimation (NCE)Gutmann & Hyvärinen (2010), which is invoked to explain the success of negative sampling, increasing negative samples has shown to provably improve the asymptotic variance of the learned parameters. However, we find that such a phenomenon does not always hold for contrastive learning – larger can hurt performance for the same inherent reasons highlighted in Section 5.1, as we illustrate next.
When is close to uniform and the number of negative samples is , frequent class collisions can prevent the unsupervised algorithm from learning the representation that is optimal for the supervised problem. In this case, owing to the contribution of being high, a large number of negative samples could hurt. This problem, in fact, can arise even when the number of negative samples is much smaller than the number of classes. For instance, if the best representation function groups classes into “clusters”,555This can happen when is not rich enough. such that cannot contrast well between classes from the same cluster, then will contribute to the unsupervised loss being high even when . We illustrate, by examples, how these issues can lead to picking suboptimal in Appendix C. Experimental results in Figures 1(a) and 1(b) also suggest that larger negative samples hurt performance beyond a threshold, confirming our suspicions.
6.3 Blocks of Similar Points
Often a dataset consists of blocks of similar data instead of just pairs: a block consists of that are i.i.d. draws from a class distribution for a class . In text, for instance, paragraphs can be thought of as a block of sentences sampled from the same latent class. How can an algorithm leverage this additional structure?
We propose an algorithm that uses two blocks: one for positive samples that are i.i.d. samples from and another one of negative samples that are i.i.d. samples from . Our proposed algorithm then minimizes the following loss:
To understand why this loss function make sense, recall that the connection between and was made in Lemma 4.3 by applying Jensen’s inequality. Thus, the algorithm that uses the average of the positive and negative samples in blocks as a proxy for the classifier instead of just one point each should have a strictly better bound owing to the Jensen’s inequality getting tighter. We formalize this intuition below. Let be as defined in Section 4.
This bound tells us that is a better surrogate for , making it a more attractive choice than when larger blocks are available.666Rigorous comparison of the generalization errors is left for future work.. The algorithm can be extended, analogously to Equation (5), to handle more than one negative block. Experimentally we find that minimizing instead of can lead to better performance and our results are summarized in Section 8.2. We defer the proof of Proposition 6.2 to Appendix A.4.
7 Related Work
The contrastive learning framework is inspired by several empirical works, some of which were mentioned in the introduction. The use of co-occurring words as semantically similar points and negative sampling for learning word embeddings was introduced in Mikolov et al. (2013). Subsequently, similar ideas have been used by Logeswaran & Lee (2018) and Pagliardini et al. (2018) for sentences representations and by Wang & Gupta (2015) for images. Notably the sentence representations learned by the quick thoughts (QT) method in Logeswaran & Lee (2018) that we analyze has state-of-the-art results on many text classification tasks. Previous attempts have been made to explain negative sampling Dyer (2014) using the idea of Noise Contrastive Estimation (NCE) Gutmann & Hyvärinen (2010) which relies on the assumption that the data distribution belongs to some known parametric family. This assumption enables them to consider a broader class of distributions for negative sampling. The mean classifier that appears in our guarantees is of significance in meta-learning and is a core component of ProtoNets Snell et al. (2017).
Our data model for similarity is reminiscent of the one in co-training Blum & Mitchell (1998). They assume access to pairs of “views” with the same label that are conditionally independent given the label. Our unlabeled data model can be seen as a special case of theirs, where the two views have the same conditional distributions. However, they additionally assume access to some labeled data (semi-supervised), while we learn representations using only unlabeled data, which can be subsequently used for classification when labeled data is presented. Two-stage kernel learning Cortes et al. (2010); Kumar et al. (2012) is similar in this sense: in the first stage, a positive linear combination of some base kernels is learned and is then used for classification in the second stage; they assume access to labels in both stages. Similarity/metric learning Bellet et al. (2012, 2013) learns a linear feature map that gives low distance to similar points and high to dissimilar. While they identify dissimilar pairs using labels, due to lack of labels we resort to negative sampling and pay the price of class collision. While these works analyze linear function classes, we can handle arbitrarily powerful representations. Learning of representations that are broadly useful on a distribution of tasks is done in multitask learning, specifically in the learning-to-learn model Maurer et al. (2016) but using labeled data.
Recently Hazan & Ma (2016) proposed “assumption-free” methods for representation learning via MDL/compression arguments, but do not obtain any guarantees comparable to ours on downstream classification tasks. As noted by Arora & Risteski (2017), this compression approach has to preserve all input information (e.g. preserve every pixel of the image) which seems suboptimal.
8 Experimental Results
We report experiments in text and vision domains supporting our theory. Since contrastive learning has already shown to obtain state-of-the-art results on text classification by quick thoughts (QT) in Logeswaran & Lee (2018), most of our experiments are conducted to corroborate our theoretical analysis. We also show that our extension to similarity blocks in Section 6.3 can improve QT on a real-world task.
Datasets: Two datasets were used in the controlled experiments. (1) The CIFAR-100 dataset Krizhevsky (2009) consisting of 32x32 images categorized into 100 classes with a 50000/10000 train/test split. (2) Lacking an appropriate NLP dataset with large number of classes, we create the Wiki-3029 dataset, consisting of 3029 Wikipedia articles as the classes and 200 sentences from each article as samples. The train/dev/test split is 70%/10%/20%. To test our method on a more standard task, we also use the unsupervised part of the IMDb review corpus Maas et al. (2011), which consists of 560K sentences from 50K movie reviews. Representations trained using this corpus are evaluated on the supervised IMDb binary classification task, consisting of training and testing set with 25K reviews each.
8.1 Controlled Experiments
To simulate the data generation process described in Section 2, we generate similar pairs (blocks) of data points by sampling from the same class. Dissimilar pairs (negative samples) are selected randomly. Contrastive learning was done using our objectives (5), and compared to performance of standard supervised training, with both using the same architecture for representation . For CIFAR-100 we use VGG-16 Simonyan & Zisserman (2014) with an additional 512x100 linear layer added at the end to make the final representations 100 dimensional, while for Wiki-3029 we use a Gated Recurrent Network (GRU) Chung et al. (2015) with output dimension 300 and fix the word embedding layer with pretrained GloVe embeddings Pennington et al. (2014). The unsupervised model for CIFAR-100 is trained with 500 blocks of size 2 with 4 negative samples, and for Wiki-3029 we use 20 blocks of size 10 with 8 negative samples. We test (1) learned representations on average tasks by using the mean classifier and compare to representations trained using labeled data; (2) the effect of various parameters like amount of unlabeled data ()777If we used similar blocks of size and negative blocks for each similar block, . In practice, however, we reuse the blocks for negative sampling and lose the factor of ., number of negative samples () and block size () on representation quality; (3) whether the supervised loss tracks the unsupervised loss as suggested by Theorem 4.1; (4) performance of the mean classifier of the supervised model.
Results: These appear in Table 1. For Wiki-3029 the unsupervised performance is very close to the supervised performance in all respects, while for CIFAR-100 the avg- performance is respectable, rising to good for binary classification. One surprise is that the mean classifier, central to our analysis of unsupervised learning, performs well also with representations learned by supervised training on CIFAR-100. Even the mean computed by just labeled samples performs well, getting within accuracy of the
sample mean classifier on CIFAR-100. This suggests that representations learnt by standard supervised deep learning are actually quite concentrated. We also notice that the supervised representations have fairly low unsupervised training loss (as low as 0.4), even though the optimization is minimizing a different objective.
To measure the sample complexity benefit provided by contrastive learning, we train the supervised model on just fraction of the dataset and compare it with an unsupervised model trained on unlabeled data whose mean classifiers are computed using the same amount of labeled data. We find that the unsupervised model beats the supervised model by almost on the 100-way task and by on the average binary task when only 50 labeled samples are used.
Figure 2 highlights the positive effect of increasing number of negative samples as well as amount of data used by unsupervised algorithm. In both cases, using a lot of negative examples stops helping after a point, confirming our suspicions in Section 6.2. We also demonstrate how the supervised loss tracks unsupervised test loss in Figure 1(c).
8.2 Effect of Block Size
As suggested in Section 6.3, a natural extension to the model would be access to blocks of similar points. We refer to our method of minimizing the loss in (12) as CURL for Contrastive Unsupervised Representation Learning and perform experiments on CIFAR-100, Wiki-3029, and IMDb. In Table 2 we see that for CIFAR-100 and Wiki-3029, increasing block size yields an improvement in classification accuracy. For IMDb, as is evident in Table 2, using larger blocks provides a clear benefit and the method does better than QT, which has state-of-the-art performance on many tasks. A thorough evaluation of CURL and its variants on other unlabeled datasets is left for future work.
Contrastive learning methods have been empirically successful at learning useful feature representations. We provide a new conceptual framework for thinking about this form of learning, which also allows us to formally treat issues such as guarantees on the quality of the learned representations. The framework gives fresh insights into what guarantees are possible and impossible, and shapes the search for new assumptions to add to the framework that allow tighter guarantees. The framework currently ignores issues of efficient minimization of various loss functions, and instead studies the interrelationships of their minimizers as well as sample complexity requirements for training to generalize, while clarifying what generalization means in this setting. Our approach should be viewed as a first cut; possible extensions include allowing tree structure – more generally metric structure – among the latent classes. Connections to meta-learning and transfer learning may arise.
We use experiments primarily to illustrate and support the new framework. But one experiment on sentence embeddings already illustrates how fresh insights derived from our framework can lead to improvements upon state-of-the-art models in this active area. We hope that further progress will follow, and that our theoretical insights will begin to influence practice, including design of new heuristics to identify semantically similar/dissimilar pairs.
This work is supported by NSF, ONR, the Simons Foundation, the Schmidt Foundation, Mozilla Research, Amazon Research, DARPA, and SRC. We thank Rong Ge, Elad Hazan, Sham Kakade, Karthik Narasimhan, Karan Singh and Yi Zhang for helpful discussions and suggestions.
- Arora & Risteski (2017) Arora, S. and Risteski, A. Provable benefits of representation learning. arXiv, 2017.
- Bellet et al. (2012) Bellet, A., Habrard, A., and Sebban, M. Similarity learning for provably accurate sparse linear classification. arXiv preprint arXiv:1206.6476, 2012.
- Bellet et al. (2013) Bellet, A., Habrard, A., and Sebban, M. A survey on metric learning for feature vectors and structured data. CoRR, abs/1306.6709, 2013.
Blum & Mitchell (1998)
Blum, A. and Mitchell, T.
Combining labeled and unlabeled data with co-training.
Proceedings of the Eleventh Annual Conference on Computational Learning Theory, COLT’ 98, 1998.
Chung et al. (2015)
Chung, J., Gulcehre, C., Cho, K., and Bengio, Y.
Gated feedback recurrent neural networks.In Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15. JMLR.org, 2015.
- Cortes et al. (2010) Cortes, C., Mohri, M., and Rostamizadeh, A. Two-stage learning kernel algorithms. 2010.
- Devlin et al. (2018) Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 10 2018.
- Dyer (2014) Dyer, C. Notes on noise contrastive estimation and negative sampling. CoRR, abs/1410.8251, 2014. URL http://arxiv.org/abs/1410.8251.
Gutmann & Hyvärinen (2010)
Gutmann, M. and Hyvärinen, A.
Noise-contrastive estimation: A new estimation principle for
unnormalized statistical models.
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 297–304, 2010.
- Hazan & Ma (2016) Hazan, E. and Ma, T. A non-generative framework and convex relaxations for unsupervised learning. In Neural Information Processing Systems, 2016.
- Kiros et al. (2015) Kiros, R., Zhu, Y., Salakhutdinov, R., Zemel, R. S., Torralba, A., Urtasun, R., and Fidler, S. Skip-thought vectors. In Neural Information Processing Systems, 2015.
- Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009.
- Kumar et al. (2012) Kumar, A., Niculescu-Mizil, A., Kavukcoglu, K., and Daumé, H. A binary classification framework for two-stage multiple kernel learning. In Proceedings of the 29th International Coference on International Conference on Machine Learning, ICML’12, 2012.
- Logeswaran & Lee (2018) Logeswaran, L. and Lee, H. An efficient framework for learning sentence representations. In Proceedings of the International Conference on Learning Representations, 2018.
Maas et al. (2011)
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., and Potts, C.
Learning word vectors for sentiment analysis.In Proceedings of the 49th Annual Meeting of the ACL: Human Language Technologies, 2011.
- Maurer (2016) Maurer, A. A vector-contraction inequality for rademacher complexities. In International Conference on Algorithmic Learning Theory, pp. 3–17. Springer, 2016.
- Maurer et al. (2016) Maurer, A., Pontil, M., and Romera-Paredes, B. The benefit of multitask representation learning. J. Mach. Learn. Res., 2016.
- Mikolov et al. (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. Distributed representations of words and phrases and their compositionality. In Neural Information Processing Systems, 2013.
- Mohri et al. (2018) Mohri, M., Rostamizadeh, A., and Talwalkar, A. Foundations of machine learning. MIT press, 2018.
Pagliardini et al. (2018)
Pagliardini, M., Gupta, P., and Jaggi, M.
Unsupervised learning of sentence embeddings using compositional n-gram features.Proceedings of the North American Chapter of the ACL: Human Language Technologies, 2018.
- Pennington et al. (2014) Pennington, J., Socher, R., and Manning, C. D. GloVe: Global vectors for word representation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2014.
- Peters et al. (2018) Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. In Proceedings of NAACL-HLT, 2018.
- Simonyan & Zisserman (2014) Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- Snell et al. (2017) Snell, J., Swersky, K., and Zemel, R. Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems 30. 2017.
Wang & Gupta (2015)
Wang, X. and Gupta, A.
Unsupervised learning of visual representations using videos.
Proc. of IEEE International Conference on Computer Vision, 2015.
Appendix A Deferred Proofs
a.1 Class Collision Lemma
We prove a general Lemma, from which Lemma 4.4 can be derived directly.
Let and be either the -way hinge loss or -way logistic loss, as defined in Section 2. Let be iid draws from . For all , let
where is a positive constant.
Proof of Lemma a.1.
Fix an and let and . First, we show that , for some constant . Note that .
-way hinge loss: By definition . Here, .
-way logistic loss: By definition , we have .
Finally, . But,
a.2 Proof of Lemma 5.1
Fix an and suppose that within each class , is -subgaussian in every direction. 888A random variable X is called -subgaussian if , . A random vector is -subgaussian in every direction, if , the random variable is -subgaussian. Let . This means that for all and unit vectors , for , we have that is -subgaussian. Let and .