Functional Regularization for Representation Learning: A Unified Theoretical Perspective

08/06/2020 ∙ by Siddhant Garg, et al. ∙ University of Wisconsin-Madison 13

Unsupervised and self-supervised learning approaches have become a crucial tool to learn representations for downstream prediction tasks. While these approaches are widely used in practice and achieve impressive empirical gains, their theoretical understanding largely lags behind. Towards bridging this gap, we present a unifying perspective where several such approaches can be viewed as imposing a regularization on the representation via a learnable function using unlabeled data. We propose a discriminative theoretical framework for analyzing the sample complexity of these approaches. Our sample complexity bounds show that, with carefully chosen hypothesis classes to exploit the structure in the data, such functional regularization can prune the hypothesis space and help reduce the labeled data needed. We then provide two concrete examples of functional regularization, one using auto-encoders and the other using masked self-supervision, and apply the framework to quantify the reduction in the sample complexity bound. We also provide complementary empirical results for the examples to support our analysis.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Advancements in machine learning have resulted in large prediction models, which need large amounts of labeled data for effective learning. Expensive label annotation costs have increased the popularity of unsupervised (or self-supervised) representation learning techniques using additional unlabeled data. These techniques learn a representation function on the input, and a prediction function over the representation for the target prediction task. Unlabeled data is utilised by posing an auxiliary unsupervised learning task on the representation, e.g., using the representation to reconstruct the input. Some popular examples of the auxiliary task are auto-encoders 

Rumelhart and McClelland (1987); Baldi (2011), sparse dictionaries Sadeghi et al. (2013), masked self-supervision Devlin et al. (2019), manifold learning Cayton (2005), and others Bengio et al. (2012)

. These approaches have been extensively used in computer vision 

Vincent et al. (2008); Zhang et al. (2017); Dosovitskiy et al. (2014)

and natural language processing 

Yang et al. (2017); Devlin et al. (2019); Liu et al. (2019) applications, and have achieved impressive empirical performance.

An important takeaway from these empirical studies is that learning representations using unlabeled data can drastically reduce the size of labeled data needed for the prediction task. In contrast to the popularity and impressive practical gains of these representation learning approaches, there have been far fewer theoretical studies towards understanding them, most of which have been specific to individual approaches. While intuitively, the distributions of the unlabeled and labeled data along with the choice of models are crucial factors governing the empirical gains, theoretically there is still ambiguity over questions like "When can the auxiliary task over the unlabeled data help? How much can it reduce the sample size of the labeled data by?"

In this work, we take a step to improve the theoretical understanding of the benefits of learning representations for the target prediction task via an auxiliary task. We focus on analyzing the sample complexity of labeled and unlabeled data for this representation learning paradigm. Such an analysis can help in identifying conditions to achieve a significant reduction in sample complexity of the labeled data. Arguably, this is one of the most fundamental questions for this learning paradigm, and existing literature on this has been limited and scattered, specific to individual approaches.

We propose a unified perspective where several representation learning approaches can be viewed as if they impose a regularization on the representation via a learnable regularization function. Under this paradigm, representations are learned jointly on unlabeled and labeled data. The former is used in the auxiliary task to jointly learn the representation and the regularization function. The latter is used in the target prediction task to learn the representation and the prediction function. Henceforth, we refer to this paradigm as representation learning via functional regularization.

We present a PAC-style discriminative framework Valiant (1984) to bound the sample complexities of labeled and unlabeled data under different assumptions on the models and data distributions. In particular, we show that functional regularization using unlabeled data can prune the model hypothesis class for learning representations, leading to a reduction of the labeled data required for the prediction task.

To demonstrate the application of our framework, we construct two concrete examples of functional regularization, one using auto-encoder and the other using masked self-supervision. These specific functional regularization settings allow us to quantify the reduction in the sample bounds of labeled data more explicitly. We also provide complementary empirical support to our theoretical results through experiments on synthetic data for these examples.

2 Related Work

Self-supervised approaches to learn image representations have been extensively used in computer vision through auxiliary tasks such as masked image patch prediction Dosovitskiy et al. (2014), image rotations Gidaris et al. (2018)

, pixel colorization 

Zhang et al. (2016), context prediction of image patches Doersch et al. (2015); Pathak et al. (2016); Noroozi and Favaro (2016); Hénaff et al. (2019), etc. Furthermore, several similar approaches find practical use in the field of robotics Sofman et al. (2006); Pinto and Gupta (2015); Agrawal et al. (2016); Ebert et al. (2017); Jang et al. (2018). Masked self-supervision, where representations are learnt by hiding a portion of the input and then reconstructing it, has lead to powerful language models like BERT Devlin et al. (2019) and RoBERTa Liu et al. (2019) in natural language processing. There have also been numerous research studies on other apporaches to learn meaningful representations for a target task, such as dictionary learning Rodriguez and Sapiro (2008); Mairal et al. (2009) and manifold learning Rifai et al. (2011); Bengio et al. (2012) presents an extensive review of multiple representation learning approaches.

On the theoretical front, Balcan and Blum (2010)

present a discriminative framework for analyzing semi-supervised learning, and show that unlabeled data can reduce the labeled sample complexity; 

Chapelle et al. (2006) also study this benefit of using unlabeled data. Our setting utilizes unlabelled data through a learnable regularization function allowing us to unify multiple representation learning approaches, while these works concentrate on utilizing unlabeled data through a fixed function. Some other works have explored the benefit of unlabeled data for domain adaptation, e.g., Ben-David et al. (2010); Ben-David and Urner (2012). Our setting differs from this since our goal is to learn a prediction function on the labeled data, rather than for a change in the domain of labeled data from source to target. Another line of related work is on multi-task learning, such as Baxter (2000); Liu et al. (2017). They show that multiple supervised learning tasks on different, but related, distributions can help generalization. Our work differs from these since we focus on learning a supervised task using auxiliary unsupervised tasks on the unlabeled data.

There have also been theoretical studies on specific approaches, without providing a unified holistic perspective. For example, Erhan et al. (2010)

present comprehensive empirical studies on the benefits of unsupervised pre-training for deep learning; some of their empirical results show that it shrinks the hypothesis space searched during learning, which also forms the intuition behind our analysis.

Saunshi et al. (2019) present a theoretical framework to analyse unsupervised representation learning techniques that can be posed as a contrastive learning problem, with their results later improved by Nozawa et al. (2019). Hazan and Ma (2016) provide a theoretical analysis of unsupervised learning from an optimization viewpoint with applications to dictionary learning and spectral auto-encoders. Le et al. (2018) prove uniform stability generalization bounds for linear auto-encoders and empirically demonstrate the benefits of using supervised auto-encoders.

3 Problem Formulation

Consider labeled data from a distribution over the domains , where is the input feature space and is the label space. The goal is to learn a predictor that fits . This can be achieved by first learning a representation function over the input and then learning a predictor on the representation. Denote the hypothesis classes for and by and

respectively, and the loss function by

. Without loss of generality, assume . We are interested in representation learning approaches where is learned with the help of an auxiliary task on unlabeled data from a distribution (same or different from the marginal distribution of ). Such an auxiliary task is posed as an unsupervised (or rather a self-supervised) learning task depending only on the input feature .

Representation learning using auto-encoders is an example that fits this consideration, where given input , the goal is to learn s.t. can be decoded back from . More precisely, the decoder takes the representation and decodes it to . and are learnt by minimizing the reconstruction error between and (e.g., ). Representation learning via masked self-supervision is another example of our problem setting, where the goal is to learn such that a part of the input (e.g., ) can be predicted from the representation of the remaining part (e.g., on input with masked). This approach uses a function over the representation to predict the masked part of the input. and are optimized by minimizing the error between the true and the predicted values (e.g., ).

Now consider a simple example of a regression problem where and we use masked self-supervision to learn from . If each , then will not be able to learn a meaningful representation of for predicting , since is independent of all other coordinates of . On the other extreme, if all ’s are equal, can learn the perfect underlying representation for predicting , which corresponds to a single coordinate of . This shows two contrasting abstractions of the inherent structure in the data and how the benefits of using a specific auxiliary task may vary. Our work aims at providing a framework for analyzing sample complexity and clarifying these subtleties on the benefits of the auxiliary task depending on the data distribution.

4 Functional Regularization: A Unified Perspective

We make a key observation that the auxiliary task in several representation learning approaches provides a regularization on the representation function via a learnable function. To better illustrate this viewpoint, consider the auxiliary task of an auto-encoder, where the decoder can be viewed as such a learnable function, and the reconstruction error can be viewed as a regularization penalty imposed on through the decoder for the data point .

To formalize this notion, we consider learning representations via an auxiliary task which involves: a learnable function , and a loss of the form on the representation via for an input . We refer to as the regularization function and as the regularization loss. Let denote the hypothesis class for . Without loss of generality we assume that .

Definition 1.

Given a loss function for an input involving a representation and a regularization function , the regularization loss of and on a distribution over is defined as

(1)

The regularization loss of a representation function on is defined as

(2)

We can similarly define and to denote the loss over a fixed set of unlabeled data points, i.e., and .

Here, can be viewed as a notion of incompatibility of a representation function on the data distribution . This formalizes the prior knowledge about the representation function and the data. For example, in auto-encoders measures how well the representation function complies with the prior knowledge of the input being reconstructible from the representation.

We now introduce a notion for the subset of representation functions having a bounded regularization loss, which is crucial for our sample complexity analysis.

Definition 2.

Given , the -regularization-loss subset of representation hypotheses is:

(3)

We also define the prediction loss over the data distribution for a prediction function on top of : Similarly, the empirical loss on the labeled data set is . In summary, given hypothesis classes , and , a labeled dataset , an unlabeled dataset , and a threshold on the regularization loss, we consider the following learning problem:

(4)

4.1 Instantiations of Functional Regularization

Here we show that three popular representation learning approaches are instantiations of our framework by specifying the analogous regularization functions and regularization losses . We mention more instantiations in our framework like manifold learning, dictionary learning, etc in Appendix A.

Auto-encoder. Recall from Section 3, there is a decoder function that takes the representation and decodes it to , where and are learnt by minimizing the error . The reconstruction error corresponds to the regularization loss . is the subset of representation functions with at most reconstruction error using the best decoder in .

Masked Self-supervised Learning. Recall the simple example from Section 3 where , the first dimension of input , is predicted from the representation of the remaining part (i.e., with masked). The prediction function for using corresponds to the regularization function , and corresponds to the regularization loss . is the subset of which have at most MSE on predicting using the best function . More general masked self-supervised learning methods can be similarly mapped in our framework.

Variational Auto-encoder. VAEs encode the input as a distribution over a parametric latent space instead of a single point, and sample from it to reconstruct using a decoder . The encoder is used to model the underlying mean

and co-variance matrix

of the distribution over . VAEs are trained by minimising a loss over parameters and

where is specified as the prior distribution over (e.g., ). The encoder can be viewed as the representation function , the decoder as the learnable regularization function , and the loss as the regularization loss in our framework. Then is the subset of encoders which have at most VAE loss when using the best decoder for it.

4.2 Sample Complexity Analysis

To analyze the sample complexity of these approaches using functional regularization, we first enumerate the considerations on the data and the hypothesis classes: 1) the labeled and unlabeled data can either be from the same or different distributions (i.e., same domain or different domains); 2) the hypothesis classes can contain zero error hypothesis or not (i.e., being realizable or unrealizable); 3) the hypothesis classes can be finite or infinite in size. We perform the analysis for different combinations of these assumptions. While the bounds have some differences, our proofs share a common high-level intuition across different settings. We now present sample complexity bounds for three interesting, characteristic settings. We present bounds for several other settings and the proofs of all the theorems in Appendix B.

Same Domain, Realizable, Finite Hypothesis Classes. We begin with the simplest setting, where the unlabeled dataset and the labeled dataset are from the same distribution , and the hypothesis classes contain functions with a zero prediction and regularization loss. We further assume that the hypothesis classes are finite in size. We get the following result:

Theorem 1.

Suppose there exist such that and . For any , a set of unlabeled examples and a set of labeled examples are sufficient to learn to an error

with probability

, where

(5)

In particular, with probability at least , all hypotheses with and will have .

Theorem 1 shows that, if the target function is indeed perfectly correct for prediction, and has a zero regularization loss, then optimizing the prediction and regularization loss to 0 over the said number of data points is sufficient to learn an accurate predictor in the PAC learning sense.

Recall that standard analysis shows that without unlabeled data, labeled points are needed to get the same error guarantee. On comparing the bounds, Theorem 1 shows that the functional regularization can prune away some hypotheses in ; thereby replacing the factor with its subset in the bound. Thus, the sample complexity bound is reduced by . Equivalently, the error is reduced by when using labeled data. So the auxiliary task is helpful for learning the predictor when is significantly smaller than , avoiding the requirement of a large number of labeled points to find a good representation function among them.

Same Domain, Unrealizable, Infinite Hypothesis Classes. We now present the result for a more elaborate setting, where both the prediction and regularization losses are non-zero. We also relax the assumptions on the hypothesis classes being finite. We use metric entropy to measure the capacity of the hypothesis classes for demonstration here. Alternative capacity measures like VC-dimension or Rademacher complexity can also be used with essentially no change to the analysis. Assume that the parameter space of is equipped with a norm and let denote the -covering number of ; similarly for and . Let the Lipschitz constant of the losses w.r.t. these norms be bounded by .

Theorem 2.

Suppose there exist such that and . For any , a set of unlabeled examples and a set of labeled examples are sufficient to learn to an error with probability , where

(6)
(7)

for some absolute constant . In particular, with probability at least , the that optimize subject to have .

Theorem 2 shows that optimizing the prediction loss subject to the regularization loss bounded by can give a solution with prediction loss close to the optimal. The sample complexity bounds are broadly similar to those in the simpler realizable and finite hypothesis class setting, but with and replaced by and due to the unrealizability, and logarithms of the hypothesis class sizes replaced by their metric entropy due to the classes being infinite. We show a reduction of with the standard bound on without unlabeled data. Equivalently, the error bound is reduced by .

We bring attention to some subtleties which are worth noting. Firstly, the regularization loss of need not be optimal; there may be other which get a smaller (even ). Secondly, the prediction loss is bounded by , which is independent of . Similarly, the bounds on and mainly depend on and respectively, while only depends on through the term. Thus, even when the regularization loss is large (e.g., the reconstruction of an auto-encoder is far from accurate), it is still possible to learn an accurate predictor with a significantly reduced labeled data size using the unlabeled data. This suggests that when designing an auxiliary task ( and ), it is not necessary to ensure that the “ground-truth” has a small regularization loss. Rather, one should ensure that only a small fraction of have a smaller (or similar) regularization loss than so as to reduce the label sample complexity.

This bound also shows that should be carefully chosen for the constraint . With a very small , the ground-truth (or hypotheses of similar quality) may not satisfy the constraint and become infeasible for learning. With a very large , the auxiliary task may not reduce the labelled sample complexity. Practical learning algorithms typically turn this constrain into a regularization like term, i.e., by optimizing . For such objectives, the requirement on translates to carefully choosing . When is very large, this leads to a small but a large , while when is very small, this may not reduce the labeled sample complexity.

Different Domain, Unrealizable, Infinite Hypothesis Classes. In practice, the unlabeled data is often from a different domain than the labeled data. For example, state-of-the-art NLP systems are often pre-trained on a large general-purpose unlabeled corpus (e.g., the entire Wikipedia) while the target task has a small specific labeled corpus (e.g., a set of medical records). That is, the unlabeled data is from a distribution different from . For this setting, we have the following bound:

Theorem 3.

Suppose the unlabeled data is from a distribution different from . Suppose there exist such that and . Then the same sample complexity bounds as in Theorem 2 hold (replacing with in Equation 7).

The bound is similar to that in the setting of the domain distributions being same. It implies that unlabeled data from a different domain than the labeled data can help in learning the target task, as long as there exists a “ground-truth” representation function , which is shared across the two domains, having a small prediction loss on the labeled data and a suitable regularization loss on the unlabeled data. The former (small prediction loss) is typically assumed according to domain knowledge, e.g., for image data, common visual perception features are believed to be shared across different types of images. The latter (suitable regularization loss) means only a small fraction of have a smaller (or similar) regularization loss than , which requires a careful design of and .

5 Applying the Theoretical Framework to Concrete Examples

The analysis in Section 4 shows that the sample complexity bound reduction depends on the notion of the pruned subset , which captures the effect of the regularization function and the property of the unlabeled data distribution. Our generic framework can be applied to various concrete configurations of the hypothesis classes and data distributions. This way we can quantify the reduction more explicitly by investigating . We provide two such examples: one using an auto-encoder regularization and the other using a masked self-supervision regularization. We outline how to bound the sample complexity for these examples, and present the complete details and proofs in Appendix C.

5.1 An Example of Functional Regularization via Auto-encoder

Learning Without Functional Regularization. Consider to be the class of linear functions from to where , and to be the class of linear functions over some activations. That is,

(8)

Here

is an activation function (e.g.,

), the rows of and have norms bounded by 1. We consider the Mean Square Error prediction loss, i.e., . Without prior knowledge on data, no functional regularization corresponds to end-to-end training on .

Data Property. We consider a setting where the data has properties which allows functional regularization. We assume that the data consists of a signal and noise, where the signal lies in a certain -dimensional subspace. Formally, let columns of

be eigenvectors of

, then the prediction labels are largely determined by the signal in the first directions: and , where is a ground-truth parameter with , is the set of first eigenvectors of , and is a small Gaussian noise. We assume a difference in the and eigenvalues of to distinguish the corresponding eigenvectors. Let denote .

Learning With Functional Regularization. Knowing that the signal lies in an -dimensional subspace, we can perform auto-encoder functional regularization. Let be a class of linear functions from to , i.e., where has orthonormal columns. The regularization loss . For simplicity, we assume access to infinite unlabeled data.

Without regularization, the standard -covering argument shows that the labeled sample complexity, for an error close to the optimal, is for some absolute constant . Applying our framework when using regularization with , the sample complexity is bounded by . Then we show that (Proof in Lemma 6 of Appendix C.1) since

where refers to the sub-matrix of columns in having indices in . Therefore, the label sample complexity bound is reduced by , i.e., the error bound is reduced by when using labeled points. Note that when is small, and thus the reduction is roughly linear initially and then grows slower with . Interestingly, the reduction depends on the hidden dimension but has little dependence on the input dimension .

5.2 An Example of Functional Regularization via Masked Self-supervision

Learning Without Functional Regularization. Let be linear functions from to where followed by a quadratic activation, and be linear functions from to . That is,

(9)

Here for is the quadratic activation function. W.l.o.g, we assume that and have norm bounded by . Without prior knowledge on the data, no functional regularization corresponds to end-to-end training on .

Data Property. We consider the setting where the data point satisfies , where and is the -th eigenvector of . Furthermore, the label is given by for some and a small Gaussian noise . We also assume a difference in the and eigenvalues of .

Learning With Functional Regularization. Suppose we have prior knowledge that and

for some vectors

and an with . Based on this, we perform masked self-supervision by constraining the first coordinate of to be for , and choosing the regularization function and the regularization loss . Again for simplicity, we assume access to infinite unlabeled data and set the regularization loss threshold .

On applying our framework, we get that functional regularization can reduce the labeled sample bound by for some absolute constant . We can derive the following using properties of and (Proof in Lemma 7 of Appendix C.2):

Using this we can show that the reduction of the sample bound is , i.e., a reduction in the error bound by when using labeled data. We also note that this reduction depends on but has little dependence on .

5.3 Empirical Support

While there is abundant empirical evidence on the benefits of auxiliary tasks in various applications, our framework allows mathematical analysis and we can get non-trivial implications from the two examples. Therefore, here we focus on experimentally verifying the following implications for the two examples: 1) the reduction in prediction error (between end-to-end training and functional regularization) using the same labelled data; 2) the reduction in prediction error on varying a property of the data and hypotheses (specifically, varying parameter ); 3) the reduction in prediction error is obtained due to pruning the hypothesis class. We present the complete experimental details in Appendix D for reproducibility, and additional results which verify that the reduction has little dependence on the input dimension . For completeness, we also present additional experiments on verifying the benefits of functional regularization on synthetic and real data in Appendix E.

Setup. For the auto-encoder example, we randomly generate orthonormal vectors () in , means and variances for such that . We sample and generate . For generating , we use a randomly generated vector . We use and generate unlabeled, labeled training and labeled test points. We use the quadratic activation function and follow the specification in Section 5.1 (e.g., hypothesis classes, losses). For the masked self-supervision example, we similarly generate and then follow the specifications in Section 5.2. We report Test MSE averaged over 10 runs as the metric.

To support our key theoretical intuition that functional regularization prunes the hypothesis classes, we seek to visualize the learned model. Since multiple permutations of model parameters can result in the same model behavior, we visualise the function (from input to output) represented instead of the parameters using the method from Erhan et al. (2010). Formally, we concatenate the outputs on the test set points from a trained model into a single vector and visualise the vector in 2D using t-SNE van der Maaten and Hinton (2008). We perform 1000 independent runs for each of the two models (with and without functional regularization) and plot the vectors for comparison. See Appendix D.1 for more details.

Figure 1: Auto-Encoder: 1 shows the variation of test MSE with labeled data size (here =30), 1 shows this variation with the parameter , and 1 shows the 2D visualization of the functional approximation using t-SNE. Masked Self-Supervision: 1, 1 and 1 show the same corresponding plots. Reduction refers to Test MSE of end-to-end training - Test MSE with regularization.

Results. Figure 1 plots the Test MSE loss v.s. the size of the labeled data when . We observe that with the same labeled data size, functional regularization can significantly reduce the error compared to end-to-end training. Equivalently, it needs much fewer labeled samples to achieve the same error as end-to-end training (e.g., 500 v.s. 10,000 points). Also, the error without regularization does not decrease for sample sizes while it decreases with regularization, suggesting that the regularization can even help alleviate optimization difficulty. Figure 1 shows the effect of varying (i.e., the dimension of the subspace containing signals for prediction). We observe that the reduction in the error increases roughly linearly with and then grows slower, as predicted by our analysis. Figure 1 visualizes the prediction functions learned. It shows that when using the functional regularization, the learned functions stay in a small functional space, while they are scattered otherwise. This supports our intuition for the theoretical analysis. This result also interestingly suggests that pruning the representation hypothesis space via functional regularization translates to a compact functional space for the prediction, even through optimization. We make similar observations for the masked self-supervision example in Figure 1-1, which provide additional support for our analysis.

6 Conclusion

In this paper we have presented a unified discriminative framework for analyzing many representation learning approaches using unlabeled data, by viewing them as imposing a regularization on the representation via a learnable function. We have derived sample complexity bounds under various assumptions on the hypothesis classes and data, and shown that the functional regularization can be used to prune the hypothesis class and reduce the labeled sample complexity. We have also applied our framework to two concrete examples. An interesting future work direction is to investigate the effect of such functional regularization on the optimization of the learning methods.

Broader Impact

Our paper is mostly theoretical in nature and we foresee no immediate societal impact. Our theoretical framework may inspire development of improved representation learning methods on unlabeled data, which may have a positive impact in practice. In addition to the theoretical machine learning community, we perceive that our precise and easy-to-read formulation of unsupervised learning for downstream tasks can be highly beneficial to engineering-inclined machine learning researchers.

References

  • [1] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine (2016-06) Learning to poke by poking: experiential learning of intuitive physics. Conference on Neural Information Processing Systems, pp. . Cited by: §2.
  • [2] M. Balcan and A. Blum (2010-03) A discriminative model for semi-supervised learning. J. ACM 57 (3). External Links: ISSN 0004-5411, Link, Document Cited by: §2.
  • [3] P. Baldi (2011) Autoencoders, unsupervised learning and deep architectures. In

    Proceedings of the 2011 International Conference on Unsupervised and Transfer Learning Workshop - Volume 27

    ,
    UTLW’11, pp. 37–50. Cited by: §1.
  • [4] N. Bansal, X. Chen, and Z. Wang (2018) Can we gain more from orthogonality regularizations in training deep cnns?. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, Red Hook, NY, USA, pp. 4266–4276. Cited by: §D.1.
  • [5] J. Baxter (2000-03) A model of inductive bias learning. J. Artif. Int. Res. 12 (1), pp. 149–198. External Links: ISSN 1076-9757 Cited by: §2.
  • [6] S. Ben-David, J. Blitzer, K. Crammer, A. Kulesza, F. Pereira, and J. W. Vaughan (2010) A theory of learning from different domains. Machine learning 79 (1-2), pp. 151–175. Cited by: §2.
  • [7] S. Ben-David and R. Urner (2012) On the hardness of domain adaptation and the utility of unlabeled target samples. In Proceedings of the 23rd International Conference on Algorithmic Learning Theory, ALT’12, Berlin, Heidelberg, pp. 139–153. External Links: ISBN 9783642341052, Link, Document Cited by: §2.
  • [8] Y. Bengio, A. Courville, and P. Vincent (2012) Representation learning: a review and new perspectives. External Links: 1206.5538 Cited by: §1, §2.
  • [9] A. Brock, T. Lim, J. M. Ritchie, and N. Weston (2017) Neural photo editing with introspective adversarial networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Link Cited by: §D.1.
  • [10] L. Cayton (2005) Algorithms for manifold learning. Univ. of California at San Diego Tech. Rep 12 (1-17), pp. 1. Cited by: §1.
  • [11] O. Chapelle, B. Schölkopf, and A. Zien (2006) Semi-supervised learning (adaptive computation and machine learning). The MIT Press. External Links: ISBN 0262033585 Cited by: §2.
  • [12] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019-06) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: Appendix A, §E.2, §1, §2.
  • [13] C. Doersch, A. Gupta, and A. A. Efros (2015) Unsupervised visual representation learning by context prediction. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), ICCV ’15, USA, pp. 1422–1430. External Links: ISBN 9781467383912, Link, Document Cited by: §2.
  • [14] W. B. Dolan and C. Brockett (2005) Automatically constructing a corpus of sentential paraphrases. In Proceedings of the Third International Workshop on Paraphrasing (IWP2005), External Links: Link Cited by: §E.2.
  • [15] A. Dosovitskiy, J. T. Springenberg, M. Riedmiller, and T. Brox (2014)

    Discriminative unsupervised feature learning with convolutional neural networks

    .
    In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.), pp. 766–774. Cited by: §1, §2.
  • [16] F. Ebert, C. Finn, A. X. Lee, and S. Levine (2017) Self-supervised visual planning with temporal skip connections. In CoRL, Cited by: §2.
  • [17] D. Erhan, Y. Bengio, A. Courville, P. Manzagol, P. Vincent, and S. Bengio (2010-03) Why does unsupervised pre-training help deep learning?. J. Mach. Learn. Res. 11, pp. 625–660. External Links: ISSN 1532-4435 Cited by: §2, §5.3.
  • [18] S. Gidaris, P. Singh, and N. Komodakis (2018) Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations, External Links: 1803.07728 Cited by: §2.
  • [19] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: Appendix A.
  • [20] E. Hazan and T. Ma (2016) A non-generative framework and convex relaxations for unsupervised learning. In Proceedings of the 30th International Conference on Neural Information Processing Systems, NIPS’16, Red Hook, NY, USA, pp. 3314–3322. External Links: ISBN 9781510838819 Cited by: §2.
  • [21] O. J. Hénaff, A. Srinivas, J. D. Fauw, A. Razavi, C. Doersch, S. M. A. Eslami, and A. van den Oord (2019) Data-efficient image recognition with contrastive predictive coding. External Links: 1905.09272 Cited by: §2.
  • [22] G. E. Hinton and R. R. Salakhutdinov (2006) Reducing the dimensionality of data with neural networks. science 313 (5786), pp. 504–507. Cited by: Appendix A.
  • [23] E. Hoffer and N. Ailon (2015) Deep metric learning using triplet network.. In ICLR (Workshop), Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: Appendix A.
  • [24] E. Jang, C. Devin, V. Vanhoucke, and S. Levine (2018-29–31 Oct) Grasp2Vec: learning object representations from self-supervised grasping. In Proceedings of The 2nd Conference on Robot Learning, A. Billard, A. Dragan, J. Peters, and J. Morimoto (Eds.), Proceedings of Machine Learning Research, Vol. 87, , pp. 99–112. External Links: Link Cited by: §2.
  • [25] O. Kovaleva, A. Romanov, A. Rogers, and A. Rumshisky (2019-11) Revealing the dark secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4365–4374. External Links: Link, Document Cited by: §E.2.
  • [26] L. Le, A. Patterson, and M. White (2018) Supervised autoencoders: improving generalization performance with unsupervised regularizers. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 107–117. External Links: Link Cited by: §2.
  • [27] T. Liu, D. Tao, M. Song, and S. J. Maybank (2017-02) Algorithm-dependent generalization bounds for multi-task learning. IEEE Trans. Pattern Anal. Mach. Intell. 39 (2), pp. 227–241. External Links: ISSN 0162-8828, Link, Document Cited by: §2.
  • [28] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: A robustly optimized BERT pretraining approach. CoRR abs/1907.11692. External Links: Link, 1907.11692 Cited by: §1, §2.
  • [29] J. Mairal, J. Ponce, G. Sapiro, A. Zisserman, and F. R. Bach (2009) Supervised dictionary learning. In Advances in Neural Information Processing Systems 21, D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou (Eds.), pp. 1033–1040. External Links: Link Cited by: §2.
  • [30] M. Mohri, A. Rostamizadeh, and A. Talwalkar (2018) Foundations of machine learning. MIT press. Cited by: §B.3.
  • [31] M. Noroozi and P. Favaro (2016) Unsupervised learning of visual representations by solving jigsaw puzzles. In ECCV, Cited by: §2.
  • [32] K. Nozawa, P. Germain, and B. Guedj (2019) PAC-bayesian contrastive unsupervised representation learning. External Links: 1910.04464 Cited by: §2.
  • [33] D. Pathak, P. Krähenbühl, J. Donahue, T. Darrell, and A. A. Efros (2016) Context encoders: feature learning by inpainting.

    2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    , pp. 2536–2544.
    Cited by: §2.
  • [34] L. Pinto and A. Gupta (2015) Supersizing self-supervision: learning to grasp from 50k tries and 700 robot hours. 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 3406–3413. Cited by: §2.
  • [35] S. Rifai, Y. N. Dauphin, P. Vincent, Y. Bengio, and X. Muller (2011)

    The manifold tangent classifier

    .
    In Advances in Neural Information Processing Systems 24, J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, F. Pereira, and K. Q. Weinberger (Eds.), pp. 2294–2302. External Links: Link Cited by: §2.
  • [36] F. Rodriguez and G. Sapiro (2008) Sparse representations for image classification: learning discriminative and reconstructive non-parametric dictionaries. Technical report MINNESOTA UNIV MINNEAPOLIS. Cited by: §2.
  • [37] D. E. Rumelhart and J. L. McClelland (1987) Learning internal representations by error propagation. In Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Foundations, Vol. , pp. 318–362. Cited by: §1.
  • [38] M. Sadeghi, M. Babaie-Zadeh, and C. Jutten (2013) Dictionary learning for sparse representation: a novel approach. IEEE Signal Processing Letters 20 (12), pp. 1195–1198. Cited by: §1.
  • [39] N. Saunshi, O. Plevrakis, S. Arora, M. Khodak, and H. Khandeparkar (2019-09–15 Jun) A theoretical analysis of contrastive unsupervised representation learning. In Proceedings of the 36th International Conference on Machine Learning, K. Chaudhuri and R. Salakhutdinov (Eds.), Proceedings of Machine Learning Research, Vol. 97, Long Beach, California, USA, pp. 5628–5637. External Links: Link Cited by: §2.
  • [40] P. Smolensky (1986) Information processing in dynamical systems: foundations of harmony theory. In Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations, pp. 194–281. Cited by: Appendix A.
  • [41] B. Sofman, E. Lin, J. Bagnell, J. Cole, N. Vandapel, and A. Stentz (2006-11) Improving robot navigation through self‐supervised online learning. J. Field Robotics 23, pp. 1059–1075. External Links: Document Cited by: §2.
  • [42] L. G. Valiant (1984) A theory of the learnable. Communications of the ACM 27 (11), pp. 1134–1142. Cited by: §1.
  • [43] L. van der Maaten and G. Hinton (2008) Visualizing data using t-SNE. Journal of Machine Learning Research 9, pp. 2579–2605. External Links: Link Cited by: §D.1, §5.3.
  • [44] R. Vershynin (2018)

    High-dimensional probability: an introduction with applications in data science

    .
    Vol. 47, Cambridge university press. Cited by: §B.2.
  • [45] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol (2008)

    Extracting and composing robust features with denoising autoencoders

    .
    In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, New York, NY, USA, pp. 1096–1103. External Links: ISBN 9781605582054, Link, Document Cited by: §1.
  • [46] Z. Yang, Z. Hu, R. Salakhutdinov, and T. Berg-Kirkpatrick (2017-06–11 Aug) Improved variational autoencoders for text modeling using dilated convolutions. In Proceedings of the 34th International Conference on Machine Learning, D. Precup and Y. W. Teh (Eds.), Proceedings of Machine Learning Research, Vol. 70, International Convention Centre, Sydney, Australia, pp. 3881–3890. External Links: Link Cited by: §1.
  • [47] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §B.3.
  • [48] R. Zhang, P. Isola, and A. Efros (2016-10) Colorful image colorization. In European Conference on Computer Vision, Vol. 9907, pp. 649–666. External Links: ISBN 978-3-319-46486-2, Document Cited by: §2.
  • [49] R. Y. Zhang, P. Isola, and A. A. Efros (2017-11-06) Split-brain autoencoders: unsupervised learning by cross-channel prediction. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, United States, pp. 645–654 (English (US)). External Links: Document Cited by: §1.

Appendix A Instantiations of Functional Regularization

Here we show that several unsupervised (self-supervised) representation learning strategies can be viewed as imposing a learnable function to regularize the representations being learned. We note that the class can be an index set instead of a class of functions; our framework applies as long as the loss is well defined (see the manifold learning example). can also only have a single , corresponding to the special case of a fixed regularizer (see the norm penalty example).

Auto-encoder.

Auto-encoders use an encoder function to map the input to a lower dimensional space and a decoder network to reconstruct the input back from using a MSE loss . One can view as a regularizer on the feature representation through the regularization loss . is the subset of representation functions with at most reconstruction error using the best decoder in .

Variants of standard auto-encoders like noisy auto-encoders or sparse auto-encoders can be formulated similarly as a functional regularization on the representation being learnt.

Masked Self-supervised Learning.

Masked self-supervision techniques, in abstract terms, cover a portion of the input and then predict the masked input portion [12]. More concretely, say the input is masked as and a function is learned to predict the masked input over an input representation . This function used to reconstruct , can be viewed as imposing a regularization on through a MSE regularization loss given by . is the subset of which have at most MSE on predicting using the best function .

Variational Auto-encoder.

VAEs encode the input as a distribution over a parametric latent space instead of a single point, and sample from it to reconstruct using a decoder . The encoder is used to model the underlying mean and co-variance matrix of the distribution over . VAEs are trained by minimising a loss

where is specified as the prior distribution over (e.g., ). The encoder can be viewed as the representation function , the decoder as the learnable regularization function , and the loss as the regularization loss in our framework. Then is the subset of encoders which have at most VAE loss when using the best decoder for it.

Manifold Learning through the Triplet Loss.

Learning manifold representations through metric learning is a popular technique used in computer vision applications [23]. A triplet loss formulation is used to learn a distance metric for the representations, by trying to minimise this metric between a baseline and positive sample and maximising the metric between the baseline and a negative sample. This is achieved by learning a representation function for an input . Considering a triple of input samples corresponding to a baseline, positive and negative sample, we use a loss to learn . This is a special instantiation of our framework using a dummy having a single function , where the regularization loss is computed over a triple of input samples.

Further, one can also consider some variants of the standard triplet loss formulation under our functional regularization perspective. For example, let the triplet loss be where is a margin between the positive and negative pairs. When is learnable, this corresponds to a functional regularization where , and the regularization loss is . In this case, the class is not defined on top of the representation . However, our framework and the sample complexity analysis can still be applied through the definition of .

Sparse Dictionary Learning.

Sparse dictionary learning is an unsupervised learning approach to obtain a sparse low-dimensional representation of the input data. Here we consider a distributional view of sparse dictionary learning. Give a distribution over unlabelled data and a hyper-parameter , we want to find a dictionary matrix and a sparse representation for each , so as to minimize the error , where is the error on one point defined as , subject to the constraint that each column of has norm bounded by . The learned representations can then be used for a target prediction task. Under our framework, we can view the representation function corresponding to , and is the parameter of the representation function. The regularization function class has a single , and the regularization loss is .

Our framework also captures an interesting variant of dictionary learning. Consider another dictionary matrix and a hyper-parameter . The representation function still corresponds to , with as the parameter. The regularization function class is now given by , and the regularization loss is defined as . This special case of dictionary learning allows the encoding and decoding steps to use two different dictionaries and but constraining the difference between them. When , this variant reduces to the original version described earlier.

Explicit Norm Penalty.

Techniques imposing explicit regularizations on the representation being learned, often use an norm penalty on i.e, to the prediction loss while jointly training and . This can be viewed as a special case of our framework using a fixed regularization function .

Restricted Boltzmann Machines.

Restricted Boltzmann Machines (RBM) [40, 22]

generate hidden representations for an input through unsupervised learning on unlabeled data. RBMs are characterized by a joint distribution over the input

and the representation : , where is the partition function and is the energy function defined as: , where are parameters to be learned.

Then , for a fixed , is a distribution parameterized by and ; which can be denoted as . Similarly, is parameterized by and and thus can be denoted as . Given , the objective of the RBM is to minimize .

While the standard RBM objective does not have a direct analogy under our functional regularization framework, a heuristic variant can be formulated under our framework. If we use

to denote the expectation over the marginal distribution of in the RBM, to denote the expectation over the marginal distribution of , and to denote the expectation over . Then the following hold for the standard RBM:

(10)
(11)

In the heuristic variant, we replace with in Equation (10):