1 Introduction
AI is undergoing a paradigm shift with the rise of big/foundation models bommasani2021opportunities ; yuan2022roadmap , e.g., BERT stickland2019bert
, GPT3
brown2020language ), CLIP radford2021learning , DALLEs ramesh2021zero ; ramesh2022hierarchical , MAE he2021masked , etc. Foundation models, often based on maskandpredict pretraining and downstream finetuning, are capable of benefiting from pretraining on broad data at scale and accordingly, demonstrate diverse downstream task capabilities with impressive robustness stickland2019bert , adaptability bommasani2021opportunities ; he2021masked , and generalization brown2020language ; ramesh2021zero . Therefore, they are rapidly being integrated injustifyto realworld AI systems, e.g., BERT into Google search^{1}^{1}1https://blog.google/products/search/searchlanguageunderstandingbert/, Codex chen2021evaluating into GitHub’s Copilot^{2}^{2}2https://copilot.github.com/, etc.Despite their impressive performance and practical capabilities, a unified theoretical framework justifying foundation models remains missing bommasani2021opportunities ; yuan2022roadmap , which is crucial for their further improvements and extensions and is likely a milestone for the foundation model community tamkin2021dabs .
To address that challenge, we first observe that the success of foundation models are mainly attributed to the following two flexibilities, in addition to increasingly powerful parallel computing techniques.

[leftmargin=7mm]

Data Flexibility. Foundation models are not “picky” about their training data, enabling training with largescale/Internetscale data with great diversity (e.g., across many domains). Those training data, often orders of magnitude larger than conventional machine learning datasets, are likely more consistent with the underlying data distribution, with minimal manual collection (i.e., human interventions), leading to narrowed trainingtest gap and therefore improved generalization and robustness of models.

Task Flexibility. Foundation models are often trained across many tasks in potentially many domains^{3}^{3}3This will be made clearer in Section 3.1 and Section 3.3; their multipurpose training nature may offer new leverage for learning shared, compositional, and intrinsic metaknowledge encoded in the model parameters, leading to improved model performance, adaptability, and generalization capabilities papadimitriou2020learning ; wu2021lime ; lu2021pretrained ; aghajanyan2021muppet . Note compositionality is a crucial ingredient of human intelligence lake2017building and may hold the key for outofdistribution generalization bommasani2021opportunities .
Next, by referring to literatures baker2019emergent ; bommasani2021opportunities ; yuan2022roadmap
and reviewing the development of deep learning, we perceive a vague avenue for artificial intelligence, that is
bid data, big/foundation models, big learning, .A clear trend of deep learning is more exploited information better model performance, e.g., utilizing bid data as source of abundant information and developing big/foundation models to exploit that information from the model perspective. However, we observe that most existing machine learning paradigms do not comprehensively exploit the abundant information within big training data. By contrast, the fundamental unconscious mind and the vision system of human brains are excellent at exquisite information exploitation in a multitasking manner bargh2008unconscious ; mesquita2015human ; ludwig2014foveal ; saarela2015integration .
Motivate by the above observations, we propose to further strengthen those flexibilities with more exquisite exploitation of data information from the learning perspective, via a newlyintroduced universal machine learning paradigm named big learning, mimicking human brains.
The presented big learning leverages a universal/foundation model to simultaneously model manytoall joint/conditional/marginal data distributions, manifested as a “big” training task that contains many machine learning paradigms as special cases. Our big learning comes with three main contributions.

[leftmargin=7mm]

It serves as a theoretical platform for justifying, analyzing, and improving big/foundation models, because most of them are implicitly doing (parts of) big learning, as revealed in Section 3.

By modeling manytoall joint/conditional/marginal data distributions, big learning () comprehensively exploits the available data information (thus focusing on the data essence) and delivers the corresponding data capabilities (valuable for e.g., data completion and flexible counterfactual analysis) and () embraces statistical sharing power to implicitly summarize intrinsic compositional data metaknowledge within model parameters, enhancing the model’s robustness, adaptability, and generalization capabilities.

It delivers extraordinary data and task flexibilities by enabling largescale training with complete/incomplete data on diverse learning tasks across various domains, leading to () minimal human interventions in data collection and learningtask specification, () significantly reduced trainingtest (or pretrainingfinetuning) gap, and () potentially an avenue to the true selflearning on the Internet.
2 Related Work and Preliminary
Big/Foundation Models. Taking shape in NLP, big/foundation models have drastically changed the research and practice of AI bommasani2021opportunities ; yuan2022roadmap . For example, BERT stickland2019bert and GPT3 brown2020language ) have had a huge impact on the field of NLP, whereas CLIP radford2021learning , DALLEs ramesh2021zero ; ramesh2022hierarchical , MAE he2021masked , Florence yuan2021florence etc. have attracted wide attentions from both NLP and CV research fields. Most foundation models are trained in a maskandpredict manner, i.e., holding out a portion of the input followed by training the model to predict the missing content, as demonstrated in Fig. (b)b and Fig. (b)b. We will reveal that the maskandpredict learning is a special case of a principled universal machine learning paradigm, i.e., the proposed big learning, which justifies the success of foundation models and provides theoretical guidance for their improvements.
Transformers and Vision Transformers (ViTs). Based on the selfattention mechanism vaswani2017attention , Transformers have been serving as the De facto model architecture for foundation models in both NLP and CV fields radford2018improving ; liu2019roberta ; stickland2019bert ; radford2019language ; brown2020language . Often Transformers take as input a sequence of discrete indexes with length and output the corresponding latent embedding with embedding dimension for downstream applications; attentions are implemented among the locations layerwisely. ViTs dosovitskiy2020image
are Transformers modified for dealing with continuous images, which have been empirically proven to have better generalization and robustness than convolutional neural networks
naseer2021intriguing . Different from Transformers embedding discrete indexes into highdimensional continuous features, ViTs directly employ flattened image patches as those continuous features, as demonstrated in Fig. (b)b. It’s well known that Transformers/ViTs are often overparameterized and therefore data/information hungry; we will reveal that this property of Transformers/ViTs, together with their great modeling flexibility, exactly matches our big learning.Multimodal Learning Objectives. Two famous examples for multimodal learning objectives are () maximum likelihood learning with discrete categorical observations, i.e., the crossentropy loss and () adversarial learning for continuous observations, i.e., the GAN loss goodfellow2014generative .

[leftmargin=7mm]

Given datalabel pairs
and a classifier
that outputs the probabilities
of belonging to category , the crossentropy loss is identical to(1) where evaluates to 1 if , 0 otherwise, and the optimal . Note both and are often regarded as categorical distributions that are capable of modeling multiple modes; for example, consider the diverse generation from the GPT3 brown2020language .

Generative adversarial nets (GANs) are widely used for synthesizing highly realistic images brock2018large ; Karras_2019_CVPR ; karras2019analyzing ; karras2021alias . A standard GAN goodfellow2014generative consists of a generator and a discriminator , both of which are trained in an adversarial manner via
(2) where is the data distribution and is the generated distribution with the generative process .
is an easytosample distribution, like a normal distribution. With optimal
, Eq. (2) minimizes the JensenShannon divergence goodfellow2014generative . Recently, the community begins to exploit integrating ViTs into GANs jiang2021transgan ; lee2021vitgan ; zhao2021improved ; zhang2021styleswin . For example, the ViTGAN lee2021vitgan , delivering SOTA generative performance, employs simple modifications to the ViT architecture to form the ViTbased generator and discriminator, but adopts many techniques to regularize the discriminator for stabilized training. Empirically, we also find it challenging to stabilize GAN training with a ViTbased discriminator.
3 Big Learning: A Universal Machine Learning Paradigm
As aforementioned in the Introduction, the presented big learning has extraordinary data flexibility, where the training data may be incomplete e.g., with missing dimensions/values in the feature (with length and dimension , like flattened patches of an image) or with missing supervisions (like a label with ).
For better introduction of our big learning, we first present its main idea in simplified unsupervised settings, where a data sample contains only a feature , followed by generalizing its scope to the general settings, where a data sample contains both a feature and a supervision . Note in both cases, may be incomplete with missing dimensions/values.
3.1 Unsupervised Big Learning
In unsupervised settings, we focus mainly on generation tasks for introduction. Given a collection of data samples from the underlying data distribution , the main stream of machine learning paradigms concentrate solely on the joint modeling, i.e., to construct a joint model to resemble the joint data distribution , or informally , using GANs goodfellow2014generative ; brock2018large ; Karras_2019_CVPR ; karras2019analyzing ; karras2021alias , VAEs kingma2013auto ; dai2018diagnosing , Flows dinh2014nice ; dinh2016density ; kingma2018glow , diffusion models ho2020denoising ; song2020score , etc.
Motivations. We highlight two practical situations where that joint modeling is restricted. () In addition to potentially limited complete data samples, often plenty of incomplete ones are available, e.g., in medical/biological scenarios. The joint modeling cannot handle incomplete data, making it wasteful and inexpedient to simply discard those incomplete ones (and the valuable information therein), especially where data collection is expensive. Besides, discarding incomplete data samples potentially introduces unexpected interventions, likely damaging the i.i.d. assumption that lays the foundation of machine learning. () It’s worth highlighting that, given a dataset with complete data, one already receives the data samples from all joint/conditional/marginal distributions; therefore, ideally, one should comprehensively exploit that valuable information e.g., to form all the associated data capabilities (like various conditional sampling for data completion) or to leverage different joint/conditional/marginal perspectives (formed as different tasks) to regularize each other^{4}^{4}4 If the joint modeling is learned perfectly, it’s possible but often computationally expensive to recover all conditional/marginal capabilities. Moreover, that perfect modeling assumption is likely violated in practice, hindering the retrieval of conditional/marginal capabilities. .
The above practical situations motivate us to model manytoall joint/conditional/marginal distributions simultaneously (manifested as “big” learning with massive tasks in high dimensions), so as to enable flexible training with all available complete/incomplete data and, at the same time, “collect” comprehensive data capabilities via exquisite data exploitation. Note incomplete data are readily utilized in the corresponding conditional/marginal tasks.
However, it’s not straightforward to do that big learning, because of the massive learning tasks rising from modeling all joint/conditional/marginal distributions. Consider a simple length dimensional problem, where , , , and the length index set ,

the goal of the joint modeling is to deliver ;

naively, big learning need simultaneously construct models
(3) so as to yield the corresponding data capabilities, like with the available data pairs from training data.
Compared Methods  Joint Modeling  Unsupervised Big Learning 

Intuitively  Straightforward  Complicated/Intractable 
Training Data  Complete Data  Complete/Incomplete Data 
Data Exploitation  Single Joint Perspective  Exquisite Manytoall Perspectives 
Capabilities After Training  Joint  Joint/Conditional/Marginal 
Potential Downstream Applications  Limited  Extremely Abundant 
Unsupervised Big Learning. By referring to Eq. (3) and considering a general problem with and the index set , ideally, one need build ^{5}^{5}5 denotes the number of combinations from a set with elements. Note considering implementation complexities, we only consider big learning in the dimension here; it’s straightfroward but likely expensive to generalize to the situations. joint/conditional/marginal models in total, which is clearly prohibitive. We alternatively propose to leverage a universal model
(4) 
with shared parameters to model all of them simultaneously, where and denote the random nonoverlapping input/output index subsets, respectively. Note need not be , indicating that our unsupervised big learning can naturally handle incomplete data (with the model architecture and training objective detailed below). Fig. 3 demonstrates the unsupervised big learning based on Eq. (4) and Table 1 compares it with the conventional joint modeling.
Model Architecture and Training Objective. Since the length/dimensions of input and output are not fixed, it’s not easy to model based on convolutions. Motivated by the modeling flexibility of Transformers vaswani2017attention and the fact that most existing foundation models are built on top of Transformers and implicitly doing big learning (as revealed below), we propose to model in Eq. (4) based on the Transformer architecture. The training objective is taskspecific and is often specified as simple and commonlyused machine learning objectives, as exampled below, where we reveal by parts that common foundation models are implicitly doing big learning.

[leftmargin=7mm]

Let denotes a sequence of continuous embeddings, such as the flattened patches of an image in ViTs dosovitskiy2020image ; he2021masked ; wei2021masked , as demonstrated in Fig. (b)b. Big learning aims at learning the data capabilities of generating a subset of image patches given another subset of patches , manifested as versatile data completions () or joint/marginal generations (). Considering the diversity of input/output patches , it’s natural to model as a bidirectional ViT that models the generative process of conditioned on , mimicking a (conditional)GAN generator. An additional noise token/input is often necessary lee2021vitgan ; zhang2021styleswin , in addition to a GAN loss and a specially designed GAN discriminator (see the following example and Appendix A for details). One may also consider extensions associated with VAEs kingma2013auto , Flows dinh2014nice , and diffusion models ho2020denoising ; song2020score .
MAE he2021masked as A Special Case. By employing a unimodal Gaussian likelihood , where is the ViT model, our unsupervised big learning based on the maximum likelihood objective recovers the MAE, which predicts based on using the mean squared error loss, with additional constraints of and being a subset of .

Let denotes a sequence of discrete tokens, like text words stickland2019bert ; brown2020language
or vectorquantified image patches
ramesh2021zero . The task is to predict the output/target/masked tokens given the input/source ones . It’s straightforward to model as a bidirectional Transformer encoder, which outputs at each index of the probabilities of a Categorical distribution for prediction. With the employed crossentropy loss and additional constraints of and being a subset of , big learning exactly reduces to the masked language modeling of the BERT stickland2019bert . Of course, unsupervised big learning will also recover autoregressive language models like GPTs radford2018improving ; radford2019language ; brown2020language , with special settings for both and .
Take the former with GAN objectives for example. For simplicity, we illustrate with the standard GAN goodfellow2014generative ; one may also consider generalizing to other GAN variants. Given a universal model , modeling the generative processes of given for all pairs, and a randomly sampled union set with different realizations, e.g., ,

[leftmargin=7mm]

one could match any model distributions to the corresponding underlying (subset) data distribution with
(5) where the optimal discriminator is with . is also modeled as a Transformer (see Appendix A for details).

one can also enable “communications” among any two model distributions via
(6) where the “communication” discriminator is constructed indirectly with the same neural network
from Eq. (5). Proofs are given in Appendix A.
Ideally, after training, one should yield for all pairs, i.e., big learning delivers all joint/conditional/marginal data capabilities. Fig. LABEL:fig:demo_unsupervised_biglearn_GAN in Appendix A demonstrates the main idea.
By comparing with the MAE he2021masked (see Fig. (b)b) that employs a unimodal Gaussian likelihood, it’s clear that our unsupervised big learning, based on GAN objectives in Eqs. (5) and (6), is capable of handling practical situations with multimodal given . Besides, different from the MAE assuming independence among pixels, our unsupervised big learning leverages a GAN discriminator to implicitly embrace the underlying pixellevel dependence.
3.2 Discussions on Unsupervised Big Learning
The following discussions are readily extended to our big learning presented in Section 3.3.
Can we share one universal model among diverse pairs? Yes. We justify our positive answer in the following three steps. Note we focus solely on the modeling here.

[leftmargin=7mm]

For modeling with a fixed and varying , one can of course employ a shared “joint” model to model the joint generative process of given , because its “marginal” generative process associated with exactly models the corresponding .

For with a fixed union but different pairs, extensive and successful empirical evaluations from existing foundation models, like BERT stickland2019bert , GPTs radford2018improving ; radford2019language ; brown2020language , and MAE he2021masked , have proven that it’s possible to employ a shared universal model.

Combining the previous two steps, we conclude our positive answer.
On the model capacity of . To collect manytoall data capabilities within one universal/foundation model brings tremendous challenges to its model capacity. Fortunately, Transformers are wellknown to be data/information hungry, along with their modeling flexibility and parallelcomputing amenability, making them suitable for modeling . Moreover, huge Transformers already emerges, e.g., the BaGuaLu with 174 trillion parameters ma2022bagualu . Therefore, the model capacity might not be an issue for big learning (please also refer to our experiments).
On the generalization of model parameters and latent features. As aforementioned, exiting big/foundation models, showing extraordinary robustness, adaptability, and generalization capabilities, are implicitly doing big learning. Accordingly, we leverage our big learning to explain why they have such amazing characteristics.

[leftmargin=7mm]

Firstly, by referring to Eq. (4) and Fig. 3, both the model parameters and the latent features of are shared among manytoall data tasks (with potentially complete/incomplete data and diverse learning objectives^{6}^{6}6 This will be made clearer in Section 3.3 ), manifesting a massive multitask learning scenario that exquisitely exploits the data information with statistical sharing power. Because all tasks share a consistent goal to model (some perspective of) the one underlying data distribution , it’s expected that big learning would encourage the parameters (and also the latent features) to summarize the intrinsic data information and potentially forms compositional and generally applicable data metaknowledge papadimitriou2020learning ; wu2021lime ; lu2021pretrained , manifested as those amazing characteristics.

Secondly, along with delivering manytoall joint/conditional/marginal data capabilities that have great practical value, big learning, benefiting from its massive training tasks, is also expected to minimize the influence of human interventions in data collection and learningtask specification as well as to significantly reduce the trainingtest (or pretrainingfinetuning) gap, which is believed crucial for justifying the realworld performance and extraordinary capabilities of big/foundation models stickland2019bert ; he2021masked .
On the weighting of massive training tasks. It’s worth highlighting that big learning comes with flexible weighting of its massive training tasks, via various sampling implementations of the pair. How to optimally weight those training tasks is important but is likely taskdependent; we leave that as future research. Alternatively, we notice that realworld training datasets are often composed of both complete and incomplete samples, meaning the corresponding pairs are already given; therefore, one may prefer to “let the data speak for themselves” in realworld applications. As for largescale pretraining, one may also employ a specific sampling (like uniform) for according to the available domain knowledge. It’s worth emphasizing that, despite different weighting strategy, the optimum is the same and consistent, i.e., informationally identical to the underlying data distribution .
Generalizations based on domain knowledge. Considering practical situations, to directly model within the observed domain, i.e., , may not be a good choice he2021masked ; wei2021masked . We reveal that, with trustworthy domain knowledge, one may alternatively model in a transformed domain, i.e., with or (e.g., wei2021masked ; see Fig. (b)b), where , , and are domainknowledgeinspired functions.
3.3 Big Learning
Based on the unsupervised big learning introduced in the previous section, where a data sample contains only with length and dimension , we next present its generalized version with containing both a feature and a supervision . Accordingly, the random nonoverlapping input/output index subsets are expanded as and , respectively, where and with being the index set of . Although we focus on for introduction, the following techniques and discussions are readily generalized to more complicated realworld situations with .
Thanks to the modeling flexibility of unsupervised big learning, we can readily generalize it into big learning in general settings (demonstrated in Fig. (a)a), whose main idea is to model for random pairs with being the underlying data distribution.
Model Architecture and Training Objective. For situations where has the same data type (e.g., both and denotes a sequence of continuous image patches), the techniques presented in Section 3.1 are straightforwardly generalized. We next elaborate on practical situations with multimodality gupta2021towards ; li2021towards ; ramesh2021zero ; ramesh2022hierarchical ; baevski2022data2vec , for example, denoting a discrete token sequence while being a continuous embedding sequence. We reveal two solutions.

[leftmargin=7mm]

Transform one data type to the other one for alignment. For example, we can vectorquantize the continuous embedding sequence into a sequence of discrete
tokens, similar to DALLE
ramesh2021zero , followed by employing similar techniques introduced in Section 3.1. 
Recursively reuse to model the dependence of and . The key idea is to exploit the flexibility of our big learning. Specifically, we can unfold the learning via
(7) where and . Big learning with Eq. (7) first forwardpropagates twice through the universal/foundation model , with the output of the first propagation inserted to the input of the second one; after calculating the objective, big learning then backpropagates gradients twice to the parameters for model updating, thanks to the continuity of . Note both and have one unique data type after folding.
BERT pretraining as A Special Case. Section 3.1 reveals that our (unsupervised) big learning contains the masked language modeling part of the BERT pretraining as a special case. By further introducing the conditional independence simplification to Eq. (7), i.e., , and setting , one can readily verify that our big learning recovers the BERT pretraining including both masked language modeling and next sentence prediction stickland2019bert .
3.4 Discussions on Big Learning
Supervised Learning  Selfsupervised Learning  Unconditioned Generation  Conditioned Generation 

Big learning serves as a universal machine learning paradigm. Benefiting from its modeling flexibility, big learning has most machine learning paradigms as special cases, including supervised learning, selfsupervised learning, generative learning, their mixtures, etc, as shown in Table 2^{7}^{7}7 We focus on the core idea here for demonstration and highlight that the implementation details can be taskspecific, such as network architectures, maxlikelihood/adversarial objectives, etc. . That universality of big learning, combined with its flexibility, enables flexible combinations of existing machine learning paradigms with knowledge communications (via the shared model parameters as well as training objectives like Eq. (6)); therefore, the proposed big learning might potentially facilitate semantically diverse multitask selflearning on the Internet, producing brainscale big/foundation models with reinforced performance, robustness, generalization, and general intelligence.
Big learning versus selfsupervised contrastive learning. Contrastive learning he2020momentum ; chen2020simple ; grill2020bootstrap ; chen2021exploring focuses mainly on exploiting the (image) domain prior knowledge to learn generally applicable data representations for downstream tasks. From the perspective of prior exploitation, contrastive learning is orthogonal to our big learning that is mostly data driven. However, as aforementioned in Section 3.2, the proposed big learning can be combined with contrastive learning to incorporate trustworthy domain priors, thanks to its great flexibility. Further discussions are given in Appendix C.
On the i.i.d. assumption. The i.i.d. assumption is one of the key foundations of deep learning, but it’s also wellknown that the training data collected for practical applications are rarely i.i.d., leading to a trainingtest mismatch that significantly hinders the reliability of deep learning models. Existing foundation models have began to demonstrate increasing robustness and generalization capabilities towards that mismatch, benefiting from their implicit implementations of big learning. It’s reasonable to believe (with the following 3 reasons) that our big learning, embracing its remarkable flexibility for largescale training data and massive training tasks, will significantly reduce that trainingtest mismatch (or pretrainingfinetuning gap) and therefore behave robustly in realworld applications.

[leftmargin=7mm]

Manually collecting or filtering data samples will likely introduce unintentional interventions that violate the i.i.d. assumption and enlarge the trainingtest mismatch. The data flexibility of our big learning makes it possible to conduct training with minimaltonone human interventions in data collection, and accordingly, “let the data speak for themselves.”

Thanks to big learning’s remarkable flexibilities in complete/incomplete training data and massive learning tasks, its “experience horizons,” associated with both the data (i.e., the scope of the underlying data distribution ) and the tasks, are significantly enlarged, making it less likely for a trainingtest mismatch (or pretrainingfinetuning gap) to emerge.

Even with the same dataset with all complete samples, big learning is also expected to behave more robustly to the i.i.d. assumption, because () to collect i.i.d. complete samples in the joint domain is often challenging for practical applications; () the conditional tasks of big learning, e.g., , are always implemented with perfect and trustworthy i.i.d. samples from ; and () big learning enables communications among tasks, which is expected to bring benefits.
4 Experiments
Because the wide scope of our big learning, it’s extremely challenging and timeconsuming to comprehensively evaluate its effectiveness in general settings. We’d like to emphasize that the great success of existing big/foundation models has provided concrete evidences that support our big learning. In what follows, we concentrate on the simplified but fundamental unsupervised big learning for demonstration, which is already a great challenge, as detailed below.
Based on Section 3.1 (more specifically, Eqs. (5) and (6)), we conduct unsupervised big learning with all joint/conditional/marginal data tasks (via
sampling) on image datasets of MNIST and CelebA. After training, we diversely test the data completion capabilities of the biglearned model, followed by abusing anomalous outofdomain completion tasks to challenge the generalization capability of our big learning. See Appendix A for the experimental details.
4.1 Versatile Data Completion Capabilities with Adaptive Generation Diversity
We first test the biglearned data generation/completion capabilities with different ratios of in . For a specific ratio, we either randomly sample the corresponding number of image patches or choose the initialportion of to form the , which is then input to the universal generator for image completion. Fig. 7 shows the results.
It’s clear that after big learning, the universal generator masters manytoall joint/conditional/marginal data capabilities simultaneously. Besides, the big learning automatically learns from the data an adaptive generation diversity that follows human intuition. Specifically, with limitedtonone (i.e., less conditional information), the big learning delivers increasingly diverse generations that are mainly controlled by the noise, whereas, with larger ratios for (i.e., more conditional information), the generation becomes increasingly deterministic and gets close to the source image as expected (see Appendix D for more results).
We then test the biglearned data completion capability with respect to various and noise settings, with the corresponding results summarized in Fig. 8. On the one hand, given an image and a random noise , big learning clearly delivers for various s diverse (relatively) realistic generations on both MNIST (see the variations in class/strokethickness/shape/angle) and CelebA (see the varying identity/hairstyle/makeup/expression). On the other hand, given a specific with limited information, the biglearned generator, when inputs different noises , also generates realistic images with diversity, justifying the multimodal generation power of the big learning.
4.2 Remarkable Generalization on Abused Anomalous Outofdomain Completion
We also design abused anomalous outofdomain completion tasks to challenge the generalization of our big learning. Specifically, we intentionally design with () abused interventions (e.g., random relocation and duplication) to image patches, as shown in Fig.9(a); () mixedup patches from different data samples (see Fig.9(b)); () unseen outofdomain image patches, as shown in Fig.9(c)(d).
It’s clear that our big learning manages to handle these abused with reasonable image completion; for example, the realistic characters with overall consistent style and smooth strokes in Fig.9(a), harmoniously completed faces even with mismatched face frame and hair color provided in Fig.9(b), as well as the fluent and consistent outofdomain completion with smooth junctions in Fig.9(c)(d). Those surprising results from abused anomalous outofdomain completion, consistent with existing findings ramesh2021zero ; ramesh2022hierarchical , justify the remarkable generalization capability of the presented big learning.
5 Conclusions
We propose the big learning for justifying, analyzing, and improving big/foundation models. Our big learning () is equipped with marvelous flexibility for both training data and trainingtask customization; () comprehensively exploits the available data information and potentially delivers all joint/conditional/marginal data capabilities after training; () significantly reduces the trainingtest (or pretrainingfinetuning) gap and thus improves model robustness and generalization capabilities; and () unifies conventional machine learning paradigms and enables their flexible cooperation, manifesting a universal learning paradigm.
Though inspiring, the big learn shares the same pros and cons of foundation models bommasani2021opportunities ; yuan2022roadmap . For example, to comprehensively demonstrate and evaluate the big learning is extremely challenging and timeconsuming. Besides, because of its wide scope and the associated massive training tasks with diversity, existing machinelearning building blocks may not be sufficient for big learning; e.g., a reliable technique to stabilize GAN training with ViTbased discriminator in high dimensions is still under development, to our knowledge. Big learning needs our community, and vice versa.
References
 (1) Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava, Xilun Chen, Luke Zettlemoyer, and Sonal Gupta. Muppet: Massive multitask representations with prefinetuning. arXiv preprint arXiv:2101.11038, 2021.
 (2) Alexei Baevski, WeiNing Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. Data2vec: A general framework for selfsupervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555, 2022.
 (3) Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. Emergent tool use from multiagent autocurricula. arXiv preprint arXiv:1909.07528, 2019.
 (4) John A Bargh and Ezequiel Morsella. The unconscious mind. Perspectives on psychological science, 3(1):73–79, 2008.
 (5) Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
 (6) A. Brock, J. Donahue, and K. Simonyan. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019.
 (7) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are fewshot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
 (8) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
 (9) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.

(10)
Xinlei Chen and Kaiming He.
Exploring simple siamese representation learning.
In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
, pages 15750–15758, 2021.  (11) B. Dai and D. Wipf. Diagnosing and enhancing VAE models. In ICLR, 2019.
 (12) Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Nonlinear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
 (13) Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
 (14) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
 (15) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672–2680, 2014.
 (16) JeanBastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. Bootstrap your own latenta new approach to selfsupervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
 (17) Tanmay Gupta, Amita Kamath, Aniruddha Kembhavi, and Derek Hoiem. Towards general purpose vision systems. arXiv preprint arXiv:2104.00743, 2021.
 (18) Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
 (19) Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
 (20) Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
 (21) Yifan Jiang, Shiyu Chang, and Zhangyang Wang. Transgan: Two pure transformers can make one strong gan, and that can scale up. Advances in Neural Information Processing Systems, 34, 2021.

(22)
T. Karras, S. Laine, and T. Aila.
A stylebased generator architecture for generative adversarial networks.
In CVPR, June 2019.  (23) T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and improving the image quality of StyleGAN. arXiv preprint arXiv:1912.04958, 2019.
 (24) Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Training generative adversarial networks with limited data. Advances in Neural Information Processing Systems, 33:12104–12114, 2020.
 (25) Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Aliasfree generative adversarial networks. Advances in Neural Information Processing Systems, 34, 2021.
 (26) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 (27) Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018.
 (28) Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and brain sciences, 40, 2017.
 (29) Kwonjoon Lee, Huiwen Chang, Lu Jiang, Han Zhang, Zhuowen Tu, and Ce Liu. Vitgan: Training gans with vision transformers. arXiv preprint arXiv:2107.04589, 2021.
 (30) Qing Li, Boqing Gong, Yin Cui, Dan Kondratyuk, Xianzhi Du, MingHsuan Yang, and Matthew Brown. Towards a unified foundation model: Jointly pretraining transformers on unpaired images and text. arXiv preprint arXiv:2112.07074, 2021.
 (31) Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. RoBERTa: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
 (32) Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Pretrained transformers as universal computation engines. arXiv preprint arXiv:2103.05247, 2021.
 (33) Casimir JH Ludwig, J Rhys Davies, and Miguel P Eckstein. Foveal analysis and peripheral selection during active visual sampling. Proceedings of the National Academy of Sciences, 111(2):E291–E299, 2014.
 (34) Zixuan Ma, Jiaao He, Jiezhong Qiu, Huanqi Cao, Yuanwei Wang, Zhenbo Sun, Liyan Zheng, Haojie Wang, Shizhi Tang, Tianyu Zheng, et al. BaGuaLu: targeting brain scale pretrained models with over 37 million cores. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 192–204, 2022.
 (35) Anabela Mesquita. Human behavior, psychology, and social interaction in the digital era. IGI Global, 2015.
 (36) Muhammad Muzammal Naseer, Kanchana Ranasinghe, Salman H Khan, Munawar Hayat, Fahad Shahbaz Khan, and MingHsuan Yang. Intriguing properties of vision transformers. Advances in Neural Information Processing Systems, 34:23296–23308, 2021.
 (37) M. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In ICCV, Graphics & Image Processing, pages 722–729. IEEE, 2008.
 (38) Isabel Papadimitriou and Dan Jurafsky. Learning music helps you read: Using transfer to study linguistic structure in language models. arXiv preprint arXiv:2004.14601, 2020.
 (39) A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
 (40) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
 (41) Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pretraining. 2018.
 (42) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical textconditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
 (43) Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, and Ilya Sutskever. Zeroshot texttoimage generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
 (44) Toni P Saarela and Michael S Landy. Integration trumps selection in object recognition. Current Biology, 25(7):920–927, 2015.
 (45) Yang Song, Jascha SohlDickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Scorebased generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
 (46) A. Stickland and I. Murray. BERT and PALs: Projected attention layers for efficient adaptation in multitask learning. arXiv preprint arXiv:1902.02671, 2019.
 (47) Alex Tamkin, Vincent Liu, Rongfei Lu, Daniel Fein, Colin Schultz, and Noah Goodman. DABS: A domainagnostic benchmark for selfsupervised learning. arXiv preprint arXiv:2111.12062, 2021.
 (48) Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 32–42, 2021.
 (49) A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is all you need. In NeurIPS, pages 5998–6008, 2017.
 (50) Chen Wei, Haoqi Fan, Saining Xie, ChaoYuan Wu, Alan Yuille, and Christoph Feichtenhofer. Masked feature prediction for selfsupervised visual pretraining. arXiv preprint arXiv:2112.09133, 2021.
 (51) Yuhuai Wu, Markus N Rabe, Wenda Li, Jimmy Ba, Roger B Grosse, and Christian Szegedy. Lime: Learning inductive bias for primitives of mathematical reasoning. In International Conference on Machine Learning, pages 11251–11262. PMLR, 2021.
 (52) Lu Yuan, Dongdong Chen, YiLing Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
 (53) Sha Yuan, Hanyu Zhao, Shuai Zhao, Jiahong Leng, Yangxiao Liang, Xiaozhi Wang, Jifan Yu, Xin Lv, Zhou Shao, Jiaao He, et al. A roadmap for big model. arXiv preprint arXiv:2203.14101, 2022.
 (54) Bowen Zhang, Shuyang Gu, Bo Zhang, Jianmin Bao, Dong Chen, Fang Wen, Yong Wang, and Baining Guo. Styleswin: Transformerbased gan for highresolution image generation. arXiv preprint arXiv:2112.10762, 2021.
 (55) Long Zhao, Zizhao Zhang, Ting Chen, Dimitris Metaxas, and Han Zhang. Improved transformer for highresolution gans. Advances in Neural Information Processing Systems, 34, 2021.