Big Learning: A Universal Machine Learning Paradigm?

by   Yulai Cong, et al.

Recent breakthroughs based on big/foundation models reveal a vague avenue for artificial intelligence, that is, bid data, big/foundation models, big learning, ⋯. Following that avenue, here we elaborate on the newly introduced big learning. Specifically, big learning comprehensively exploits the available information inherent in large-scale complete/incomplete data, by simultaneously learning to model many-to-all joint/conditional/marginal data distributions (thus named big learning) with one universal foundation model. We reveal that big learning is what existing foundation models are implicitly doing; accordingly, our big learning provides high-level guidance for flexible design and improvements of foundation models, accelerating the true self-learning on the Internet. Besides, big learning (i) is equipped with marvelous flexibility for both training data and training-task customization; (ii) potentially delivers all joint/conditional/marginal data capabilities after training; (iii) significantly reduces the training-test gap with improved model generalization; and (iv) unifies conventional machine learning paradigms e.g. supervised learning, unsupervised learning, generative learning, etc. and enables their flexible cooperation, manifesting a universal learning paradigm.


page 4

page 7

page 9

page 10


Big Learning with Bayesian Methods

Explosive growth in data and availability of cheap computing resources h...

Unsupervised Visual Representation Learning with Increasing Object Shape Bias

(Very early draft)Traditional supervised learning keeps pushing convolut...

Machine Learning in/for Blockchain: Future and Challenges

Machine learning (including deep and reinforcement learning) and blockch...

Universal Differential Equations for Scientific Machine Learning

In the context of science, the well-known adage "a picture is worth a th...

Learning in a Small/Big World

Savage (1972) lays down the foundation of Bayesian decision theory, but ...

Unsupervised Data Selection for Supervised Learning

Recent research put a big effort in the development of deep learning arc...

Jack and Masters of All Trades: One-Pass Learning of a Set of Model Sets from Foundation AI Models

For deep learning, size is power. Massive neural nets trained on broad d...

1 Introduction

AI is undergoing a paradigm shift with the rise of big/foundation models bommasani2021opportunities ; yuan2022roadmap , e.g., BERT stickland2019bert

, GPT-3

brown2020language ), CLIP radford2021learning , DALL-Es ramesh2021zero ; ramesh2022hierarchical , MAE he2021masked , etc. Foundation models, often based on mask-and-predict pretraining and downstream finetuning, are capable of benefiting from pretraining on broad data at scale and accordingly, demonstrate diverse downstream task capabilities with impressive robustness stickland2019bert , adaptability bommasani2021opportunities ; he2021masked , and generalization brown2020language ; ramesh2021zero . Therefore, they are rapidly being integrated injustifyto real-world AI systems, e.g., BERT into Google search111, Codex chen2021evaluating into GitHub’s Copilot222, etc.

Despite their impressive performance and practical capabilities, a unified theoretical framework justifying foundation models remains missing bommasani2021opportunities ; yuan2022roadmap , which is crucial for their further improvements and extensions and is likely a milestone for the foundation model community tamkin2021dabs .

To address that challenge, we first observe that the success of foundation models are mainly attributed to the following two flexibilities, in addition to increasingly powerful parallel computing techniques.

  • [leftmargin=7mm]

  • Data Flexibility. Foundation models are not “picky” about their training data, enabling training with large-scale/Internet-scale data with great diversity (e.g., across many domains). Those training data, often orders of magnitude larger than conventional machine learning datasets, are likely more consistent with the underlying data distribution, with minimal manual collection (i.e., human interventions), leading to narrowed training-test gap and therefore improved generalization and robustness of models.

  • Task Flexibility. Foundation models are often trained across many tasks in potentially many domains333This will be made clearer in Section 3.1 and Section 3.3; their multi-purpose training nature may offer new leverage for learning shared, compositional, and intrinsic meta-knowledge encoded in the model parameters, leading to improved model performance, adaptability, and generalization capabilities papadimitriou2020learning ; wu2021lime ; lu2021pretrained ; aghajanyan2021muppet . Note compositionality is a crucial ingredient of human intelligence lake2017building and may hold the key for out-of-distribution generalization bommasani2021opportunities .

Next, by referring to literatures baker2019emergent ; bommasani2021opportunities ; yuan2022roadmap

and reviewing the development of deep learning, we perceive a vague avenue for artificial intelligence, that is

bid data, big/foundation models, big learning, .

A clear trend of deep learning is more exploited information better model performance, e.g., utilizing bid data as source of abundant information and developing big/foundation models to exploit that information from the model perspective. However, we observe that most existing machine learning paradigms do not comprehensively exploit the abundant information within big training data. By contrast, the fundamental unconscious mind and the vision system of human brains are excellent at exquisite information exploitation in a multitasking manner bargh2008unconscious ; mesquita2015human ; ludwig2014foveal ; saarela2015integration .

Motivate by the above observations, we propose to further strengthen those flexibilities with more exquisite exploitation of data information from the learning perspective, via a newly-introduced universal machine learning paradigm named big learning, mimicking human brains.

The presented big learning leverages a universal/foundation model to simultaneously model many-to-all joint/conditional/marginal data distributions, manifested as a “big” training task that contains many machine learning paradigms as special cases. Our big learning comes with three main contributions.

  • [leftmargin=7mm]

  • It serves as a theoretical platform for justifying, analyzing, and improving big/foundation models, because most of them are implicitly doing (parts of) big learning, as revealed in Section 3.

  • By modeling many-to-all joint/conditional/marginal data distributions, big learning () comprehensively exploits the available data information (thus focusing on the data essence) and delivers the corresponding data capabilities (valuable for e.g., data completion and flexible counter-factual analysis) and () embraces statistical sharing power to implicitly summarize intrinsic compositional data meta-knowledge within model parameters, enhancing the model’s robustness, adaptability, and generalization capabilities.

  • It delivers extraordinary data and task flexibilities by enabling large-scale training with complete/incomplete data on diverse learning tasks across various domains, leading to () minimal human interventions in data collection and learning-task specification, () significantly reduced training-test (or pretraining-finetuning) gap, and () potentially an avenue to the true self-learning on the Internet.

2 Related Work and Preliminary

Big/Foundation Models. Taking shape in NLP, big/foundation models have drastically changed the research and practice of AI bommasani2021opportunities ; yuan2022roadmap . For example, BERT stickland2019bert and GPT-3 brown2020language ) have had a huge impact on the field of NLP, whereas CLIP radford2021learning , DALL-Es ramesh2021zero ; ramesh2022hierarchical , MAE he2021masked , Florence yuan2021florence etc. have attracted wide attentions from both NLP and CV research fields. Most foundation models are trained in a mask-and-predict manner, i.e., holding out a portion of the input followed by training the model to predict the missing content, as demonstrated in Fig. (b)b and Fig. (b)b. We will reveal that the mask-and-predict learning is a special case of a principled universal machine learning paradigm, i.e., the proposed big learning, which justifies the success of foundation models and provides theoretical guidance for their improvements.

Transformers and Vision Transformers (ViTs). Based on the self-attention mechanism vaswani2017attention , Transformers have been serving as the De facto model architecture for foundation models in both NLP and CV fields radford2018improving ; liu2019roberta ; stickland2019bert ; radford2019language ; brown2020language . Often Transformers take as input a sequence of discrete indexes with length and output the corresponding latent embedding with embedding dimension for downstream applications; attentions are implemented among the locations layer-wisely. ViTs dosovitskiy2020image

are Transformers modified for dealing with continuous images, which have been empirically proven to have better generalization and robustness than convolutional neural networks

naseer2021intriguing . Different from Transformers embedding discrete indexes into high-dimensional continuous features, ViTs directly employ flattened image patches as those continuous features, as demonstrated in Fig. (b)b. It’s well known that Transformers/ViTs are often over-parameterized and therefore data/information hungry; we will reveal that this property of Transformers/ViTs, together with their great modeling flexibility, exactly matches our big learning.

Multi-modal Learning Objectives. Two famous examples for multi-modal learning objectives are () maximum likelihood learning with discrete categorical observations, i.e., the cross-entropy loss and () adversarial learning for continuous observations, i.e., the GAN loss goodfellow2014generative .

  • [leftmargin=7mm]

  • Given data-label pairs

    and a classifier

    that outputs the probabilities

    of belonging to category , the cross-entropy loss is identical to


    where evaluates to 1 if , 0 otherwise, and the optimal . Note both and are often regarded as categorical distributions that are capable of modeling multiple modes; for example, consider the diverse generation from the GPT-3 brown2020language .

  • Generative adversarial nets (GANs) are widely used for synthesizing highly realistic images brock2018large ; Karras_2019_CVPR ; karras2019analyzing ; karras2021alias . A standard GAN goodfellow2014generative consists of a generator and a discriminator , both of which are trained in an adversarial manner via


    where is the data distribution and is the generated distribution with the generative process .

    is an easy-to-sample distribution, like a normal distribution. With optimal

    , Eq. (2) minimizes the Jensen-Shannon divergence goodfellow2014generative . Recently, the community begins to exploit integrating ViTs into GANs jiang2021transgan ; lee2021vitgan ; zhao2021improved ; zhang2021styleswin . For example, the ViTGAN lee2021vitgan , delivering SOTA generative performance, employs simple modifications to the ViT architecture to form the ViT-based generator and discriminator, but adopts many techniques to regularize the discriminator for stabilized training. Empirically, we also find it challenging to stabilize GAN training with a ViT-based discriminator.

3 Big Learning: A Universal Machine Learning Paradigm

As afore-mentioned in the Introduction, the presented big learning has extraordinary data flexibility, where the training data may be incomplete e.g., with missing dimensions/values in the feature (with length and dimension , like flattened patches of an image) or with missing supervisions (like a label with ).

For better introduction of our big learning, we first present its main idea in simplified unsupervised settings, where a data sample contains only a feature , followed by generalizing its scope to the general settings, where a data sample contains both a feature and a supervision . Note in both cases, may be incomplete with missing dimensions/values.

3.1 Unsupervised Big Learning

In unsupervised settings, we focus mainly on generation tasks for introduction. Given a collection of data samples from the underlying data distribution , the main stream of machine learning paradigms concentrate solely on the joint modeling, i.e., to construct a joint model to resemble the joint data distribution , or informally , using GANs goodfellow2014generative ; brock2018large ; Karras_2019_CVPR ; karras2019analyzing ; karras2021alias , VAEs kingma2013auto ; dai2018diagnosing , Flows dinh2014nice ; dinh2016density ; kingma2018glow , diffusion models ho2020denoising ; song2020score , etc.

Motivations. We highlight two practical situations where that joint modeling is restricted. () In addition to potentially limited complete data samples, often plenty of incomplete ones are available, e.g., in medical/biological scenarios. The joint modeling cannot handle incomplete data, making it wasteful and inexpedient to simply discard those incomplete ones (and the valuable information therein), especially where data collection is expensive. Besides, discarding incomplete data samples potentially introduces unexpected interventions, likely damaging the i.i.d. assumption that lays the foundation of machine learning. () It’s worth highlighting that, given a dataset with complete data, one already receives the data samples from all joint/conditional/marginal distributions; therefore, ideally, one should comprehensively exploit that valuable information e.g., to form all the associated data capabilities (like various conditional sampling for data completion) or to leverage different joint/conditional/marginal perspectives (formed as different tasks) to regularize each other444 If the joint modeling is learned perfectly, it’s possible but often computationally expensive to recover all conditional/marginal capabilities. Moreover, that perfect modeling assumption is likely violated in practice, hindering the retrieval of conditional/marginal capabilities. .

(a) Unsupervised Big Learning
(b) MAE he2021masked and HOG wei2021masked
Figure 3: Unsupervised big learning (a) and its special cases (b). Often a mask token [M] is inserted to the input locations outside for forward propagation, while no loss is back-propagated to the output locations outside . Note inserting the [M] token later in a middle layer (but at the same location) often lightens the computation and memory burdens but improves the performance he2021masked .

The above practical situations motivate us to model many-to-all joint/conditional/marginal distributions simultaneously (manifested as “big” learning with massive tasks in high dimensions), so as to enable flexible training with all available complete/incomplete data and, at the same time, “collect” comprehensive data capabilities via exquisite data exploitation. Note incomplete data are readily utilized in the corresponding conditional/marginal tasks.

However, it’s not straightforward to do that big learning, because of the massive learning tasks rising from modeling all joint/conditional/marginal distributions. Consider a simple -length -dimensional problem, where , , , and the length index set ,

  • the goal of the joint modeling is to deliver ;

  • naively, big learning need simultaneously construct models


    so as to yield the corresponding data capabilities, like with the available data pairs from training data.

Compared Methods Joint Modeling Unsupervised Big Learning
Intuitively Straight-forward Complicated/Intractable
Training Data Complete Data Complete/Incomplete Data
Data Exploitation Single Joint Perspective Exquisite Many-to-all Perspectives
Capabilities After Training Joint Joint/Conditional/Marginal
Potential Downstream Applications Limited Extremely Abundant
Table 1: Comparing joint modeling with unsupervised big learning.

Unsupervised Big Learning. By referring to Eq. (3) and considering a general problem with and the index set , ideally, one need build 555 denotes the number of -combinations from a set with elements. Note considering implementation complexities, we only consider big learning in the -dimension here; it’s straight-froward but likely expensive to generalize to the situations. joint/conditional/marginal models in total, which is clearly prohibitive. We alternatively propose to leverage a universal model


with shared parameters to model all of them simultaneously, where and denote the random non-overlapping input/output index subsets, respectively. Note need not be , indicating that our unsupervised big learning can naturally handle incomplete data (with the model architecture and training objective detailed below). Fig. 3 demonstrates the unsupervised big learning based on Eq. (4) and Table 1 compares it with the conventional joint modeling.

Model Architecture and Training Objective. Since the length/dimensions of input and output are not fixed, it’s not easy to model based on convolutions. Motivated by the modeling flexibility of Transformers vaswani2017attention and the fact that most existing foundation models are built on top of Transformers and implicitly doing big learning (as revealed below), we propose to model in Eq. (4) based on the Transformer architecture. The training objective is task-specific and is often specified as simple and commonly-used machine learning objectives, as exampled below, where we reveal by parts that common foundation models are implicitly doing big learning.

  • [leftmargin=7mm]

  • Let denotes a sequence of continuous embeddings, such as the flattened patches of an image in ViTs dosovitskiy2020image ; he2021masked ; wei2021masked , as demonstrated in Fig. (b)b. Big learning aims at learning the data capabilities of generating a subset of image patches given another subset of patches , manifested as versatile data completions () or joint/marginal generations (). Considering the diversity of input/output patches , it’s natural to model as a bidirectional ViT that models the generative process of conditioned on , mimicking a (conditional-)GAN generator. An additional noise token/input is often necessary lee2021vitgan ; zhang2021styleswin , in addition to a GAN loss and a specially designed GAN discriminator (see the following example and Appendix A for details). One may also consider extensions associated with VAEs kingma2013auto , Flows dinh2014nice , and diffusion models ho2020denoising ; song2020score .

    MAE he2021masked as A Special Case. By employing a unimodal Gaussian likelihood , where is the ViT model, our unsupervised big learning based on the maximum likelihood objective recovers the MAE, which predicts based on using the mean squared error loss, with additional constraints of and being a subset of .

  • Let denotes a sequence of discrete tokens, like text words stickland2019bert ; brown2020language

    or vector-quantified image patches

    ramesh2021zero . The task is to predict the output/target/masked tokens given the input/source ones . It’s straight-forward to model as a bidirectional Transformer encoder, which outputs at each index of the probabilities of a Categorical distribution for prediction. With the employed cross-entropy loss and additional constraints of and being a subset of , big learning exactly reduces to the masked language modeling of the BERT stickland2019bert . Of course, unsupervised big learning will also recover auto-regressive language models like GPTs radford2018improving ; radford2019language ; brown2020language , with special settings for both and .

Take the former with GAN objectives for example. For simplicity, we illustrate with the standard GAN goodfellow2014generative ; one may also consider generalizing to other GAN variants. Given a universal model , modeling the generative processes of given for all pairs, and a randomly sampled union set with different realizations, e.g., ,

  1. [leftmargin=7mm]

  2. one could match any model distributions to the corresponding underlying (subset) data distribution with


    where the optimal discriminator is with . is also modeled as a Transformer (see Appendix A for details).

  3. one can also enable “communications” among any two model distributions via


    where the “communication” discriminator is constructed indirectly with the same neural network

    from Eq. (5). Proofs are given in Appendix A.

Ideally, after training, one should yield for all pairs, i.e., big learning delivers all joint/conditional/marginal data capabilities. Fig. LABEL:fig:demo_unsupervised_biglearn_GAN in Appendix A demonstrates the main idea.

By comparing with the MAE he2021masked (see Fig. (b)b) that employs a unimodal Gaussian likelihood, it’s clear that our unsupervised big learning, based on GAN objectives in Eqs. (5) and (6), is capable of handling practical situations with multimodal given . Besides, different from the MAE assuming independence among pixels, our unsupervised big learning leverages a GAN discriminator to implicitly embrace the underlying pixel-level dependence.

3.2 Discussions on Unsupervised Big Learning

The following discussions are readily extended to our big learning presented in Section 3.3.

Can we share one universal model among diverse pairs? Yes. We justify our positive answer in the following three steps. Note we focus solely on the modeling here.

  1. [leftmargin=7mm]

  2. For modeling with a fixed and varying , one can of course employ a shared “joint” model to model the joint generative process of given , because its “marginal” generative process associated with exactly models the corresponding .

  3. For with a fixed union but different pairs, extensive and successful empirical evaluations from existing foundation models, like BERT stickland2019bert , GPTs radford2018improving ; radford2019language ; brown2020language , and MAE he2021masked , have proven that it’s possible to employ a shared universal model.

  4. Combining the previous two steps, we conclude our positive answer.

On the model capacity of . To collect many-to-all data capabilities within one universal/foundation model brings tremendous challenges to its model capacity. Fortunately, Transformers are well-known to be data/information hungry, along with their modeling flexibility and parallel-computing amenability, making them suitable for modeling . Moreover, huge Transformers already emerges, e.g., the BaGuaLu with 174 trillion parameters ma2022bagualu . Therefore, the model capacity might not be an issue for big learning (please also refer to our experiments).

On the generalization of model parameters and latent features. As aforementioned, exiting big/foundation models, showing extraordinary robustness, adaptability, and generalization capabilities, are implicitly doing big learning. Accordingly, we leverage our big learning to explain why they have such amazing characteristics.

  • [leftmargin=7mm]

  • Firstly, by referring to Eq. (4) and Fig. 3, both the model parameters and the latent features of are shared among many-to-all data tasks (with potentially complete/incomplete data and diverse learning objectives666 This will be made clearer in Section 3.3 ), manifesting a massive multi-task learning scenario that exquisitely exploits the data information with statistical sharing power. Because all tasks share a consistent goal to model (some perspective of) the one underlying data distribution , it’s expected that big learning would encourage the parameters (and also the latent features) to summarize the intrinsic data information and potentially forms compositional and generally applicable data meta-knowledge papadimitriou2020learning ; wu2021lime ; lu2021pretrained , manifested as those amazing characteristics.

  • Secondly, along with delivering many-to-all joint/conditional/marginal data capabilities that have great practical value, big learning, benefiting from its massive training tasks, is also expected to minimize the influence of human interventions in data collection and learning-task specification as well as to significantly reduce the training-test (or pretraining-finetuning) gap, which is believed crucial for justifying the real-world performance and extraordinary capabilities of big/foundation models stickland2019bert ; he2021masked .

On the weighting of massive training tasks. It’s worth highlighting that big learning comes with flexible weighting of its massive training tasks, via various sampling implementations of the pair. How to optimally weight those training tasks is important but is likely task-dependent; we leave that as future research. Alternatively, we notice that real-world training datasets are often composed of both complete and incomplete samples, meaning the corresponding pairs are already given; therefore, one may prefer to “let the data speak for themselves” in real-world applications. As for large-scale pretraining, one may also employ a specific sampling (like uniform) for according to the available domain knowledge. It’s worth emphasizing that, despite different weighting strategy, the optimum is the same and consistent, i.e., informationally identical to the underlying data distribution .

Generalizations based on domain knowledge. Considering practical situations, to directly model within the observed domain, i.e., , may not be a good choice he2021masked ; wei2021masked . We reveal that, with trustworthy domain knowledge, one may alternatively model in a transformed domain, i.e., with or (e.g., wei2021masked ; see Fig. (b)b), where , , and are domain-knowledge-inspired functions.

3.3 Big Learning

(a) Big Learning
(b) BERT stickland2019bert
Figure 6: Big learning (a) and its special case of BERT (b). Similar to the mask token [M] for (see Fig. (b)b), we employ another mask token for , which works identically to the classification token [CLS] in BERT settings stickland2019bert and the start-of-sentence token in GPT settings brown2020language . Often inserting [M]/ tokens later in a middle layer improves performance he2021masked ; touvron2021going .

Based on the unsupervised big learning introduced in the previous section, where a data sample contains only with length and dimension , we next present its generalized version with containing both a feature and a supervision . Accordingly, the random non-overlapping input/output index subsets are expanded as and , respectively, where and with being the index set of . Although we focus on for introduction, the following techniques and discussions are readily generalized to more complicated real-world situations with .

Thanks to the modeling flexibility of unsupervised big learning, we can readily generalize it into big learning in general settings (demonstrated in Fig. (a)a), whose main idea is to model for random pairs with being the underlying data distribution.

Model Architecture and Training Objective. For situations where has the same data type (e.g., both and denotes a sequence of continuous image patches), the techniques presented in Section 3.1 are straightforwardly generalized. We next elaborate on practical situations with multimodality gupta2021towards ; li2021towards ; ramesh2021zero ; ramesh2022hierarchical ; baevski2022data2vec , for example, denoting a discrete token sequence while being a continuous embedding sequence. We reveal two solutions.

  1. [leftmargin=7mm]

  2. Transform one data type to the other one for alignment. For example, we can vector-quantize the continuous embedding sequence into a sequence of discrete

    tokens, similar to DALL-E

    ramesh2021zero , followed by employing similar techniques introduced in Section 3.1.

  3. Recursively reuse to model the dependence of and . The key idea is to exploit the flexibility of our big learning. Specifically, we can unfold the learning via


    where and . Big learning with Eq. (7) first forward-propagates twice through the universal/foundation model , with the output of the first propagation inserted to the input of the second one; after calculating the objective, big learning then back-propagates gradients twice to the parameters for model updating, thanks to the continuity of . Note both and have one unique data type after folding.

BERT pretraining as A Special Case. Section 3.1 reveals that our (unsupervised) big learning contains the masked language modeling part of the BERT pretraining as a special case. By further introducing the conditional independence simplification to Eq. (7), i.e., , and setting , one can readily verify that our big learning recovers the BERT pretraining including both masked language modeling and next sentence prediction stickland2019bert .

3.4 Discussions on Big Learning

Supervised Learning Self-supervised Learning Unconditioned Generation Conditioned Generation
Table 2: Example special cases of big learning with and . Without loss of generality, we assume is the label paired with , where .

Big learning serves as a universal machine learning paradigm. Benefiting from its modeling flexibility, big learning has most machine learning paradigms as special cases, including supervised learning, self-supervised learning, generative learning, their mixtures, etc, as shown in Table 2777 We focus on the core idea here for demonstration and highlight that the implementation details can be task-specific, such as network architectures, max-likelihood/adversarial objectives, etc. . That universality of big learning, combined with its flexibility, enables flexible combinations of existing machine learning paradigms with knowledge communications (via the shared model parameters as well as training objectives like Eq. (6)); therefore, the proposed big learning might potentially facilitate semantically diverse multi-task self-learning on the Internet, producing brain-scale big/foundation models with reinforced performance, robustness, generalization, and general intelligence.

Big learning versus self-supervised contrastive learning. Contrastive learning he2020momentum ; chen2020simple ; grill2020bootstrap ; chen2021exploring focuses mainly on exploiting the (image) domain prior knowledge to learn generally applicable data representations for downstream tasks. From the perspective of prior exploitation, contrastive learning is orthogonal to our big learning that is mostly data driven. However, as aforementioned in Section 3.2, the proposed big learning can be combined with contrastive learning to incorporate trustworthy domain priors, thanks to its great flexibility. Further discussions are given in Appendix C.

On the i.i.d. assumption. The i.i.d. assumption is one of the key foundations of deep learning, but it’s also well-known that the training data collected for practical applications are rarely i.i.d., leading to a training-test mismatch that significantly hinders the reliability of deep learning models. Existing foundation models have began to demonstrate increasing robustness and generalization capabilities towards that mismatch, benefiting from their implicit implementations of big learning. It’s reasonable to believe (with the following 3 reasons) that our big learning, embracing its remarkable flexibility for large-scale training data and massive training tasks, will significantly reduce that training-test mismatch (or pretraining-finetuning gap) and therefore behave robustly in real-world applications.

  1. [leftmargin=7mm]

  2. Manually collecting or filtering data samples will likely introduce unintentional interventions that violate the i.i.d. assumption and enlarge the training-test mismatch. The data flexibility of our big learning makes it possible to conduct training with minimal-to-none human interventions in data collection, and accordingly, “let the data speak for themselves.”

  3. Thanks to big learning’s remarkable flexibilities in complete/incomplete training data and massive learning tasks, its “experience horizons,” associated with both the data (i.e., the scope of the underlying data distribution ) and the tasks, are significantly enlarged, making it less likely for a training-test mismatch (or pretraining-finetuning gap) to emerge.

  4. Even with the same dataset with all complete samples, big learning is also expected to behave more robustly to the i.i.d. assumption, because () to collect i.i.d. complete samples in the joint domain is often challenging for practical applications; () the conditional tasks of big learning, e.g., , are always implemented with perfect and trustworthy i.i.d. samples from ; and () big learning enables communications among tasks, which is expected to bring benefits.

4 Experiments

Because the wide scope of our big learning, it’s extremely challenging and time-consuming to comprehensively evaluate its effectiveness in general settings. We’d like to emphasize that the great success of existing big/foundation models has provided concrete evidences that support our big learning. In what follows, we concentrate on the simplified but fundamental unsupervised big learning for demonstration, which is already a great challenge, as detailed below.

Based on Section 3.1 (more specifically, Eqs. (5) and (6)), we conduct unsupervised big learning with all joint/conditional/marginal data tasks (via

-sampling) on image datasets of MNIST and CelebA. After training, we diversely test the data completion capabilities of the big-learned model, followed by abusing anomalous out-of-domain completion tasks to challenge the generalization capability of our big learning. See Appendix A for the experimental details.

4.1 Versatile Data Completion Capabilities with Adaptive Generation Diversity

Figure 7: Versatile data generation/completion capabilities from big learning. The first row with light-blue boxes shows different s, with an increasing ratio from left to right. The rightmost column gives the complete image.

We first test the big-learned data generation/completion capabilities with different ratios of in . For a specific ratio, we either randomly sample the corresponding number of image patches or choose the initial-portion of to form the , which is then input to the universal generator for image completion. Fig. 7 shows the results.

It’s clear that after big learning, the universal generator masters many-to-all joint/conditional/marginal data capabilities simultaneously. Besides, the big learning automatically learns from the data an adaptive generation diversity that follows human intuition. Specifically, with limited-to-none (i.e., less conditional information), the big learning delivers increasingly diverse generations that are mainly controlled by the noise, whereas, with larger ratios for (i.e., more conditional information), the generation becomes increasingly deterministic and gets close to the source image as expected (see Appendix D for more results).

We then test the big-learned data completion capability with respect to various and noise settings, with the corresponding results summarized in Fig. 8. On the one hand, given an image and a random noise , big learning clearly delivers for various s diverse (relatively) realistic generations on both MNIST (see the variations in class/stroke-thickness/shape/angle) and CelebA (see the varying identity/hair-style/make-up/expression). On the other hand, given a specific with limited information, the big-learned generator, when inputs different noises , also generates realistic images with diversity, justifying the multi-modal generation power of the big learning.

The experimental results in Figs. 7 and 8 demonstrate that, by comprehensively exploiting the available information inherent in large-scale complete/incomplete data, big learning is capable of delivering versatile data completion capabilities with adaptive generation diversity.

Figure 8: Versatile data completion capabilities from big learning regarding various (left) and noise (right). s are shown in upper-right light-blue boxes, while the red boxes show (left) and (right), respectively.

4.2 Remarkable Generalization on Abused Anomalous Out-of-domain Completion

Figure 9: Abused anomalous completion for demonstrating the generalization of big learning. (a) Random center patches placed in the upper/left locations as (top) and duplicated and placed in the center (bottom), with a model trained on MNIST. (b)-(d) use the same model trained on CelebA. (b) Combining image parts from two different images to form . Out-of-domain from the Flowers nilsback2008automated (c) and the MetFaces karras2020training (d).

We also design abused anomalous out-of-domain completion tasks to challenge the generalization of our big learning. Specifically, we intentionally design with () abused interventions (e.g., random relocation and duplication) to image patches, as shown in Fig.9(a); () mixed-up patches from different data samples (see Fig.9(b)); () unseen out-of-domain image patches, as shown in Fig.9(c)-(d).

It’s clear that our big learning manages to handle these abused with reasonable image completion; for example, the realistic characters with overall consistent style and smooth strokes in Fig.9(a), harmoniously completed faces even with mismatched face frame and hair color provided in Fig.9(b), as well as the fluent and consistent out-of-domain completion with smooth junctions in Fig.9(c)-(d). Those surprising results from abused anomalous out-of-domain completion, consistent with existing findings ramesh2021zero ; ramesh2022hierarchical , justify the remarkable generalization capability of the presented big learning.

5 Conclusions

We propose the big learning for justifying, analyzing, and improving big/foundation models. Our big learning () is equipped with marvelous flexibility for both training data and training-task customization; () comprehensively exploits the available data information and potentially delivers all joint/conditional/marginal data capabilities after training; () significantly reduces the training-test (or pretraining-finetuning) gap and thus improves model robustness and generalization capabilities; and () unifies conventional machine learning paradigms and enables their flexible cooperation, manifesting a universal learning paradigm.

Though inspiring, the big learn shares the same pros and cons of foundation models bommasani2021opportunities ; yuan2022roadmap . For example, to comprehensively demonstrate and evaluate the big learning is extremely challenging and time-consuming. Besides, because of its wide scope and the associated massive training tasks with diversity, existing machine-learning building blocks may not be sufficient for big learning; e.g., a reliable technique to stabilize GAN training with ViT-based discriminator in high dimensions is still under development, to our knowledge. Big learning needs our community, and vice versa.