1 Introduction
A hallmark of general intelligence is the ability of handling openended environments, which roughly means complex, diverse environments without manually predefined tasks adams_mapping_2012 . In recent years, although machine learning systems have achieved remarkable success in various humanspecified domains krizhevsky_imagenet_2012 ; he_deep_2016 ; mnih_humanlevel_2015 ; vaswani_attention_2017 , the problem of learning in openended environments without manual task specification remains largely open goertzel_artificial_2007 ; clune_aigas_2019 ; colas_curious:_2019 ; wang_enhanced_2020 . The main reason is that this problem has not been clearly defined, and the essential difference between openended and traditional, “closeended” environments has not been explicitly identified, rendering this direction vague and elusive. Hence, a formal problem definition and a principled learning framework are crucial for its further development.
In this paper, we study this problem from the perspective of openended data, i.e., the data sampled from openended environments, in the context of supervised learning. Without manual task definition, openended environments may involve multiple tasks simultaneously with no intertask delineation. Therefore, openended data may contain data sampled from multiple tasks, which exhibits an essential structural difference from conventional supervised data. Concretely, openended data is a mixture of the data generated by multiple target functions due to the emergence of different contexts or the semantic ambiguity of the data itself, which nullifies the assumption in conventional supervised learning that the relation between inputs and outputs can be modeled by a single target function vapnik_nature_2013 . For example, it is rational to map an image of a red sphere to “red” or “sphere”, and a video clip of a nodding man may be labeled as “yes” by some people and “no” by others. To formally characterize the structural property of openended data, we introduce a novel, datasetlevel concept termed as mapping rank, which is defined as the minimal number of deterministic functions required to “fully express” all inputoutput relations in the data.
Definition 1 (Mapping rank).
Let be an input space, an output space, and a dataset with cardinality . Let be a function set with cardinality , in which each element is a singlevalued deterministic function from to . Then, the mapping rank of , denoted by , is defined as the minimal positive integer , satisfying that there exists a function set such that for every , there exists with .
Under this definition, conventional supervised data has a mapping rank as it assumes that each input instance only corresponds to a unique output, thus the whole dataset can be expressed using a single function. In contrast, openended data has a mapping rank since for the same input different outputs exist. Hence, conventional supervised learning is problematic for openended data since a single function is insufficient to express the data with , resulting in the “conflict” between different sample pairs when running empirical risk minimization (ERM). This phenomenon has been observed by prior works finn_online_2019 ; su_task_2020 , where a vanilla agent trying to regress from multiple target functions trivially outputs their mean, leading to an inevitable training error regardless of the model class adopted. Meanwhile, we note that our setting is essentially different from multitask learning caruana_multitask_1997 and multilabel learning zhang_review_2014 : multitask learning aims to perform inductive transfer between related tasks, while the mapping rank of the data in each task is ; multilabel learning assigns each input with a fixed label set containing multiple labels, which also has a mapping rank though a larger output space is considered. Therefore, for openended data, a manual allocation or aggregation process is required to transform the data from to to enable effective learning, as shown in Figure 1. This raises an important question: Is it possible to learn from openended data with mapping rank directly?
In this work, we give an affirmative answer to this question by presenting an Openended Supervised Learning (OSL) framework to enable humanfree learning from openended data. Contrary to previous methods that rely on human efforts to passively eliminate the conflict in data, OSL does the opposite by actively leveraging the conflict to establish a hierarchical cognition structure. Concretely, OSL introduces a set of lowlevel prediction models and a novel, highlevel subjective function
that automatically allocates the data to these models so that the data processed by each model has no conflict, mimicking human subjectivity that acts in the manual allocation process of conflict samples. The motivation of such process is that if the subjective function yields an inappropriate allocation, i.e., assigning conflicting data samples to the same model, then the global training error cannot reach the minimum due to this conflict, thus driving the subjective function to improve its allocation. Based on probably approximately correct (PAC) learnability
haussler_probably_1990 , we prove that our design overcomes the learnability issue of conventional supervised learning with sufficient lowlevel models.The hierarchical nature of OSL leads to a novel global error decomposition consisting of two parts: a highlevel subjective error that measures the “rationality” of its data allocation, and a lowlevel model error that measures the accuracy of the models. This differs from conventional supervised learning, where only the model error exists. Utilizing the tools of statistical learning theory
vapnik_nature_2013 , we locate the error terms reflecting both type of errors in the generalization error of OSL, and respectively discuss their controlling strategies.We conduct extensive experiments including both regression and classification tasks with openended data to verify the efficacy of OSL. Our results show that our OSL agent can effectively learn from openended data without additional human intervention, simultaneously achieves small subjective and model errors, and exhibits humanlike tasklevel cognition with only instancelevel supervision.
In summary, we establish a novel, theoretically grounded supervised learning paradigm for openended data. Our contributions are threefold:
Openended data and mapping rank. We formalize a new problem of learning from openended data, and propose a novel concept termed as mapping rank to model the structural property of openended data, outlining its fundamental difference from conventional supervised data.
OSL framework and theory. We present an OSL framework to enable effective learning from openended data (Section 2), and justify its learnability (Section 3) and generalizability (Section 4) respectively based on statistical learning theory.
Global error decomposition and subjective error minimization. We show that our framework induces a novel global error decomposition, and empirically demonstrate that the minimization of the highlevel subjective error is crucial for achieving rational tasklevel cognition (Section 5).
2 OpenEnded Supervised Learning
In this section, we present the formulation of OSL, introducing its sampling process, the global objective, and the derivation of the subjective function. We adhere to the conventional terminology in supervised learning and let be an input space, an output space, a hypothesis space where each hypothesis (model) is a function from to , and
a nonnegative and bounded loss function without loss of generality. We use
for positive integers , and denote by the indicator function. We use superscripts (e.g., and ) to denote sampling indices and subscripts as element indices (e.g., ).2.1 Problem Statement
As introduced in Section 1, we consider openended data with mapping rank , and it is straightforward to subsume the traditional setting with within our framework as a degenerated case. Without loss of generality, to formulate the generation process of openended data we introduce the notion of domain. Inspired by bendavid_theory_2010 , we define a domain as a pair consisting of a distribution on and a deterministic target function , and assume that the openended data is generated by a domain set containing (agnostic to the learner) domains, resulting in a dataset with samples and a mapping rank . Hence, contains the data from all these domains, yet the exact samplewise domain information is unavailable. It is easy to verify that since the target functions in all domains are deterministic (i.e., there is no conflict in intradomain samples).
We consider a bilevel sampling procedure: first, domain samples are i.i.d. drawn from a distribution defined on (thus the same domain can be sampled multiple times), resulting in sampling episodes; second, in each episode data samples are drawn i.i.d. from the distribution and labeled by the target function of domain (hence ). This sampling regime is analagous to the bilevel sampling process adopted by metalearning pentina_pacbayesian_2014 ; amit_metalearning_2018 . However, in metalearning typically assumes a dense distribution of related domains to enable tasklevel generalization, while OSL is compatible with scarce and disparate domains and the interdomain transfer is not our concern.
In the above setting, an episodic sample number parameter is introduced to maintain the local consistency of data, implicitly assuming that the data samples in every size () batch belong to the same domain. We will both theoretically and empirically verify the necessity of this assumption in the following sections. Intuitively, setting may result in the multiplicity of solutions (see su_task_2020 and Figure 3(b) in our experiments) and fail to maintain the integrity of the original domains.
The aim of the learner is to predict each output given the input . As we have mentioned in Section 1, a single model is not sufficient in this setting due to the conflict in data. Thus, we equip the learner with a hypothesis set consisting of hypotheses, enhancing its expressive capability. Although both and are assumed to be unknown, we will show that in general suffices (Section 3.2
), which eases the difficulty of setting the hyperparameter
.Applications. Openended data naturally emerges in various machine learning applications. Note that although most of machine learning datasets used today are not openended,
is quite common for raw data directly collected from the real world. Therefore, if the machine can learn from openended data directly, the cumbersome manual data cleaning process can be avoided. Also, even after manual data cleaning, in many cases the innate conflict in data is inevitable in a singlelabel regime. For example, in the wellknown ImageNet dataset
deng_imagenet:_2009 , while each image is assigned to a single label, it may contain multiple objects of interest. This makes the annotated label only one of many equally valid descriptions of the image, which has led to issues in both training and evaluation as suggested by recent research shankar_evaluating_2020 ; beyer_are_2020 ; yun_relabeling_2021 . Meanwhile, in some scenarios it is not possible to manually check and resolve the conflict in the data from multiple sources in advance due to privacy or other reasons, such as in federated learning where the data is distributed on multiple clients mcmahan_communicationefficient_2017 ; kairouz_advances_2021 . In these scenarios, OSL can be applied for automatically detecting and resolving the conflict.2.2 Global Error
Since the openended dataset implicitly contains multiple underlying inputoutput mapping relations, a primary start point is the empirical multitask loss:
(1) 
where is an oracle mapping function that determines which hypothesis each data batch is assigned to. However, in OSL the oracle mapping function is unavailable, imposing a fundamental discrepancy. To tackle this difficulty, here we substitute the oracle mapping function with a learnable empirical subjective function that aims to select a hypothesis from the hypothesis set for the data batch . This substitution yields the empirical global error of OSL:
(2) 
Our insight is that the data batch itself can be harnessed to guide its suitable allocation in the presence of mapping conflict, since a single model trained by conflicting data batches result in an inevitable training error, thus hindering the optimization of the global error (2). This in turn facilitates data allocation with less conflict. Then, the corresponding expected global error is
(3) 
where is an expected subjective function, which can be viewed as the empirical subjective function with infinite samples from every single domain so that all available domain information can be fully reflected by the data samples. The global objective (3) characterizes the test process where the learner usually experiences only one task (domain) in a particular time period (which is natural in the real world), therefore its performance can be separately tested in all domains.
So far, our framework remains incomplete since the subjective function remain undefined. In the next section, we will present our design of the subjective function and elucidate its rationale.
2.3 Design of the Subjective Function
To attain a reasonable choice of the subjective function, in this section we provide an alternative, probabilistic view of OSL from the angle of maximum conditional likelihood, and draw an intriguing connection between the choice of the subjective function and the posterior maximization using expectationmaximization (EM)
dempster_maximum_19771 on the openended data.Let represents the predictive conditional distribution of the hypothesis set , where we use and as the shorthand for and respectively. Recall that and is a permutation of (the same for and ). We consider maximizing the empirical loglikelihood , where denotes the selected hypothesis. Using EM, in the Mstep we seek to maximize the objective
(4) 
where denotes the responsibility of the th hypothesis in the hypothesis set w.r.t. the local data batch , and denotes the prior of the th hypothesis in ; we adopt the local consistency assumption, i.e., data in the same data batch shares the same hypothesis. This draws a direct connection between the empirical posterior calculation of in the Estep and the empirical subjective function. The main difference is that here we assume the subjective function to be deterministic, representing a hard assignment of the data to the task. This entails the usage of a variant known as hard EM samdani_unified_2012 , which considers the posterior to be a Dirac delta function. Applying this constraint, the Estep yields under a uniform prior, motivating a principled choice of the subjective function:
(5a)  
(5b) 
which can be interpreted as selecting the hypothesis that incurs the smallest (empirical or expected) error. While this connection is not rigorous in general, in some cases exact equivalence can be obtained when certain types of loss functions and likelihood families are applied, which encompasses common regression and classification settings. We provide concrete examples in Appendix A.
3 Algorithm
In this section, we present our OSL algorithm, discuss the form of its optimal solutions, and analyze its learnability based on the PAC learnability framework.
3.1 Algorithm Overview
We assume that the hypothesis set comprises
parameterized hypotheses with parameter vectors
respectively. In the high level, with the choice of the empirical subjective function (5a), our algorithm consists of two phases in each sampling episode: (i) evaluating the error of each hypothesis in w.r.t. the data in this episode, and (ii) training the hypothesis with the smallest error. For brevity, we introduce a notion of empirical episodic error defined as(6) 
where is a hypothesis parameterized by . Then, phase (i) aims to find a hypothesis that mininize (6). Note that this selection process may induce a bias between empirical and expected objectives (2) (3), since a hypothesis that minimizes the empirical loss on finite samples may not minimize the expected loss of this domain. Hence, the global error of OSL can be intuitively decomposed into a highlevel subjective error that measures the reliability of the model selection, and a lowlevel model error that measures the accuracy of models, of which we provide detailed theoretical analysis in Section 4
. In practice, we parameterize each hypothesis in the hypothesis set with a deep neural network (DNN), and apply stochastic gradient descent (SGD) for the optimization process. We provide the pseudocode of OSL in Appendix
B.3.2 PAC Analysis
We now analyze the learnability of OSL based on PAC learnability valiant_theory_1984 . Since our analysis directly applys to conventional supervised learning by setting , we also verify the conflict phenomenon mentioned in Section 1 from a theoretical perspective. While the learnability in conventional PAC analysis mainly relates to the choice of the hypothesis space, openended data imposes a new source of complexity by its mapping rank, and we expect the cardinality of the proposed hypothesis set can compensate this complexity. We consider the realizable case where the hypothesis space covers the target functions in all domains, e.g., when complex hypothesis spaces such as parameterized DNNs are adopted, which is practical and helps underline the core characteristic of our problem. The proofs are in Appendix C.1 and C.2. We begin by a result on the form of the optimal solutions of OSL.
Theorem 1 (Form of the optimal solutions of OSL).
Assume that the target functions in all domains are realizable. Then, the following two propositions are equivalent:

[label=(0)]

For all domain distributions and data distributions , .

For each domain in , there exists such that .
Theorem 1 suggests that minimizing the expected global error (3) with (5b) elicits a global optimal solution where every target function is learned accurately in the realizable case. Note that this does not require hypotheses for domains, since nonconflict domains can be incorporated into the same model. In other words, what determines the minimal cardinality of the hypothesis set is not the number of the domains, but the number of conflicting domains, which can be exactly characterized by the mapping rank. Formally, we attain a necessary condition of the PAC learnability of OSL.
Theorem 2.
A necessary condition of the PAC learnability of OSL is .
Theorem 2 indicates that the cardinality of the hypothesis set should be large enough to enable effective learning, and shows the impact of mapping rank on the learnability of OSL.
Remark 1.
While it is generally hard to derive a necessary and sufficient condition of PAC learnability theoretically (which requires a sampleefficient optimization algorithm), we empirically find that is indeed an essential condition for learnability with complex hypothesis spaces (parameterized DNNs). We also note that several recent works allenzhu_convergence_2019 ; du_gradient_2019 have proved that overparameterized neural networks trained by SGD can achieve zero training error in polynomial time under nonconvexity, which may also be used to enhance our analysis. We leave a more rigorous study for future work.
4 Controlling the Generalization Error
We have shown that minimizing the expected global error is sufficient for effective learning from openended data. However, in practice, since we only have access to the empirical global error, how to control the discrepancy between these two errors, i.e., the generalization error, remains crucial. In this section, we identify the terms in the generalization error that respectively correspond to the subjective error and model error in the error decomposition of OSL, and discuss their controlling strategies. Our key findings are (i) the number of episodes and episodic samples can compensate each other in controlling the generalization error incurred by the models (
instance estimation error
), and (ii) the number of episodic samples is critical for controlling the generalization error incurred by the subjective function (subjective estimation error). The proof is deferred to Appendix C.3.Theorem 3 (Generalization error bound of OSL).
For any , the following inequality holds uniformly for all hypothesis sets with probability at least :
(7a)  
(7b)  
(7c) 
where is the function set of the domainwise expected error, is the function set of the samplewise error, is the sampling count of the target function from the th domain , and the VapnikChervonenkis (VC) dimension vapnik_uniform_1971 of a realvalued function set.
Theorem 3 indicates that the expected global error is bounded by the empirical global error plus three terms. The subjective estimation error term (7c) is derived by bounding the discrepancy between the empirical and the expected subjective functions due to the limitation of finite episodic samples (detailed derivation is in Appendix C.3). This term can be controlled by the samplelevel complexity term and the number of episodic samples , which certifies the necessity of the local consistency assumption in Section 2. Although in theory this error term converges to zero only if , in practice we find that usually a very small (e.g., ) suffices (see Section 5). We posit that this is because the domains in our experiments are relatively diverse, thus reducing the difficulty of discriminating between different domains. The domain estimation error term (7a) contains a domainlevel complexity term , and converges to zero if the number of episodes reaches infinity (); the instance estimation error term (7b) contains samplelevel complexity terms , and converges to zero if the sample number in each episode or the number of episodes reaches infinity ( or ), showing the synergy between highlevel domain samples and lowlevel data samples in controlling the modelwise generalization error.
Remark 2.
We compare our bound (7) with existing bounds of conventional supervised learning vapnik_nature_2013 ; mcallester_pacbayesian_1999 and metalearning pentina_pacbayesian_2014 ; amit_metalearning_2018 . Typically, supervised learning bounds contain a instancelevel complexity term as (7b), and metalearning bounds further contain a tasklevel complexity term as (7a). Yet, conventional supervised learning only considers a single domain or multiple known domains, while metalearning treats each episode as a new domain rather than domains that may have been encountered as in OSL. Thus, none of these bounds contains an explicit inference term as (7c).
Remark 3.
Although our bound applies conventional VC dimension as the complexity measure, it is also compatible with other datadependant complexity measures, e.g., Rademacher and Gaussian complexities bartlett_rademacher_2002 ; bartlett_model_2002 ; koltchinskii_rademacher_2000 ; koltchinskii_rademacher_2001 ; koltchinskii_empirical_2002 . It is worth noting that the bounds based on these measures share the same asymptotic property w.r.t. and as our bound above.
5 Experiments
Openended data exists in a wide range of machine learning domains. In this section, we consider two basic supervised learning tasks: regression and classification. Our experiments are designed to (i) validate our theoretical claims, (ii) assess the effectiveness of OSL by measuring both its subjective and model errors, (iii) compare OSL with taskspecific baselines, and (iv) investigate the relation between minimizing the subjective error and achieving rational (humanlike) tasklevel cognition.
Colored MNIST and Fashion Product Images 
5.1 Setup and Baselines
Openended regression. We consider an openended regression task similar to our motivating example in Figure 1. In this task, data points are sampled from three heterogeneous functions with widely varied shapes, as shown in Figure 2(a) (solid lines). As we have mentioned in Section 1, traditional supervised learning fails in this task due to mapping conflict.
In this task, we compare OSL with two baselines. (1) Vanilla: a conventional ERMbased learner. (2) MAML finn_modelagnostic_2017 : a popular gradientbased metalearning approach. (3) Modular alet_modular_2018 : a modular metalearning approach that employs multiple modules to foster combinatoral generalization. We set the hyperparameters of OSL to and . To verify our theoretical results, we also run OSL with different number of hypotheses ( and ) and different sampling hyperparameters ( and ). More details can be found in Appendix D.
Openended classification. We consider two openended image recognition tasks where the same image may correspond to different labels in different sample pairs. We refer to these tasks according to the structure of their label spaces, namely parallel and hierarchical tasks, representating two common realworld application cases. In detail, for the parallel task, we derive the data respectively from Colored MNIST in which each number is assigned with a number label and a color label, and Fashion Product Images fashion_product_dataset , a multiattribute clothes dataset that involves 3 main parallel tasks including gender, category and color classification, as shown in Figure 1(a); we construct our openended datasets by randomly choosing one label from the label set for each image. For the hierarchical task, we derive the data from CIFAR100 krizhevsky2009learning , a widelyused image recognition dataset comprising 100 classes with “fine” labels subsumed by 20 superclasses with “coarse” labels, as shown in Figure 1(b); we construct our openended dataset by randomly using the fine label or the coarse label for each image.
Compared with openended regression, openended classification is a more practical yet less challenging task since its output space (label space) is discrete and finite. Therefore, it can be alternatively modeled by more existing methods that we would like to compare. (1) Probabilistic concepts (ProbCon)
: this baseline considers the general scenario in the relation between inputs and outputs depends on an conditional probability distribution
devroye_probabilistic_1996 , i.e.,. We adopt a DNN to learn this probability distribution using the crossentropy loss and choose top
labels as the final prediction results. (2) Semisupervised multilabel learning: this class of methods model openended classification as the problem of “multilabel learning with missing data” Dhillon2014Large ; bi2014multilabel ; huang2019improving , i.e., only one label in each label set is available. We compare OSL with two representative semisupervised methods: Pseudolabel (PseudoL) lee2013pseudo and Label propagation (LabelProp) tang2011image ; iscen2019label . In addition, we introduce two oracle baselines using additional information. (3) Full labels (FullL): a standard multilabel learning method where we provide the full label set for each image, hence there is no missing labels. (4) Full tasks (FullT): a standard multitask learning method where the “task” of each image is designated by human experts in advance to ensure that there is no mapping conflict. More details can be found in Appendix D.5.2 Evaluation Metrics
Due to its hierarchical nature, the global error of OSL is related to the error of both the highlevel subjective function and the lowlevel models. Since these two types of errors are mutually influenced and thus hard to measure solely from the overall performance of the agent, we respectively propose two metrics to quantitatively estimate these errors.
Subjective error. This metric measures the learner’s ability to perform appropriate data allocation. Given a domain , a suitable subjective function should yield stable allocations for all data batches sampled from this domain. Thus, we measure the error of the subjective function using the rate of inconsistent data allocations, which we define as
(8) 
where denotes the total number of samples in domain (the same below).
Model error. This metric measures the learner’s ability to make accurate indomain predictions, which is analagous to traditional singletask error metrics. Given a domain , this metric takes the following form:
(9) 
where we apply for regression and for classification.
5.3 Empirical Results and Comparisons
Openended regression. We compare the performance of OSL and the baselines in Figure 4. Unsurprisingly, the vanilla baseline converges to a trivial mean function (dashed curve in Figure 2(a)). MAML successfully predicts the left part of all target functions by finetuning from episodic samples, but fails in the right part where functions exhibit larger difference. We hypothesis that it is because metalearning typically requires tasks to be in sufficient numbers, and with more similarity. Although Modular currectly predicts the general trend of the curves, its predictions are still inaccurate in finegrained details. Note that both metalearning methods use more episodic data samples than OSL (see Appendix D.3.1). Meanwhile, OSL with ( or ) successfully distinguishes different functions and recover each of them accurately while OSL with () fails, which matches our analysis in Section 3.2. In particular, OSL with automatically leaves one network to be redundant (dashed curve in Figure 2(f)), demonstrating the robustness of our algorithm.
Figure 4 illustrates the impact of sampling hyperparameters on OSL. Concretely, OSL with fewer sampling episodes induces a large model error (9), which corresponds to the samplewise estimation term in the generalization error (7b) since the product is not sufficiently large. On the other hand, OSL with fewer episodic samples induces a large subjective error (8), which corresponds to the subjective estimation term in the generalization error (7c). Another interesting phenomenon is that the curves in Figure 3(b) are partially swapped when the subjective error is large. This indicates that minimizing the subjective error subserves better tasklevel cognition.
Methods  Colored MNIST  Fashion Product Images  CIFAR100  

SubErr  ModErr  SubErr  ModErr  SubErr  ModErr  
Num  Col  Num  Col  Gen  Cat  Col  Gen  Cat  Col  Sup  Cla  Sup  Cla  
ProbCon  0.09  0.34  3.04  0.39  8.81  13.54  12.50  22.80  18.02  33.15  5.59  5.44  27.35  31.20 
PseudoL  6.62  9.25  7.47  10.08  4.95  5.40  14.59  33.69  20.04  34.06  9.26  8.38  28.46  38.96 
LabelProp  7.53  0.28  11.52  13.57  2.91  7.22  21.97  14.59  50.43  64.48  18.82  10.34  66.62  45.17 
OSL (ours)  0.10  0.00  1.70  0.03  0.00  0.00  0.00  7.87  1.93  12.85  1.05  0.82  21.40  25.05 
FullL  0.23  0.00  1.02  0.00  1.19  0.86  7.44  8.46  1.17  9.45  7.84  0.84  22.08  26.29 
FullT  0  0  1.20  0.00  0  0  0  7.14  1.90  11.04  0  0  21.11  25.08 
Openended classification. Table 1 shows the results of OSL and the baselines on openended classification tasks. On all tasks, OSL outperforms all baselines on both the subjective error and the model error. It is also worth noting that compared to the oracle FullL with full label annotations, OSL still induces smaller subjective error, showing a strong capability of tasklevel cognition that resembles the “ground truth” cognition of human experts (FullT). In addition, we visualize the features output by OSL on Fashion Product Images, as shown by Figure 5. The visualization shows that different models of OSL automatically focus on different label subspaces corresponding to different domains, which further complements our empirical results.
6 Related Work
Apart from the formulation in this paper, openended data may also be formulated using the framework of multilabel learning with partial labels zhang_review_2014 or probabilistic density estimation, such as probabilistic concepts kearns_efficient_1994 ; kearns_toward_1994 ; devroye_probabilistic_1996 and the energybased learning framework lecun_tutorial_2006 . A fundamental difference between these formulations and OSL is that these methods learn a unified model without a cognition hierarchy, while our method explicitly encourages the learner to perform highlevel data allocation and result in a set of independently reusable and interpretable models. A recent work by Su et al. su_task_2020 studies a similar problem as ours where the goal of the learner is to learn from multitask samples without task annotation, but they adopt a different objective function and a centralized gating network for data allocation, while the subjective function in OSL is fully decentralized and thus is more compatible with parallelization, and our work provides formal theoretical justification.
Extensive literature has investigated the collaboration of multiple models (or modules) in completing one or more tasks doya_multiple_2002 ; shazeer_outrageously_2017 ; meyerson_modular_2019 ; alet_modular_2018 ; nagabandi_deep_2019 ; chen_modular_2020 ; yang_multitask_2020 . The difference between our work and these works is that our multimodel architecture is driven by the inherent conflict in openended data without manual alignment between models and tasks, and we only allow a single lowlevel model to be invoked during a particular sampling episode.
7 Discussion
An immense open problem in machine learning is to identify the main factors that contribute to the gap between contemporary machine intelligence and general intelligence. While various viewpoints exist, recently the notion of openended environments has been advocated by an increasing number of works goertzel_artificial_2007 ; clune_aigas_2019 ; silver_reward_2021 . Nevertheless, to the best of our knowledge, we are the first to formalize this notion in the context of supervised learning, and provide a general learning framework with theoretical guarantee. We hope that this work can facilicate future research in this direction.
Limitations and future work. An important limitation of the current work is that it can only attain rational task cognition when there is indeed conflict between data samples (); in contrast, humans may also allocate the data to different tasks when there is no conflict (), e.g., consider a digit recognition task and a face recoginition task, where the data from these two tasks is not conflicting. Therefore, how to develop a task cognition objective in general scenarios remains an important challenge. Also, our formulation is limited to the data that is fullyinformative, i.e., absolute predictions can be made given the inputs. While this assumption is valid in a variety of machine learning applications, it is interesting to devise algorithms that can also handle data with inherent stochasticity. Other important future work includes developing lifelong learning
agents that benefit from the growing diversity of openended data as the sampling process continuously proceeds, and extending our framework to reinforcement learning domains.
This work was supported in part by the National Natural Science Foundation of China under Grant 61671266, Grant 61836004, Grant 61836014 and in part by the TsinghuaGuoqiang research program under Grant 2019GQG0006. The authors would like to thank Chongkai Gao for helpful discussions.
References
 [1] Sam Adams, Itmar Arel, Joscha Bach, Robert Coop, Rod Furlan, Ben Goertzel, J. Storrs Hall, Alexei Samsonovich, Matthias Scheutz, Matthew Schlesinger, Stuart C. Shapiro, and John Sowa. Mapping the landscape of humanlevel artificial general intelligence. AI Magazine, 33(1):25–42, 2012.
 [2] Param Aggarwal. Fashion product images dataset. https://www.kaggle.com/paramaggarwal/fashionproductimagesdataset, 2019.
 [3] Ferran Alet, Tomás LozanoPérez, and Leslie P. Kaelbling. Modular metalearning. In CoRL, 2018.

[4]
Zeyuan AllenZhu, Yuanzhi Li, and Zhao Song.
A convergence theory for deep learning via overparameterization.
In ICML, 2019.  [5] Ron Amit and Ron Meir. Metalearning by adjusting priors based on extended pacbayes theory. In ICML, 2018.
 [6] Peter L. Bartlett, Stephane Boucheron, and Gábor Lugosi. Model selection and error estimation. Machine Learning, 48:85–113, 2002.
 [7] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002.
 [8] Shai BenDavid, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(12):151–175, 2010.
 [9] Lucas Beyer, Olivier J. Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and Aäron van den Oord. Are we done with ImageNet? arXiv preprint arXiv:2006.07159, 2020.
 [10] Wei Bi and James Kwok. Multilabel classification with label correlations and missing labels. In AAAI, 2014.
 [11] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
 [12] Yutian Chen, Abram L Friesen, Feryal Behbahani, Arnaud Doucet, David Budden, Matthew W Hoffman, and Nando de Freitas. Modular metalearning with shrinkage. In Advances in Neural Information Processing Systems, 2020.
 [13] Jeff Clune. AIGAs: AIgenerating algorithms, an alternate paradigm for producing general artificial intelligence. arXiv preprint arXiv:1905.10985, 2019.
 [14] Cédric Colas, Pierre Fournier, Olivier Sigaud, Mohamed Chetouani, and PierreYves Oudeyer. CURIOUS: Intrinsically motivated multitask, multigoal reinforcement learning. In ICML, 2019.
 [15] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
 [16] Jia Deng, Wei Dong, Richard Socher, LiJia Li, Kai Li, and Li FeiFei. ImageNet: A largescale hierarchical image database. In CVPR, 2009.

[17]
Luc Devroye, László Györfi, and Gábor Lugosi.
A probabilistic theory of pattern recognition
, volume 31 of Stochastic Modelling and Applied Probability. New York, 1996.  [18] Kenji Doya, Kazuyuki Samejima, Kenichi Katagiri, and Mitsuo Kawato. Multiple modelbased reinforcement learning. Neural Computation, 14(6):1347–1369, 2002.
 [19] Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In ICML, 2019.
 [20] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Modelagnostic metalearning for fast adaptation of deep networks. In ICML, 2017.
 [21] Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online metalearning. In ICML, 2019.
 [22] Ben Goertzel and Cassio Pennachin. Artificial general intelligence, volume 2. 2007.
 [23] David Haussler. Probably approximately correct learning. In AAAI, 1990.
 [24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [25] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, 2017.
 [26] Jun Huang, Feng Qin, Xiao Zheng, Zekai Cheng, Zhixiang Yuan, Weigang Zhang, and Qingming Huang. Improving multilabel classification with missing labels by learning labelspecific features. Information Sciences, 492:124–146, 2019.

[27]
Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum.
Label propagation for deep semisupervised learning.
In CVPR, 2019.  [28] Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D’Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Konečný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, and Sen Zhao. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(12):1–210, 2021.
 [29] Michael J. Kearns and Robert E. Schapire. Efficient distributionfree learning of probabilistic concepts. Journal of Computer and System Sciences, 48(3):464–497, 1994.
 [30] Michael J Kearns, Robert E Schapire, and Linda M Sellie. Toward efficient agnostic learning. Machine Learning, 17:115–141, 1994.
 [31] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
 [32] Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902–1914, 2001.
 [33] Vladimir Koltchinskii and Dmitriy Panchenko. Rademacher processes and bounding the risk of function learning. In High dimensional probability II, pages 443–457. Springer, 2000.

[34]
Vladimir Koltchinskii and Dmitry Panchenko.
Empirical margin distributions and bounding the generalization error of combined classifiers.
Annals of statistics, 30(1):1–50, 2002.  [35] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.

[36]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.
Imagenet classification with deep convolutional neural networks.
In Advances in Neural Information Processing Systems, 2012.  [37] Yann LeCun, Sumit Chopra, Raia Hadsell, M. Ranzato, and F. Huang. A tutorial on energybased learning. Predicting structured data, 2006.
 [38] DongHyun Lee. Pseudolabel: The simple and efficient semisupervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, 2013.
 [39] David A. McAllester. PACBayesian model averaging. In COLT, 1999.
 [40] H. Brendan McMahan, Eider Moore, Daniel Ramage, and Seth Hampson. Communicationefficient learning of deep networks from decentralized data. In AISTATS, 2017.
 [41] Elliot Meyerson and Risto Miikkulainen. Modular universal reparameterization: Deep multitask learning across diverse domains. In Advances in Neural Information Processing Systems, 2019.
 [42] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, and Georg Ostrovski. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 [43] Anusha Nagabandi, Chelsea Finn, and Sergey Levine. Deep online learning via metalearning: Continual adaptation for modelbased RL. In ICLR, 2019.
 [44] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems, 2019.
 [45] Anastasia Pentina and Christoph H Lampert. A pacbayesian bound for lifelong learning. In ICML, 2014.
 [46] Rajhans Samdani, MingWei Chang, and Dan Roth. Unified expectation maximization. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2012.
 [47] Vaishaal Shankar, Rebecca Roelofs, Horia Mania, Alex Fang, Benjamin Recht, and Ludwig Schmidt. Evaluating machine accuracy on ImageNet. In ICML, 2020.
 [48] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparselygated mixtureofexperts layer. In ICLR, 2017.
 [49] David Silver, Satinder Singh, Doina Precup, and Richard S. Sutton. Reward is enough. Artificial Intelligence, 2021.
 [50] Xin Su, Yizhou Jiang, Shangqi Guo, and Feng Chen. Task understanding from confusing multitask data. In ICML, 2020.
 [51] Jinhui Tang, Richang Hong, Shuicheng Yan, TatSeng Chua, GuoJun Qi, and Ramesh Jain. Image annotation by k nnsparse graphbased label propagation over noisily tagged web images. ACM Transactions on Intelligent Systems and Technology, 2(2):1–15, 2011.
 [52] Leslie G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
 [53] V. N. Vapnik and A. Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264–280, 1971.
 [54] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.
 [55] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
 [56] Rui Wang, Joel Lehman, Aditya Rawal, Jiale Zhi, Yulun Li, Jeff Clune, and Kenneth O. Stanley. Enhanced poet: Openended reinforcement learning through unbounded invention of learning challenges and their solutions. In ICML, 2020.
 [57] Ruihan Yang, Huazhe Xu, Yi Wu, and Xiaolong Wang. Multitask reinforcement learning with soft modularization. In Advances in Neural Information Processing Systems, 2020.
 [58] HsiangFu Yu, Prateek Jain, Purushottam Kar, and Inderjit Dhillon. Largescale multilabel learning with missing labels. In ICML, 2014.
 [59] Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, Junsuk Choe, and Sanghyuk Chun. Relabeling ImageNet: From single to multilabels, from global to localized labels. In CVPR, 2021.
 [60] MinLing Zhang and ZhiHua Zhou. A review on multilabel learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8):1819–1837, 2014.
Checklist

For all authors…

Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

Did you discuss any potential negative societal impacts of your work?

Have you read the ethics review guidelines and ensured that your paper conforms to them?


If you are including theoretical results…

Did you state the full set of assumptions of all theoretical results?

Did you include complete proofs of all theoretical results? See Appendix C.


If you ran experiments…

Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? See the supplemental material.

Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? See Appendix D.

Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?

Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? See Appendix D.


If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

If your work uses existing assets, did you cite the creators? See Appendix D.

Did you mention the license of the assets? See Appendix D.

Did you include any new assets either in the supplemental material or as a URL?

Did you discuss whether and how consent was obtained from people whose data you’re using/curating? See Appendix D.

Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? See Appendix D.


If you used crowdsourcing or conducted research with human subjects…

Did you include the full text of instructions given to participants and screenshots, if applicable?

Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Discussion on the Subjective Function
In Section 2.3, we derive our design of the subjective function using a EMbased maximum likelihood formulation. However, the exact equivalence between the Estep of hard EM and the subjective function has not been established, since it relies on the exact form of the loss function and the conditional likelihood . As a complement, in the sequel we provide two examples where under the uniform prior, the exact equivalence between calculating the posterior and the expected subjective function (5b) can be obtained (hence their empirical counterparts are also equivalent). Recall that under the uniform prior, the Estep of hard EM yields .
Example 1 (Regression with isotropic Gaussian joint likelihood).
Consider a regression task where we assume that given the random variable
, the joint distribution of its conditional likelihoods conforms to an isotropic Gaussian
whereis a variance parameter and
is a identity matrix. From and we have that . This equals to the expected subjective function with squared loss .Example 2 (Classification with independent categorical likelihoods).
Consider a multiclass classification task where we assume that given the random variable , all conditional likelihoods conform to independent categorical distributions with parameters , where and is the total number of classes. From , where and we have that , where denotes the predicted probability of the th class and is the corresponding ground truth. This equals to the expected subjective function with crossentropy loss .
Appendix B Algorithm Pseudocode
We provide the pseudocode of OSL in Algorithm 1.
Appendix C Proofs of Theoretical Results
In this section, we provide the proofs of our theoretical results. For better exposition, we restate each theorem before its proof.
c.1 Proof of Theorem 1
Theorem 1 (Form of the optimal solution).
Assume that the target functions in all domains are realizable. Then, the following two propositions are equivalent:

[label=(0)]

For all domain distributions and data distributions , .

For each in , there exists such that .
Proof.
The derivation of proposition (1) from proposition (2) is obvious. On the other hand, if proposition (2) is false, i.e., there exists such that for all , . Then, we have
From the above we know that , thus there exists such that for every . This indicates that , which is in contradiction with proposition (a). Therefore, proposition (2) must hold if proposition (1) is true. ∎
c.2 Proof of Theorem 2
We first revisit the definition of PAC learnability.
Definition 2 (PAC learnability).
A target function set class is said to be PAC learnable if there exists an algorithm such that for any and , for all distributions and distribution set , the following holds:
(10) 
A target function set class is said to be PAC learnable if there exists an algorithm and a polynomial function such that for any and , for all distributions and distribution set , the following holds for any sample size
The above definition can be viewed as an extension of the singletask PAC learnability haussler_probably_1990 that considers the problem of learning a single target function. Based on this definition, in the following we give the proof.
Theorem 2.
A necessary condition of the PAC learnability of OSL is .
Proof.
According to Definition 2, if OSL is PAC learnable, there must exist an algorithm that outputs a hypothesis set with zero error, i.e., for every and (otherwise the inequality (10) will not hold for a small enough and ). Theorem 1 indicates that this is equivalent to , which is impossible if according to Definition 1. ∎
c.3 Proof of Theorem 3
We first introduce several technical lemmas.
Lemma 1.
Let be a set of events satisfying , with some . Then, .
Lemma 2 (Singletask generalization error bound vapnik_nature_2013 ).
Let be a measurable and bounded realvalued function set, of which the VapnikChervonenkis (VC) dimension vapnik_uniform_1971 is . Let be data samples sampled i.i.d. from a distribution with size . Then, for any , the following inequality holds with probability at least :
(11) 
where
(12) 
Then, we upper bound the error induced by the estimation of the expected subjective function (5b) in each sampling episode of OSL, which is critical in bounding the generalization error. Recall that we use superscripts to denote the sampling index, e.g., denotes the th domain sample, which can be any domains in the domain set . We define two shorthands as follows:
(13a)  
(13b) 
where denotes the th sampling episode of OSL, is the hypothesis set.
Lemma 3 (Subjective estimation error bound).
Let be episodic samples in the th sampling episode i.i.d. drawn from domain with size . Then, for any , the following inequality holds uniformly for all hypothesis with probability at least :
(14)  
where is the function set of the samplewise error.
Proof.
We have the following decomposition:
in which the original difference is decomposed into three terms. By definition (13a) and (13b) the middle term is nonpositive. Using Lemma 2 by substituting with and replacing with , the first term and the last term can both be bounded by with probability at least (Lemma 1). Combining these two bounds using Lemma 1 completes the proof. ∎
Now we can give the proof of the main theorem.
Theorem 3 (Generalization error bound of OSL).
For any , the following inequality holds uniformly for all hypothesis sets with probability at least :
(15a)  
(15b)  
(15c) 
where is the function set of the domainwise expected error, is the function set of the samplewise error, is the sampling count of the target function from the th domain .
Proof.
Combining the objectives (2) (3) with the subjective function (5a) (5b), we have the following decomposition:
By definition (13a) and (13b) we rewrite the equation above:
in which the generalization error of OSL is decomposed into three terms. By substituting in Lemma 2 by and replacing with , the first term can be bounded by with probability at least . By replacing with in Lemma 3 we bound the last term by with probability at least . There remains the middle term for which we have
Comments
There are no comments yet.