Subjective Learning for Open-Ended Data

08/27/2021
by   Tianren Zhang, et al.
Tsinghua University
Tencent
9

Conventional machine learning methods typically assume that data is split according to tasks, and the data in each task can be modeled by a single target function. However, this assumption is invalid in open-ended environments where no manual task definition is available. In this paper, we present a novel supervised learning paradigm of learning from open-ended data. Open-ended data inherently requires multiple single-valued deterministic mapping functions to capture all its input-output relations, exhibiting an essential structural difference from conventional supervised data. We formally expound this structural property with a novel concept termed as mapping rank, and show that open-ended data poses a fundamental difficulty for conventional supervised learning, since different data samples may conflict with each other if the mapping rank of data is larger than one. To address this issue, we devise an Open-ended Supervised Learning (OSL) framework, of which the key innovation is a subjective function that automatically allocates the data among multiple candidate models to resolve the conflict, developing a natural cognition hierarchy. We demonstrate the efficacy of OSL both theoretically and empirically, and show that OSL achieves human-like task cognition without task-level supervision.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

06/09/2021

Self-supervision of Feature Transformation for Further Improving Supervised Learning

Self-supervised learning, which benefits from automatically constructing...
12/29/2016

Deep Semi-Supervised Learning with Linguistically Motivated Sequence Labeling Task Hierarchies

In this paper we present a novel Neural Network algorithm for conducting...
05/01/2017

Towards well-specified semi-supervised model-based classifiers via structural adaptation

Semi-supervised learning plays an important role in large-scale machine ...
08/31/2021

Unsupervised Open-Domain Question Answering

Open-domain Question Answering (ODQA) has achieved significant results i...
11/19/2015

Patterns for Learning with Side Information

Supervised, semi-supervised, and unsupervised learning estimate a functi...
01/09/2020

Supporting supervised learning in fungal Biosynthetic Gene Cluster discovery: new benchmark datasets

Fungal Biosynthetic Gene Clusters (BGCs) of secondary metabolites are cl...
08/18/2021

Analogical Learning in Tactical Decision Games

Tactical Decision Games (TDGs) are military conflict scenarios presented...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A hallmark of general intelligence is the ability of handling open-ended environments, which roughly means complex, diverse environments without manually pre-defined tasks adams_mapping_2012 . In recent years, although machine learning systems have achieved remarkable success in various human-specified domains krizhevsky_imagenet_2012 ; he_deep_2016 ; mnih_human-level_2015 ; vaswani_attention_2017 , the problem of learning in open-ended environments without manual task specification remains largely open goertzel_artificial_2007 ; clune_aigas_2019 ; colas_curious:_2019 ; wang_enhanced_2020 . The main reason is that this problem has not been clearly defined, and the essential difference between open-ended and traditional, “close-ended” environments has not been explicitly identified, rendering this direction vague and elusive. Hence, a formal problem definition and a principled learning framework are crucial for its further development.

In this paper, we study this problem from the perspective of open-ended data, i.e., the data sampled from open-ended environments, in the context of supervised learning. Without manual task definition, open-ended environments may involve multiple tasks simultaneously with no inter-task delineation. Therefore, open-ended data may contain data sampled from multiple tasks, which exhibits an essential structural difference from conventional supervised data. Concretely, open-ended data is a mixture of the data generated by multiple target functions due to the emergence of different contexts or the semantic ambiguity of the data itself, which nullifies the assumption in conventional supervised learning that the relation between inputs and outputs can be modeled by a single target function vapnik_nature_2013 . For example, it is rational to map an image of a red sphere to “red” or “sphere”, and a video clip of a nodding man may be labeled as “yes” by some people and “no” by others. To formally characterize the structural property of open-ended data, we introduce a novel, dataset-level concept termed as mapping rank, which is defined as the minimal number of deterministic functions required to “fully express” all input-output relations in the data.

Definition 1 (Mapping rank).

Let be an input space, an output space, and a dataset with cardinality . Let be a function set with cardinality , in which each element is a single-valued deterministic function from to . Then, the mapping rank of , denoted by , is defined as the minimal positive integer , satisfying that there exists a function set such that for every , there exists with .

Figure 1: Comparison between multi-task learning, multi-label learning and OSL.

Under this definition, conventional supervised data has a mapping rank as it assumes that each input instance only corresponds to a unique output, thus the whole dataset can be expressed using a single function. In contrast, open-ended data has a mapping rank since for the same input different outputs exist. Hence, conventional supervised learning is problematic for open-ended data since a single function is insufficient to express the data with , resulting in the “conflict” between different sample pairs when running empirical risk minimization (ERM). This phenomenon has been observed by prior works finn_online_2019 ; su_task_2020 , where a vanilla agent trying to regress from multiple target functions trivially outputs their mean, leading to an inevitable training error regardless of the model class adopted. Meanwhile, we note that our setting is essentially different from multi-task learning caruana_multitask_1997 and multi-label learning zhang_review_2014 : multi-task learning aims to perform inductive transfer between related tasks, while the mapping rank of the data in each task is ; multi-label learning assigns each input with a fixed label set containing multiple labels, which also has a mapping rank though a larger output space is considered. Therefore, for open-ended data, a manual allocation or aggregation process is required to transform the data from to to enable effective learning, as shown in Figure 1. This raises an important question: Is it possible to learn from open-ended data with mapping rank directly?

In this work, we give an affirmative answer to this question by presenting an Open-ended Supervised Learning (OSL) framework to enable human-free learning from open-ended data. Contrary to previous methods that rely on human efforts to passively eliminate the conflict in data, OSL does the opposite by actively leveraging the conflict to establish a hierarchical cognition structure. Concretely, OSL introduces a set of low-level prediction models and a novel, high-level subjective function

that automatically allocates the data to these models so that the data processed by each model has no conflict, mimicking human subjectivity that acts in the manual allocation process of conflict samples. The motivation of such process is that if the subjective function yields an inappropriate allocation, i.e., assigning conflicting data samples to the same model, then the global training error cannot reach the minimum due to this conflict, thus driving the subjective function to improve its allocation. Based on probably approximately correct (PAC) learnability 

haussler_probably_1990 , we prove that our design overcomes the learnability issue of conventional supervised learning with sufficient low-level models.

The hierarchical nature of OSL leads to a novel global error decomposition consisting of two parts: a high-level subjective error that measures the “rationality” of its data allocation, and a low-level model error that measures the accuracy of the models. This differs from conventional supervised learning, where only the model error exists. Utilizing the tools of statistical learning theory 

vapnik_nature_2013 , we locate the error terms reflecting both type of errors in the generalization error of OSL, and respectively discuss their controlling strategies.

We conduct extensive experiments including both regression and classification tasks with open-ended data to verify the efficacy of OSL. Our results show that our OSL agent can effectively learn from open-ended data without additional human intervention, simultaneously achieves small subjective and model errors, and exhibits human-like task-level cognition with only instance-level supervision.

In summary, we establish a novel, theoretically grounded supervised learning paradigm for open-ended data. Our contributions are three-fold:

Open-ended data and mapping rank. We formalize a new problem of learning from open-ended data, and propose a novel concept termed as mapping rank to model the structural property of open-ended data, outlining its fundamental difference from conventional supervised data.

OSL framework and theory. We present an OSL framework to enable effective learning from open-ended data (Section 2), and justify its learnability (Section 3) and generalizability (Section 4) respectively based on statistical learning theory.

Global error decomposition and subjective error minimization. We show that our framework induces a novel global error decomposition, and empirically demonstrate that the minimization of the high-level subjective error is crucial for achieving rational task-level cognition (Section 5).

2 Open-Ended Supervised Learning

In this section, we present the formulation of OSL, introducing its sampling process, the global objective, and the derivation of the subjective function. We adhere to the conventional terminology in supervised learning and let be an input space, an output space, a hypothesis space where each hypothesis (model) is a function from to , and

a non-negative and bounded loss function without loss of generality. We use

for positive integers , and denote by the indicator function. We use superscripts (e.g., and ) to denote sampling indices and subscripts as element indices (e.g., ).

2.1 Problem Statement

As introduced in Section 1, we consider open-ended data with mapping rank , and it is straightforward to subsume the traditional setting with within our framework as a degenerated case. Without loss of generality, to formulate the generation process of open-ended data we introduce the notion of domain. Inspired by ben-david_theory_2010 , we define a domain as a pair consisting of a distribution on and a deterministic target function , and assume that the open-ended data is generated by a domain set containing (agnostic to the learner) domains, resulting in a dataset with samples and a mapping rank . Hence, contains the data from all these domains, yet the exact sample-wise domain information is unavailable. It is easy to verify that since the target functions in all domains are deterministic (i.e., there is no conflict in intra-domain samples).

We consider a bilevel sampling procedure: first, domain samples are i.i.d. drawn from a distribution defined on (thus the same domain can be sampled multiple times), resulting in sampling episodes; second, in each episode data samples are drawn i.i.d. from the distribution and labeled by the target function of domain (hence ). This sampling regime is analagous to the bilevel sampling process adopted by meta-learning pentina_pac-bayesian_2014 ; amit_meta-learning_2018 . However, in meta-learning typically assumes a dense distribution of related domains to enable task-level generalization, while OSL is compatible with scarce and disparate domains and the inter-domain transfer is not our concern.

In the above setting, an episodic sample number parameter is introduced to maintain the local consistency of data, implicitly assuming that the data samples in every size- () batch belong to the same domain. We will both theoretically and empirically verify the necessity of this assumption in the following sections. Intuitively, setting may result in the multiplicity of solutions (see su_task_2020 and Figure 3(b) in our experiments) and fail to maintain the integrity of the original domains.

The aim of the learner is to predict each output given the input . As we have mentioned in Section 1, a single model is not sufficient in this setting due to the conflict in data. Thus, we equip the learner with a hypothesis set consisting of hypotheses, enhancing its expressive capability. Although both and are assumed to be unknown, we will show that in general suffices (Section 3.2

), which eases the difficulty of setting the hyperparameter

.

Applications. Open-ended data naturally emerges in various machine learning applications. Note that although most of machine learning datasets used today are not open-ended,

is quite common for raw data directly collected from the real world. Therefore, if the machine can learn from open-ended data directly, the cumbersome manual data cleaning process can be avoided. Also, even after manual data cleaning, in many cases the innate conflict in data is inevitable in a single-label regime. For example, in the well-known ImageNet dataset 

deng_imagenet:_2009 , while each image is assigned to a single label, it may contain multiple objects of interest. This makes the annotated label only one of many equally valid descriptions of the image, which has led to issues in both training and evaluation as suggested by recent research shankar_evaluating_2020 ; beyer_are_2020 ; yun_re-labeling_2021 . Meanwhile, in some scenarios it is not possible to manually check and resolve the conflict in the data from multiple sources in advance due to privacy or other reasons, such as in federated learning where the data is distributed on multiple clients mcmahan_communication-efficient_2017 ; kairouz_advances_2021 . In these scenarios, OSL can be applied for automatically detecting and resolving the conflict.

2.2 Global Error

Since the open-ended dataset implicitly contains multiple underlying input-output mapping relations, a primary start point is the empirical multi-task loss:

(1)

where is an oracle mapping function that determines which hypothesis each data batch is assigned to. However, in OSL the oracle mapping function is unavailable, imposing a fundamental discrepancy. To tackle this difficulty, here we substitute the oracle mapping function with a learnable empirical subjective function that aims to select a hypothesis from the hypothesis set for the data batch . This substitution yields the empirical global error of OSL:

(2)

Our insight is that the data batch itself can be harnessed to guide its suitable allocation in the presence of mapping conflict, since a single model trained by conflicting data batches result in an inevitable training error, thus hindering the optimization of the global error (2). This in turn facilitates data allocation with less conflict. Then, the corresponding expected global error is

(3)

where is an expected subjective function, which can be viewed as the empirical subjective function with infinite samples from every single domain so that all available domain information can be fully reflected by the data samples. The global objective (3) characterizes the test process where the learner usually experiences only one task (domain) in a particular time period (which is natural in the real world), therefore its performance can be separately tested in all domains.

So far, our framework remains incomplete since the subjective function remain undefined. In the next section, we will present our design of the subjective function and elucidate its rationale.

2.3 Design of the Subjective Function

To attain a reasonable choice of the subjective function, in this section we provide an alternative, probabilistic view of OSL from the angle of maximum conditional likelihood, and draw an intriguing connection between the choice of the subjective function and the posterior maximization using expectation-maximization (EM) 

dempster_maximum_1977-1 on the open-ended data.

Let represents the predictive conditional distribution of the hypothesis set , where we use and as the shorthand for and respectively. Recall that and is a permutation of (the same for and ). We consider maximizing the empirical log-likelihood , where denotes the selected hypothesis. Using EM, in the M-step we seek to maximize the objective

(4)

where denotes the responsibility of the -th hypothesis in the hypothesis set w.r.t. the local data batch , and denotes the prior of the -th hypothesis in ; we adopt the local consistency assumption, i.e., data in the same data batch shares the same hypothesis. This draws a direct connection between the empirical posterior calculation of in the E-step and the empirical subjective function. The main difference is that here we assume the subjective function to be deterministic, representing a hard assignment of the data to the task. This entails the usage of a variant known as hard EM samdani_unified_2012 , which considers the posterior to be a Dirac delta function. Applying this constraint, the E-step yields under a uniform prior, motivating a principled choice of the subjective function:

(5a)
(5b)

which can be interpreted as selecting the hypothesis that incurs the smallest (empirical or expected) error. While this connection is not rigorous in general, in some cases exact equivalence can be obtained when certain types of loss functions and likelihood families are applied, which encompasses common regression and classification settings. We provide concrete examples in Appendix A.

3 Algorithm

In this section, we present our OSL algorithm, discuss the form of its optimal solutions, and analyze its learnability based on the PAC learnability framework.

3.1 Algorithm Overview

We assume that the hypothesis set comprises

parameterized hypotheses with parameter vectors

respectively. In the high level, with the choice of the empirical subjective function (5a), our algorithm consists of two phases in each sampling episode: (i) evaluating the error of each hypothesis in w.r.t. the data in this episode, and (ii) training the hypothesis with the smallest error. For brevity, we introduce a notion of empirical episodic error defined as

(6)

where is a hypothesis parameterized by . Then, phase (i) aims to find a hypothesis that mininize (6). Note that this selection process may induce a bias between empirical and expected objectives (2) (3), since a hypothesis that minimizes the empirical loss on finite samples may not minimize the expected loss of this domain. Hence, the global error of OSL can be intuitively decomposed into a high-level subjective error that measures the reliability of the model selection, and a low-level model error that measures the accuracy of models, of which we provide detailed theoretical analysis in Section 4

. In practice, we parameterize each hypothesis in the hypothesis set with a deep neural network (DNN), and apply stochastic gradient descent (SGD) for the optimization process. We provide the pseudo-code of OSL in Appendix 

B.

3.2 PAC Analysis

We now analyze the learnability of OSL based on PAC learnability valiant_theory_1984 . Since our analysis directly applys to conventional supervised learning by setting , we also verify the conflict phenomenon mentioned in Section 1 from a theoretical perspective. While the learnability in conventional PAC analysis mainly relates to the choice of the hypothesis space, open-ended data imposes a new source of complexity by its mapping rank, and we expect the cardinality of the proposed hypothesis set can compensate this complexity. We consider the realizable case where the hypothesis space covers the target functions in all domains, e.g., when complex hypothesis spaces such as parameterized DNNs are adopted, which is practical and helps underline the core characteristic of our problem. The proofs are in Appendix C.1 and C.2. We begin by a result on the form of the optimal solutions of OSL.

Theorem 1 (Form of the optimal solutions of OSL).

Assume that the target functions in all domains are realizable. Then, the following two propositions are equivalent:

  1. [label=(0)]

  2. For all domain distributions and data distributions , .

  3. For each domain in , there exists such that .

Theorem 1 suggests that minimizing the expected global error (3) with (5b) elicits a global optimal solution where every target function is learned accurately in the realizable case. Note that this does not require hypotheses for domains, since non-conflict domains can be incorporated into the same model. In other words, what determines the minimal cardinality of the hypothesis set is not the number of the domains, but the number of conflicting domains, which can be exactly characterized by the mapping rank. Formally, we attain a necessary condition of the PAC learnability of OSL.

Theorem 2.

A necessary condition of the PAC learnability of OSL is .

Theorem 2 indicates that the cardinality of the hypothesis set should be large enough to enable effective learning, and shows the impact of mapping rank on the learnability of OSL.

Remark 1.

While it is generally hard to derive a necessary and sufficient condition of PAC learnability theoretically (which requires a sample-efficient optimization algorithm), we empirically find that is indeed an essential condition for learnability with complex hypothesis spaces (parameterized DNNs). We also note that several recent works allen-zhu_convergence_2019 ; du_gradient_2019 have proved that over-parameterized neural networks trained by SGD can achieve zero training error in polynomial time under non-convexity, which may also be used to enhance our analysis. We leave a more rigorous study for future work.

4 Controlling the Generalization Error

We have shown that minimizing the expected global error is sufficient for effective learning from open-ended data. However, in practice, since we only have access to the empirical global error, how to control the discrepancy between these two errors, i.e., the generalization error, remains crucial. In this section, we identify the terms in the generalization error that respectively correspond to the subjective error and model error in the error decomposition of OSL, and discuss their controlling strategies. Our key findings are (i) the number of episodes and episodic samples can compensate each other in controlling the generalization error incurred by the models (

instance estimation error

), and (ii) the number of episodic samples is critical for controlling the generalization error incurred by the subjective function (subjective estimation error). The proof is deferred to Appendix C.3.

Theorem 3 (Generalization error bound of OSL).

For any , the following inequality holds uniformly for all hypothesis sets with probability at least :

(7a)
(7b)
(7c)

where is the function set of the domain-wise expected error, is the function set of the sample-wise error, is the sampling count of the target function from the -th domain , and the Vapnik-Chervonenkis (VC) dimension vapnik_uniform_1971 of a real-valued function set.

Theorem 3 indicates that the expected global error is bounded by the empirical global error plus three terms. The subjective estimation error term (7c) is derived by bounding the discrepancy between the empirical and the expected subjective functions due to the limitation of finite episodic samples (detailed derivation is in Appendix C.3). This term can be controlled by the sample-level complexity term and the number of episodic samples , which certifies the necessity of the local consistency assumption in Section 2. Although in theory this error term converges to zero only if , in practice we find that usually a very small (e.g., ) suffices (see Section 5). We posit that this is because the domains in our experiments are relatively diverse, thus reducing the difficulty of discriminating between different domains. The domain estimation error term (7a) contains a domain-level complexity term , and converges to zero if the number of episodes reaches infinity (); the instance estimation error term (7b) contains sample-level complexity terms , and converges to zero if the sample number in each episode or the number of episodes reaches infinity ( or ), showing the synergy between high-level domain samples and low-level data samples in controlling the model-wise generalization error.

Remark 2.

We compare our bound (7) with existing bounds of conventional supervised learning vapnik_nature_2013 ; mcallester_pac-bayesian_1999 and meta-learning pentina_pac-bayesian_2014 ; amit_meta-learning_2018 . Typically, supervised learning bounds contain a instance-level complexity term as (7b), and meta-learning bounds further contain a task-level complexity term as (7a). Yet, conventional supervised learning only considers a single domain or multiple known domains, while meta-learning treats each episode as a new domain rather than domains that may have been encountered as in OSL. Thus, none of these bounds contains an explicit inference term as (7c).

Remark 3.

Although our bound applies conventional VC dimension as the complexity measure, it is also compatible with other data-dependant complexity measures, e.g., Rademacher and Gaussian complexities bartlett_rademacher_2002 ; bartlett_model_2002 ; koltchinskii_rademacher_2000 ; koltchinskii_rademacher_2001 ; koltchinskii_empirical_2002 . It is worth noting that the bounds based on these measures share the same asymptotic property w.r.t. and as our bound above.

5 Experiments

Open-ended data exists in a wide range of machine learning domains. In this section, we consider two basic supervised learning tasks: regression and classification. Our experiments are designed to (i) validate our theoretical claims, (ii) assess the effectiveness of OSL by measuring both its subjective and model errors, (iii) compare OSL with task-specific baselines, and (iv) investigate the relation between minimizing the subjective error and achieving rational (human-like) task-level cognition.

(a) Parallel task:

Colored MNIST

and Fashion Product Images
(b) Hierarchical task: CIFAR-100
Figure 2: Open-ended classification tasks and datasets.

5.1 Setup and Baselines

Open-ended regression. We consider an open-ended regression task similar to our motivating example in Figure 1. In this task, data points are sampled from three heterogeneous functions with widely varied shapes, as shown in Figure 2(a) (solid lines). As we have mentioned in Section 1, traditional supervised learning fails in this task due to mapping conflict.

In this task, we compare OSL with two baselines. (1) Vanilla: a conventional ERM-based learner. (2) MAML finn_model-agnostic_2017 : a popular gradient-based meta-learning approach. (3) Modular alet_modular_2018 : a modular meta-learning approach that employs multiple modules to foster combinatoral generalization. We set the hyperparameters of OSL to and . To verify our theoretical results, we also run OSL with different number of hypotheses ( and ) and different sampling hyperparameters ( and ). More details can be found in Appendix D.

Open-ended classification. We consider two open-ended image recognition tasks where the same image may correspond to different labels in different sample pairs. We refer to these tasks according to the structure of their label spaces, namely parallel and hierarchical tasks, representating two common real-world application cases. In detail, for the parallel task, we derive the data respectively from Colored MNIST in which each number is assigned with a number label and a color label, and Fashion Product Images fashion_product_dataset , a multi-attribute clothes dataset that involves 3 main parallel tasks including gender, category and color classification, as shown in Figure 1(a); we construct our open-ended datasets by randomly choosing one label from the label set for each image. For the hierarchical task, we derive the data from CIFAR-100 krizhevsky2009learning , a widely-used image recognition dataset comprising 100 classes with “fine” labels subsumed by 20 superclasses with “coarse” labels, as shown in Figure 1(b); we construct our open-ended dataset by randomly using the fine label or the coarse label for each image.

(a) Vanilla (dashed) (b) MAML (c) Modular (d) OSL (e) OSL (f) OSL
Figure 3: Results of OSL and the baselines on the open-ended regression task.
(a) Smaller (b) Smaller (c) Subjective error of (b) (dashed)
Figure 4: The impact of sampling hyperparameters on OSL.

Compared with open-ended regression, open-ended classification is a more practical yet less challenging task since its output space (label space) is discrete and finite. Therefore, it can be alternatively modeled by more existing methods that we would like to compare. (1) Probabilistic concepts (ProbCon)

: this baseline considers the general scenario in the relation between inputs and outputs depends on an conditional probability distribution 

devroye_probabilistic_1996 , i.e.,

. We adopt a DNN to learn this probability distribution using the cross-entropy loss and choose top-

labels as the final prediction results. (2) Semi-supervised multi-label learning: this class of methods model open-ended classification as the problem of “multi-label learning with missing data” Dhillon2014Large ; bi2014multilabel ; huang2019improving , i.e., only one label in each label set is available. We compare OSL with two representative semi-supervised methods: Pseudo-label (Pseudo-L) lee2013pseudo and Label propagation (LabelProp) tang2011image ; iscen2019label . In addition, we introduce two oracle baselines using additional information. (3) Full labels (Full-L): a standard multi-label learning method where we provide the full label set for each image, hence there is no missing labels. (4) Full tasks (Full-T): a standard multi-task learning method where the “task” of each image is designated by human experts in advance to ensure that there is no mapping conflict. More details can be found in Appendix D.

5.2 Evaluation Metrics

Due to its hierarchical nature, the global error of OSL is related to the error of both the high-level subjective function and the low-level models. Since these two types of errors are mutually influenced and thus hard to measure solely from the overall performance of the agent, we respectively propose two metrics to quantitatively estimate these errors.

Subjective error. This metric measures the learner’s ability to perform appropriate data allocation. Given a domain , a suitable subjective function should yield stable allocations for all data batches sampled from this domain. Thus, we measure the error of the subjective function using the rate of inconsistent data allocations, which we define as

(8)

where denotes the total number of samples in domain (the same below).

Model error. This metric measures the learner’s ability to make accurate in-domain predictions, which is analagous to traditional single-task error metrics. Given a domain , this metric takes the following form:

(9)

where we apply for regression and for classification.

5.3 Empirical Results and Comparisons

Figure 5: Visualization of the features output by OSL in Fashion Product Images.

Open-ended regression. We compare the performance of OSL and the baselines in Figure 4. Unsurprisingly, the vanilla baseline converges to a trivial mean function (dashed curve in Figure 2(a)). MAML successfully predicts the left part of all target functions by fine-tuning from episodic samples, but fails in the right part where functions exhibit larger difference. We hypothesis that it is because meta-learning typically requires tasks to be in sufficient numbers, and with more similarity. Although Modular currectly predicts the general trend of the curves, its predictions are still inaccurate in fine-grained details. Note that both meta-learning methods use more episodic data samples than OSL (see Appendix D.3.1). Meanwhile, OSL with ( or ) successfully distinguishes different functions and recover each of them accurately while OSL with () fails, which matches our analysis in Section 3.2. In particular, OSL with automatically leaves one network to be redundant (dashed curve in Figure 2(f)), demonstrating the robustness of our algorithm.

Figure 4 illustrates the impact of sampling hyperparameters on OSL. Concretely, OSL with fewer sampling episodes induces a large model error (9), which corresponds to the sample-wise estimation term in the generalization error (7b) since the product is not sufficiently large. On the other hand, OSL with fewer episodic samples induces a large subjective error (8), which corresponds to the subjective estimation term in the generalization error (7c). Another interesting phenomenon is that the curves in Figure 3(b) are partially swapped when the subjective error is large. This indicates that minimizing the subjective error subserves better task-level cognition.

Methods Colored MNIST Fashion Product Images CIFAR-100
SubErr ModErr SubErr ModErr SubErr ModErr
Num Col Num Col Gen Cat Col Gen Cat Col Sup Cla Sup Cla
ProbCon 0.09 0.34 3.04 0.39 8.81 13.54 12.50 22.80 18.02 33.15 5.59 5.44 27.35 31.20
Pseudo-L 6.62 9.25 7.47 10.08 4.95 5.40 14.59 33.69 20.04 34.06 9.26 8.38 28.46 38.96
LabelProp 7.53 0.28 11.52 13.57 2.91 7.22 21.97 14.59 50.43 64.48 18.82 10.34 66.62 45.17
OSL (ours) 0.10 0.00 1.70 0.03 0.00 0.00 0.00 7.87 1.93 12.85 1.05 0.82 21.40 25.05
Full-L 0.23 0.00 1.02 0.00 1.19 0.86 7.44 8.46 1.17 9.45 7.84 0.84 22.08 26.29
Full-T 0 0 1.20 0.00 0 0 0 7.14 1.90 11.04 0 0 21.11 25.08
Table 1: Results of OSL and the baselines on open-ended classification tasks. We report subjective and model errors on Number (Num) and Color (Col) domains of Colored MNIST, Gender (Gen), Category (Cat) and Color (Col) domains of Fashion Product Images, and Superclass (Sup) and Class (Cla) domains of CIFAR-100 respectively.

Open-ended classification. Table 1 shows the results of OSL and the baselines on open-ended classification tasks. On all tasks, OSL outperforms all baselines on both the subjective error and the model error. It is also worth noting that compared to the oracle Full-L with full label annotations, OSL still induces smaller subjective error, showing a strong capability of task-level cognition that resembles the “ground truth” cognition of human experts (Full-T). In addition, we visualize the features output by OSL on Fashion Product Images, as shown by Figure 5. The visualization shows that different models of OSL automatically focus on different label subspaces corresponding to different domains, which further complements our empirical results.

6 Related Work

Apart from the formulation in this paper, open-ended data may also be formulated using the framework of multi-label learning with partial labels zhang_review_2014 or probabilistic density estimation, such as probabilistic concepts kearns_efficient_1994 ; kearns_toward_1994 ; devroye_probabilistic_1996 and the energy-based learning framework lecun_tutorial_2006 . A fundamental difference between these formulations and OSL is that these methods learn a unified model without a cognition hierarchy, while our method explicitly encourages the learner to perform high-level data allocation and result in a set of independently reusable and interpretable models. A recent work by Su et al. su_task_2020 studies a similar problem as ours where the goal of the learner is to learn from multi-task samples without task annotation, but they adopt a different objective function and a centralized gating network for data allocation, while the subjective function in OSL is fully decentralized and thus is more compatible with parallelization, and our work provides formal theoretical justification.

Extensive literature has investigated the collaboration of multiple models (or modules) in completing one or more tasks doya_multiple_2002 ; shazeer_outrageously_2017 ; meyerson_modular_2019 ; alet_modular_2018 ; nagabandi_deep_2019 ; chen_modular_2020 ; yang_multi-task_2020 . The difference between our work and these works is that our multi-model architecture is driven by the inherent conflict in open-ended data without manual alignment between models and tasks, and we only allow a single low-level model to be invoked during a particular sampling episode.

7 Discussion

An immense open problem in machine learning is to identify the main factors that contribute to the gap between contemporary machine intelligence and general intelligence. While various viewpoints exist, recently the notion of open-ended environments has been advocated by an increasing number of works goertzel_artificial_2007 ; clune_aigas_2019 ; silver_reward_2021 . Nevertheless, to the best of our knowledge, we are the first to formalize this notion in the context of supervised learning, and provide a general learning framework with theoretical guarantee. We hope that this work can facilicate future research in this direction.

Limitations and future work. An important limitation of the current work is that it can only attain rational task cognition when there is indeed conflict between data samples (); in contrast, humans may also allocate the data to different tasks when there is no conflict (), e.g., consider a digit recognition task and a face recoginition task, where the data from these two tasks is not conflicting. Therefore, how to develop a task cognition objective in general scenarios remains an important challenge. Also, our formulation is limited to the data that is fully-informative, i.e., absolute predictions can be made given the inputs. While this assumption is valid in a variety of machine learning applications, it is interesting to devise algorithms that can also handle data with inherent stochasticity. Other important future work includes developing lifelong learning

agents that benefit from the growing diversity of open-ended data as the sampling process continuously proceeds, and extending our framework to reinforcement learning domains.

This work was supported in part by the National Natural Science Foundation of China under Grant 61671266, Grant 61836004, Grant 61836014 and in part by the Tsinghua-Guoqiang research program under Grant 2019GQG0006. The authors would like to thank Chongkai Gao for helpful discussions.

References

  • [1] Sam Adams, Itmar Arel, Joscha Bach, Robert Coop, Rod Furlan, Ben Goertzel, J. Storrs Hall, Alexei Samsonovich, Matthias Scheutz, Matthew Schlesinger, Stuart C. Shapiro, and John Sowa. Mapping the landscape of human-level artificial general intelligence. AI Magazine, 33(1):25–42, 2012.
  • [2] Param Aggarwal. Fashion product images dataset. https://www.kaggle.com/paramaggarwal/fashion-product-images-dataset, 2019.
  • [3] Ferran Alet, Tomás Lozano-Pérez, and Leslie P. Kaelbling. Modular meta-learning. In CoRL, 2018.
  • [4] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song.

    A convergence theory for deep learning via over-parameterization.

    In ICML, 2019.
  • [5] Ron Amit and Ron Meir. Meta-learning by adjusting priors based on extended pac-bayes theory. In ICML, 2018.
  • [6] Peter L. Bartlett, Stephane Boucheron, and Gábor Lugosi. Model selection and error estimation. Machine Learning, 48:85–113, 2002.
  • [7] Peter L. Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002.
  • [8] Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. A theory of learning from different domains. Machine learning, 79(1-2):151–175, 2010.
  • [9] Lucas Beyer, Olivier J. Hénaff, Alexander Kolesnikov, Xiaohua Zhai, and Aäron van den Oord. Are we done with ImageNet? arXiv preprint arXiv:2006.07159, 2020.
  • [10] Wei Bi and James Kwok. Multilabel classification with label correlations and missing labels. In AAAI, 2014.
  • [11] Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
  • [12] Yutian Chen, Abram L Friesen, Feryal Behbahani, Arnaud Doucet, David Budden, Matthew W Hoffman, and Nando de Freitas. Modular meta-learning with shrinkage. In Advances in Neural Information Processing Systems, 2020.
  • [13] Jeff Clune. AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence. arXiv preprint arXiv:1905.10985, 2019.
  • [14] Cédric Colas, Pierre Fournier, Olivier Sigaud, Mohamed Chetouani, and Pierre-Yves Oudeyer. CURIOUS: Intrinsically motivated multi-task, multi-goal reinforcement learning. In ICML, 2019.
  • [15] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Methodological), 39(1):1–22, 1977.
  • [16] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  • [17] Luc Devroye, László Györfi, and Gábor Lugosi.

    A probabilistic theory of pattern recognition

    , volume 31 of Stochastic Modelling and Applied Probability.
    New York, 1996.
  • [18] Kenji Doya, Kazuyuki Samejima, Ken-ichi Katagiri, and Mitsuo Kawato. Multiple model-based reinforcement learning. Neural Computation, 14(6):1347–1369, 2002.
  • [19] Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In ICML, 2019.
  • [20] Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In ICML, 2017.
  • [21] Chelsea Finn, Aravind Rajeswaran, Sham Kakade, and Sergey Levine. Online meta-learning. In ICML, 2019.
  • [22] Ben Goertzel and Cassio Pennachin. Artificial general intelligence, volume 2. 2007.
  • [23] David Haussler. Probably approximately correct learning. In AAAI, 1990.
  • [24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [25] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In CVPR, 2017.
  • [26] Jun Huang, Feng Qin, Xiao Zheng, Zekai Cheng, Zhixiang Yuan, Weigang Zhang, and Qingming Huang. Improving multi-label classification with missing labels by learning label-specific features. Information Sciences, 492:124–146, 2019.
  • [27] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum.

    Label propagation for deep semi-supervised learning.

    In CVPR, 2019.
  • [28] Peter Kairouz, H. Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Arjun Nitin Bhagoji, Kallista Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings, Rafael G. L. D’Oliveira, Hubert Eichner, Salim El Rouayheb, David Evans, Josh Gardner, Zachary Garrett, Adrià Gascón, Badih Ghazi, Phillip B. Gibbons, Marco Gruteser, Zaid Harchaoui, Chaoyang He, Lie He, Zhouyuan Huo, Ben Hutchinson, Justin Hsu, Martin Jaggi, Tara Javidi, Gauri Joshi, Mikhail Khodak, Jakub Konečný, Aleksandra Korolova, Farinaz Koushanfar, Sanmi Koyejo, Tancrède Lepoint, Yang Liu, Prateek Mittal, Mehryar Mohri, Richard Nock, Ayfer Özgür, Rasmus Pagh, Mariana Raykova, Hang Qi, Daniel Ramage, Ramesh Raskar, Dawn Song, Weikang Song, Sebastian U. Stich, Ziteng Sun, Ananda Theertha Suresh, Florian Tramèr, Praneeth Vepakomma, Jianyu Wang, Li Xiong, Zheng Xu, Qiang Yang, Felix X. Yu, Han Yu, and Sen Zhao. Advances and open problems in federated learning. Foundations and Trends in Machine Learning, 14(1-2):1–210, 2021.
  • [29] Michael J. Kearns and Robert E. Schapire. Efficient distribution-free learning of probabilistic concepts. Journal of Computer and System Sciences, 48(3):464–497, 1994.
  • [30] Michael J Kearns, Robert E Schapire, and Linda M Sellie. Toward efficient agnostic learning. Machine Learning, 17:115–141, 1994.
  • [31] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [32] Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902–1914, 2001.
  • [33] Vladimir Koltchinskii and Dmitriy Panchenko. Rademacher processes and bounding the risk of function learning. In High dimensional probability II, pages 443–457. Springer, 2000.
  • [34] Vladimir Koltchinskii and Dmitry Panchenko.

    Empirical margin distributions and bounding the generalization error of combined classifiers.

    Annals of statistics, 30(1):1–50, 2002.
  • [35] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. 2009.
  • [36] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton.

    Imagenet classification with deep convolutional neural networks.

    In Advances in Neural Information Processing Systems, 2012.
  • [37] Yann LeCun, Sumit Chopra, Raia Hadsell, M. Ranzato, and F. Huang. A tutorial on energy-based learning. Predicting structured data, 2006.
  • [38] Dong-Hyun Lee. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In Workshop on challenges in representation learning, ICML, 2013.
  • [39] David A. McAllester. PAC-Bayesian model averaging. In COLT, 1999.
  • [40] H. Brendan McMahan, Eider Moore, Daniel Ramage, and Seth Hampson. Communication-efficient learning of deep networks from decentralized data. In AISTATS, 2017.
  • [41] Elliot Meyerson and Risto Miikkulainen. Modular universal reparameterization: Deep multi-task learning across diverse domains. In Advances in Neural Information Processing Systems, 2019.
  • [42] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, and Georg Ostrovski. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [43] Anusha Nagabandi, Chelsea Finn, and Sergey Levine. Deep online learning via meta-learning: Continual adaptation for model-based RL. In ICLR, 2019.
  • [44] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, 2019.
  • [45] Anastasia Pentina and Christoph H Lampert. A pac-bayesian bound for lifelong learning. In ICML, 2014.
  • [46] Rajhans Samdani, Ming-Wei Chang, and Dan Roth. Unified expectation maximization. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2012.
  • [47] Vaishaal Shankar, Rebecca Roelofs, Horia Mania, Alex Fang, Benjamin Recht, and Ludwig Schmidt. Evaluating machine accuracy on ImageNet. In ICML, 2020.
  • [48] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In ICLR, 2017.
  • [49] David Silver, Satinder Singh, Doina Precup, and Richard S. Sutton. Reward is enough. Artificial Intelligence, 2021.
  • [50] Xin Su, Yizhou Jiang, Shangqi Guo, and Feng Chen. Task understanding from confusing multi-task data. In ICML, 2020.
  • [51] Jinhui Tang, Richang Hong, Shuicheng Yan, Tat-Seng Chua, Guo-Jun Qi, and Ramesh Jain. Image annotation by k nn-sparse graph-based label propagation over noisily tagged web images. ACM Transactions on Intelligent Systems and Technology, 2(2):1–15, 2011.
  • [52] Leslie G. Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
  • [53] V. N. Vapnik and A. Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16(2):264–280, 1971.
  • [54] Vladimir Vapnik. The nature of statistical learning theory. Springer science & business media, 2013.
  • [55] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
  • [56] Rui Wang, Joel Lehman, Aditya Rawal, Jiale Zhi, Yulun Li, Jeff Clune, and Kenneth O. Stanley. Enhanced poet: Open-ended reinforcement learning through unbounded invention of learning challenges and their solutions. In ICML, 2020.
  • [57] Ruihan Yang, Huazhe Xu, Yi Wu, and Xiaolong Wang. Multi-task reinforcement learning with soft modularization. In Advances in Neural Information Processing Systems, 2020.
  • [58] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit Dhillon. Large-scale multi-label learning with missing labels. In ICML, 2014.
  • [59] Sangdoo Yun, Seong Joon Oh, Byeongho Heo, Dongyoon Han, Junsuk Choe, and Sanghyuk Chun. Re-labeling ImageNet: From single to multi-labels, from global to localized labels. In CVPR, 2021.
  • [60] Min-Ling Zhang and Zhi-Hua Zhou. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8):1819–1837, 2014.

Checklist

  1. For all authors…

    1. Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?

    2. Did you describe the limitations of your work? See Remark 1 and Section 7.

    3. Did you discuss any potential negative societal impacts of your work?

    4. Have you read the ethics review guidelines and ensured that your paper conforms to them?

  2. If you are including theoretical results…

    1. Did you state the full set of assumptions of all theoretical results?

    2. Did you include complete proofs of all theoretical results? See Appendix C.

  3. If you ran experiments…

    1. Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? See the supplemental material.

    2. Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? See Appendix D.

    3. Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?

    4. Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? See Appendix D.

  4. If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. If your work uses existing assets, did you cite the creators? See Appendix D.

    2. Did you mention the license of the assets? See Appendix D.

    3. Did you include any new assets either in the supplemental material or as a URL?

    4. Did you discuss whether and how consent was obtained from people whose data you’re using/curating? See Appendix D.

    5. Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? See Appendix D.

  5. If you used crowdsourcing or conducted research with human subjects…

    1. Did you include the full text of instructions given to participants and screenshots, if applicable?

    2. Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?

    3. Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?

Appendix A Discussion on the Subjective Function

In Section 2.3, we derive our design of the subjective function using a EM-based maximum likelihood formulation. However, the exact equivalence between the E-step of hard EM and the subjective function has not been established, since it relies on the exact form of the loss function and the conditional likelihood . As a complement, in the sequel we provide two examples where under the uniform prior, the exact equivalence between calculating the posterior and the expected subjective function (5b) can be obtained (hence their empirical counterparts are also equivalent). Recall that under the uniform prior, the E-step of hard EM yields .

Example 1 (Regression with isotropic Gaussian joint likelihood).

Consider a regression task where we assume that given the random variable

, the joint distribution of its conditional likelihoods conforms to an isotropic Gaussian

where

is a variance parameter and

is a identity matrix. From and we have that . This equals to the expected subjective function with squared loss .

Example 2 (Classification with independent categorical likelihoods).

Consider a multi-class classification task where we assume that given the random variable , all conditional likelihoods conform to independent categorical distributions with parameters , where and is the total number of classes. From , where and we have that , where denotes the predicted probability of the -th class and is the corresponding ground truth. This equals to the expected subjective function with cross-entropy loss .

Appendix B Algorithm Pseudo-code

We provide the pseudo-code of OSL in Algorithm 1.

Appendix C Proofs of Theoretical Results

In this section, we provide the proofs of our theoretical results. For better exposition, we restate each theorem before its proof.

c.1 Proof of Theorem 1

Theorem 1 (Form of the optimal solution).

Assume that the target functions in all domains are realizable. Then, the following two propositions are equivalent:

  1. [label=(0)]

  2. For all domain distributions and data distributions , .

  3. For each in , there exists such that .

Proof.

The derivation of proposition (1) from proposition (2) is obvious. On the other hand, if proposition (2) is false, i.e., there exists such that for all , . Then, we have

From the above we know that , thus there exists such that for every . This indicates that , which is in contradiction with proposition (a). Therefore, proposition (2) must hold if proposition (1) is true. ∎

c.2 Proof of Theorem 2

We first revisit the definition of PAC learnability.

Definition 2 (PAC learnability).

A target function set class is said to be PAC learnable if there exists an algorithm such that for any and , for all distributions and distribution set , the following holds:

(10)

A target function set class is said to be PAC learnable if there exists an algorithm and a polynomial function such that for any and , for all distributions and distribution set , the following holds for any sample size

The above definition can be viewed as an extension of the single-task PAC learnability haussler_probably_1990 that considers the problem of learning a single target function. Based on this definition, in the following we give the proof.

Theorem 2.

A necessary condition of the PAC learnability of OSL is .

Proof.

According to Definition 2, if OSL is PAC learnable, there must exist an algorithm that outputs a hypothesis set with zero error, i.e., for every and (otherwise the inequality (10) will not hold for a small enough and ). Theorem 1 indicates that this is equivalent to , which is impossible if according to Definition 1. ∎

0:  Hypothesis set , sampling hyperparameters and .
1:  for epochs do
2:     for  episodes do
3:        Sample data .
4:        Select a hypothesis from the hypothesis set using the empirical subjective function (5a).
5:        Train the hypothesis by minimizing the empirical episodic error (6).
6:     end for
7:  end for
Algorithm 1 OSL: Open-ended Supervised Learning

c.3 Proof of Theorem 3

We first introduce several technical lemmas.

Lemma 1.

Let be a set of events satisfying , with some . Then, .

Lemma 2 (Single-task generalization error bound vapnik_nature_2013 ).

Let be a measurable and bounded real-valued function set, of which the Vapnik-Chervonenkis (VC) dimension vapnik_uniform_1971 is . Let be data samples sampled i.i.d. from a distribution with size . Then, for any , the following inequality holds with probability at least :

(11)

where

(12)

Then, we upper bound the error induced by the estimation of the expected subjective function (5b) in each sampling episode of OSL, which is critical in bounding the generalization error. Recall that we use superscripts to denote the sampling index, e.g., denotes the -th domain sample, which can be any domains in the domain set . We define two shorthands as follows:

(13a)
(13b)

where denotes the -th sampling episode of OSL, is the hypothesis set.

Lemma 3 (Subjective estimation error bound).

Let be episodic samples in the -th sampling episode i.i.d. drawn from domain with size . Then, for any , the following inequality holds uniformly for all hypothesis with probability at least :

(14)

where is the function set of the sample-wise error.

Proof.

We have the following decomposition:

in which the original difference is decomposed into three terms. By definition (13a) and (13b) the middle term is non-positive. Using Lemma 2 by substituting with and replacing with , the first term and the last term can both be bounded by with probability at least (Lemma 1). Combining these two bounds using Lemma 1 completes the proof. ∎

Now we can give the proof of the main theorem.

Theorem 3 (Generalization error bound of OSL).

For any , the following inequality holds uniformly for all hypothesis sets with probability at least :

(15a)
(15b)
(15c)

where is the function set of the domain-wise expected error, is the function set of the sample-wise error, is the sampling count of the target function from the -th domain .

Proof.

Combining the objectives (2) (3) with the subjective function (5a) (5b), we have the following decomposition:

By definition (13a) and (13b) we rewrite the equation above:

in which the generalization error of OSL is decomposed into three terms. By substituting in Lemma 2 by and replacing with , the first term can be bounded by with probability at least . By replacing with in Lemma 3 we bound the last term by with probability at least . There remains the middle term for which we have