Knowledge Aggregation via Epsilon Model Spaces

05/20/2018 ∙ by Neel Guha, et al. ∙ 0

In many practical applications, machine learning is divided over multiple agents, where each agent learns a different task and/or learns from a different dataset. We present Epsilon Model Spaces (EMS), a framework for learning a global model by aggregating local learnings performed by each agent. Our approach forgoes sharing of data between agents, makes no assumptions on the distribution of data across agents, and requires minimal communication between agents. We empirically validate our techniques on MNIST experiments and discuss how EMS can generalize to a wide range of problem settings, including federated averaging and catastrophic forgetting. We believe our framework to be among the first to lay out a general methodology for "combining" distinct models.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In traditional machine learning settings, the dataset for a task is consolidated on one device, where a single data example (or batch) can be fetched at any time. In these settings, a model can be easily trained using standard methods (i.e. stochastic gradient descent).

Increasingly, however, datasets may be distributed over multiple devices, where sharing data between devices is precluded for privacy and/or communication reasons. Our goal in these settings is to learn a single model over the entire dataset without sharing data between devices. Examples of such settings include:

  • Patient records from different hospitals, which could be used to train disease-diagnosing models.

  • Sensor data from different self-driving cars, which could be used to improve perceptual and navigational models.

  • Text messages from different cell phone users, which could be used to train improved auto-correct/spell checking models.

In these cases, there is a significant communication and/or privacy cost with sharing data. Hospitals are prevented from sharing data with outside entities and users may prefer to keep personal data on their device. Sensor data from many different cars may be too expensive to communicate and/or aggregate.

There are advantages to learning a single global model over the entire dataset (as opposed to individual models over each device). The data contained on a single device may not be sufficient to train a complex model (especially deep neural networks). When devices generate their own data, the distribution of data across different devices may be non-i.i.d. Training a single model across all data could improve generalization and robustness.

Recent approaches [22; 25; 13] learn a global model by iteratively averaging local learnings performed on each device. In a single round of communication, each device communicates a parameter update (based on local data) to a central server, which averages updates from all devices to calculate a new set of global parameters. These parameters are communicated back to each device, and the procedure repeats until some terminal condition. These approaches face several challenges:

Non-I.I.D data: When the distribution of data across devices differs significantly, averaging parameter updates could be suboptimal on certain devices or prevent convergence for the global parameters. This is particularly common when devices generate their own data. For example, hospitals in different parts of the United States may see patients with different symptoms for the same condition.

Communication Costs: If devices communicate infrequently or unreliably, approaches requiring synchronization between devices will fail. For example, if a user switches off their phone, they are no longer capable of sending updates to a central server. In some cases, the size of a device’s message may face bandwidth limitations.

In our work, we present Good-Enough Model Spaces (GEMS) - an alternative framework to traditional federated learning approaches. In GEMS, the goal is to learn a globally satisficing (i.e. "good-enough") hypothesis by considering each learner’s set of locally satisficing hypotheses. Assuming a globally satisficing hypothesis is locally satisficing for every learner, we can learn a globally satisficing hypothesis by intersecting each learner’s set of locally satisficing hypotheses. GEMS is inspired by Version Spaces [23], in which a set of hypotheses consistent with the data is also maintained during learning.

This approach avoids the need to share data between learners, requires fewer rounds of communication, and can provide constraints on the size of the global learned model.

In this paper, we make the following contributions.

  1. We present a formalization of GEMS agnostic to the function class of the hypothesis (model type).

  2. We present techniques for applying GEMS to both shallow hypotheses

    (logit models) and

    multilayer hypotheses (deep neural networks). In particular, we address the problem introduced by the existance of a large number of isomorphic models in multilayer neural networks.

  3. We discuss how "public" data available to the central server can be used to fine-tune a globally satisficing model.

  4. Finally, we present empirical validation of our methods on a variety of image recognition (MNIST) and text understanding (sentiment analysis with Twitter) experiments.

2 Literature Review

We now discuss the relation of GEMS to existing work.

Distributed Learning. Distributed learning is a common paradigm in industry settings, where a single dataset may be partitioned between multiple data centers [8; 20]. However, these approaches are intended for i.i.d settings, where the partition of data across devices may be explicitly specified to optimize performance. Further, these approaches assume constant communication with devices and may involve a very large number of updates. In contrast, our approach allows for non-i.i.d partitions of data and requires very few updates.

Federated Optimization. More recently, Federated Averaging [22] has been explored for learning aggregate models over distributed datasets. Each learner uses a set of global parameters to calculate a gradient update based on local data. The central server applies the average of all learner gradients to the global parameters. [25] also present the distributed selective gradient descent algorithm, where learners only send updates for their most important parameters. [6] examine learning sparse SVM’s over distributed medical data. Though federated averaging has found empirical success, it has significant communication costs. The number of communication rounds can be . Our approach allows learners to identify a global aggregate model within a few rounds of communication. In cases where agents may be faulty or unreliable (e.g. cellphones), this presents significant advantages.

Ensemble Techniques Our approach draws comparisons to work in ensemble learning, including bagging [9], Adaboost [12], distributed boosting [18] and distillation [16]. At a high level, GEMS and ensemble methods both attempt to "combine" models by aggregating the knowledge they contain. However, there are several important distinctions. Ensemble methods can produce arbitrarily large models, where the size of the final model is the size of the entire ensemble. When storage is constrained, this may be impractical. Additionally, ensemble techniques assume an i.i.d distribution. Furthermore, most approaches assume that every model in the ensemble is trained on the same dataset. This differs significantly from our setting.

Privacy. Learning "privacy-preserving" models is an active area of work. Applying some noise to a model/query’s output has been widely explored [2; 7]. Other methods have proposed generating surrogate datasets (which are significantly different from the original data) but still sufficient to learn an accurate model [3; 4]. Similarly, [24] propose constructing an ensemble of black-box teachers (each trained locally on the agent datasets) and training a student network on the aggregated votes of the teacher. In general, these approaches do not scale to a distributed data setting or require the central server to have a significant amount of unlabeled data. It is not clear how privacy guarantees are affected in non-i.i.d settings, or when one of the learners is considered adversarial. These are questions we hope to explore in future work.

3 Good-Enough Model Spaces

In a distributed learning setting, a training set drawn from is divided amongst learners in a potentially unbalanced and non i.i.d manner. We let , the subset of training examples belonging to learner such that . Our goal is to learn a global aggregate model (hypothesis) for a predictive task over . We find can be learned by sharing information about local learnings (as opposed to raw data) between agents.

We modify the traditional approach of identifying the optimal (i.e. empirical risk minimizing) model over all of . Instead, we learn a model that is good-enough, or satisficing[26] over . Formally, we define a hypothesis decider which determines whether a given hypothesis is satisficing over some data. We discuss possible definitions for later. There exists a set of globally good-enough models , where:

(1)

In our approach, each learner communicates a support set over the hypothesis space consisting of hypotheses which are good-enough for its local data, i.e., its Good-Enough Model Set. This is given by:

(2)

If we assume that there exists a hypothesis , then . Informally, a hypothesis in the good-enough model space of every one of the learners should be globally good-enough. Our approach learns

by identifying the good-enough model space for each learner and calculating their intersection. When the data partition across learners is extremely imbalanced or skewed, the global good-enough model might not be good-enough for all local learners. In such cases,

and our approach will fail.

Figure 1 visualizes this approach for a model class with only two weights ( and ) and two learners ("red" and "blue"). The GEMS for each learner is a set of regions over the weight space (the blue regions correspond to one learner and the red regions correspond to second learner). A model in the overlap of both gives us . In terms of this visualization, our goal is to find a point in the intersecting space.

3.1 Good-Enough Criteria

There are different approaches to defining Q. In the original Version Spaces work, Q was defined as the set of theories/models consistent with the data observed. From a bayesian perspective, each learner’s learning is in the form of a set of priors over the set of possible models. In this case, , where corresponds to a learner’s bayesian prior over the model space and scores the likelihood of a hypothesis being good-enough. [5]

present methods for training deep neural networks that induce unique probability distributions over the weights of a network. Applying a threshold to these parameter distributions could be used to define

.

may also be used to enforce additional constraints specific to a learner. For example, could restrict a learner’s models space to models which are differentially private [1], or provide some other privacy guarantees.

In this work, we define a simple in terms of the maximum acceptable cross-validation loss of

. For some hyperparameter

and loss function

,

For the global good-enough model, is defined over the entire dataset .

We used a fixed across all learners. In some settings, defining a different over each learner may be appropriate. This may be beneficial when the data is extremely unbalanced (i.e. one learner has orders of magnitude more data than another).

There is a natural trade-off between the "selectivity" of , our ability to identify an intersection between different learner’s hypothesis spaces, and the performance of the resulting model. If is selective, i.e. is small, than finding an intersection between different learners’ hypothesis spaces is harder. If is larger, the hypothesis spaces will be larger and identifying an intersection will be easier. However, an identified hypothesis in the intersection may have poorer performance.

For non-convex loss surfaces, a given might induce a discontigous and non-convex good-enough model space. For certain , the good-enough model space might include regions of the hypothesis space corresponding to local optimums. In general however, our goal is not to compute all of , but simply to find a model that lies in the intersection of the different learners’ . In our experiments we find that even simple approximations of provide good results.

3.2 Communication Efficiency

In the idealized scenario, the coordinating agent requires only one update from each learner (i.e. one approximation of ) to identify . In addition to reducing the number of communication rounds, this allows for asynchronous learning, where all learners are not required to be "active" at the same time.

With multi-layer neural networks, it is useful to allow for a small number of updates in both directions (). Regardless, the number of communication rounds is still smaller than other approaches like [22].

3.3 Privacy

Because no data is shared between learners, our approach affords a privacy benefit. However, there is an unknown privacy cost incurred by sharing . Recent work [11] has demonstrated that a model’s weights can be inverted to learn characteristics of the underlying training data. Determining the privacy loss of sharing and defining counter measures is an area for future work. For example, the inclusion/exclusion of a single training example on learner will impact the boundaries of the resulting support space. A differentially private approach [10] might involve perturbing the support space boundaries.

3.4 Fine Tuning

In certain cases, there may be a small sample of public data available to all agents. After the coordinating agent identifies , it can fine-tune the weights with this public data. This is analogous to optimizing within the intersection of all .

Figure 1: Illustration of good-enough model spaces

4 Computing GEMS

In applying our methods, we identify two variants: cases where the hypothesis function class corresponds to a shallow model and cases where it corresponds to a deep model. Shallow models include linear regressors, logit models, etc, where the input space corresponds to the feature space.

Deep models (such as DNNs) pose additional challenges arising out of the explosion in the number of isomorphic models. For example, swapping neurons within a DNN’s hidden layer does not effect the behavior of network, but corresponds to a model located in a very different region of the weight space.

We now discuss how to determine and identify for both of these cases. Though our algorithms are described with two learners, they can easily generalize to more learners.

Metric Averaged Gold Model Fine-tuned
Accuracy 0.51 0.51 0.66 0.92 0.83 0.88
Loss 2.05 1.88 1.20 0.27 0.54 0.39
Table 1: MNIST Deep GEMS. The accuracy and loss (cross-entropy) for each agent’s local model ( and ), the averaged model (average of and ), the gold model, , and a fine-tuned in the deep experiment. The learned contained 68 hidden neurons (41 pairs of intersections)
Metric Averaged Gold Model Fine-tuned
Accuracy 0.51 0.53 0.42 0.95 0.71 0.90
Loss 2.79 2.43 1.72 0.16 1.264 0.66
Table 2: MNIST Multi-round GEMS. The accuracy and loss (cross-entropy) for each learner’s local model ( and ), the gold model, , and the fine-tuned in the multi-round setting for a three layer model.

4.1 Shallow Models

For a shallow hypothesis class with trainable parameters, we approximate as a set of closed balls in where all models contained by each ball produced an empirical loss less than the hyperparameter on a learner’s local data. A learner’s support space is thus defined by a set of tuples where denotes the center of a ball and denotes its radius. Each is computed from the weights of the optimal model on ’s local data and an set of initial weights. is determined by evaluating the local loss of uniformly sampled models at a fixed radius from . The number of samples is a function of where is number of parameters in the model. If all sampled models produce a loss less than , we increase the radius and repeat until the sampling radius produces non-satisficing models. The final radius producing all satisficing models is used to define a ball of good-enough models that is part of . It is possible that there are points inside a ball that have an error greater than . However, with a sufficiently large number of samples, we can generate a ball such that with high probability, every point inside the ball is a good-enough model.

Given intersecting balls from and from two different learners, we approximate as the center of intersection. If no intersection exists, we retrain the local models using different parameter initializations and repeat the process. For each iteration, we add the ball corresponding to the hypothesis space, from that iteration, for each learner to the net hypothesis space for that learner. Figure 1 provides the precise formulation.

Input: Local Data , , threshold

1:while True do
2:     Initialize empty hypothesis spaces and
3:     
4:     
5:     Add to and to
6:     if  and contain intersecting spaces then
7:         return for intersecting .      
8:
9:function GetGoodEnoughModels()
10:     
11:     
12:     while True do:
13:         
14:         if   then
15:              
16:         else return               
Algorithm 1 Shallow GEMS for two learners

4.2 Deep Models

Multilayer neural networks consist of stacked transformations, where the initial layers act as feature extractors. It’s likely in our setting that different learners may require different features in order to learn accurate models for their tasks/data. Our approach treats each neuron as distinct shallow models, and identifies their support spaces using the methodology described above.

The global model is approximated by stitching together neurons from both learners. For all neurons in the same layer from both learners, we evaluate the cartesian product of their support spaces. If particular feature extractor is common to both learners, the corresponding neurons should have overlapping support spaces. The neuron at the center of the intersection of these support spaces is inserted into at layer . If a particular feature extractor is unique to one learner, its neuron’s support space may not overlap with any of the neuron support spaces from the other learner. However, this feature extractor is still necessary in the context of the task. The neuron is thus inserted unchanged into layer of .

Metric Averaged Gold Model Fine-tuned
Accuracy 0.63 0.53 0.62 0.72 0.68 0.68
Loss 0.65 0.68 0.68 0.59 0.64 0.64
Table 3: Twitter Deep GEMS Results. The accuracy and loss (cross-entropy) for each agent’s local model ( and ), the averaged model (average of and ), the gold model, , and a fine-tuned in the deep experiment. The learned contained 82 hidden neurons (36 pairs of intersections)
Metric Averaged Gold Model Fine-tuned
Accuracy 0.61 0.53 0.51 0.72 0.67 0.68
Loss 0.66 0.67 0.69 0.67 0.66 0.66
Table 4: Twitter Multi-Round GEMS Results. The accuracy and loss (cross-entropy) for each agent’s local model ( and ), the averaged model (average of and ), the gold model, , and a fine-tuned in the deep experiment with a three layer network.

Input: Data for Learner 1 and 2 ( and ),

function Combine
2:     
     
4:     for all  do
         for neuron in layer of Learner 1 do:          
6:         for neuron in layer of Learner 2 do:          
         for all Pairs of neurons with intersecting epsilon spaces do
8:              Insert intersection into aggregate model          
         for all Remaining neurons and with no corresponding intersection do
10:              Insert into aggregate model               
12:function GetNeuronSpace()
     
14:     while True do:
         
16:         if All produce loss less than on then
         else return               
Algorithm 2 Deep GEMS for 2 learners

5 Experimental Results

We conduct experiments on MNIST [19] and Twitter sentiment data [14]. Our Twitter dataset consisted of 121578 tweets (60583 positive and 60995 negative) from 50,000 users. All of our experiments follow a similar methodology. and involve two distinct learners. We begin by identifying an adversarial partition over the dataset to create local datasets for both learners (discussed in more detail below). We discuss these partitions in more detail below. We divide each learner’s local dataset into a train/test/validation split. All results are reported on the aggregated test set across both learners. We use the validation split for each learner to determine . In general, our partitions are such that models trained locally on a learner’s train data perform relatively poorly on the aggregated test data.

We present experiments for shallow hypotheses on MNIST, and for deep hypotheses on both the Twitter data and MNIST.

5.1 Shallow Hypothesis Class

We learn a logistic model to predict the parity (even/odd) of MNIST images where learner 1’s data consists of digits

and learner 2’s data consists of digits . We featurize each image by training a two layer neural network with 50 hidden neurons, and passing each image through the first layer. Each learner constructs on the featurized train set for its digits, and the final models are evaluated on the aggregate featurized test set across both learners.

Metric Gold Model
Accuracy 0.75 0.73 0.93 0.91
Loss 0.99 0.736 0.19 0.22
Table 5: Shallow GEMS Results. The accuracy and loss for each learner’s local model ( and ), the gold model, and in the shallow experiment.

Table 5 presents the accuracy and loss (cross-entropy) on the aggregated global test test. is a model trained on digits 0-4, is a model trained on digits 5-9. The gold model was learned on aggregated train data from both learners. The reported results correspond to the lowest epsilon for which shallow GEMS identified an aggregate model. We found that at lower values of epsilon, GEMS failed to identify an intersection between each agent’s epsilon space.

5.2 Deep Hypothesis Class

Our second experiment learns a two layer neural network. For MNIST, we scramble every image’s pixels (in a uniform manner) to generate a transformed MNIST dataset. Learner 1 learns digit mappings (0-9) on the original MNIST dataset while Learner 2 learns digit mappings on the transformed dataset. This experimental setting is commonly found in work on catastrophic forgetting (the tendency for neural networks to "forget" old knowledge when trained on data for new tasks) [17; 21; 15]. This constitutes an adversarial experimental setting, where the data distributions over the learners differs significantly.

Each learner learned a two layer fully connected neural network with 50 hidden neurons, ReLU activations, and dropout

[27] on its local data. The hypothesis space for a neuron consists of both the incoming and outgoing weights. For fine-tuning , we held out 200 images from the MNIST dataset and 200 images from the transformed MNIST dataset. This data was used to update the final layer weights for .

Table 1 compares the accuracy and loss (cross-entropy) on the aggregated test set across both learners (both regular and transformed MNIST images). is the local model trained only on the original MNIST digits and is the model trained only on the transformed MNIST digits. The averaged model was produced by averaging the weights of and . The gold model was trained on the aggregate of data across all agents.

We compare our techniques against naively averaging the weights for both learners’ local models in the deep hypothesis setting. We found that when both local models start with the same weight initializations, the averaged model has an accuracy of 79% and a loss of 0.78 on the aggregated data. When each local model is initialized with a random set of weights, the averaged model has an accuracy of 66% and a loss of 1.20. In both cases, our learned global model outperforms the averaged model (performing approximately similar to the results reported above).

For Twitter, we use a bag-of-words model consisting of TF-IDF scores for the 1000 most frequent terms across all tweets. We randomly selected a subset of terms and partitioned tweets by whether or not they contained any of the terms. This forces a non-i.i.d distribution across both learners. Each learner learned a two-layer neural network with 50 hidden neurons, ReLU activations, and dropout. The results of the local models and the aggregated models are presented in Table 3.

Finally, we experimented with a multi-round variant which scales to models with three or more layers. In this variant, learners communicate one update per hidden layer. After identifying a layer with good-enough neurons, each learner retrains its model using this layer (updating weights for uppers layers). Table 2 presents the results for this approach on a three layer network with 50 hidden neurons in each layer (RELU activations) for the deep hypothesis experiment. and fine-tuned significantly outperforms both local models and naive averaging. Table 4 contains the result of a similar three layer network on the Twitter dataset.

6 Discussion

In our experiments, we find that the global model consistently outperforms learner local models, while requiring only a few rounds of communication. We now discuss several observations and tradeoffs of our approaches.

6.1 Data Splits

We also evaluated our methods under less adversarial data splits. Specifically, we allow each learner’s dataset to be a random mixture of both the original and transformed MNIST data according to some mixing factor . When , learner 1’s data is entirely sampled from the original MNIST dataset (and learner 2 is entirely sampled from the transformed dataset). When , approximately half of the data for each learner is sampled from original and transformed dataset. Thus, constitutes the most adversarial setting with the least data overlap between agents and is the most "friendly" setting with an IID distribution of data across agents. Figure 3 displays the loss for different models. For simplicity, we only present the performance from the better performing local model ( vs ). We find that in general, the global model performs consistently well for different data mixtures.

6.2 Model Size

For shallow models, our approach ensures that the learned global model contains the same number of parameters as each learner’s local model. In settings where memory is limited or evaluating a model is expensive, this is advantageous. This differs from ensemble methods, where the size of the final model would increase as more learners are added.

For deep models, our approach balances model size with performance. By picking greater values for epsilon we can implicitly reduce the size of the aggregate model. We experimented with this by learning global model for different epsilon thresholds (Figure 2). Picking greater values of epsilon increases the tolerance for each neuron, allowing for more intersections and a smaller hidden layer.

6.3 Hypothesis Sampling Tradeoff

Our approach identifies by sampling a fixed number of hypotheses at radius from a center and evaluating the loss of each sampled hypothesis. As the radius increases, so does the region from which to sample. Thus, we expect the accuracy of our approximation of to weaken for greater radii. However, if the number of samples was to scale as with the size of the ball, calculating for large radii would be computationally unfeasible.

Related to this tradeoff is the question of coverage. For a given induced by a threshold , it would be useful to quantify the strength of . This could be defined as

  1. The proportion of where the loss of is less than (i.e. precision).

  2. The proportion of where the loss of is less than epsilon (i.e. recall).

Unfortunately, is is not clear how these could be calculated for non-convex loss surfaces (in a computationally feasible manner). In future work, we hope to develop better methods for evaluating .

Figure 2: We experiment with different values of epsilon. For each value of epsilon, we identify the global model and report the number of neurons in the hidden layer.
Figure 3: Performance on different data splits. For clarity, we only present the result of the better performing local model ("Best Local Model").

7 Conclusion and Future Work

In summary, we present techniques for learning global models over distributed datasets in relatively few rounds of communication and no data sharing. By considering the set of good-enough models for each learner, we require far less synchronicity between learners. We find that simple -ball approximations of a learner’s good-enough model set produce promising results.

In future work, we hope to expand our techniques to different model architectures (convolutional layers, decision tree ensembles, etc). In addition, we would like to explore the theoretical performance and privacy guarantees of this approach. Finally, we are interested in gaining a deeper understanding of how good-enough model spaces can be used to construct "libraries" of models.

References