The standard machine learning paradigm involves algorithms that learn from centralized data, possibly pooled together from multiple data sources. The computations involved may be done on a single machine or farmed out to a cluster of machines. However, in the real world, data often live in silos and amalgamating them may be prohibitively expensive due to communication costs, time sensitivity, or privacy concerns. Consider, for instance, data recorded from sensors embedded in wearable devices. Such data is inherently private, can be voluminous depending on the sampling rate of the sensors, and may be time sensitive depending on the analysis of interest. Pooling data from many users is technically challenging owing to the severe computational burden of moving large amounts of data, and is fraught with privacy concerns stemming from potential data breaches that may expose a user’s protected health information (PHI).
Federated learning addresses these pitfalls by obviating the need for centralized data, instead designing algorithms that learn from sequestered data sources. These algorithms iterate between training local models on each data source and distilling them into a global federated model, all without explicitly combining data from different sources. Typical federated learning algorithms, however, require access to locally stored data for learning. A more extreme case surfaces when one has access to models pre-trained on local data but not the data itself. Such situations may arise from catastrophic data loss but increasingly also from regulations such as the general data protection regulation (GDPR) (EU, 2016), which place severe restrictions on the storage and sharing of personal data. Learned models that capture only aggregate statistics of the data can typically be disseminated with fewer limitations. A natural question then is, can “legacy” models trained independently on data from different sources be combined into an improved federated model?
Here, we develop and carefully investigate a probabilistic federated learning framework with a particular emphasis on training and aggregating neural network models. We assume that either local data or pre-trained models trained on local data are available. When data is available, we proceed by training local models for each data source, in parallel
. We then match the estimated local model parameters (groups of weight vectors in the case of neural networks) across data sources to construct a global network. The matching, to be formally defined later, is governed by the posterior of a Beta-Bernoulli process (BBP)(Thibaux & Jordan, 2007), a Bayesian nonparametric (BNP) model that allows the local parameters to either match existing global ones or to create new global parameters if existing ones are poor matches.
Our construction provides several advantages over existing approaches. First, it decouples the learning of local models from their amalgamation into a global federated model. This decoupling allows us to remain agnostic about the local learning algorithms, which may be adapted as necessary, with each data source potentially even using a different learning algorithm. Moreover, given only pre-trained models, our BBP informed matching procedure is able to combine them into a federated global model without requiring additional data or knowledge of the learning algorithms used to generate the pre-trained models. This is in sharp contrast with existing work on federated learning of neural networks (McMahan et al., 2017), which require strong assumptions about the local learners, for instance, that they share the same random initialization, and are not applicable for combining pre-trained models. Next, the BNP nature of our model ensures that we recover compressed global models with fewer parameters than the cardinality of the set of all local parameters. Unlike naive ensembles of local models, this allows us to store fewer parameters and perform more efficient inference at test time, requiring only a single forward pass through the compressed model as opposed to forward passes, once for each local model. While techniques such as knowledge distillation (Hinton et al., 2015) allow for the cost of multiple forward passes to be amortized, training the distilled model itself requires access to data pooled across all sources or an auxiliary dataset, luxuries unavailable in our scenario. Finally, even in the traditional federated learning scenario, where local and global models are learned together, we show empirically that our proposed method outperforms existing distributed training and federated learning algorithms (Dean et al., 2012; McMahan et al., 2017) while requiring far fewer communications between the local data sources and the global model server.
The remainder of the paper is organized as follows. We briefly introduce the Beta-Bernoulli process in Section 2 before describing our model for federated learning in Section 3. We thoroughly evaluate the proposed model and demonstrate its utility empirically in Section 4. Finally, Section 5 discusses current limitations of our work and open questions.
2 Background and Related Works
Our approach builds on tools from Bayesian nonparametrics, in particular the Beta-Bernoulli Process (BBP) (Thibaux & Jordan, 2007) and the closely related Indian Buffet Process (IBP) (Griffiths & Ghahramani, 2011). We briefly review these ideas before describing our approach.
2.1 Beta-Bernoulli Process (BBP)
Let be a random measure distributed by a Beta process with mass parameter and base measure . That is, . It follows that is a discrete (notprobability) measure formed by an infinitely countable set of (weight, atom) pairs . The weights are distributed by a stick-breaking process (Teh et al., 2007): and the atoms are drawn i.i.d from the normalized base measure with domain . In this paper, is simply for some . Subsets of atoms in the random measure are then selected using a Bernoulli process with a base measure . That is, each subset with is characterized by a Bernoulli process with base measure , . Each subset is also a discrete measure formed by pairs ,
is a binary random variable indicating whether atombelongs to subset . The collection of such subsets is then said to be distributed by a Beta-Bernoulli process.
2.2 Indian Buffet Process (IBP)
The above subsets are conditionally independent given . Thus, marginalizing will induce dependencies among them. In particular, we have , where (dependency on is suppressed in the notation for simplicity) and is sometimes called the Indian Buffet Process. The IBP can be equivalently described by the following culinary metaphor. Imagine customers arrive sequentially at a buffet and choose dishes to sample as follows, the first customer tries Poisson dishes. Every subsequent -th customer tries each of the previously selected dishes according to their popularity, i.e. dish with probability , and then tries Poisson new dishes.
The IBP, which specifies a distribution over sparse binary matrices with infinitely many columns, was originally demonstrated for latent factor analysis (Ghahramani & Griffiths, 2005). Several extensions to the IBP (and the equivalent BBP) have been developed, see Griffiths & Ghahramani (2011) for a review. Our work is related to a recent application of these ideas to distributed topic modeling (Yurochkin et al., 2018), where the authors use the BBP for modeling topics learned from multiple collections of document, and provide an inference scheme based on the Hungarian algorithm (Kuhn, 1955).
2.3 Federated and Distributed Learning
Federated learning has garnered interest from the machine learning community of late. Smith et al. (2017)
pose federated learning as a multi-task learning problem, which exploits the convexity and decomposability of the cost function of the underlying support vector machine (SVM) model for distributed learning. This approach however does not extend to the neural network structure considered in our work.McMahan et al. (2017) use strategies based on simple averaging of the local learner weights to learn the federated model. However, as pointed out by the authors, such naive averaging of model parameters can be disastrous for non-convex cost functions. To cope, they have to use a scheme where the local learners are forced to share the same random initialization. In contrast, our proposed framework is naturally immune to such issues since its development assumes nothing specific about how the local models were trained. Moreover, unlike existing work in this area, our framework is non-parametric in nature allowing the federated model to flexibly grow or shrink its complexity (i.e., its size) to account for varying data complexity.
There is also significant work on distributed deep learning(Lian et al., 2015, 2017; Moritz et al., 2015; Li et al., 2014; Dean et al., 2012). However, the emphasis of these works is on scalable training from large data and they typically require frequent communication between the distributed nodes to be effective. Yet others explore distributed optimization with a specific emphasis on communication efficiency (Zhang et al., 2013; Shamir et al., 2014; Yang, 2013; Ma et al., 2015; Zhang & Lin, 2015). However, as pointed out by McMahan et al. (2017), these works primarily focus on settings with convex cost functions and often assume that each distributed data source contains an equal number of data instances. These assumptions, in general, do not hold in our scenario. Finally, neither these distributed learning approaches nor existing federated learning approaches decouple local training from global model aggregation. As a result they are not suitable for combining pre-trained legacy models, a particular problem of interest in this paper.
3 Probabilistic Federated Neural Matching
We now describe how the Bayesian nonparametric machinery can be applied to the problem of federated learning with neural networks. Our goal will be to identify subsets of neurons in each of thelocal models that match neurons in other local models. We will then appropriately combine the matched neurons to form a global model.
Our approach to federated learning builds upon the following basic problem. Suppose we have trained Multilayer Perceptrons (MLPs) with one hidden layer each. For the th MLP , let and be the weights and biases of the hidden layer; and
be weights and biases of the softmax layer;be the data dimension, the number of neurons on the hidden layer; and the number of classes. We consider a simple architecture: where
is some nonlinearity (sigmoid, ReLU, etc.). Given the collection of weights and biaseswe want to learn a global neural network with weights and biases , where is an unknown number of hidden units of the global network to be inferred.
Our first observation is that ordering of neurons of the hidden layer of an MLP is permutation invariant. Consider any permutation of the -th MLP – reordering columns of , biases and rows of according to will not affect the outputs for any value of . Therefore, instead of treating weights as matrices and biases as vectors we view them as unordered collections of vectors , and scalars correspondingly.
Hidden layers in neural networks are commonly viewed as feature extractors. This perspective can be justified by the fact that the last layer of a neural network classifier simply performs a softmax regression. Since neural networks often outperform basic softmax regression, they must be learning high quality feature representations of the raw input data. Mathematically, in our setup, every hidden neuron of the-th MLP represents a new feature . Our second observation is that each ( parameterizes the corresponding neuron’s feature extractor. Since, the MLPs are trained on the same general type of data (not necessarily homogeneous), we assume that they share at least some feature extractors that serve the same purpose. However, due to the permutation invariance issue discussed previously, a feature extractor indexed by from the -th MLP is unlikely to correspond to a feature extractor with the same index from a different MLP. In order to construct a set of global feature extractors (neurons) we must model the process of grouping and combining feature extractors of collection of MLPs.
3.1 Single Layer Neural Matching
We now present the key building block of our framework, a Beta Bernoulli Process (Thibaux & Jordan, 2007) based model of MLP weight parameters. Our model assumes the following generative process. First, draw a collection of global atoms (hidden layer neurons) from a Beta process prior with a base measure and mass parameter , . In our experiments we choose as the base measure with and diagonal . Each is a concatenated vector of formed from the feature extractor weight-bias pairs with the corresponding weights of the softmax regression. In what follows, we will use “batch” to refer to a partition of the data.
Next, for each select a subset of the global atoms for batch via the Bernoulli process:
is supported by atoms , which represent the identities of the atoms (neurons) used by batch . Finally, assume that observed local atoms are noisy measurements of the corresponding global atoms:
with being the weights, biases, and softmax regression weights corresponding to the -th neuron of the -th MLP trained with neurons on the data of batch .
Under this model, the key quantity to be inferred is the collection of random variables that match observed atoms (neurons) at any batch to the global atoms. We denote the collection of these random variables as , where implies that (there is a one-to-one correspondence between and ).
Maximum a posteriori estimation.
We now derive an algorithm for MAP estimation of global atoms for the model presented above. The objective function to be maximized is the posterior of and :
Note that the next proposition easily follows from Gaussian-Gaussian conjugacy:
Given , the MAP estimate of is given by
where for simplicity we assume and .
Using this fact we can cast optimization corresponding to (3) with respect to only . Taking the natural logarithm we obtain:
We consider an iterative optimization approach: fixing all but one we find corresponding optimal assignment, then pick a new at random and proceed until convergence. In the following we will use notation to denote “all but ”. Let denote number of active global weights outside of group . We now rearrange the first term of (5) by partitioning it into and . We are interested in solving for , hence we can modify the objective function by subtracting terms independent of and noting that , i.e. it is 1 if some neuron from batch is matched to global neuron and 0 otherwise:
Now we consider the second term of (5):
First, because we are optimizing for , we can ignore . Second, due to exchangeability of batches (i.e. customers of the IBP), we can always consider to be the last batch (i.e. last customer of the IBP). Let denote number of times batch weights were assigned to global weight outside of group . We then obtain:
The (negative) assignment cost specification for finding is
We then apply the Hungarian algorithm to find the minimizer of and obtain the neuron matching assignments. Proof is described in Supplement A.
We summarize the overall single layer inference procedure in Figure 1 below.
3.2 Multilayer Neural Matching
The model we have presented thus far can handle any arbitrary width single layer neural network, which is known to be theoretically sufficient for approximating any function of interest (Hornik et al., 1989). However, deep neural networks with moderate layer widths are known to be beneficial both practically (LeCun et al., 2015) and theoretically (Poggio et al., 2017). We extend our neural matching approach to these deep architectures by defining a generative model of deep neural network weights from outputs back to inputs (top-down). Let denote the number of hidden layers and the number of neurons on the -th layer. Then is the number of labels and is the input dimension. In the top down approach, we consider the global atoms to be vectors of outgoing weights from a neuron instead of weights forming a neuron as it was in the single hidden layer model. This change is needed to avoid base measures with unbounded dimensions.
Starting with the top hidden layer , we generate each layer following a model similar to that used in the single layer case. For each layer we generate a collection of global atoms and select a subset of them for each batch using Beta-Bernoulli process construction. is the number of neurons on the layer , which controls the dimension of the atoms in layer .
Definition 1 (Multilayer generative process).
Starting with layer , generate (as in the single layer process)
This is the set of global atoms (neurons) used by batch in layer , it contains atoms . Finally, generate the observed local atoms:
where we have set . Next, compute the generated number of global neurons and repeat this generative process for the next layer . Repeat until all layers are generated ().
An important difference from the single layer model is that we should now set to 0 some of the dimensions of since they correspond to weights outgoing to neurons of the layer not present on the batch , i.e. if for . The resulting model can be understood as follows. There is a global fully connected neural network with neurons on layer and there are partially connected neural networks with active neurons on layer , while weights corresponding to the remaining neurons are zeroes and have no effect locally.
Our model can conceptually handle permuted ordering of the input dimensions across batches, however in most practical cases the ordering of input dimensions is consistent across batches. Thus, we assume that the weights connecting the first hidden layer to the inputs exhibit permutation invariance only on the side of the first hidden layer. Similarly to how all weights were concatenated in the single hidden layer model, we consider for . We also note that the bias term can be added to the model, we omitted it to simplify notation.
Following the top-down generative model, we adopt a greedy inference procedure that first infers the matching of the top layer and then proceeds down the layers of the network. This is possible because the generative process for each layer depends only on the identity and number of the global neurons in the layer above it, hence once we infer the th layer of the global model we can apply the single layer inference algorithm (Algorithm 1) to the th layer. This greedy setup is illustrated in Supplement Figure 4. The per-layer inference follows directly from the single layer case, yielding the following propositions.
The (negative) assignment cost specification for finding is
where for simplicity we assume and . We then apply the Hungarian algorithm to find the minimizer of and obtain the neuron matching assignments.
Given the assignment , the MAP estimate of is given by
We combine these propositions and summarize the overall multilayer inference procedure in Supplement Algorithm 2.
3.3 Neural Matching with Additional Communications
In the traditional federated learning scenario, where local and global models are learned together, common approach (see e.g., McMahan et al. (2017)
) is to learn via rounds of communication between local and global models. Typically, local model parameters are trained for few epochs, sent to server for updating the global model and then reinitialized with the global model parameters for the new round. One of the key factors in federated learning is the number of communications required to achieve accurate global model. In the preceding sections we proposed Probabilistic Federated Neural Matching (PFNM) to aggregate local models in a single communication round. Our approach can be naturally extended to benefit from additional communication rounds as follows.
Let denote a communication round. To initialize local models at round we set . Recall that , hence a local model is initialized with a subset of the global model, keeping local model size constant across communication rounds (this also holds for the multilayer case). After local models are updated we proceed to apply matching to obtain new global model. Note that global model size can change across communication rounds, in particular we expect it to shrink as local models improve on each step.
To verify our methodology we simulate federated learning scenarios using two standard datasets: MNIST and CIFAR-10. We randomly partition each of these datasets into batches. Two partition strategies are of interest: (a) a homogeneous partition where each batch has approximately equal proportion of each of the classes; and (b) a heterogeneous partition for which batch sizes and class proportions are unbalanced. We simulate a heterogeneous partition by simulating and allocating a proportion of the instances of class to batch . Note that due to the small concentration parameter () of the Dirichlet distribution, some sampled batches may not have any examples of certain classes of data. For each of the four combinations of partition strategy and dataset we run
trials to obtain mean performances with standard deviations.
In our empirical studies below, we will show that our framework can aggregate multiple local neural networks trained independently on different batches of data into an efficient, modest-size global neural network with as few as a single communication round. We also demonstrate enhanced performance when additional communication is allowed.
Learning with single communication
First we consider a scenario where a global neural network needs to be constructed with a single communication round. This imitates the real-world scenario where data is no longer available and we only have access to pre-trained local models (i.e. “legacy” models). To be useful, this global neural network needs to outperform the individual local models. Ensemble methods (Dietterich, 2000; Breiman, 2001) are a classic approach for combining predictions of multiple learners. They often perform well in practice even when the ensemble members are of poor quality. Unfortunately, in the case of neural networks, ensembles have large storage and inference costs, stemming from having to store and forward propagate through all local networks.
The performance of local NNs and the ensemble method define the lower and upper extremes of aggregating when limited to a single communication. We also compare to other strong baselines, including federated averaging of local neural networks trained with the same random initialization as proposed by McMahan et al. (2017)
. We note that a federated averaging variant without the shared initialization would likely be more realistic when trying to aggregate pre-trained models, but this variant performs significantly worse than all other baselines. We also consider k-Means clustering(Lloyd, 1982) of vectors constructed by concatenating weights and biases of local neural networks. The key difference between k-Means and our approach is that clustering, unlike matching, allows several neurons from a single neural network to be assigned to the same global neuron, potentially averaging out their individual feature representations. Further, k-Means requires us to choose k, which we set to
. In contrast, PFNM nonparametrically learns the global model size and other hyperparameters, i.e., are chosen based on the training data. We discuss parameter sensitivity in Supplement D.3.
Figure 2 presents our results with single hidden layer neural networks for varying number of batches . Note that a higher number of batches implies fewer data instances per batch, leading to poorer local model performances. The upper plots summarize test data accuracy, while the lower plots show the model size compression achieved by PFNM. Specifically we plot , which is the log ratio of the PFNM global model size to the total number of neurons across all local models (i.e. the size of an ensemble model). In this and subsequent experiments each local neural network has hidden neurons. We see that PFNM produces strong results, occasionally even outperforming ensembles. In the heterogeneous setting we observe a noticeable degradation in the performance of the local NNs and of k-means, while PFNM retains its good performance. It is worth noting that the gap between PFNM and ensemble increases on CIFAR10 with , while it is constant (and even in favor of PFNM) on MNIST. This is not surprising. Ensemble methods are known to perform particularly well at aggregating “weak” learners (recall higher implies smaller batches) (Breiman, 2001), while PFNM assumes the neural networks being aggregated already perform reasonably well.
Next, we investigate aggregation of multi-layer neural networks, each using a hundred neurons per layer. The extension of k-means to this setting is unclear and k-means is excluded from further comparisons. In Figure 2, we show that PFNM again provides drastic and consistent improvements over local models and federated averaging. It performs marginally worse than ensembles, especially for deeper networks on CIFAR10. This aligns with our previous observation — when there is insufficient data for training good local models, PFNM’s performance marginally degrades with respect to ensembles, but still provides significant compression over ensembles.
We briefly discuss complexity of our algorithms and experiment run-times in Supplement C.
Learning with limited communication
While in some scenarios limiting communication to a single communication round may be a hard constraint, we also consider situations, that frequently arise in practice, where a limited amount of communication is permissible. To this end, we investigate federated learning with batches and up to twenty communications when the data has a homogeneous partition and up to fifty communications under a heterogeneous partition. We compare PFNM, using the communication procedure from Section 3.3 ( across experiments) to federated averaging and the distributed optimization approach, downpour SGD (D-SGD) of Dean et al. (2012). In this limited communication setting, the ensembles can be outperformed by many distributed learning algorithms provided a large enough communication budget. An interesting metric then is the number of communications rounds required to outperform ensembles.
We report results with both one and two layer neural networks in Figure 3. In either case, we use a hundered neurons per layer. PFNM outperforms ensembles in all scenarios given sufficient communications. Moreover, in all experiments, PFNM requires significantly fewer communication rounds than both federated averaging and D-SGD to achieve a given performance level. In addition to improved performance, additional rounds of communication allow PFNM to shrink the size of the global model as demonstrated in the figure. In Figures 2(d), 2(c), 2(f), 2(e), 2(b) and 2(a) we note steady improvement in accuracy and reduction in the global model size. In CIFAR10 experiments, the two layer PFNM network’s performance temporarily drops, which corresponds to a sharp reduction in the size of the global network. See Figures 2(g) and 2(h).
In this work, we have developed methods for federated learning of neural networks, and empirically demonstrated their favorable properties. Our methods are particularly effective at learning compressed federated networks from pre-trained local networks and with a modest communication budget can outperform state-of-the-art algorithms for federated learning of neural networks. In future work, we plan to explore more sophisticated ways of combining local networks especially in the regime where each local network has very few training instances. Our current matching approach is completely unsupervised – incorporating some form of supervision may help further improve the performance of the global network, especially when the local networks are of poor quality. Finally, it is of interest to extend our modeling framework to other architectures such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). The permutation invariance necessitating matching inference also arises in CNNs since any permutation of the filters results in the same output, however additional bookkeeping is needed due to the pooling operations.
- Breiman (2001) Breiman, L. Random forests. Machine learning, 45(1):5–32, 2001.
- Date & Nagi (2016) Date, K. and Nagi, R. Gpu-accelerated hungarian algorithms for the linear assignment problem. Parallel Computing, 57:52–72, 2016.
- Dean et al. (2012) Dean, J., Corrado, G., Monga, R., Chen, K., Devin, M., Mao, M., Senior, A., Tucker, P., Yang, K., Le, Q. V., et al. Large scale distributed deep networks. In Advances in neural information processing systems, pp. 1223–1231, 2012.
- Dietterich (2000) Dietterich, T. G. Ensemble methods in machine learning. In International workshop on multiple classifier systems, pp. 1–15. Springer, 2000.
- EU (2016) EU. Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). Official Journal of the European Union, L119:1–88, may 2016. URL http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=OJ:L:2016:119:TOC.
- Ghahramani & Griffiths (2005) Ghahramani, Z. and Griffiths, T. L. Infinite latent feature models and the Indian buffet process. In Advances in Neural Information Processing Systems, pp. 475–482, 2005.
- Griffiths & Ghahramani (2011) Griffiths, T. L. and Ghahramani, Z. The Indian buffet process: An introduction and review. Journal of Machine Learning Research, 12:1185–1224, 2011.
- Hinton et al. (2015) Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Hornik et al. (1989) Hornik, K., Stinchcombe, M., and White, H. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989.
- Kuhn (1955) Kuhn, H. W. The Hungarian method for the assignment problem. Naval Research Logistics (NRL), 2(1-2):83–97, 1955.
- LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436, 2015.
- Li et al. (2014) Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., Long, J., Shekita, E. J., and Su, B.-Y. Scaling distributed machine learning with the parameter server. In OSDI, volume 14, pp. 583–598, 2014.
- Lian et al. (2015) Lian, X., Huang, Y., Li, Y., and Liu, J. Asynchronous parallel stochastic gradient for nonconvex optimization. In Advances in Neural Information Processing Systems, pp. 2737–2745, 2015.
Lian et al. (2017)
Lian, X., Zhang, C., Zhang, H., Hsieh, C.-J., Zhang, W., and Liu, J.
Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent.In Advances in Neural Information Processing Systems 30, pp. 5330–5340. 2017.
- Lloyd (1982) Lloyd, S. Least squares quantization in PCM. Information Theory, IEEE Transactions on, 28(2):129–137, Mar 1982.
- Ma et al. (2015) Ma, C., Smith, V., Jaggi, M., Jordan, M., Richtarik, P., and Takac, M. Adding vs. averaging in distributed primal-dual optimization. In Proceedings of the 32nd International Conference on Machine Learning, pp. 1973–1982, 2015.
- McMahan et al. (2017) McMahan, B., Moore, E., Ramage, D., Hampson, S., and y Arcas, B. A. Communication-efficient learning of deep networks from decentralized data. In Artificial Intelligence and Statistics, pp. 1273–1282, 2017.
- Moritz et al. (2015) Moritz, P., Nishihara, R., Stoica, I., and Jordan, M. I. Sparknet: Training deep networks in spark. arXiv preprint arXiv:1511.06051, 2015.
Paszke et al. (2017)
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z.,
Desmaison, A., Antiga, L., and Lerer, A.
Automatic differentiation in pytorch.In NIPS-W, 2017.
Poggio et al. (2017)
Poggio, T., Mhaskar, H., Rosasco, L., Miranda, B., and Liao, Q.
Why and when can deep-but not shallow-networks avoid the curse of dimensionality: a review.International Journal of Automation and Computing, 14(5):503–519, 2017.
- Reddi et al. (2018) Reddi, S. J., Kale, S., and Kumar, S. On the convergence of adam and beyond. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=ryQu7f-RZ.
- Shamir et al. (2014) Shamir, O., Srebro, N., and Zhang, T. Communication-efficient distributed optimization using an approximate newton-type method. In International conference on machine learning, pp. 1000–1008, 2014.
- Smith et al. (2017) Smith, V., Chiang, C.-K., Sanjabi, M., and Talwalkar, A. S. Federated multi-task learning. In Advances in Neural Information Processing Systems, pp. 4424–4434, 2017.
- Teh et al. (2007) Teh, Y. W., Grür, D., and Ghahramani, Z. Stick-breaking construction for the Indian buffet process. In Artificial Intelligence and Statistics, pp. 556–563, 2007.
- Thibaux & Jordan (2007) Thibaux, R. and Jordan, M. I. Hierarchical Beta processes and the Indian buffet process. In Artificial Intelligence and Statistics, pp. 564–571, 2007.
- Yang (2013) Yang, T. Trading computation for communication: Distributed stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems, pp. 629–637, 2013.
- Yurochkin et al. (2018) Yurochkin, M., Fan, Z., Guha, A., Koutris, P., and Nguyen, X. Scalable inference of topic evolution via models for latent geometric structures. arXiv preprint arXiv:1809.08738, 2018.
- Zhang & Lin (2015) Zhang, Y. and Lin, X. Disco: Distributed optimization for self-concordant empirical loss. In International conference on machine learning, pp. 362–370, 2015.
- Zhang et al. (2013) Zhang, Y., Duchi, J., Jordan, M. I., and Wainwright, M. J. Information-theoretic lower bounds for distributed statistical estimation with communication constraints. In Advances in Neural Information Processing Systems, pp. 2328–2336, 2013.
Appendix A Single Hidden Layer Inference
The goal of maximum a posteriori (MAP) estimation is to maximize posterior probability of the latent variables: global atomsand assignments of observed neural network weight estimates to global atoms , given estimates of the batch weights :
MAP estimates given matching.
First we note that given it is straightforward to find MAP estimates of based on Gaussian-Gaussian conjugacy:
where is the number of active global atoms, which is an (unknown) latent random variable identified by . For simplicity we assume , and .
Inference of atom assignments (Proposition 2).
We can now cast optimization corresponding to (12) with respect to only . Taking natural logarithm we obtain:
We now simplify the first term of (14) (in this and subsequent derivations we use to say that two objective functions are equivalent up to terms independent of the variables of interest):
We consider an iterative optimization approach: fixing all but one we find the corresponding optimal assignment, then pick a new at random and repeat until convergence. We define notation to denote “all but ”, and let denote number of active global weights outside of group . We partition (15) between and , and since we are solving for , we subtract terms independent of :
Now observe that , i.e. it is 1 if some neuron from batch is matched to global neuron and 0 otherwise. Due to this we can rewrite (16) as a linear sum assignment problem:
Now we consider second term of (14):
First, because we are optimizing for , we can ignore . Second, due to exchangeability of batches (i.e. customers of the IBP), we can always consider to be the last batch (i.e. last customer of the IBP). Let denote number of times batch weights were assigned to global atom outside of group . We now obtain the following:
We now rearrange (18) as linear sum assignment problem:
This completes the proof of Proposition 1.
Appendix B Multilayer Inference Details
Appendix C Complexity Analysis
In this section we present a brief discussion of the complexity of our algorithms. The worst case complexity per layer is achieved when no neurons are matched and is equal to for building the cost matrix and for running the Hungarian algorithm, where is the number of neurons per batch (here for simplicity we assume that each batch has same number of neurons) and is the number of batches. The best case complexity per layer (i.e. when all neurons are matched) is , also note that complexity is independent of the data size. In practice the complexity is closer to the best case since global model size is moderate (i.e. ). Actual timings with our code for the experiments in Figure 2 are as follows - 40sec for Fig. 1(a), 1(b) at groups; 500sec for Fig. 1(c), 1(d) at (the term is dominating as CIFAR10 dimension is much higher than MNIST); 60sec for Fig. 1(e), 1(f) () at layers; 150sec for Fig. 1(g), 1(h) () at . The computations were done using 2 CPU cores and 4GB memory on a machine with 3.0 GHz core speed. We note that (i) this computation only needs to be performed once (ii) the cost matrix construction which appears to be dominating can be trivially sped up using GPUs (iii) recent work demonstrates impressive large scale running times for the Hungarian algorithm using GPUs (Date & Nagi, 2016).