reframes the learning process in a context that is described by a collection of constraints. Such constraints are the mean that is used to inject knowledge into the learning process and they represent different aspects of the task at hand. The goal of LfC is to learn a vector function
(classifier, regressor, etc.) by solving a constrained optimization problem whereis required to maximize some regularity conditions in the space to which it belongs . Different types of knowledge can be exploited in LfC, including the ones that are represented using First-Order Logic (FOL) formulas . For example, knowledge on the relationships among classes [5, 6], on the interactions among different tasks , and on labeled regions of the input space , can be easily converted into constraints and embedded in the LfC learning problem (including point-wise constrains on supervised pairs ). The strength of LfC is more evident when using semi-supervised data, thus enforcing constraints also on unsupervised samples. Depending on the type of knowledge, constraints can be convex or non-convex, enforced in a soft or hard way .
To the best of our knowledge, LfC has always been conceived as a centralized framework, where constraints (i.e., knowledge), data, and the learned predictors are all handled within the same computational unit. This paper studies the extension of LfC to the distributed setting, where multiple computational nodes, connected over the network, contribute to the learning process. This setting is inspired by the nowadays organization of data and knowledge, where it is extremely common to participate to communities over the net, sharing some resources (e.g., public photos on social networks), keeping other local (e.g., private pictures taken with a personal smartphone, saved on the cloud), and having the need of developing (and eventually sharing) customized or more robust services that might benefit both from private data and public data taken from the net (e.g., a recognizer of pictures of a custom type). Our goal consists in formulating a distributed implementation of LfC with a generic structure that covers the described setting and that could be further extended emphasizing more specific aspects, such as the ones related to privacy-preserving methods [9, 10, 11, 12]. The generality of LfC prevents the direct application of many distributed optimization approaches (see  and references therein), since we need to support hard, soft, convex and nonconvex constraints. There has been several recent progresses in the field of distributed constrained optimization [14, 15, 16, 17], and the Asynchronous Method of Multipliers (ASYMM) [18, 19] offers the capability of dealing with convex and nonconvex constraints that are locally defined in computational nodes. Moreover, ASYMM has been proved to be equivalent to a centralized instance of the Method of Multipliers , thus inheriting the properties of its centralized counterpart.
It is worth observing that recently there has been an increased interest in distributed learning scenarios (see, e.g., [21, 22, 23, 24]). Specific frameworks have been studied, like the one of federated learning [25, 26], and several algorithms have been proposed [27, 28, 29]. However, distributed learning is usually intended in the sense of learning from distributed datasets, and central servers are required to perform at least a part of the learning process. Conversely, in this paper we consider a scenario in which not only data, but also knowledge is distributed in the network. Moreover, we exploit a fully distributed architecture, in which no central computational unit is required.
The main contribution of this work is to tailor the ASYMM algorithm to the aforementioned LfC distributed setting, showing how constraints can be used as a bridge between shared and private resources. As a proof-of-concept, the model is applied to two real-world problems: digit image classification and document classification. In both cases, we consider semi-supervised data, constraints on supervised examples, and constraints devised from FOL formulas. The results show that, in this distributed setting, FOL-based constraints improve the quality of the private classifiers, and local and shared constraints are asymptotically fulfilled.
2 Learning from constraints
In the framework of LfC we consider the problem of finding the most suitable vector function subject to a set of constraints that models the available knowledge on the considered problem. is a space of functions from (being the dimensionality of the input data) to where a regularity measure is defined, and each is referred to as “task function” (for example, a classifier of a certain class). It is pretty common to enforce point-wise constraints, i.e., constraints applied to evaluated on a given collection of data points , and to consider both the bilateral and/or unilateral cases, that we denote by
respectively. Notice that and compactly indicate vectors of constraints222For simplicity, all the constraints are applied to the same , but our approach also holds when different constraints operate on different data. We will sometimes replace the vector with an explicit list of functions.. We consider to be partitioned into a collection of points for which a label is known and a set of unlabeled points, respectively collected in and ,
A popular category of constraints that is frequently exploited in LfC is given by polynomials derived from First-Order Logic formulas  . In particular, each task function is assumed to implement the activation in of a predicate that describes a property of the considered environment, and FOL formulas represent relationships among such properties, i.e., among the tasks in . FOL formulas are then converted into numerical constraints using Triangular Norms (T-Norms, ), special binary functions that generalize the conjunction operator . For example, we might know that, in the considered environment, , . This information is converted into a bilateral constraint of Eq. (1), that in the case of the product T-Norm is , and applied to all the data points of (see  for more examples). In this paper, we assume
to be a generic neural network. As regularity measure we use the squared norm of the weights, leading to the popular weight decay term (for simplicity, we avoid reporting this term in the following equations).
Depending on the nature of the constraints, it could be necessary to enforce some of them in a hard way, and others in a soft manner . While supervisions on examples are generally subject to noise (suggesting a penalty-based soft enforcement), there might be structural or environment-related conditions that must be enforced in a hard way. In the rest of the paper, we will use the notation to refer to those constraints that must be enforced in a hard way, while indicates the sum of the penalty functions associated to the soft constraints333The definitions of can include weighting terms to give different importance to the different soft constraints.. Moreover, the choices of both the form of the constraints of (1) and of the form of usually end up in generating constraints that are nonconvex with respect to the model parameters that are subject of optimization. This consideration holds even more strongly when we select to be a generic neural net.
3 Distributed framework
When moving to the distributed setting, we consider computational nodes connected over a network, whose underlying connectivity structure can be represented through an undirected and connected graph , where is the set of nodes and is the set of node-to-node connections. Nodes might have some specific requirements in terms of the resources they want to keep private (local) and the ones they want to share with the other nodes. We distinguish among three types of resources: data (i.e, the available data points), knowledge (i.e. constraints), and predictors (the outcome of the learning process, i.e., the task functions in ). Figure 1 illustrates the distributed framework with privacy conditions, where we can distinguish the node-private resources and the shared ones, accessible by all the nodes. From the notation point of view, the subscript indicates a private resource of the -th node, while the superscript is used to refer to shared resources.
More formally, for each node we consider some private data , private knowledge , and some private predictors modeled with a vector function , where are the learnable parameters of the neural network. Similarly, for the whole network we define shared data , shared knowledge , and some shared predictors implemented by the vector function , with parameters . Then, the merged collection of all the data (Eq. (2)) and the set of all the model parameters (shared and private) are
The centralized optimization problem we have to solve is
where constraints involve the just introduced vector functions , , and . Notice that the private constraints can involve both private and shared predictors, thus bridging shared and local resources. Due to their shared nature, constraints can be enforced in all the available data.
where . The first constraint ensures consistency of the local copies over , and the last two constraints (involving shared resources) are now replicated times, splitting the private portion of the data. The objective function has been regrouped in order to be a summation over the index , paying attention to differently weigh when applied to shared or private data. This formulation of the problem can be more easily partitioned among the nodes and it helps in the application of a distributed optimization algorithms.
4 ASYMM algorithm
The Asynchronous Method of Multipliers (ASYMM) is a distributed optimization algorithm that has no central authorities, and that solves constrained optimization problems in which both local cost functions and constraints can be nonconvex. Thus, it is a well suited method for solving problem (4) in a distributed way, when rewritten in the form (5).
The idea behind ASYMM is rooted around the concept of computational units that wake up asynchronously at different time instants, perform some operations, and broadcast their local copies of shared parameters (and some other variables) to their neighbors . We assume that each node keeps waking up indefinitely and the time interval between two consecutive awakenings is bounded for all nodes. Moreover, we assume for simplicity that it cannot happen for two nodes to be awake in the same time instant. When nodes wakes up, it performs a gradient descent step on a locally defined augmented Lagrangian until every neighboring node matches a convergence criterion based on a node-defined tolerance . By doing so, the nodes collectively approach a stationary point of the entire augmented Lagrangian of the considered optimization problem. The convergence check on the augmented Lagrangian is performed by the nodes in a distributed way, using a logic-AND algorithm (see ). When a node gets aware of the convergence condition, it performs one ascent step on its local multiplier vector and it increases its penalty parameters. After a node has received the updated multipliers and penalty parameters associated to shared constraints from all its neighbors on , it starts over a new Lagrangian minimization. Under suitable technical assumptions, it can be shown that the computational units collectively converge to a local minimum of problem (4) which satisfies all the constraints. Moreover private resources are never passed over the net.
In order to devise a specialized version of ASYMM for problem (5), we need to introduce the corresponding augmented Lagrangian. Let and be the multiplier vector and penalty parameter associated to the equality constraint . We compactly define , . Similarly, let , and , (resp. , and , ) be the multiplier and penalty parameter associated to the equality (resp. inequality) constraint of node , where the superscript denotes the association with the local copy of the shared constraints. Moreover, let , and denote by the vector stacking all the penalty parameters; , and be the vectors stacking the corresponding multipliers, and, consistently, let . Finally, in order to present the next equations in a simpler way, let us define two parametric functions with a compact notation: and where the operator is to be intended component-wise. Then, the augmented Lagrangian associated to (5) is
where is a (column) vector of ones.
In order to collectively minimize (6), nodes in ASYMM need to compute a local augmented Lagrangian. The local augmented Lagrangian for node groups all the terms in (6) depending on and it is defined as
where , , and .
Finally, we define a local binary matrix for each node, where is the graph diameter and . Such a matrix is used to perform the distributed logic-AND algorithm, which is a building block of ASYMM (see ).
In , it has been shown that the distributed Algorithm 1 is equivalent to a centralized version of the Method of Multipliers in which the primal update is carried out by means of a block-coordinate gradient descent on the augmented Lagrangian (e.g., see 
for a survey on coordinate descent algorithms). Specifically, there exists a sequence of (centralized) block-coordinate gradient descent steps that returns the same sequence of estimatesas those computed by ASYMM. This equivalence property implies that ASYMM inherits all the convergence properties of the centralized block-coordinate Method of Multipliers. In particular, under technical assumptions on the local augmented Lagrangian, similar to those adopted in the centralized case (see ), the estimates generated by ASYMM converge to a local minimum. A key point to establish this result is to bound the norm of the gradient of the augmented Lagrangian (5) by a function of the local tolerances employed in Algorithm 1. The interested reader is referred to  for a thorough theoretical analysis of ASYMM.
We evaluate the numerical application of our approach to two different distributed environments, focussing on digit recognition and document classification, respectively.
5.1 Digit Recognition
We consider a network composed by nodes, indexed from to . In the context of digit recognition, each node aims at learning to recognize a precise digit given its image , where we assume that node learns to recognize digit . Notice that each node could also learn to recognize more than one digit. We consider only one digit per node for the sake of presentation. The -th recognizer is a private function , and the -th node has the use of private data composed of positive examples of such digit and negative examples of other digits (labeled with and , respectively, and collected in ), and unsupervised examples (belonging to ). No shared data are considered, so that . All the nodes of the network have access to a shared function with two scalar outputs , that predicts whether
is even (first output) or odd (second output)444We used two outputs to emphasize the role of the mutual-exclusivity constraints that we will introduce shortly..
|(a b) c|
Fitting the labeled examples in node is a private soft-constraint, and it depends on and only,
Due to the private nature of , each node has no information that it can directly use to learn in discriminative way. However, each node has private knowledge about the fact that its associated digit is either even or odd, and all the nodes have access to the shared knowledge that and are mutually exclusive. Using FOL, we get the following universally quantified formulas (for the sake of simplicity, we skip the arguments of and ),
We selected the popular MNIST dataset to test our algorithm in the proposed scenario. MNIST consists of black and white images of handwritten digits of size pixels. Each image is represented through a normalized-flattened vector , in which each entry is a pixel intensity. The dataset comes divided in a training set and a test set, which consist of and labeled samples respectively.
In order to generate the private data we proceeded as follows. We randomly selected training examples, evenly distributed among classes, and we built each from them. In particular, we selected all the images of digit as positive examples, and images of digits as negative examples (evenly distributed among digits , and such that there is no overlap among the negative examples of the different ’s). The unsupervised sets , were implemented by selecting the training examples not involved in the previous operations, and evenly assigning them to each with no overlap (keeping the original class distribution). We remark that this setting is different from the one that is commonly assumed in semi-supervised classification on the MNIST data, in which the same data (supervised and unsupervised) is shared by all the classifiers and in which only one class is predicted for each test example [32, 33, 34].
We modeled each function , , and
using a simple neural architecture, that is a Multi-Layer Perceptron (MLP) with a hidden layer ofunits (activation function) and an output unit with sigmoidal activation function. Then, we ran ASYMM by repeating
times the aforementioned data generation. Each run consists of 50000 total iterations (which means 5000 awakenings per node on average). After solving the optimization problem, the learned predictors were tested using the original MNIST test set. The mean and standard deviation of the obtained F1 scores555The F1 score is a classification performance metrics which is typically defined in terms of precision () and recall () as . are reported in Table 2 (left column) for each predictor. While the results on each confirm that each node is reaching its goal of learning a recognizer for digit , we can also see how the system is learning to correctly () predict even and odd digits without having access to any example labeled as even or odd, but only using the hard constraints that are enforced on private data in a distributed setting. In order to experimentally verify the theoretical properties of ASYMM, we solved in a centralized way the same optimization problem considered in the presented scenario, by using the (centralized) Method of Multipliers. The results are reported in Table 2 (Centralized Semi-Supervised column) and are very close to those of the distributed implementation, the small discrepancies being due to the different orders in which block-coordinate descent steps are performed. In the distributed scenario, in order to give an idea of the role played by unsupervised data, the simulation has been repeated without using the unsupervised data, and the results are reported in the Table 2 (right column). We can see that the F1 scores of all the classifiers are lower than in the semi-supervised scenario, confirming that the distributed implementation positively exploits the unsupervised data. Finally, in order to see how the (hard) logic constraints are asymptotically satisfied, in Figure 2 (left), the average constraint violation (considering all the problem constraints) is reported over the evolution of one run of ASYMM, in logarithmic scale.
5.2 Document Classification
In the second application, we consider the problem of document classification. We focus on a network of nodes, each of them associated with a category of documents ( classes). Differently from the previous experiment, each node is assumed to have access to a shared predictor with outputs, that models the class-membership scores of an input document , and that must be learned in a distributed setting. What makes this task more challenging is that each node is associated to a unique document class, and it is equipped with a private dataset of (supervised) positive-only examples from the category associated to it (i.e. all the samples in have label ), and a set of unlabeled documents (). Moreover, each node has some limited and incomplete private knowledge on how its document category is related to the other ones. The goal of the experiment is to make available to each node the -class classifier , without sharing private data, and learning from positive examples and constraints in a distributed setting.
We implemented a network of nodes, and picked the following document categories: clothing (), politics (), running (), shoes (), sport () and wrestling (), where the number indicates the node index associated to each of them. The private knowledge of each node is reported in Table 3, from which, using the polynomial forms in Table 1, the local constraints can be easily retrieved. As an example, following the described setup, node has the use of positive examples of category (shoes) and some other unlabeled data. It also knows how shoes is related with some other categories. In particular it knows that the following two relations hold: (politics shoes) and (running shoes) clothing. Following the rules of Table 1, the local constraint consists of:
The local objective function, instead, has the same form for all the nodes and is defined as
The considered problem is in the form of (5) and, hence, can be solved by the ASYMM algorithm.
|local knowledge||aware nodes|
|(politics clothing)||2, 1|
|(politics sport)||2, 5|
|(politics running)||2, 3|
|(politics shoes)||2, 4|
|wrestling sport||6, 5|
|(running shoes) clothing||3, 4, 2|
|running sport||3, 5|
A collection of documents belonging to the selected categories has been obtained by crawling Wikipedia. We downloaded up to pages for each category, where roughly of the pages were taken by exploring sub-categories (limiting the depth of the exploration, and randomly deciding whether we should have considered a subcategory or not). Documents were represented by tf-idf, on a dictionary of words. In this experiment, classes are not mutually exclusive. We marked 70% of the documents of each class as supervised samples, while the remaining 30% are marked as unsupervised. All the unsupervised data have been merged, randomized, and evenly assigned to the nodes (without overlap). We explored a transductive learning scenario, so the unlabeled data is also used to evaluate the quality of the learned classifiers.
We tested two types of architectures for the classifiers : neural networks without hidden layers (referred to as “single layer”) and neural networks with one hidden layer (composed by 100 units with tanh activation). Both architectures share some common properties. Namely, in their output units they have sigmoidal activation functions and a fixed negative bias (), and we enforced a strong regularization (weight decay) to better cope with the selected setting (learning from positive examples). ASYMM has been run for total iterations ( awakenings per node on average) and the final results are reported in Table 4. The system is learning to classify the classes, with some low-precision results due to the lack of large discriminative information (many false positives). One of the classifiers with the highest scores is about the politics class, since it is the only class which, by means of constraints, is known to be completely disjoint from the others. The classifiers that are more involved into constraints, such as sport and running, yield better results than the other ones. As a matter of fact, the other classifiers suffer from the small amount of knowledge which is injected in the system through the constraints. For example, the clothing classifier has no information to discriminate samples from the class sport. Introducing a hidden layer allows the system to develop strongly non-linear decision boundaries around the given positive examples, increasing the overall performance. As in the previous example, in order to corroborate the theoretical properties of ASYMM, we also provide the results obtained by solving the considered optimization problem in a centralized way (Table 4, last column - single layer case), once again almost equivalent to those provided by ASYMM.
In order to further evaluate the proposed distributed setting, we compare the results with those of a centralized approach in which the knowledge on the relationships between the classifiers is not enforced through constraints, but by means of additional supervised data. In particular, we merged all the training sets of the classifiers, and we enriched the supervision with additional labels that are coherent with the logic constraints of Table 3. For example, the data that are labeled as running or wrestling are also labeled as sport (due to constraints 6 and 8 of Table 3), while examples from the politics class also become negative examples of the class sport (due to constraint 3 of Table 3). We allowed all the classifiers to have access to these data (thus violating the privacy assumption we made in the distributed setting), so that each of them can exploit an augmented collection of supervised training examples with respect to the distributed case. We remark that now classifiers have also the use of negative examples. Then, we excluded the logic constraints and the unsupervised data from the optimization (unsupervised data is not used whenever we drop the logic constraints). Besides the two architectures considered in the set-up of Table 4
, we also include the case in which the classifiers are modeled as RBF networks, in order to evaluate the effect of locally supported units in this learning problem. For each classifier we considered 1000 RBF neurons and another hidden layer consisting of 100 units with tanh activation (the centers of the RBF network were estimated on an out-of-sample small set of Wikipedia documents, and shared among the nodes). Results are reported in Table5. The RBF network performs better on some classifiers, however the overall performances are lower than those obtained with the other two architectures.
By comparing the results reported in Table 5 and 4, it can be seen that the semi-supervised scenario (in which we exploit only positive examples, logic constraints and unsupervised data) leads, in general, to slightly better scores than those obtained in the centralized setting with artificially generated labelings. This comparison emphasizes the quality of the proposed distributed setting and validates the idea of sharing knowledge by means of constraints. Finally, the average constraint violation along the evolution of ASYMM is reported in Figure 2 (right) for the single layer architecture, showing that, as expected, the violation vanishes as the algorithm proceeds.
|1 Hidden Layer||Single Layer||Single Layer|
|Standard Network||Standard Network||RBF Network|
|1 Hidden Layer||Single Layer|
We proposed a distributed implementation of the framework of Learning from Constraints. We exploited the Asynchronous Method of Multipliers (ASYMM), and we implemented and evaluated a distributed setting where local (private) and shared resources (including constraints) are considered. Experiments were performed on distributed digit and document classification, confirming the quality of the proposed approach.
-  M. Gori, Machine Learning: A Constraint-based Approach. Morgan Kaufmann, 2017.
-  G. Gnecco, M. Gori, S. Melacci, and M. Sanguineti, “Foundations of support constraint machines,” Neural computation, vol. 27, no. 2, pp. 388–480, 2015.
-  ——, “Learning with mixed hard/soft point-wise constraints,” IEEE transactions on neural networks and learning systems, vol. 26, no. 9, pp. 2019–2032, 2015.
-  M. Gori and S. Melacci, “Constraint verification with kernel machines,” IEEE transactions on neural networks and learning systems, vol. 24, no. 5, pp. 825–831, 2013.
-  M. Maggini, S. Melacci, and L. Sarti, “Learning from pairwise constraints by similarity neural networks,” Neural Networks, vol. 26, pp. 141–158, 2012.
-  S. Melacci and M. Gori, “Semi-supervised multiclass kernel machines with probabilistic constraints,” in Lecture Notes in Computer Science, vol. 6934, R. Pirrone and F. Sorbello, Eds. Springer, 2011, pp. 21–32.
S. Melacci, M. Maggini, and M. Gori, “Semi-supervised learning with constraints for multi-view object recognition,” inLecture Notes in Computer Science, vol. 5769. Springer, 2009, pp. 653–662.
-  S. Melacci and M. Gori, “Learning with box kernels,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp. 2680–2692, 2013.
-  M. J. Wainwright, M. I. Jordan, and J. C. Duchi, “Privacy aware learning,” in Advances in Neural Information Processing Systems, 2012, pp. 1430–1438.
-  K. Chaudhuri, C. Monteleoni, and A. D. Sarwate, “Differentially private empirical risk minimization,” Journal of Machine Learning Research, vol. 12, no. Mar, pp. 1069–1109, 2011.
A. Rajkumar and S. Agarwal, “A differentially private stochastic gradient descent algorithm for multiparty classification,” inArtificial Intelligence and Statistics, 2012, pp. 933–941.
-  R. Fierimonte, S. Scardapane, A. Uncini, and M. Panella, “Fully decentralized semi-supervised learning via privacy-preserving matrix completion,” IEEE transactions on neural networks and learning systems, vol. 28, no. 11, pp. 2699–2711, 2016.
-  D. Yin, A. Pananjady, M. Lam, D. Papailiopoulos, K. Ramchandran, and P. Bartlett, “Gradient diversity: a key ingredient for scalable distributed learning,” in International Conference on Artificial Intelligence and Statistics, 2018, pp. 1998–2007.
-  H.-T. Wai, A. Scaglione, J. Lafond, and E. Moulines, “A projection-free decentralized algorithm for non-convex optimization,” in Signal and Information Processing (GlobalSIP), 2016 IEEE Global Conference on. IEEE, 2016, pp. 475–479.
-  P. Di Lorenzo and G. Scutari, “Next: In-network nonconvex optimization,” IEEE Transactions on Signal and Information Processing over Networks, vol. 2, no. 2, pp. 120–136, 2016.
-  T. Tatarenko and B. Touri, “Non-convex distributed optimization,” IEEE Transactions on Automatic Control, vol. 62, no. 8, pp. 3744–3757, 2017.
-  K. Margellos, A. Falsone, S. Garatti, and M. Prandini, “Distributed constrained optimization and consensus in uncertain networks via proximal minimization,” IEEE Transactions on Automatic Control, vol. 63, no. 5, pp. 1372–1387, 2018.
-  F. Farina, A. Garulli, A. Giannitrapani, and G. Notarstefano, “A distributed asynchronous method of multipliers for constrained nonconvex optimization,” Automatica, vol. 103, pp. 243 – 253, 2019.
-  ——, “Asynchronous distributed method of multipliers for constrained nonconvex optimization,” in 2018 European Control Conference, 2018.
-  D. P. Bertsekas, Constrained optimization and Lagrange multiplier methods. Academic press, 2014.
-  J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A. Senior, P. Tucker, K. Yang, Q. V. Le et al., “Large scale distributed deep networks,” in Advances in neural information processing systems, 2012, pp. 1223–1231.
-  T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J. Franklin, and M. I. Jordan, “Mlbase: A distributed machine-learning system.” in Cidr, vol. 1, 2013, pp. 2–1.
-  M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server,” in 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), 2014, pp. 583–598.
-  E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu, “Petuum: A new platform for distributed machine learning on big data,” IEEE Transactions on Big Data, vol. 1, no. 2, pp. 49–67, 2015.
-  J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communication efficiency,” arXiv preprint arXiv:1610.05492, 2016.
-  V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multi-task learning,” in Advances in Neural Information Processing Systems, 2017, pp. 4424–4434.
-  S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends® in Machine Learning, vol. 3, no. 1, pp. 1–122, 2011.
-  J. C. Duchi, A. Agarwal, and M. J. Wainwright, “Dual averaging for distributed optimization: convergence analysis and network scaling,” IEEE Transactions on Automatic Control, vol. 57, no. 3, pp. 592–606, 2012.
-  L. Georgopoulos and M. Hasler, “Distributed machine learning in networks by consensus,” Neurocomputing, vol. 124, pp. 2–12, 2014.
-  E. P. Klement, R. Mesiar, and E. Pap, Triangular norms. Springer Science & Business Media, 2013, vol. 8.
-  S. J. Wright, “Coordinate descent algorithms,” Mathematical Programming, vol. 151, no. 1, pp. 3–34, 2015.
-  D. P. Kingma, S. Mohamed, D. J. Rezende, and M. Welling, “Semi-supervised learning with deep generative models,” in Advances in neural information processing systems, 2014, pp. 3581–3589.
-  M. Wang, W. Fu, S. Hao, D. Tao, and X. Wu, “Scalable semi-supervised learning by efficient anchor graph regularization,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 7, pp. 1864–1877, 2016.
-  T. Miyato, S.-i. Maeda, S. Ishii, and M. Koyama, “Virtual adversarial training: a regularization method for supervised and semi-supervised learning,” IEEE transactions on pattern analysis and machine intelligence, 2018.