Many combinatorial problems in machine learning can be cast as the minimization of submodular functions (i.e., set functions that exhibit a diminishing marginal returns property). Applications include isotonic regression, image segmentation and reconstruction, and semi-supervised clustering (see, e.g.,[bach2013learning]).
In this paper we consider the problem of minimizing in a distributed fashion (without any central unit) the sum of many submodular functions, i.e.,
where is called the ground set and the functions are submodular.
We consider a scenario in which problem (1) is to be solved by peer agents communicating locally and performing local computations. The communication is modeled as a directed graph , where is the set of agents and is the set of directed edges in the graph. Each agent receives information only from its in-neighbors, i.e., agents , while it sends messages only to its out-neighbors , where we have included agent itself in these sets. In this set-up, each agent knows only a portion of the entire optimization problem. Namely, agent , knows the function and the set only. Moreover, the local functions must be maintained private by each agent and cannot be shared.
In order to give an insight on how the proposed scenario arises, let us introduce the distributed image segmentation problem that we will consider later on as a numerical example. Given a certain image to segment, the ground set consists of the pixels of such an image. We consider a scenario in which each of the agents in the network has access to only a portion of the image. In Figure 1 a concept with the associated communication graph is shown. Given , the local submodular functions are constructed by using some locally retrieved information, like pixel intensities. While agents do not want to share any information on how they compute local pixel intensities (due to, e.g., local proprietary algorithms), their common goal is to correctly segment the entire image.
Such a distributed set-up is motivated by the modern organization of data and computational power. It is extremely common for computational units to be connected in networks, sharing some resources, while keeping other private, see, e.g., [stone2000multiagent, decker1987distributed]. Thus, distributed algorithms in which agents do not need to disclose their own private data will represent a novel disruptive technology. This paradigm has received significant attention in the last decade in the area of control and signal processing, [ahmed2016distributed, chen2018internet].
Submodular minimization problems can be mainly addressed in two ways. On the one hand, a number of combinatorial algorithms have been proposed [iwata2001combinatorial, iwata2009simple], some based on graph-cut algorithms [jegelka2011fast] or relying on problems with a particular structure [kolmogorov2012minimizing]. On the other hand, convex optimization techniques can be exploited to face submodular minimization problems by resorting the so called Lovàsz extension. Many specialized algorithms have been developed in the last years by building on the particular properties of submodular functions (see [bach2013learning] and reference therein). In this paper we focus on the problem of minimizing the sum of many submodular functions, which has received attention in many works [stobbe2010efficient, kolmogorov2012minimizing, jegelka2013reflection, fix2013structured, nishihara2014convergence]. In particular, centralized algorithms have been proposed based on smoothed convex minimization [stobbe2010efficient] or alternating projections and splitting methods [jegelka2013reflection], whose convergence rate is studied in [nishihara2014convergence]. This problem structure typically arises, for example, in Markov Random Fields (MRF) Maximum a-Posteriori (MAP) problems [shanu2016min, fix2013structured], a notable example of which is image segmentation.
Distributed approaches for tackling submodular optimization problems started to appear only recently. Submodular maximization problems have been treated and approximately solved in a distributed way in several works [kim2011distributed, mirzasoleiman2013distributed, bogunovic2017distributed, williams2017decentralized, gharesifard2017distributed, grimsman2017impact]. In particular, distributed submodular maximization subject to matroid constraints is addressed in [williams2017decentralized, gharesifard2017distributed], while in [grimsman2017impact], the authors handle the design of communication structures maximizing the worst case efficiency of the well-known greedy algorithm for submodular maximization when applied over networks. Regarding distributed algorithms for submodular minimization problems, they have not received much attention yet. In [jaleel2018real] a distributed subgradient method is proposed, while in [testa2018distributed] a greedy column generation algorithm is given.
Contribution and organization
The main contribution of this paper is the MIxing bloCKs and grEedY (MICKEY) method, i.e., a distributed block-wise algorithm for solving problem (1
). At any iteration, each agent computes a weighted average on local copies of neighbors solution estimates. Then, it selects a random block and performs an ad-hoc (block-wise) greedy algorithm (based on the one in[bach2013learning, Section 3.2]) until the selected block is updated. Finally, based on the output of the greedy algorithm, the selected block of the local solution estimate is updated and broadcast to the out-neighbors. The proposed algorithm is shown to produce cost-optimal solutions in expected value by showing that it is an instance of the Distributed Block Proximal Method presented in [farina2019arXivProximal]. In fact, the partial greedy algorithm performed on the local submodular cost function is shown to compute a block of a subgradient of its Lovàsz extension.
A key property of this algorithm is that each agent is required to update and transmit only one block of its solution estimate. In fact, it is quite common for networks to have communication bandwidth restrictions. In these cases the entire state variable may not fit the communication channels and, thus, standard distributed optimization algorithms cannot be applied. Furthermore, the greedy algorithm can be very time consuming when an oracle for evaluating the submodular functions is not available and, hence, stopping it earlier can reduce the computational load.
Notation and definitions
Given a vector, we denote by the -th entry of . Let be a finite, non-empty set with cardinality . We denote by the set of all its subsets. Given a set , we denote by its indicator vector, defined as if , and if . A set function is said to be submodular if it exhibits the diminishing marginal returns property, i.e., for all , and for all , it holds that . In the following we assume for all and, without loss of generality, . Given a submodular function , we define the associated base polyhedron as and by the Lovàsz extension of . When is a submodular function, then is a continuous, piece-wise affine, nonsmooth convex function.
2 Distributed algorithm
2.1 Algorithm description
In order to describe the proposed algorithm, let us introduce the following nonsmooth convex optimization problem
where is the Lovàsz extension of for all . It can be shown that solving problem (2) is equivalent to solving problem (1) (see, e.g., [lovasz1983submodular] and [bach2013learning, Proposition 3.7]). In fact, given a solution to problem (2), a solution to problem (1) can be retrieved by thresholding the components of at an arbitrary (see [bach2019submodular]), i.e.,
In order to compute a single block of a subgradient of , each agent is equipped with a local routine (reported next), that we call BlockGreedy and that resembles a local (block-wise) version of the greedy algorithm in [bach2013learning, Section 3.2]. This routine takes as inputs a vector and the required block , and returns the -th block of a subgradient of at .
The MICKEY algorithm works as follows. Each agent stores a local solution estimate of problem (2) and, for each in-neighbor , a local copy of the corresponding solution estimate . At the beginning, each node selects the initial condition at random in and shares it with its out-neighbors. We associate to the communication graph a weighted adjacency matrix and we denote with the weight associated to the edge . At each iteration , agent performs three tasks:
it computes a weighted average ;
it picks randomly a block and performs the BlockGreedy;
based on the output of the BlockGreedy routine it update only and broadcasts it to its out-neighbors .
Agents halt the algorithm after iterations and recover the local estimates of the set solution to problem (1) by thresholding the value of as in (3). Notice that, in order to avoid to introduce additional notation, we have assumed each block of the optimization variable to be scalar (so that blocks are selected in ). However, blocks of arbitrary sizes can be used (as shown in the subsequent analysis). A pseudocode of the proposed algorithm is reported in the next table.
The proposed algorithm possesses many interesting features. Its distributed nature requires agents to communicate only with their direct neighbors, without resorting to multi-hop communications. Moreover, all the local computations involve locally defined quantities only. In fact, stepsize sequences and block drawing probabilities are locally defined at each node.
Regarding the block-wise updates and communications, they bring benefits in two areas. Communicating single blocks of the optimization variable, instead of the entire one, can significantly reduce the communication bandwidth required by each agent in broadcasting their local estimates. This makes the proposed algorithm implementable in networks with communication bandwidth restrictions. Moreover, the classical greedy algorithm requires to evaluate times the submodular function in order to produce a subgradient. When is very high and an oracle for evaluating functions is not available, this can be a very time consuming task. In the Numerical example Section, we consider the minimum graph cut problem. Evaluating the value of a cut for a graph with nodes and arcs, requires a running-time . In the BlockGreedy routine, the greedy algorithm is terminated earlier, i.e., when the needed block is reached. Such an early termination can significantly speed up the convergence of the algorithm in those cases in which the submodular function evaluations constitutes the bottleneck of the algorithmic evolution.
In order to state the convergence properties of the proposed algorithm, let us make the following two assumptions on the communication graph and the associated weight matrix .
Assumption 1 (Strongly connected graph).
The digraph is strongly connected.
Assumption 2 (Doubly stochastic weight matrix).
For all , the weights of the weight matrix satisfy
if , if and only if ;
there exists a constant such that and if , then ;
The above two assumptions are very common when designing distributed optimization algorithms. In particular, Assumption 1 guarantees that the information is spread through the entire network, while Assumption 2 assures that each agent gives sufficient weight to the information coming from its in-neighbors.
Let be the average over the agents of the local solution estimates at iteration and define . Then, in the next result, we show that by cooperating through the proposed algorithm all the agents agree on a common solution and the produced sequences are asymptotically cost optimal in expected value when .
By using the same arguments used in [farina2019arXivProximal, Lemma 3.1], it can be shown that for all and all . Then (10) follows from [farina2019arXivProximal, Lemma 5.11]. Moreover, as anticipated, it can be shown that is the -th block of a subgradient of the function in problem (2) (see, e.g., [bach2013learning, Section 3.2]). In fact, being defined as the support function of the base polyhedron , i.e., , the greedy algorithm [bach2013learning, Section 3.2] iteratively computes a subgradient of component by component. Moreover, subgradients of are bounded by some constant , since every component of a subgradient of is computed as the difference of over two different subsets of . Given that, the proposed algorithm can be seen as a special instance of the Distributed Block Proximal Method in [farina2019arXivProximal]. Thus, since Assumptions 1 and 2 holds, it inherits all the convergence properties of the Distributed Block Proximal Method and under the assumption of diminishing stepsizes (9) respectively, the result in (11) follows (see [farina2019arXivProximal, Theorem 5.15]). ∎
Notice that the result in Theorem 1 does not say anything about the convergence of the sequences , but only states that if diminishing stepsizes are employed, asymptotically these sequences are consensual and cost optimal in expected value.
Despite that, from a practical point of view, two facts typically happen. First, agents approach consensus, i.e., for all , the value becomes small, extremely fast, so that they all agree on a common solution. Second, if the number of iterations in the algorithm is sufficiently large, the value of is a good solution to problem (2). Then, given , each agent can reconstruct a set solution to problem (1) by using (8) and, in order to obtain the same solution for all the agents, we consider a unique threshold value, known to all the agents, .
3 Numerical example: cooperative image segmentation
Submodular minimization has been widely applied to computer vision problems as image classification, segmentation and reconstruction, see, e.g.,[stobbe2010efficient, jegelka2013reflection, greig1989exact]. In this section, we consider a binary image segmentation problem in which agents have to cooperate in order to separate an object from the background in an image of size pixels (with ). Each agent has access only to a portion of the entire image, see Figure 2, and can communicate according to the graph reported in the figure.
Before giving the details of the distributed experimental set-up let us introduce how such a problem is usually treated in a centralized way, i.e., by casting it into a – minimum cut problem.
3.1 – minimum cut problem
Assume the entire image be available for segmentation, and denote as the set of pixels. As shown, e.g., in [greig1989exact, boykov2006graph] this problem can be reduced to an equivalent – minimum cut problem, which can be approached by submodular minimization techniques. More in detail, this approach is based on the construction of a weighted digraph , where is the set of nodes, is the edge set and is a positive weighted adjacency matrix. There are two sets of directed edges and , with positive weights and respectively, for all . Moreover, there is an undirected edge between any two neighboring pixels with weight . The weights and represent individual penalties for assigning pixel to the object and to the background respectively. On the other hand, given two pixels and , the weight can be interpreted as a penalty for a discontinuity between their intensities.
In order to quantify the weights defined above, let us denote by the intensity of pixel . Then, see, e.g., [boykov2006graph], is computed as
is a constant modeling, e.g., the variance of the camera noise. Moreover, weightsand are respectively computed as
where is a constant and (respectively ) denotes the probability of pixel to belong to the foreground (respectively background).
The goal of the – minimum cut problem is to find a subset of pixels such that the sum of the weights of the edges from to is minimized.
3.2 Distributed set-up
In the considered distributed set-up, agents are connected according to a strongly-connected Erdős-Rényi random digraph and each of them has access only to a portion of the entire image (see Figure 2). In this set-up, clearly, each agent can assign weights only to some edges in so that, it cannot segment the entire image on its own.
Let be the set of pixels seen by agent . Each node assigns a local intensity to each pixel . Then, it computes its local weights as
Given the above locally defined weights, each agent construct its private submodular function as
Here, the first term takes into account the edges from to , the second one those from to , and the third one those from to . The last term is a normalization term guaranteeing . Then, by plugging (12) in problem (1), the optimization problem that the agents have to cooperatively solve in order to segment the given image, turns out to be
We applied the MICKEY distributed algorithm to this set-up and we split the optimization variable in blocks. In order to mimic possible errors in the construction of the local weights, we added some random noise to the image. We run the algorithm for iterations and the results are represented in Figure 3. Each row is associated to one network agent while each column is associated to a different time stamp. More in detail, we show the initial condition at time and the candidate (continuous) solution at iterations. The last column represents the solution of each agent obtained by thresholding with and .
As appearing in Figure 3, the local solution set estimates are almost identical. Moreover, the connectivity structure of the network clearly affects the evolution of the local estimates.
In this paper we presented MICKEY, a distributed algorithm for solving submodular problems involving the minimization of the sum of many submodular functions without any central unit. It involves random block updates and communications, thus requiring a reduced local computational load and allowing its deployment on networks with low communication bandwidth (since it requires a small amount of information to be transmitted at each iteration). Its convergence in expected value has been shown under mild assumptions. The MICKEY algorithm has ben tested on a cooperative image segmentation problem in which each agent has access to only a portion of the entire image.
This result is part of a project that has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 638992 - OPT4SMART).