Recently, the need has arisen to design algorithms that distribute decision making among a collection of agents or computing devices. This need has been motivated by problems from statistics, machine learning and robotics. These problems include:
(Information acquisition) How should a team of mobile robots move in order to acquire information about an environmental process or reduce uncertainty in a mapping task? 
where is a submodular function (i.e. it has a diminishing-returns property), is a finite ground set of all decision variables, and is a family of allowable subsets of . In words, the goal of (1) is to pick a set from the family of allowable subsets that maximizes the submodular set function . A wide class of relevant objective functions such as mutual information and weighted coverage are submodular; this has motivated a growing body of work surrounding submodular optimization problems -.
Intuitively, it is useful to think of the problem in (1) as a distributed -player game. In this game, each player or agent has a distinct local strategy set of actions. The goal of the game is for each agent to choose at most one action from its own strategy set to maximize a problem-specific notion of reward. Therefore, the problem is distributed in the sense that agents can only form a control policy with the actions from their local, distinct strategy sets. To maximize reward, agents are allowed to communicate with their direct neighbors in a bidirectional communication graph. In this way, we might think of these agents as robots that collectively aim to solve a coverage problem in an unknown environment by communicating their sensing actions to their nearest neighbors. Throughout this work, we will refer to this multi-agent game example to elucidate our results.
In this paper, our aim is to study problem (1) in a distributed setting, which we will formally introduce in Section III; this setting differs considerably from the centralized setting, which has been studied thoroughly in past work (see ). Notably, the distributed setting admits a more challenging problem because agents can only communicate locally with respect to a communication graph. Therefore designing an efficient communication scheme among agents is a concomitant requirement for the distributed setting, whereas in the centralized setting, there is no such desideratum.
I-a Related work
The optimization problem in (1) has previously been studied in settings that differ significantly from the setting studied in this paper. In particular,  addresses this problem in a centralized setting and shows that a centralized algorithm can obtain the tight approximation of the optimal solution. In this way,  is perhaps the closest to this paper in that both manuscripts introduce algorithms that obtain the tight guarantee for problem (1) with respect to a particular setting. However, the setting of  is inherently centralized, whereas our setting is distributed.
Another similar line of work concerns the so-called “master-worker” model. In this framework, agents solve a distributed optimization problem such as (1) by exchanging local information with a centralized master node. However, this setting also differs from the setting studied in this work in that our results assume an entirely distributed setting with no centralized node -.
Fundamentally, the optimization problem posed in (1) is NP-hard. However, near-optimal solutions to (1) can be approximated by greedy algorithms -. In the distributed context, the sequential greedy algorithm (SGA) has been rigorously studied in . This work poses (1) as a communication problem among agents distributed in an directed acyclic graph (DAG) working to optimize a global objective function. The authors of  offer upper and lower bounds on the performance of SGA based on the clique number of the underlying DAG. Building on this,  analyzes the communication redundancy in such an approach and proposes a distributed planning technique that randomly partitions the agents in the DAG. On the other hand,  extends the work of  to a sequential setting in which agents have limited access to the prior decisions of other agents. Extensions of SGA such as the distributed SGA (DSGA) have also been proposed. In particular,  poses (1) as a multi-robot exploration problem and uses DSGA to quantify the suboptimality incurred by redundant sensing information.
Others have proposed novel algorithms with the goal of avoiding the communication overhead incurred by deploying SGA for a large number of agents. Instead of explicitly solving (1), many of these algorithms seek to solve a continuous relaxation of this problem , . This continualization of the problem in (1) was originally introduced in . In particular,  proposes several gradient ascent-style algorithms for solving a problem akin to (1) in which each agent has access to a local objective function. Similarly, novel algorithms have been developed for solving problems such as unconstrained submodular maximization  and submodular maximization with matroid constraints ,  by first lifting these problems to the continuous domain.
Another notable direction in solving problem (1) has been to define an auxiliary or surrogate function in place of the original submodular objective. For instance,  introduces a distributed algorithm for maximizing a submodular auxiliary function subject to matroid constraints that obtains the optimal approximation. This approach of defining surrogate functions in place of the submodular objective differs significantly from our approach.
In this paper, we formulate the general case of maximizing a submodular function subject to a distributed partition matroid constraint in Problem 1. We then formulate the continuous relaxation of this problem via the multilinear extension in Problem 2. Both of these problems are formally defined in Section III. To this end, we study the special case of this optimization problem in which each agent can compute the global objective function and the gradient of the objective function; however we assume that each agent only has access to a local, distinct set of actions. Considering these constraints, we develop Constraint-Distributed Continuous Greedy (CDCG), a novel algorithm for solving the continuous relaxation of the distributed submodular optimization problem that achieves a tight approximation of the optimal solution, which is known to be the best possible approximation unless . We offer an analysis of the proposed algorithm and prove that it achieves the tight approximation and that its error term vanishes at a linear rate.
Previous work on the distributed version of this problem can approximate the optimal solution to within a multiplicative factor of via the SGA or DSGA , , . Algorithms for slightly different settings, such as the setting of  in which each node has access to a local objective function which is averaged to form a global objective function, can also achieve the approximation. Similarly,  shows that it is possible to achieve the optimal approximation in the centralized setting. However, to the best of our knowledge the CDCG algorithm presented in this paper is the first algorithm that is guaranteed to achieve the approximation of the optimal solution in this distributed setting. The proofs of all lemmas and theorems in this work will be included in the arxiv version of this paper.
Throughout this paper, lowercase bold-face (e.g.
) will denote a vector, while uppercase bold-face (e.g.) will denote a matrix. The component of a vector will be denoted ; the element in the row of the column of a matrix will be denoted by . The inner product between two vectors and will be denoted by and the Euclidean norm of a vector will be denoted by . Given two vectors and , we define as the (vector-valued) component-wise maximum between and ; similarly, will denote the component-wise minimum between and . We will use the notation to denote an -dimensional vector in which each component is zero; similarly will denote an -dimensional vector in which each component is one. Calligraphic fonts will denote sets (e.g. ). Given a set , will denote the carnality of , while will denote the power set of . will represent the indicator function for the set . That is, is the function that takes value one if its argument is an element of and takes value zero otherwise.
Let be a finite set and let be a set function mapping subsets of to the nonnegative real line. In this setting, is commonly referred to as the ground set. The function is called submodular if for every ,
In essence, submodularity amounts to having a so-called diminishing-returns property, meaning that the incremental value of adding a single element to the argument of is no less than that of adding the same element to a superset of the argument. To illustrate this, we will slightly overburden our notation by defining
as the marginal reward of given . This gives rise to an equivalent definition of submodularity. In particular, is said to be submodular if for every and ,
Throughout this paper, we will consider submodular functions that are also monotone, meaning that for every , , and normalized so that .
In practice, one often encounters a constraint on the allowable subsets of the ground set when maximizing a submodular objective function. Concretely, if is a nonempty family of allowable subsets of the ground set , then the tuple is a matroid if the following criteria are satisfied:
(Heredity) For any , if , then .
(Augmentation) For any , if , then there exists an such that .
Furthermore, if is partitioned into disjoint sets , then is a partition matroid if there exists positive integers such that
Partition matroids are particularly useful when defining the constraints of a distributed optimization problem because they can be used to describe a setting in which a ground set of all possible actions is written as the product of disjoint local action spaces .
The notion of submodularity can be extended to the continuous domain . Consider a set , where is a compact subset of for . We call a continuous function submodular if ,
As in the discrete case, we say that a continuous function is monotone if , implies that . Furthermore, if is differentiable, we say that is -submodular, where stands for “diminishing-returns,” if the gradients are antitone. That is, , is -submodular if implies that .
Iii Problem Statement
Given the aforementioned applications which emphasize the utility of maximizing submodular functions subject to distributed partition matroid constraints, we formulate the main problem of this paper:
Problem 1 (Submodular Maximization Subject to a Distributed Partition Matroid Constraint).
Consider a collection of agents that form the set . Let be a normalized and monotone submodular function and let be pairwise disjoint partition of a finite ground set , wherein each agent can only choose actions from its local strategy set . Consider the partition matroid , where
The problem of submodular maximization subject to a distributed partition matroid constraint is to maximize by selecting a set from the family of allowable subsets so that . Formally:
In effect, the distributed partition matroid constraint in Problem 1 enforces that each agent can choose at most one action from its local strategy set . Note that in this setting, each agent can only choose actions from its own local strategy set. Therefore, this problem is distributed in the sense that agents can only determine the actions taken by other agents by directly communicating with one another.
Iii-a Sequential greedy algorithm
It is well known that the sequential greedy algorithm (SGA), in which each agent chooses an action sequentially based on
where , approximates the optimal solution to within a multiplicative factor of . The drawbacks of this algorithm are twofold. Firstly, as we will show, it is possible to achieve the approximation of the optimal solution. Secondly, as its name suggests, SGA is sequential in nature and therefore it scales very poorly in the number of agents. That is, each agent must wait for each of the previous agents to compute their contribution to the optimal set .
Iii-B Continuous Extension of Problem 1
Sequential algorithms such as SGA can only achieve a approximation of the optimal solution. To achieve the best possible approximation of the optimal solution, it is necessary to extend Problem 1 to the continuous domain via the so-called multilinear extension of the submodular objective function . Thus, the method we use in this work to achieve the tight approximation relies on the continualization of Problem 1. Importantly, it has been shown that Problem 1 and the optimization problem engendered by lifting Problem 1 to the continuous domain via this multilinear extension yield the same solution . Furthermore, by applying proper rounding techniques, such as those described in Section 5.1 of  and in  and  to the continuous relaxation of Problem 1, one can obtain the tight approximation for Problem 1.
Therefore, our approach in this paper will be to lift Problem 1 to the continuous domain. We formulate this problem in the following way:
Problem 2 (Continuous Relaxation of Problem 1).
Problem 2 is distributed in the sense that each agent is associated with its own distinct continuous strategy space . Formally, the set is defined as
Iv Constraint-Distributed Continuous Greedy
In this section, we present Constraint-Distributed Continuous Greedy (CDCG), a decentralized algorithm for solving Problem 2. The pseudo-code of CDCG is described in Algorithm 1. At a high level, this algorithm involves updating each agent’s local decision variable based on the aggregated belief of a small group of other agents about the best control policy. In essence, inter-agent communication within small groups of agents facilitates local decision making.
For clarity, we introduce a simple framework for the inter-agent communication structure. In CDCG, agents share their decision variables with a small subset of local agents in . To encode the notion of locality, suppose that each agent is a node in a bidirectional communication graph in which denotes the set of edges. Given this structure, we assume that each agent can only communicate its decision variable with its direct neighbors in . Let us denote the neighbor set of agent by . Then the set of edges can be written . We adopt this notation for the remainder of this paper.
Iv-a Intuition for Cdcg algorithm
The goal of CDCG at a given node is to learn the local decision variable . CDCG is run at each node in to assemble the collection where is a given positive integer; this collection represents an approximate solution to Problem 2 and guarantees that each agent contributes at most one element to the solution. Then, by applying proper rounding techniques to each element of the collection such as those discussed in , , and , we obtain a solution to Problem 1. In the proceeding sections, we show that this solution achieves the tight approximation of the optimal solution.
In the analysis of CDCG, we add the superscript to the vectors and defined in Algorithm 1. This superscript denotes the iteration number so that and represent the values of the local variables and at iteration respectively.
Iv-B Description of steps for Cdcg
From the perspective of node , CDCG takes two arguments: nonnegative weights for each and a positive integer . The weights correspond to the row in a doubly-stochastic weight matrix and is the number of iterations for which the algorithm will run. The weight matrix is a design parameter of the problem and must fulfill a number of technical requirements that are fully described in Appendix A . Before any computation, the local decision variable is initialized to the zero vector.
Computation proceeds in rounds. In each round, the first step is to calculate the gradient of the multilinear extension function evaluated at the local decision variable from the previous iteration.
In line 3 of Algorithm 1, we calculate the ascent direction at iteration in the following way:
Intuitively, one can think of as the vector from the set that is most aligned with . To define the set , first define the set as the set of indices of the elements in that correspond to elements in . Then
Using this notation, we can equivalently define by
Next, in line 4 of Algorithm 1, is updated; in particular, we set
In this way, the governing principle is to collaboratively accumulate the local belief about the optimal decision and to then move in the approximate direction of steepest ascent from this point.
After rounds of computation at each node , we obtain a local decision variable at each node. By applying proper rounding techniques, we obtain a decision variable for each agent . Rounding in a decentralized manner is discussed in Section 5.1 of . The rounding techniques of  build on “pipage rounding”  and “swap rounding” , which are both centralized rounding techniques. The collection of these decision variables form the set , which represents our solution to Problem 1.
V Convergence Analysis
The main result in this paper is to show that in the distributed setting of Problem 2, CDCG achieves a tight multiplicative approximation of the optimal solution. The following theorem summarizes this result.
Consider the CDCG algorithm described in Algorithm 1. Let denote the global maximizer of the optimization problem defined in Problem 2, and assume that a positive integer and a doubly-stochastic weight matrix are given. Then provided that the assumptions outlined in Appendix A hold, for all nodes , the local variables obtained after iterations satisfy
Succinctly, Theorem 1 means that the sequence of local iterates generated by CDCG achieves the optimal approximation ratio and that the error term vanishes at a linear rate of . That is,
which implies that each agent reaches an objective value larger than after rounds of communication. Previous work can only guarantee an objective value of . The proof of this result will be provided in the arxiv version of this paper.
Vi Simulation Results
To evaluate the proposed algorithm, we consider a multi-agent area coverage problem. In this setting, each agent is constrained to move in a two-dimensional grid. We assume that each agent has a finite radius so that it can observe those grid points that lie with a square with sidelength . The objective is for the agents to collectively maximize the cardinality of the union of their observation sets of grid points. In other words, given an initial configuration, the problem is to choose an action for each agent that maximizes the overall coverage of the grid.
Consider an initial configuration of agents in states for with the dynamic constraint , where is a control input from a discrete set Elements from this set represent the admissible actions for each agent in the two-dimensional grid.
In our simulation, we compared the performance of SGA against CDCG on the coverage task posed above for a variable number of agents. For simplicity, we assumed that the underlying communication graph used in CDCG was fully connected and that each value in the weight matrix was . A random initialization for each agent’s position and the coverages achieved by CDCG and SGA are shown in Figures 0(a), 0(b), and 0(c) respectively. We compared the performance of these algorithms across ten random initializations of starting locations for the agents; the mean performance of each algorithm and the respective standard deviations are shown in Figure 0(d). In each trial, we ran both algorithms 50 times, each of which produced a control input for each agent. For each initialization, we ran CDCG for iterations.
We also compared the coverages achieved by CDCG and SGA for a setting in which each agent’s starting position is the center of the grid. The results of this experiment are shown in Figure 0(e). We ran both algorithms a total of 15 times; we ran CDCG for iterations. Interestingly, SGA converges to a local maximum in this problem, whereas CDCG achieves the optimal value.
In this work, we described an approach for achieving the optimal approximation to a class of submodular optimization problems subject to a distributed partition matroid constraint. The algorithm we proposed outperforms the sequential greedy algorithm in two senses:
CDCG achieves the tight approximation for the optimal solution whereas SGA can only achieve a approximation.
CDCG imposes a limited communication structure on this problem, which allows for significant gains via parallelization. SGA is sequential in nature and therefore is not parallelizable.
We showed empirically via an area coverage simulation with multiple agents that CDCG outperforms the sequential greedy algorithm.
-  Y. Hu, H. Chen, J.G. Lou, and J. Li, “Distributed density estimation using non-parametric statistics,” 27th International Conference on Distributed Computing Systems, 2007.
-  B. Mirzasoleiman, A. Karbasi, R. Sarkar, and A. Krause, “Distributed submodular maximization,” Journal of Machine Learning Research, vol. 17, no. 238, pp. 1-44, 2016.
-  B. Schlotfeldt, D. Thakur, N. Atanasov, V. Kumar, and G. J. Pappas, “Anytime planning for decentralized multirobot active information gathering,“ IEEE Robotics and Automation Letters, vol. 3, no. 2, pp. 1025-1032, 2018.
-  M. Zhong and C. G. Cassandras, “Distributed coverage control and data collection with mobile sensor networks,” IEEE Transactions on Automatic Control, vol. 56, no. 10, pp. 2445-2455, 2011.
A. Singh, A. Krause, C. Guestrin, and W. J. Kaiser, “Efficient informative sensing using multiple robots.” Journal of Artificial Intelligence Research, vol. 34, pp 707-755, 2009.
K. Wei, Y. Liu, K. Kirchhoff, and J. Bilmes, “Using document summarization techniques for speech data subset selection,”Proceedings of NAACL-HLT, pp. 721-726, 2013.
D. Golovin and A. Krause, “Adaptive submodularity: theory and applications in active learning and stochastic optimization,”Journal of Artificial Intelligence Research, vol. 42, pp. 427-486, 2011.
-  J. Djolonga, S. Tschiatschek, and A. Krause, “Variational Inference in Mixed Probabilistic Submodular Models,” Advances in Neural Information Processing Systems 29, 2016.
-  A. Mokhtari, H. Hassani, and A. Karbasi, “Decentralized submodular maximization: bridging discrete and continuous settings,” arXiv preprint arXiv:1802.03825v1, 2018.
-  B. Mirzasoleiman, A. Karbasi, R. Sarkar, and A. Krause. “Distributed submodular maximization: Identifying representative elements in massive data,” Advances in Neural Information Processing Systems, 2013.
-  G. Calinescu, C. Chekuri, M. Pál, and J. Vondrák, “Maximizing a monotone submodular function subject to a matroid constraint,” SIAM Journal on Computing, vol. 40, no. 6, pp. 1740-1766, 2011.
-  B. Mirzasoleiman, A. Karbasi, R. Sarkar, and A. Krause. “Distributed submodular maximization: Identifying representative elements in massive data.” Advances in Neural Information Processing Systems, 2013.
-  R. Barbosa, A. Ene, H. Nguyen, and J. Ward. “The power of randomization: Distributed submodular maximization on massive datasets.” International Conference on Machine Learning, pp. 1236-1244, 2015
-  G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, “An analysis of approximations for maximizing submodular set functions-I.” Mathematical Programming, vol. 15, no. 1., pp. 265-294, 1978.
-  G. L. Nemhauser and L. A. Wolsey, “Best algorithms for approximating the maximum of a submodular set function.” Mathematics of Operations Research, vol. 3, no. 3, pp. 177-188, 1978.
-  B. Gharesifard and S. L. Smith, “Distributed submodular maximization with limited information.” IEEE Transactions on Control of Network, vol. 5, no. 4, pp. 1635-1645, 2017.
-  M. Corah and N. Michael, “Distributed submodular maximization on partition matroids for planning on large sensor networks.” IEEE Conference on Decision and Control (CDC), pp. 6792-6799, 2018.
-  D. Grimsman, M.S. Ali, J.P. Hespanha, and J.R. Marden, “The Impact of Information in Greedy Submodular Maximization,” IEEE Transactions on Control of Network Systems, 2017.
-  M. Corah and N. Michael, “Efficient online multi-robot exploration via distributed sequential greedy assignment,” Robotics: Science and Systems, 2017.
-  H. Hassani, M. Soltanolkotabi, and A. Karbasi, “Gradient methods for submodular maximization,” Advances in Neural Information Processing Systems, pp. 5841–5851. 2017.
-  M. Mokhtari, H. Hassani, and A. Karbasi, “Stochastic conditional gradient methods: From convex minimization to submodular maximization,” arXiv preprint arXiv:1804.09554, 2018.
-  N. Buchbinder, M. Feldman, J. Seffi, and R. Schwartz. “A tight linear time (1/2)-approximation for unconstrained submodular maximization,” SIAM Journal on Computing, pp. 1384–1402, 2015.
-  N. Buchbinder, M. Feldman, J. S. Naor, and R. Schwartz, “Submodular maximization with cardinality constraints,” Proceedings of the twenty-fifth annual ACM-SIAM symposium on Discrete algorithms, pp. 1433-1452. Society for Industrial and Applied Mathematics, 2014.
-  A. Clark, B. Alomair, L. Bushnell, and R. Poovendran, “Scalable and distributed submodular maximization with matroid constraints,” 2015 13th International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt), pp. 435-442, 2015.
-  L. A. Wolsey, “An analysis of the greedy algorithm for the submodular set covering problem,” Combinatorica, 1982.
-  C. Chekuri, J. Vondrak, and R. Zenklusen. “Submodular function maximization via the multilinear relaxation and contention resolution schemes,” SIAM Journal on Computing, pp. 1831-1879, 2014.
Assumptions for Theorem 1
This is a trivial consequence of the multilinear extension , since is contained in the unit cube. Furthermore, we assume that the gradient of the multilinear extension of the objective function in Problem 1 is -Lipschitz continuous, i.e. that
so that by (10). Again, this is not a limiting assumption, because the domain of is compact, which implies the Lipschitzness of . Also, we assume that the norm of the gradient of is bounded over , i.e. that
which again follows from the compactness of the domain of . It is then easy to show that (12) and the multivariable mean value theorem imply that is -Lipschitz continuous over .111Note that in this case, since is the multilinear extension of , assumptions (10), (11), and (12) all hold. Moreover, the constants , , and all depend on the maximum singleton value of . For further justification, see -. Finally, it will be prudent to mention that for the multilinear extension of any monotone and submodular function , it holds that and
For justification, see .
Now consider the communication framework described in Section IV and the weight matrix . This matrix is a parameter that is designed to match the criteria and setting of a given application. We assume that the weights used in CDCG are nonnegative so that ; furthermore, if node , then . Also, we assume that the weight matrix is doubly stochastic and symmetric, and that . The assumptions made about are similar to those described in .
Lastly, consider that past work has studied the case in which the objective function is distributed . However, our setting is one in which the problem is distributed in the constraints rather than the objective. Therefore, we assume that each agent has access to an oracle for computing the objective submodular function .
In this appendix, we offer proofs of lemmas that support the proof of Theorem 1.
In general, the goal of Lemma 1 is to show that the local decision variable for each agent converges to the mean . Then, in Lemma 2, we show that these means are Cauchy, meaning that for a sufficiently large number of iterations , the distance between and becomes arbitrarily small. Together, Lemma 1 and Lemma 2 establish that for a sufficiently large number of iterations, the set of nodes come to a consensus for the optimal decision. Lemmas 4 and Lemma 5 are technical results used in the proof of Theorem 1.
For any iteration where , it follows that the Euclidean distance between the local variable at node and the mean of the local variables can be bounded by
where is the magnitude of the eigenvalue of
is the magnitude of the eigenvalue ofthat among all eigenvalues in has the second largest magnitude.
Define and as the concatenations of the local variables and descent directions in CDCG. The update rule in step 2 in Algorithm 1 leads to the expression
Next, if we premultiply both sides of (14) by the matrix , which is the Kronecker product of the matrices and , we obtain
The left hand side of (15) can be simplified to
where the first inequality follows from the Cauchy-Schwartz inequality and the fact that the norm of a matrix does not change if we Kronecker it by the identity matrix. The second inequality holds because
. Note that the eigenvectors of the matricesand are the same for all . Therefore, the largest eigenvalue of is 1 with eigenvector and the second largest magnitude of the eigenvalues is , where is the second largest magnitude of the eigenvalues of . Also note that because is an eigenvector of , it follows that all of the other eigenvectors of are orthogonal to since is symmetric. Hence we can bound the norm by . Applying this substitution to the right hand side of (18) yields
Since , we find that
For any iteration for , the Euclidean distance between the means and of the local variables and respectively for at consecutive iterations and can be bounded by
Averaging both sides of the update rule for of Algorithm 1 across the set of agents yields the following expression for :
Since if , we can rewrite the RHS of (22) in the following way:
Note that because the Euclidean distance between points of the polytope are assumed to be bounded, . The expression in (21) follows. ∎
Let . Then the vector is in the constraint set .
In Lemma 1 we proved that converges to . We show that by induction. Because we assign , it is clear that . Now as inductive hypothesis, we assume that is in . Observe that we can write . Thus by the inductive hypothesis and the fact that , it follows that is a convex combination of elements of . That is, we can write . Therefore , and so converges to a point in . ∎
Let be the multilinear extension of a monotone submodular function where is a discrete ground set. Then
where denotes the projection of onto the set .
Let be the multilinear extension of a monotone submodular function where is a discrete ground set. Then
Proof of Theorem 1
This Appendix establishes the main result of this paper.
Due to the assumption that is -Lipschitz,
Here (30) follows from the linearity of inner products and then from adding and subtracting . Our immediate goal is to bound (30) from below. To do so, consider that by the Cauchy-Schwartz inequality,
where (31) is due to the assumption that is -Lipschitz continuous and (32) follows from Lemma 1. Next, because is defined as the argmax between and vectors in the Step 3 of Algorithm 1 and by Lemma 4 we have
By Lemma 5, if we let , we can conclude that
By construction, since . Then we can infer from (34) that