Black-box or zero-order optimization is a recurring problem that arises in many real-world settings, ranging from robotics 10.5555/1625275.1625428 and sensor networks 10.1145/1791212.1791238
, to hyperparameter tuning in applications of machine learning10.5555/2986459.2986743. Often in such settings, the underlying function that maps variables to a reward (loss) is unknown and can be costly to query. Thus, several methods for black-box optimization have been developed to address this problem 7352306; 10.5555/1070432.1070486; 10.1162/106365601750190398; powell1973search, including variants that explicitly account for the cost to update different variables in the system.
While the precise function that maps input variables to an output loss function may be unknown, real-world systems typically have additional structure. A common structural assumption is that the system is comprised of a sequence of operations, each with their own variables, which produces modular structure that impacts the overall costs of updating variables throughout. When analyzing data in a variety of scientific settings, inferences often involve multiple stages of processing which are chained together sequentially into apipeline. These situations arise in a wide range of fields, from genomics davis2017genomics and neuroimaging abraham2014machine; johnson2019toward, to robotics chu2018real
and computer visionchowdhury2019quantifying; ronneberger2015u. When optimizing systems in an end-to-end manner, the cost to not only query, but also to switch to new variable settings at early stages in a pipeline, can be prohibitive. Despite a long history of black-box optimization, the role of modular (block-based) structure and switching costs has not yet been explored. Hence, a systematic optimization method that leverages both cost and modular structure is needed to optimize complex scientific pipelines moving forward.
In light of these motivations, we introduce a new switch cost-aware algorithm for black-box optimization called Lazy Modular Bayesian Optimization (LaMBO). This method leverages modular structure in a system to reduce overall cumulative costs during optimization (Algorithm 1). To quantify the cost of switching in modular systems, we model cost of each query as the aggregation of the cost of rerunning modules from the first point in the processing chain where has a variable must be updated. In this scenario, when variables at later stages of processing are updated, the outputs from earlier modules are frozen (stored) and can be used to facilitate downstream optimization until its necessary to switch variables early on. This idea can be codified with the notion of the movement augmented regret, which measures both the functional optimality and cost of changing variables at early stages of processing. We show that encouraging the optimization method to be lazy, or minimize variable switching in early modules, LaMBO achieves a sublinear rate in terms of this new form of regularized regret. To the best of our knowledge, LaMBO is the first algorithm for incorporating modular system structure into BO with strong theoretical guarantees.
To empirically evaluate the performance of the proposed method, we applied LaMBO to a number of synthetic datasets used in the literature. When compared with traditional Bayesian optimization (BO) baselines Srinivas:2010:GPO:3104322.3104451; movckus1975bayesian; wang2017max and cost-aware variants snoek2012practical; lee2020costaware; abdolshah2019cost, we find that our method outperforms these methods in terms of the trade-off between deviation from global optimum and the cumulative cost. We next apply LaMBO to a problem arising in neuroscience where the goal is to produce a 3D segmentation of brain structure from high-resolution imaging data lee2019convolutional; johnson2019toward
. In this application, we are tasked with end-to-end optimization of a three-module pipeline for 3D reconstruction of neuroanatomical structures from slices of 2D images. The three modules correspond to three sequential operations: (i) data pre-processing, (ii) pixel-level semantic segmentation with a deep neural networkronneberger2015u, and (iii) data post-processing steps to form a 3D reconstruction. Our empirical results show that the hyperparameters can be optimized to optimality jointly over multiple modules within 1.4 hours compared with 5.6 hours obtained from the best of the alternatives.
Summary of Contributions:
(i) In Section 3, we propose LaMBO, a switch-cost-aware method to solve black-box optimization in systems with modular structure. To the best of our knowledge, this is the first attempt to leverage modular system structure in the design of a cost-efficient algorithm for black-box optimization with theoretical guarantees. (ii) In Section 4, we analyze the performance of LaMBO using techniques from both the multi-armed bandit and BO literature and provide conditions under which the method achieves sublinear movement regret. (iii) In Section 5, we test our method on synthetic functions and demonstrate that the method can effectively solve switch cost-aware optimization across modular compositions of functions. (iv) In Section 5, we apply our method to a 3D brain image segmentation task, where the processing steps are represented in a sequential-block structure. We show that by minimizing variable switching in early modules, we can optimize our system while also reducing the total cost needed.
2 Background and Related Work
Black-box optimization methods aim to find the global minimum of an unknown function with only a few queries. Let and be the optimal function value and optimizer, respectively. Standard algorithms seek to produce a sequence of inputs that result in (potentially) noisy observations such that will approach the optimal value quickly. A common choice to measure performance of a candidate algorithm is the cumulative regret: . Among the many different approaches for black-box optimization, BO is a celebrated probabilistic method whose statistical inferences are tractable and theoretically grounded. It uses a Gaussian process (GP) prior on the distribution of the unknown function , which is characterized by a mean function and a kernel function . Let , , and
represent the noise variance. In this case, we can update the posterior with simple closed-form formulas:
Common algorithms that use a BO framework include the: Upper Confidence Bound (UCB) Srinivas:2010:GPO:3104322.3104451, Expected-Improvement (EI) 10.1007/BFb0006170, and entropy search wang2017max algorithms. At the heart of all of these methods is the design of an acquisition function that is used to select the next evaluation point, i.e., . These functions make trade-offs between exploration and exploitation and are constructed using the posterior statistics. In this paper, we will use BO as a subroutine and adopt the UCB acquisition function due to its success in both theory and practice. The GP-UCB acquisition function is given by , where is a design parameter that controls the amount of exploration in the algorithm.
Cost-aware Bayesian optimization is a topic that has received recent attention snoek2012practical; gardner2014bayesian. Instead of trying to minimize a function purely using the fewest samples, cost-aware methods strive to find the optimizer with least cumulative cost. Common strategies taking the cost explicitly into account in the design of the acquisition function snoek2012practical; abdolshah2019cost; lee2020costaware. Other approaches use dynamic programming to solve this problem with cumulative budget constraints lam2016bayesian; lam2017lookahead. A closely related topic is multi-fidelity Bayesian optimization kandasamy2016gaussian; kandasamy2017multi; sen2018multi; song2018general, where one may choose to make trade-offs between function accuracy and evaluation cost.
Slowly Moving Bandit Algorithm:
In contrast to previous approaches where cost depends on current inputs, cost in our setting depends on how variables change between iterations. A related topic lies in the Multi-armed Bandit (MAB) literature auer2002finite; arora2012online, under the special setting of switching costs kalai2005efficient; NIPS2017_7000; pmlr-v65-koren17a; dekel2014bandits; feldman2016online. In this setting, optimization is cast into a selection problem where optimal variables (arms) are selected from a set to minimize an unknown loss function . At each iteration , we can query an oracle to measure the loss (inverse reward) by pulling arm . In the switch-cost-aware case, there is a cost metric which incurs cost when switching between arms from to . The objective is to minimize a linear combination of loss and switching cost. In pmlr-v65-koren17a, the authors propose the slowly moving bandit algorithm (SMB) to tackle the problem with a general cost metric. Here, we extend the idea to the setting of black-box optimization with modular structure.
Because this method will be employed in our later algorithm, we will describe it in more detail. SMB is based on a multiplicative update strategy auer2002finite that encodes the cost of switching between arms in a tree; each arm is a leaf and the cost to switch from one arm to another is encoded in the distance from their corresponding leafs in the tree. At each iteration
, SMB chooses an arm according to a probability distributionconditioned on the level of the tree (the root is level ) selected at the last iteration. We will make the sampling distribution precise momentarily. The distribution is then updated with a standard multiplicative update rule , where is the learning rate and
is the estimated loss. Compared with basic bandit algorithms, there are two key modifications in SMB. First, it uses conditional sampling to encourage slow switching. This constrains the arm selection to be the close to the previous choice, where distance is embedded in the tree’s structure. Formally, an arm is drawn according to the following conditional distribution, where is a random level chosen at previous iteration, and denotes the leaves (arms) that belong to the subtree rooted at level which has as one of its leaves. This ensures that remains in the same subtree as in the previous iteration. Second, to utilize the classic multiplicative method, SMB makes sure that in average the conditional sampling is equivalent to direct sampling by modifying the loss estimators as,
where is an unmodified loss estimator for algorithms without switching cost, and
are i.i.d. uniform random variables in.
3 Lazy Modular Bayesian Optimization (LaMBO)
3.1 Problem Setup
A key assumption underlying this work is that the black-box system of interest is modular, or consists of a sequence of different operations, each consisting of a distinct set of variables. To make this precise, let denote the variables in the module, and let denote the set of all variables when jointly considering the entire system (end-to-end optimization). Our aim is to optimize all variables jointly or to find . The function is unknown in advance, but when given variables are input into the system, this generates a noisy output , where is -sub-Gaussian. We also assume there is a cost associated with switching a variable in the module, where . In our image analysis pipeline example (Figure 1a), costs can be thought of as the time or amount of compute required to update a variable in a specific stage/module of the overarching pipeline.
To trade-off between cost efficiency and functional optimality, we define the movement augmented regret as,
where is the movement cost at time , defined as , and is an indicator that equals to when any variables in modules before the module have been changed from the previous iteration. We note that this definition implicitly assumes that the cost to run the final module () is negligible.
serves as a regularizer which is added to the standard definition of the cumulative regret. In general, the function value and the cost are measured in different units, so serves as a parameter to trade-off between the two quantities.
In Algorithm 1, we provide pseudocode for LaMBO. This algorithm uses SMB pmlr-v65-koren17a to impose switching costs on the sampling system and couples this method with a BO strategy to achieve our objective. To step through the method, we point the reader to Figure 1. In this example, we depict a system that consists of two modules, partitioned into two and three sub-regions (subsets), respectively (Figure 1B). This partitioning of the space is then translated into a tree (Figure 1C) which encodes the cost to switch variables based upon the distance between the two partitions, represented as nodes on the tree. Finally, after selecting a joint variable subset of the space (arm) we use a BO strategy to estimate the underlying function of interest using Gaussian Process regression within each leaf (visualized in Figure 1 on the right). We now step through the details of the proposed approach.
Step 1) Modular Structure Embedding Phase. To apply SMB as part of our optimization phase, we first need to encode the switching costs associated with the system of interest. We start by linking each arm with a region (subset) of variable space. The regions are flexible and can be partitioned in different ways, but should reflect the modular structure in the system. Thus, we choose to partition the variable space of each module separately. Specifically, defines a partition for the module, where . We require these sets to be disjoint for . Thus, when selecting an arm, we select a joint region of the first modules111We exclude the last module from partitioning procedure since the cost of changing parameters in the last module is the minimum cost per iteration, and can be changed freely at each iteration., i.e., .
Next, we represent the arms in a tree to encode the cost of switching between any two variable subsets. Intuitively, we want to build a tree that encodes the cost of switching between any two sets of hyperparameters (arms) in terms of the shortest path between these two leaves in the tree. Specifically, in Line of Algorithm 1, we call a subroutine ConstructMSET which returns a tree (modular structure embedding tree, MSET), given a partitioning of the variables across all modules and depth parameters , where is the depth of the m module. The partition and modular specification define the leaves of the tree and the depth parameters control the probability of switching, with higher depth in a module corresponding to lower switching probability (more laziness). In our example (Figure 1C) , the tree consists of two parts (colored with blue and red) divided by the first forks, the upper portion corresponds to the partition of the first module, while the lower portion corresponds to the partition of the second module. In this case, the depth in the second module is set to 3 to reflect higher relative costs between the two modules and encourage lazy switching behavior.
Step 2) Optimization Phase. Now the remaining task is to devise a strategy for arm selection and estimate the local optimum within its corresponding variable subset. We propose to use SMB for region (arm) selection, and BO to search within the selected region (Line ). The parameters of SMB and BO are updated at each iteration (Line ). Unfortunately, direct application of BO changes all variables across each iteration, which typically incurs maximum cost. Hence, we propose an alternative lazy strategy: when the same variable subset is selected in an early module, we will use the results from the previous iteration rather than updating the outputs from this lazy module. This means that we do not need to rerun the module and thus can minimize the overall cost. Specifically, let be the arm we’ve selected and be its associated variable region. We propose to search for an block-wise update that minimizes the loss as follows:
where is the first module that has a variable region that differs from the previous iteration , and is a BO acquisition function.
4 Algorithmic Analysis
4.1 Theoretical Results
In this section, we analyze the performance of Algorithm 1. The main result, which is stated in Theorem 1, shows that LaMBO achieves sublinear movement regret when the parameters of the input tree are set properly based upon the cost structure of the system. To state our result, we first introduce some mild assumptions below.
Assumption 1. The function is -Lipschitz222 can be estimated by ; for instance, for exponential kernel since . , non-negative and has bounded norm in the reproducing kernel Hilbert space .
The following lemma shows that LaMBO is capable of accumulating sublinear cost by proper setup of the depth parameters of MSET. The proof is constructive and provides analytical results for the depths based on the costs. The detailed formulas can be found in Supp. A.2.
Cumulative Switching Cost. For sufficiently large , there exists depth parameters of the MSET such that LaMBO accumulates movement cost
The next theorem shows that a simple partition strategy, along with ’s set according to the previous Lemma, gives sublinear movement regret. Without additional information about how to partition each module, the simplest way to partition the space is uniformly. Hence in the analysis we adopt an uniform partition strategy characterized by , where denotes the Euclidean diameter of the partitioned subset . The result is stated in terms of maximum information gain which is defined below.
Definition 1. Maximum Information Gain. Let be defined in the domain . The observation of at any is given by the model , . For any set , let and denote the set of function values and observations at points in , and denote the Shannon Mutual Information. The Maximum Information Gain is defined by 333 In BO literature, is commonly used to specify the smoothness of the function class (see Supp. A.1).
Now we present a sketch of our main theoretical result where a proof and detailed constants could be found in Supp. A.2.
Regret Bound for LaMBO. For , let denote the dimension of . Suppose for all , , we set , . The MSET has uniform partition of each with diameters , where the depth parameters are chosen according to Lemma 1, and UCB acquisition function is used. Then LaMBO achieves the expected movement regret
The movement metric in our problem is not Lipschitz so Theorem in pmlr-v65-koren17a is not applicable; however, our result achieves a similar rate. This is due to our strategy that variables stay the same when the same subset has been selected, in addition to the fact that our loss estimator leverages correlation between arms.
4.2 Implementation Details
A crucial part of algorithm is in the construction of the MSET, which involves partitioning the variables in each module, and setting the depth parameters (’s). For a MSET with leaves to choose from, LaMBO requires solving local BO optimization problems per iteration. Hence initially, we partition each variable space of module to two subsets only, and abandon subsets when their arm selection probability is below some threshold. In our experiments, we always set it to be , where denotes the number of leaves of MSET. After that, we further divide the remaining subsets again to increase the resolution. This procedure could be iterated upon further although we typically do not go beyond two stages of refinement. In our implementation, we typically fix and set when are estimated. In both cases, we dynamically increase the parameters by every iterations when the frequency involving changing variables in the module is above a frequency (default ). Empirically we found that the performance is quite robust in the range for the different cost ratios we tested.
In this section, we evaluate the performance on two different tasks and compare with other methods for BO. First, we test LaMBO on benchmark synthetic functions used in other studies vellanki2017process; kirschner2019adaptive. Following this, we study LaMBO’s performance on an neuroscience application where multiple modules are jointly optimized to maximize end-to-end performance.
For synthetic benchmarks we selected the 6D Hartmann, 6D Rastrigin, and 6D Griewank functions, as well as the 8D Ackley function. To simulate a -module scenario, we partition the dimension space of the modules to be , , , and , respectively. To emphasize the the effect of accumulation of cost in early stages of processing, we set the cost ratio between the first and second module of 10 to 1. The sampling noise
is assumed to be independent Gaussian with standard deviation. For simplicity, we used the squared exponential kernel and initialized it using random samples before starting the inference procedure. In our experiments, the functions are normalized by their maximized absolute value for clear comparisons, the regularization parameter is fixed to , the UCB parameter is set each iteration as , and the learning rate is set to . For construction of MSET, we test on the simplest case where and partition the variable space in each module into 2 sets aligned with a random coordinate. In practice, we found some adaptive strategies could boost the performance (see Supp. B.2 for further details). The curves on synthetic data and real data were computed by averaging across and simulations, respectively, to provide information about the variability across different random initializations.
We compared LaMBO with common baselines GP-UCBSrinivas:2010:GPO:3104322.3104451, GP-EImovckus1975bayesian, Max-value entropy search wang2017max, random sampling, and three cost-aware strategies: EIpu snoek2012practical, CA-MOBO abdolshah2019cost, and CArBO lee2020costaware. To adapt the cost-aware strategies to our setting, we set the sampling cost associated with switching variables in modules. To start, we tested the methods on two synthetic functions (Figure 2A-B) (Supp. B.3 provides results for two additional synthetic functions). Empirically we find that in most of the synthetic cases tested, LaMBO performs similarly to other methods early on, but with further iterations, LaMBO starts to outperform the alternatives. This could be explained by inaccurate estimation of function at early stages, and the fact that aggressive input changes could outperform the more conservative or lazy strategy used in LaMBO. However, as more samples are gathered, LaMBO demonstrates more power in terms of its cost efficiency by being lazy in variable switching. The traditional cost-aware strategies do not perform well because they use static strategies to treat the dynamic switching costs and ignore the effect of modular structure. We also complement our finding by verifying the theory. We use 6D Hartmann function to study the convergence of LaMBO in terms of the cumulative movement regret, by computing the averaged movement regret as a function of time horizon (Figure. 2D). LaMBO has minimal regret and cost-aware methods have intermediate results between non-cost-aware strategies and ours. These results point to the superiority of LaMBO over direct adaptation from traditional cost-aware strategies.
We investigate how modular structure affects our algorithm from three different angles: variable dimension, cost ratio, and number of modules. To find out answers to these questions, we conduct a number of experiments. First, we study the impact of dimension by splitting the variables in the Ackley 8D into three different configurations (, , and ) (Figure 2C). We find that our method has consistent performance among the different variable splits, with some degradation as the complexity of the first module increases, which suggests that aggressive switching can be essential when optimizing a module is hard. Second, we examine the impact of the relative cost of modules. For this purpose, we conduct the same experiment as in (A) but change the cost ratio to be 1:1 (Figure 2E). In this case, existing cost-aware methods have the best performance while LaMBO attains comparative results. It suggests that our algorithm can be even more beneficial in the asymmetric cost setting. Third, we explore a 3-module problem, where the Ackley 8D test function is divided into dimensions and the cost is set to , respectively across modules (Figure 2F). We observe that LaMBO outperforms alternatives by a large amount, which amplifies the advantage of LaMBO when the system is comprised of more modules.
Optimizing a multi-stage neuroimaging pipeline:
Segmentation and identification of neural structures of interest (e.g., cell bodies, axons, or vasculature) is an important problem in connectomics and other applications of brain mapping helmstaedter2013connectomic; oh2014mesoscale. When dealing with large datasets, transfer can be challenging and thus workflows must be re-optimized for each new dataset johnson2019toward. Here, we consider the optimization of a a relatively common three stage brain mapping pipeline, consisting of a pre-processing (image enhancement via denoising and contrast normalization), semantic segmentation for pixel-level prediction (via a U-net architecture), and a post-processing operation (to reconstruct a 3D volume and detect objects therein). To optimize this pipeline, we use a publicly available X-ray microCT dataset prasad2019mapping to set up the experiments in both a two-module (no pre-processing) and full three-module version of the pipeline (details of search space for each module in Supp. B.1.) In our experiments, we define the cost to be the aggregate recorded clock time for generating an output after changing a variable in a specific module, which were sec (pre), sec (U-net), and sec (post). To test LaMBO on the problem, we gathered a data set consisting of combinations of hyperparameters by exhaustive search. The best set of hyperparameters in this case achieved an f1-score of . In the two-module case (Figure 2G), we observed a transition effect; when enough cost has been spent, LaMBO starts to increase its gap in performance over other methods. In the three-module case (Figure 2H) the advantage is even more pronounced, where the transition happens earlier. Quantitatively it shows that to get close to the optimum (within 5%), LaMBO can achieve this result in only 25% of the time required by the best alternative approach (1.4 vs. 5.6 hours).
In this paper, we introduced a new algorithm for Bayesian optimization that leverages known modular structure in an otherwise black-box system to minimize the overall cost required for global optimization. To do this, we introduced a new notion of regret, movement augmented regret, that captures the notion of trying to minimize the number of changes to variables in early modules of a system. We designed an algorithm to minimize the movement augmented regret and demonstrated that our method performs well both in theory and in practice. In our analysis, we showed that LaMBO achieves sublinear movement regret. In practice, we showed that our method outperforms other standard approaches in both synthetic datasets and in a brain mapping application. Thus, our results demonstrate that LaMBO not only can be used to achieve good performance for global optimization but can also be used to reduce the optimization cost of structured black-box systems.
This paper addresses a real-world problem of system optimization that is encountered in a variety of scientific disciplines. Increasingly, as we expand the size of datasets in these different domains, we need automated solutions to apply advanced machine learning systems to new datasets and re-optimize systems in an end-to-end manner. Here, we showed how to leverage structure in such systems to inspire the design of a new black-box optimization approach. We demonstrated its application in a relatively simple neuroscience application; however, this method provides a general approach that can be used in a number of scientific domains and make major impacts in advancing healthcare and science.
Appendix A Technical Preliminaries and Proofs
a.1 Common Bounds for Maximum Information Gain
Our theoretical results are presented in terms of the notation of maximum information gain defined as:
Definition 1. Maximum Information Gain Let be defined in the domain . The observation of at any is given by the model , . For any set , let and denote the set of function values and observations at points in , and denote the Shannon Mutual Information. The Maximum Information Gain is defined by
Maximum information gain has shown to be a fundamental quantity for analysis in various Bayesian optimization literature. Analytical bounds on common kernels are listed below. [Srinivas:2010:GPO:3104322.3104451]:
, for linear kernel.
, for Squared Exponential kernel.
for Matérn kernels with ,
where is the dimension of input space.
In Theorem , the two terms contributing to the regret arise from bandit and Bayesian optimization, respectively. It immediately follows from the bounds above that the regret is dominated by the former for linear and squared exponential, and Matérn (if ) kernels.
a.2 Proofs of Theoretical Results
(Theorem of [Chowdhury:2017:KMB:3305381.3305469]). Let be a function lying in the RKHS of kernel such that with input dimension . Assume the process of observation noise is -sub-gaussian. Then setting
we have the following holds with probability at least ,
where are given by the formula
(Lemma 4 of [Chowdhury:2017:KMB:3305381.3305469]). Suppose we sample the objective function at then the sum of standard deviations is bounded by,
Our first goal is to prove Lemma 4.
Suppose the learning rate of the LaMBO is set to be , where is the depth of the MSET, then the expected cumulative regret of LaMBO is:
We can compare the result with SMB in [pmlr-v65-koren17a] where . Note that ours is a lower bound of it as . They could potentially have a large gap between them in terms of order. This performance improvement is due to our loss estimator adapted to arm correlation, whereas [pmlr-v65-koren17a] considers the pure bandit information.
For any sequence of , denote to be the solution of and assume for some constant , then there exists an such that LaMBO has the property
In particular, if then for all .
The last statement is trivial from definition. We will prove Eq. (6) by induction on . Since by Eq. (4) and the UCB is upper bounded by , the statement holds for . Now assume it holds for . Then firstly,
Secondly, applying Jensen’s inequality, we have
where the second inequality is followed by the induction assumption. Therefore the proof is complete by mathematical induction. ∎
For all and the followings hold:
For all we have
With probability at least , we have that .
The proof of the second property is identical to Lemma 8 in [NIPS2017_7000] and is thus omitted . Now we prove the first property. Note that we only need to prove
We will again use the mathematical induction to prove the above statement. The initial case holds trivially. Now assume the statement is true for . Then for ,
where the last equality follows from the induction assumption. On the other hand,
where the last equality follows from the independence between and . Hence we have
Now if . Let be the subtree such that then by the tower rule for expectation we have
which completes the proof by induction. ∎
For all , we have .
[alon2015online]. Let and be real vectors such that
be real vectors such thatthen a sequence of probability vectors defined by and for all ,
have the property that
for any .
Now we are ready to prove Lemma 5.
Next we prove Lemma 1 in the main text.
For sufficient large , suppose for that the parameters of an MSET are chosen recursively,
Then LaMBO results in cumulative costs
The proof follows by showing firstly that the movement cost is dominated by a HST metric, and secondly that under the tree metric the cumulative cost is bounded by the quantity in the lemma. To define the HST metric formally, let us introduce the following terminology in accordance to [pmlr-v65-koren17a]. Given , be nodes in the MSET , let LCA() be their least common ancestor node. Then the scaled HST metric is defined as follows:
Under this metric, the cost incurred from changing variables in the module is
Then the condition of dominance over the original cost is, for ,
Rearrangements of these linear inequalities yield the solution for to as
Under the condition in Eq. (A.2), the cost incurred from the HST metric Eq. (15) is larger than our original cost. Hence, an upper bound for the cost incurred from the metric will also bound our cumulative cost.
Now we bound the cumulative cost under this HST metric. Observe and belongs to the same subtree on level of the tree with probability at least , therefore we have
On the other hand, the condition of admits a non-negative solution of for sufficient large . This condition implies an upper bound on . Finally, combining this upper bound of with Eq. (A.2) completes the proof. ∎
Now we are in the last stage of proving Theorem 1.