Active Learning for Contextual Search with Binary Feedbacks

10/03/2021 ∙ by Quanquan, et al. ∙ 0

In this paper, we study the learning problem in contextual search, which is motivated by applications such as first-price auction, personalized medicine experiments, and feature-based pricing experiments. In particular, for a sequence of arriving context vectors, with each context associated with an underlying value, the decision-maker either makes a query at a certain point or skips the context. The decision-maker will only observe the binary feedback on the relationship between the query point and the value associated with the context. We study a PAC learning setting, where the goal is to learn the underlying mean value function in context with a minimum number of queries. To address this challenge, we propose a tri-section search approach combined with a margin-based active learning method. We show that the algorithm only needs to make O(1/ε^2) queries to achieve an ϵ-estimation accuracy. This sample complexity significantly reduces the required sample complexity in the passive setting, at least Ω(1/ε^4).

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Contextual search, which extends the classical binary search problem to high dimensions, finds a wide range of applications, such as auctions, dynamic pricing, and personalized medicine. In the contextual search problem, for each round , an item (e.g., a customer or a patient) arrives sequentially, each with a contextual vector accessible to the decision-maker. We assume that the context incurs an unknown stochastic value , where is the mean value function of and is the stochastic noise. The decision-maker selects a query and then observes the binary feedback, i.e., whether or vice versa. The true value will never be revealed. To better fit our motivating applications that illustrated below, the decision-maker is allowed to skip making a query to save her budget. Our goal is to learn the mean value function with a minimum number of queries. We now explain three motivating applications:

First-price Auction: In the Ads industry, many well-known Ad exchange platforms have recently shifted from the second-price auction to the first-price auction, such as ApNexus, Index Exchange, OpenX, and Google Ad Manager (Sluis 2017, Davies 2019). Compared to the second-price auction, the first-price auction has the advantages of enhanced transparency for bidders and potentially increased revenue for the seller. The first-price auction becomes increasingly popular in real-time bidding, which runs online auctions over a large number of demand side platforms.

For an agent to design her bidding strategy, it is critically important to learn the market price given a context of an ad, i.e., the highest bid from her competitors. This problem can be naturally cast as a contextual search problem. Each corresponds to a coming bid request111Each bid request usually contains the auction information, including user information (e.g., location, browser label), publisher information (e.g., web URL and ad slot size), and the ad content., and corresponds to the unknown market price for this auction. The agent first decides whether she wants to participate in the auction and then posts her bid if she joins the auction. In many first-price auction settings in real-time bidding, the agent would only observe the binary feedback on whether she wins the auction but not the market price . This problem is referred to as the landscape forecasting problem in the literature (Zhang et al. 2014). Existing research mainly focuses on landscape forecasting under the second-price auction (Wang et al. 2016, Ren et al. 2019). However, the feedback structure of the first-price auction is fundamentally different from that of the second-price auction. In the second price auction, when , the winning agent will be able to observe the market price as is exactly the cost that she needs to pay. In contrast, the winning agent only receives the binary feedback in a standard first-price auction. Moreover, since an agent has the right to avoid making a bid, our contextual search problem setting allows the choice of not making a query to accommodate our motivating application.

Personalized Medicine Experiment: In personalized medicine, a common practice is to leverage a clinical trial experiment to determine the appropriate dosage of a drug for an individual. Bastani and Bayati (2020) adopted a linear bandit model (i.e., the linear form of ) to investigate the relationship between the optimal dosage and the patients’ profile. We consider a clinical trial experiment for an expensive drug. The profile of each coming potential experimental unit is characterized by (e.g., her demographics, diagnosis, medications, and genetics). The algorithm will first decide whether the experimental unit will be given the drug, and if so, recommends a dosage . The appropriate level of the dosage is modeled by . The algorithm receives binary feedback on whether the recommended dosage is above or below the appropriate level. Moreover, we assume that each clinical trial is costly. Thus, the goal is to use the minimum number of trials to learn the ideal personalized dosage level for the drug (i.e., the function).

Feature-based Pricing Experiment: For an online shopping platform that sells a large number of different products, it is important to understand customers’ valuation for each product. Feature-based pricing models the valuation as a linear function in the product’s feature vector (Qiang and Bayati 2016, Javanmard and Nazerzadeh 2019, Cohen et al. 2020). It would be very costly to conduct price experiments for all products. Thus, based on the feature vector for each product , the algorithm first decides whether the price experiment should be conducted for this product. If so, a price is suggested, and binary purchase decision will be observed subsequently.

Motivated by these applications, the goal of this paper is to propose an efficient algorithm to learn . Following the existing literature on contextual search and feature-based pricing, we also adopt linear model of the mean valuation function, i.e., for some unknown coefficient vector and the intercept . As compared to the existing literature, our contextual search problem has the following unique features, which calls for new algorithmic development:

  1. First, the existing contextual search setup aims to minimize either the absolute loss or the -ball loss for some pre-determined over time. Here denotes the indicator function. In contrast, we consider a learning problem, where the goal is to learn

    as accurately as possible. Therefore, we adopt a probably approximately correct (PAC) setting (see (

    2) in Sec. 2) instead of regret minimization setting in existing literature (Lobel et al. 2017, Leme and Schneider 2018, Krishnamurthy et al. 2021). To facilitate the analysis of this learning problem, we assume the stochasticity of the contextual information .

  2. Second, as we are motivated by experimental applications, the decision-maker should judge the benefit of a context to the learning problem. Therefore, compared to the existing contextual search, our problem has another layer of decision, i.e., whether to conduct a query or not, beyond the decision of the query point itself.

To address this problem, we adopt the active learning framework from machine learning

(Settles 2012). In particular, we adopt the margin-based active learning approach (Balcan et al. 2007). At a high level, let be the current estimate of the underlying function and be the query point. For an arriving context , the margin-based active learning will make a query if is sufficiently small, which indicates that it is difficult to determine the relationship between and . Although it is an intuitive approach, existing margin-based active learning approach cannot be applied to address our problem due to the existence of the intercept . In fact, there is a famous negative result, which shows that active learning cannot significantly improve sample complexity over passive learning for linear binary classification models with intercepts (Dasgupta 2005b). Please refer to Figure 1 in Sec. 1.1 for details.

To address this challenge, we propose an active learning procedure consisting of three major stages:

  1. The first stage of the algorithm is to use trisection search to locate two queries and that are close to the underlying intercept term , without consuming too many labeled (queried) samples. In this first stage sample selection (i.e., determining whether a sample is to be labeled/queried or not) is not carried out, but the algorithm will actively explore different actions in order to obtain that are close to ;

  2. The second stage of the algorithm is to apply margin based active learning to learn the linear model and an intercept term depending on both and . In this second stage sample selection will be carried out, as only those users with contextual vectors

    close to classification hyperplanes will be queried/labeled (see Algorithm

    3 later for details). The actions taken in this stage (on selected samples) will be fixed to either or obtained in the first stage.

    Note that, although this classification model still has non-zero intercept terms, the closeness of to would imply that the obtained labels under actions or are balanced, circumventing the negative results in the work of Dasgupta (2005b) which specifically constructed counter-examples with unbalanced labels. In Figure 1 and the following related work section we give a detailed account of this negative example and how it presents challenges to active learning. Indeed, our theoretical analysis extends the arguments in Balcan et al. (2007) to this more general setting of linear classification with intercepts and balanced labels, with similar convergence rates derived.

  3. The final stage of the algorithm is to reconstruct the mean utility model from the estimated linear model and intercepts. Because margin-based active learning can only estimate a linear model up to scales, we need model estimates at two different actions (corresponding to two different effective intercepts) in order to reconstruct and in . Details of how this reconstruction is carried out are given in the last two lines of Algorithm 1.

We establish the sample complexity bound for the proposed margin-based active learning with a tri-section search scheme. We assume that with total number of incoming contexts, the decision-maker only needs to make queries to estimate the mean value function within -precision (with high probability). Here here hides the dependence on and other logarithmic factors. We also show that in the passive setting, where the decision-maker is required to conduct queries for all arriving contexts as in the standard contextual search, the sample complexity would be at least (see Remark 3.3).

1.1 Related work

Our problem setting can be viewed as a variant of the contextual search problem, which is an extension of the classical binary search. In binary search, the decision-maker tries to guess a fixed constant (i.e., the value for all in our problem) by iteratively making queries . In the PAC learning setting, the binary search algorithm only needs queries to estimate within -precision. Due to the importance of applications such as personalized medicine and feature-based pricing, contextual search has received a lot of attention in recent years. The existing literature mainly adopts the linear model for the mean value function. For -ball loss , Lobel et al. (2017) established the regret lower bound and proposed the project volume algorithm that achieves a near-optimal regret of . For absolute loss , Leme and Schneider (2018) established the regret bound of . As we explained in the introduction, to fit the applications considered in our paper (e.g., landscape forecasting in the first-price auction), we adopt a PAC learning setting and equip the decision-maker with the ability to pass an incoming context (e.g., skipping an auction). While most contextual search settings in the literature consider adversarial contextual information, we assume the stochasticity of the contextual information as we study a learning problem.

In one of our motivating examples for the first-price auction, there are several recent research works devoted to optimizing the bidding strategy in repeated auctions (Balseiro et al. 2019, Han et al. 2020a, b). These works formulate the problem into contextual bandits with certain monotone structures. In particular, Balseiro et al. (2019) proposed a cross-learning contextual bandits formulation, where at each round, the decision-maker learns rewards associated with all possible contextual information after making a decision. In the context of the first-price auction, Balseiro et al. (2019) modeled the bidder’s own valuation as the context and thus, the bidder would know reward associated with other possible valuation after making the bid. In our setting, the contextual information is the bid request’s profile (e.g., the information of user, publisher, and ad content). Furthermore, the formulation and results of (Balseiro et al. 2019) allow for only a finite number of possible bids, following classical literature on contextual bandits. In contrast, our modeling and algorithm allow the placed bids to vary continuously on the real line to allow for more flexible bidding strategies. Han et al. (2020a, b) formulated the bidding problem into a contextual bandit framework with either the censored feedback structure (i.e., the bidder only knows the winner’s bid) or the full-information feedback structure (i.e., the bidder knows the highest competing bid). Our paper mainly focuses on the landscaping forecasting problem, which facilitates the design of the bidding strategy by providing an accurate estimation of the highest competing bid. Moreover, we consider a more limited (and thus challenging) feedback structure with only binary feedback.

Figure 1: Illustration of negative examples of problem instances constructed in (Dasgupta 2005b).

Active learning is an important research area in machine learning, originating from the seminal work of Cohn et al. (1994) dating back to the 1990s. The main idea behind active learning is to equip the learning algorithm with the ability to select samples or data points to be labeled, improving its sample complexity in applications where labels are expensive to obtain but unlabeled data are abundant. There have been many successful algorithms developed for active learning, such as bisection search for one-dimensional noiseless problems (Dasgupta 2005b), greedy method (Dasgupta 2005a), disagreement-based active learning (Hanneke 2007, Balcan et al. 2009, Zhang and Chaudhuri 2014), margin based active learning (Balcan et al. 2007, Balcan and Long 2013, Wang and Singh 2016)

and active learning based on surrogate loss functions

(Awasthi et al. 2017, Balcan and Zhang 2017). Due to the vast literature on active learning we cannot cite all related works here, and would like to refer interested readers to the excellent review of Hanneke et al. (2014) for an overview of this area.

Our approach in this paper resembles the margin-based active learning method (Balcan et al. 2007, Balcan and Long 2013, Wang and Singh 2016)

which is developed for linear classifiers and have been popular in the active learning literature due to its intuitive nature, tight sample complexity, and relative ease of implementation. However, while linear classifiers seem simple, non-homogeneous linear classifiers (i.e., linear classifiers with

an intercept term) present notorious challenges to active learning algorithms. More specifically, the work of Dasgupta (2005a) shows that when and non-homogeneous linear classifiers produce unbalanced samples, such as the example shown on the left panel of Figure 1. In this illustrative example, potential linear classifiers are within distance to the domain boundary, and thus, active learning cannot asymptotically improve sample complexity over passive learning as it takes samples to hit the boundaries. Note that, it is easy to verify that, if a non-homogenous linear classifier is within distance to the boundary and the underlying distribution of unlabeled samples is relatively uniform, the probability of seeing a positive sample (as indicated in the region colored by blue in Figure 1) is also on the order of . To overcome this counter-example, in this paper we exploit the special structure in the contextual search problem to “balance” the labels, as shown on the right panel of Figure 1. While the balanced model still possesses a non-zero intercept term, the classifier will be generally away from the boundary, which our theoretical analysis shows is sufficient of obtaining desired sample complexity results for active learning.

Active learning has been an important area in machine learning. However, it has not received a lot of attention in operations management. This paper takes a preliminary step on exploring the applications of active learning, and hopefully, it will inspire more research on active learning to address challenges arising from operations management.

1.2 Paper organization

The rest of the paper is organized as follows. Sec.  2 describes the problem formulation and necessary assumptions. Sec. 3 develops our margin-based active learning algorithm with the tri-section search and establishes the sample complexity bound. The technical proofs are provided in Sec. 4. We provide the numerical simulation studies in Sec. 5, followed by the conclusion in Sec. 6. Proofs of some technical lemmas are relegated to the appendix.

2 Problem Formulation and Assumptions

In our modeling, assuming the items (e.g., ads or experimental units) arrive sequentially, each with a contextual or feature vector accessible to the decision-maker. We assume that the contextual vectors are independently and identically distributed with respect to an unknown underlying distribution . We also assume that for the ease of illustration. Given the contextual vector , the “valuation” of the item (e.g., the highest competing bid in a first-price auction or appropriate dosage in personlized medical treatment) follows a linear model:

(1)

where is an underlying linear model with a fixed but unknown coefficient vector , the intercept , and the noise , which are independently and identically distributed stochastic variations with respect to an unknown distribution .

After observing the contextual vector , the decision-maker will make :

  1. Let the item pass without taking any actions, and thereby without obtaining any feedback/information;

  2. Make a query at , and observe the binary feedback if or if .

Since making a query (e.g., posting a bid to participate in an auction or admitting an experimental unit into a clinical trail program) incurs much higher implicit cost as compared to passing (i.e., taking no action), the main goal of the decision-maker is to use as fewer number of queries as possible to estimate the mean valuation function to a certain precision. More specifically, let be target accuracy and probability parameters. We use to denote the number of queries an learning algorithm takes in order to produce an estimate that satisfies

(2)

Clearly, the smaller is the more efficient the designed learning algorithm is. The main objective of this paper is to design an active learning algorithm that minimizes . Additionally, we use to denote the number of total samples (i.e., the number of total incoming contexts) an algorithm requires to obtain an estimate satisfying Eq. (2). While those incoming contexts skipped by our algorithm usually do not incur extra cost, it is desirable that is reasonable because the supply of experimental units might still be limited. In active learning literature, an is reasonable if it is a polynomial function in terms of and (Cohn 1996, Cohn et al. 1994, Balcan et al. 2007).

Throughout this paper we impose the following assumptions.

  1. There exists a constant such that and ;

  2. The distribution satisfies the following condition: it is supported on the unit ball

    ; it admits a probability density function

    ; there exist constants such that for all , where

    is the probability density function (PDF) of the uniform distribution on

    ;

  3. The distribution satisfies the following condition: ; it admits a probability density function ; there exist constants such that and .

Assumption (A1) is a standard bounded assumption imposed on model parameters. Assumption (A2) assumes that the contextual vectors are independently and identically distributed, with respect to a bounded and non-degenerate distribution that is unknown. Similar “non-degenerate” or “covariate diversity” assumptions were also adopted in the contextual learning literature (Bastani and Bayati 2020, Bastani et al. 2021), and the assumption is actually weaker than some of the existing works on active learning (Balcan et al. 2007, Wang and Singh 2016), which requires to be the exact uniform distribution over .

Assumption (A3) is a general condition imposed on the distribution of the noise variables. Essentially, it assumes that zero is the median of the noise distribution , which ensures that the linear classifier is the optimal Bayes classifier. The same assumption is common in the active learning literature (Balcan et al. 2007, Wang and Singh 2016). Note that we do not assume the noise distribution has any specific parametric forms (e.g., Logistic or Probit noises), making it generally applicable to a broad range of problems.

3 Margin-based Active Learning with Tri-section Search

1:Input: dimension , accuracy parameters , algorithm parameters .
2: with , ;
3:Let , ;
4:;
5:;
6:Let ;
7:Output: utility function estimate , where and .
Algorithm 1 A meta-algorithm for actively learning contextual functions.

The main algorithm we proposed for actively learning contextual functions is given in Algorithm 1. The main idea of the proposed algorithm can be summarized as follows.

The first step is to find two actions or bidding prices that are reasonably close to the mean utility . This is to ensure that when the actions are fixed at or , the labels received from user streams are relatively balanced, thereby circumventing the negative results in the work of Dasgupta (2005b). In Sec. 3.1 we show how can be found without using too much labeled samples, by using a trisection search idea.

After we obtained candidate bidding prices and , we use a margin-based active learning algorithm to estimate the linear model and mean utility . The margin-based active learning algorithm is similar to the work of Balcan et al. (2007), with the difference being that in our setting the active learning algorithm needs to incorporate a (relatively small) intercept term, which complicates its design and analysis.

Finally, we use the estimates and obtained from the above-mentioned active learning procedure under two different fixed actions to reconstruct the linear utility parameters and . The reason we need two fixed actions is because the active learning procedure solves a classification problem, for which we can only estimate the linear model and its intercept up to scalings because if one multiply both the linear model and its intercept by a constant the resulting classification problem is the same. Hence, we need two fixed actions to construct an approximate linear system of equations, the solution of which would give as consistent estimates of and .

Below we briefly explain our intuition behind the construction of the utility function estimate in Algorithm 1. For simplicity we will omit the learning errors occurred in the two MarginBasedActiveLearing invocations. Because the margin based active learning algorithm learns linear classifiers up to normalization (see Algorithm 3), we have the following equivalence:

where due to the construction of Algorithm 3. Again, we emphasize that the above equivalence only holds approximately due to learning errors of , but we will omit these learning errors for the ease of explanation. Let . We have and . Therefore, we set as the estimate of , and as the estimate of . Thus, we obtain the utility function estimate in Algorithm 1.

3.1 Tri-section search for accurate mean utility

1:function TrisectionSearch()
2:     Initialize: sample counter , lower and upper bounds , ;
3:     while  do
4:         , , , , ;
5:         while  and  do
6:              For an incoming user , take action and observe result ;
7:              For another incoming user , take action and observe result ;
8:              , , , ;
9:              Update: and ;
10:         end while
11:         Set if or and otherwise;
12:     end while
13:     return .
14:end function
Algorithm 2 A tri-section search algorithm to roughly estimate the mean utility parameter

Let be the unique value such that . Because and have PDFs, such a value of exists and is unique. Intuitively, if one commits to the fixed action then the labels received by the algorithm should be balanced. Algorithm 2 shows how to find actions that are reasonably close to , without consuming too many labeled samples.

Figure 2: Graphical illustration of the main idea behind Algorithm 2.

The main idea behind Algorithm 2 is a trisection search approach, motivated by the fact that the probability is a monotonically decreasing function of , and furthermore as increases the gap between and will also increase (see, e.g., Lemma 4.1.2 in the proof). This allows us to use a trisection search procedure to localize the value of , by simply comparing an empirical estimate of at the current value of . More specifically, at an iteration are the two midpoints and are lower and upper estimates of and similarly are lower and upper estimates for . With either probabilities being separated from , the algorithm could move or to or . The algorithm is guaranteed to maintain that , thanks to the monotonicity of with respect to .

The following technical lemmas are the main results explaining the objective and guarantee of Algorithm 2, which are proved in Sec. 4.1. Suppose and let . Then .

Suppose and let be the values returned by . With probability the following hold: , and at most queried samples are consumed.

Intuitively, Lemma 2 establishes that the “balancing” intercept is close to the intercept in the utility model, which is helpful for our later analysis. Lemma 2 further establishes that the returned two actions sandwich the “label-balancing” action , and also upper bound the total number of labeled (queried) samples consumed in the algorithmic procedure.

3.2 Margin-based Active Learning

In Algorithm 3 we provide the pseudocode description of the margin based active learning algorithm we use in this problem to actively learn a linear model with intercepts.

1:function MarginBasedActiveLearning()
2:     Collect samples with action and let , be the queried samples;
3:     Let ;
4:     Let ;
5:     for  do
6:         , , , ;
7:         while  do
8:              Observe context vector for the next object;
9:              if  then
10:                  Invoke action and let be the collected binary feedback;
11:                  Update ;
12:              end if
13:         end while
14:         ;
15:     end for
16:     return .
17:end function
Algorithm 3 Margin-based Active Learning Non-homogeneous Linear Classifiers

Note that in Algorithm 3 the query point is fixed, with the algorithm only able to select which sample/contextual vector to act upon. Since the query point is fixed, we can consider linear models with intercepts as . For such a model, we define the error of under the query point as

(3)

where . Note that for any , the model has the smallest error defined in Eq. (3), This is because is the Bayes classifier; that is, if and only if . Hence, we can also define the excess error of a model as

(4)
Figure 3: Graphical illustration of the main idea of Algorithm 3.

Figure 3 illustrates the principles of Algorithm 3. The main idea of Algorithm 3

is simple: the algorithm first uses a “warm-up” epoch consisting of

queried samples to construct a preliminary model estimate and . There is no sample selection or active learning in this warm-up procedure, and the analysis of excess errors of follow standard VC theory analyzing empirical risk minimizers of binary classifiers (see e.g. Lemma A in the proof and also Balcan et al. (2007), Vapnik and Chervonenkis (2015), Vapnik (2013)). Next, in each epoch the algorithm only take action for those users with contextual vectors that are close to the current classification hyperplane (i.e., those users with small “margin” ). This concentrates our labeled/queried samples to the region that are close to the classification hyperplane, which helps reduce the number of queried samples as the queried samples are collected on regions that are the most uncertain from a binary classification perspective.

The following lemma is the main result of this section, which is proved in Sec. 4.2. Let be returned by Algorithm 3 with parameters satisfying , , and . Let . Then for sufficiently large and sufficiently small , with probability the following hold:

  1. ;

  2. Algorithm 3 consumes queried samples and total samples.

Essentially, Lemma 3.2 shows that the estimated linear model produced by Algorithm 3 has the target excess risk with high probability. The lemma also upper bounds the number of queried and total samples consumed in the estimation procedure. As we can see, the number of labeled samples required is on the order of , which is an order of magnitude fewer than the total number of samples consumed (on the order of ). This shows that the active learning procedure is capable of drastically reducing the number of queried samples required to attain an accurate model estimate , by being selective in the user context vectors.

3.3 Sample complexity analysis of Algorithm 1

In this section we establish the following theorem, which analyzes the sample complexity (both samples that are queried on and samples that are passed) of Algorithm 1, and provides guidance on the selection of the algorithm input parameters.

Suppose Algorithm 1 is executed with , , and . Then for sufficiently small and sufficiently large , with probability it holds that for all . Furthermore, the algorithm makes queries among total samples/contexts, with

If the decision-maker needs to make queries to all incoming contexts/samples (i.e., skipping uninformative samples is not allowed), then at least samples are required. To see this, note that the standard classification theory establishes that samples are needed to obtain a linear classifier such that (see, e.g., Mammen and Tsybakov (1999), Ben-David and Urner (2014)). On the other hand, it can be shown via an integration argument as follows. Let denote the angle between and . If both are normalized (i.e., ) and then . This shows that in order to achieve we must have , indicating a sample complexity lower bound of .

Theorem 3.3 shows that, by using more unlabeled/unqueried samples than those that are labeled (more specifically, total samples and labeled ones), the utility function estimate produced by our active learning algorithm is within estimation error with high probability. In Sec. 5 of numerical studies, we will see that the availability of unlabeled samples will greatly improve the estimation accuracy of an active learning algorithm, compared to a passive learning baseline which cannot skip or select samples to query.

4 Technical Proofs

In this section we state the proofs of the main results in this paper. There are also some technical lemmas that either easy to prove, or cited/rephrased from existing works, which will be presented in the appendix. For simplicity, let be the uniform distribution on for all proofs in this section.

4.1 Proof of results in Sec. 3.1

4.1.1 Proof of Lemma 2.

First note that is equivalent to , with . Note also that we may assume because is invariant to . In this proof we shall use the lower and upper bounds of by connecting it with the uniform distribution on , . Because is isotropic, we may assume without loss of generality that and . We will also abbreviate and since all margins in this proof are with respect to . Then for all with , and further more

Subsequently, by Assumption (A2) and Lemma A, it holds that

(5)

With , we have . Noting that for , Eq. (5) can then be simplified to

(6)

On the other hand, for all with , and furthermore

Subsequently, by Assumption (A2) and Lemma A, it holds that

(7)

Combining Eqs. (5,7) we obtain

To satisfy the above inequality, must satisfy

which proves Lemma 2.

4.1.2 Proof of Lemma 2.

For notational simplicity define . Clearly, and is a monotonically decreasing function. By Hölder inequality, at sample we have . The same inequality holds for as well. By the union bound, the probability that and throughout the entire Algorithm 2 is lower bounded by