Privacy Amplification via Random Check-Ins

07/13/2020 ∙ by Borja Balle, et al. ∙ Google 12

Differentially Private Stochastic Gradient Descent (DP-SGD) forms a fundamental building block in many applications for learning over sensitive data. Two standard approaches, privacy amplification by subsampling, and privacy amplification by shuffling, permit adding lower noise in DP-SGD than via naïve schemes. A key assumption in both these approaches is that the elements in the data set can be uniformly sampled, or be uniformly permuted – constraints that may become prohibitive when the data is processed in a decentralized or distributed fashion. In this paper, we focus on conducting iterative methods like DP-SGD in the setting of federated learning (FL) wherein the data is distributed among many devices (clients). Our main contribution is the random check-in distributed protocol, which crucially relies only on randomized participation decisions made locally and independently by each client. It has privacy/accuracy trade-offs similar to privacy amplification by subsampling/shuffling. However, our method does not require server-initiated communication, or even knowledge of the population size. To our knowledge, this is the first privacy amplification tailored for a distributed learning framework, and it may have broader applicability beyond FL. Along the way, we extend privacy amplification by shuffling to incorporate (ϵ,δ)-DP local randomizers, and exponentially improve its guarantees. In practical regimes, this improvement allows for similar privacy and utility using data from an order of magnitude fewer users.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modern mobile devices and web services benefit significantly from large-scale machine learning, often involving training on user (client) data. When such data is sensitive, steps must be taken to ensure privacy, and a formal guarantee of differential privacy (DP) 

[16, 15] is the gold standard. For this reason, DP has been adopted by companies including Google [20, 9, 18], Apple [2], Microsoft [13], and LinkedIn [31], as well as the US Census Bureau [26].

Other privacy-enhancing techniques can be combined with DP to obtain additional benefits. In particular, cross-device federated learning (FL) [27] allows model training while keeping client data decentralized (each participating device keeps its own local dataset, and only sends model updates or gradients to the coordinating server). However, existing approaches to combining FL and DP make a number of assumptions that are unrealistic in real-world FL deployments such as [10]. To highlight these challenges, we must first review the state-of-the-art in centralized DP training, where differentially private stochastic gradient descent (DP-SGD) [34, 8, 1] is ubiquitous. It achieves optimal error for convex problems [8]

, and can also be applied to non-convex problems, including deep learning, where the privacy amplification offered by randomly subsampling data to form batches is critical for obtaining meaningful DP guarantees

[25, 8, 1, 5, 37].

Attempts to combine FL and the above lines of DP research have been made previously; notably, [28, 3] extended the approach of [1]

to FL and user-level DP. However, these works and others in the area sidestep a critical issue: the DP guarantees require very specific sampling or shuffling schemes assuming, for example, that each client participates in each iteration with a fixed probability. While possible in theory, such schemes are incompatible with the practical constraints and design goals of cross-device FL protocols

[10]; to quote [23], a comprehensive recent FL survey, “such a sampling procedure is nearly impossible in practice.”111In cross-silo FL applications [23], an enumerated set of addressable institutions or data-silos participate in FL, and so explicit server-mediated subsampling or shuffling using existing techniques may be feasible. The fundamental challenge is that clients decide when they will be available for training and when they will check in to the server, and by design the server cannot index specific clients. In fact, it may not even know the size of the participating population.

Our work targets these challenges. Our primary goal is to provide strong central DP guarantees for the final model released by FL-like protocols, under the assumption of a trusted222Notably, our guarantees are obtained by amplifying the privacy provided by local DP randomizers; we treat this use of local DP as an implementation detail in accomplishing the primary goal of central DP. As a byproduct, our approach offers (weaker) local DP guarantees even in the presence of an untrusted server. orchestrating server. This is accomplished by building upon recent work on amplification by shuffling [19, 12, 18, 22, 6] and combining it with new analysis techniques targeting FL-specific challenges (e.g., client-initiated communications, non-addressable global population, and constrained client availability).

We propose the first privacy amplification analysis specifically tailored for distributed learning frameworks. At the heart of our result is a novel technique, called random check-in, that relies only on randomness independently generated by each individual client participating in the training procedure. We show that distributed learning protocols based on random check-ins can attain privacy gains similar to privacy amplification by subsampling/shuffling (see Table 1 for a comparison), while requiring minimal coordination from the server. While we restrict our exposition to distributed DP-SGD within the FL framework for clarity and concreteness (see Figure 1 for a schematic of one of our protocols), we note that the techniques used in our analyses are broadly applicable to any distributed iterative method and might be of interest in other applications333In particular, the Federated Averaging [27] algorithm, which computes an update based on multiple local SGD steps rather than a single gradient, can immediately be plugged into our framework..




Figure 1: A schematic of the Random Check-ins protocol with Fixed Windows (Section 3.1) for Distributed DP-SGD (Algorithm 1). For the central DP guarantee, all solid arrows represent communication over privileged channels not accessible to any external adversary. (a) clients performing random check-ins with a fixed window of time steps. ‘X’ denotes that the client randomly chose to abstain from participating. (b) A time step at the server, where for training time , the server selects a client from those who checked-in for time , requests an update for model , and then updates the model to (or gradient accumulator if using minibatches).


The main contributions of this paper can be summarized as follows:

  1. [leftmargin=1em,topsep=0pt]

  2. We propose random check-ins, the first privacy amplification technique for distributed systems with minimal server-side overhead. We also instantiate three distributed learning protocols that use random check-ins, each addressing different natural constraints that arise in applications.

  3. We provide formal privacy guarantees for our protocols, and show that random check-ins attain similar rates of privacy amplification as subsampling and shuffling while reducing the need for server-side orchestration. We also provide utility guarantees for one of our protocols in the convex case that match the optimal privacy/accuracy trade-offs for DP-SGD in the central setting [7].

  4. As a byproduct of our analysis, we improve privacy amplification by shuffling [19] on two fronts. For the case of -DP local randomizers, we improve the dependency of the final central DP by a factor of . Figure 2 provides a numerical comparison of the bound from [19] with our bound; for typical parameter values this improvement allows us to provide similar privacy guarantees while reducing the number of required users by one order of magnitude. We also extend the analysis to the case of -DP local randomizers, including Gaussian randomizers that are widely used in practice.

    Figure 2: Values of (for ) after amplification by shuffling of -DP local randomizers obtained from: Theorem 5.1 (solid lines) and [19, Theorem 7] (dotted lines). The grey line represents the threshold of no amplification (); after crossing the line amplification bounds become vacuous. Observe that our bounds with and are similar to the bounds from [19] with and , respectively.

Related work

Our work considers the paradigm of federated learning as a stylized example throughout the paper. We refer the reader to [23] for an excellent overview of the state-of-the-art in federated learning, along with a suite of interesting open problems. There is a rich literature on studying differentially private ERM via DP-SGD [34, 8, 1, 39, 35, 30]. However, constraints such as limited availability in distributed settings restrict direct applications of existing techniques. There is also a growing line of works on privacy amplification by shuffling  [9, 19, 12, 4, 6, 22, 18] that focus on various ways in which protocols can be designed using trusted shuffling primitives. Lastly, privacy amplification by iteration [21] is another recent advancement that can be applied in an iterative distributed setting, but it is limited to convex objectives.

2 Background and Problem Formulation

Differential Privacy

To formally introduce our notion of privacy, we first define neighboring data sets. We will refer to a pair of data sets as neighbors if can be obtained from by modifying one sample for some .

Definition 2.1 (Differential privacy [16, 15]).

A randomized algorithm is -differentially private if, for any pair of neighboring data sets , and for all events in the output range of , we have .

For meaningful central DP guarantees (i.e., when ), is assumed to be a small constant, and . The case is often referred to as pure DP (in which case, we just write -DP). We shall also use the term approximate DP when .

Adaptive differentially private mechanisms occur naturally when constructing complex DP algorithms, for e.g., DP-SGD. In addition to the dataset , adaptive mechanisms also receive as input the output of other differentially private mechanisms. Formally, we say that an adaptive mechanism is -DP if the mechanism is -DP for every .

Specializing Definition 2.1 to the case gives what we call a local randomizer, which provides a local DP guarantee. Local randomizers are the typical building blocks of local DP protocols where individuals privatize their data before sending it to an aggregator for analysis [25].

Problem Setup

The distributed learning setup we consider in this paper involves clients, where each client holds a data record444Each client is identified as a user. In a general FL setting, each can correspond to a local data set [10]. , , forming a distributed data set . We assume a coordinating server wants to train the parameters of a model by using the dataset

to perform stochastic gradient descent steps according to some loss function

. The server’s goal is to protect the privacy of all the individuals in by providing strong DP guarantees against an adversary that can observe the final trained model as well as all the intermediate model parameters. We assume the server is trusted, all devices adhere to the prescribed protocol (i.e., there are no malicious users), and all server-client communications are privileged (i.e., they cannot be detected or eavesdropped by an external adversary).

The server starts with model parameters and over a sequence of time slots produces a sequence of model parameters . Our random check-ins technique allows clients to independently decide when to offer their contributions for a model update. If and when a client’s contribution is accepted by the server, she uses the current parameters and her data to send a privatized gradient of the form to the server, where

is a DP local randomizer (e.g., performing gradient clipping and adding Gaussian noise


Our results consider three different setups inspired by practical applications [10]: (1) The server uses time slots, where at most one user’s update is used in each slot, for a total of minibatch SGD iterations. It is assumed all users are available for the duration of the protocol, but the server does not have enough bandwidth to process updates from every user (Section 3.1); (2) The server uses time slots, and all users are available for the duration of the protocol (Section 4.1). On average, users contribute updates to each time slot, and so, we take minibatch SGD steps; (3) As with (2), but each user is only available during a small window of time relative to the duration of the protocol (Section 4.2).

3 Distributed Learning with Random Check-Ins

This section presents the random check-ins technique for privacy amplification in the context of distributed learning. We formally define the random check-ins procedure, describe a fully distributed DP-SGD protocol with random check-ins, and analyze its privacy and utility guarantees.

3.1 Random Check-Ins with a Fixed Window

Consider the distributed learning setup described in Section 2 where each client is willing to participate in the training procedure as long as their data remains private. To boost the privacy guarantees provided by the local randomizer , we will let clients volunteer their updates at a random time slot of their choosing. This randomization has a similar effect on the uncertainty about the use of an individual’s data on a particular update as the one provided by uniform subsampling or shuffling. We formalize this concept using the notion of random check-in, which can be informally expressed as a client in a distributed iterative learning framework randomizing their instant of participation, and determining with some probability whether to participate in the process at all.

Definition 3.1 (Random check-in).

Let be a distributed learning protocol with check-in time slots. For a set and probability , client performs an -check-in in the protocol if with probability she requests the server to participate in at time step , and otherwise abstains from participating. If , we alternatively denote it as an -check-in.

Our first distributed learning protocol based on random check-ins is presented in Algorithm 1. Client independently decides in which of the possible time steps (if any) she is willing to participate by performing an -check-in. We set for all , and assume555We make this assumption only for utility; the privacy guarantees are independent of this assumption. all clients are available throughout the duration of the protocol. On the server side, at each time step , a random client among all the ones that checked-in at time is queried: this client receives the current model , locally computes a gradient update using their data , and returns to the server a privatized version of the gradient obtained using a local randomizer . Clients checked-in at time that are not selected do not participate in the training procedure. If at time no client is available, the server adds a “dummy” gradient to update the model.

Server-side protocol:   parameters: local randomizer , number of steps   Initialize model   Initialize gradient accumulator   for  do            if  is empty then           // Dummy gradient      else          Sample          Request User for update to model          Receive from User            Output Client-side protocol for User:   parameters: check-in window , check-in probability , loss function , local randomizer   private inputs: datapoint   if a -biased coin returns heads then      Check-in with the server at time      if receive request for update to model  then                    Send to server ModelUpdate:   parameters: batch size , learning rate   if  then      return   // Gradient descent step   else      return   // Skip update


Algorithm 1 – Distributed DP-SGD with random check-ins (fixed window).

3.2 Privacy Analysis

From a privacy standpoint, Algorithm 1 shares an important pattern with DP-SGD: each model update uses noisy gradients obtained from a random subset of the population. However, there exist two key factors that make the privacy analysis of our protocol more challenging than the existing analysis based on subsampling and shuffling. First, unlike in the case of uniform sampling where the randomness in each update is independent, here there is a correlation induced by the fact that clients that check-in into one step cannot check-in into a different step. Second, in shuffling there is also a similar correlation between updates, but there we can ensure each update uses the same number of datapoints, while here the server does not control the number of clients that will check-in into each individual step. Nonetheless, the following result shows that random check-ins provides a factor of privacy amplification comparable to these techniques.

Theorem 3.2 (Amplification via random check-ins into a fixed window).

Suppose is an -DP local randomizer. Let be the protocol from Algorithm 1 with check-in probability and check-in window for each client . For any , algorithm is -DP with . In particular, for and , we get . Furthermore, if is -DP with , then is -DP with and .

Remark 1

We can always increase privacy in the above statement by decreasing . However, this will also increase the number of dummy updates, which suggests choosing . With such a choice, we obtain an amplification factor of . Critically, however, exact knowledge of the population size is not required to have a precise DP guarantee above.

Remark 2

At first look, the amplification factor of may appear stronger than the typical factor obtained via uniform subsampling/shuffling. Note that one run of our technique provides updates (as opposed to updates via the other methods). When the server has sufficient capacity, we can set to recover a amplification. The primary advantage of our approach is that we can benefit from amplification in terms of even if only a much smaller number of updates are actually processed. We can also extend our approach to recover the amplification even when the server is rate limited (), by repeating the protocol adaptively times to get Corollary 3.3 from Theorem 3.2 and applying advanced composition for DP [17].

Corollary 3.3.

For algorithm described in Theorem 3.2, suppose is an -DP local randomizer s.t. , and . Setting , and running repetitions of results in a total of updates, along with an overall central -DP guarantee with and , where hides polylog factors in and .

Comparison to Existing Privacy Amplification Techniques

Table 1 provides a comparison of the bound in Corollary 3.3

to other existing techniques, for performing one epoch of training (i.e., use one update from each client). Note that for this comparison, we assume that

, since for all the shown amplification bounds can be written as . “None” denotes a naïve scheme (with no privacy amplification) where each client is used exactly once in any arbitrary order. Also, note that in general, the guarantees via privacy amplification by subsampling/shuffling apply only under the assumption of complete participation availability666By a complete participation availability for a client, we mean that the client should be available to participate when requested by the server for any time step(s) of training. of each client. Thus, they define the upper limits of achieving such amplifications. Also, note that even though the bound in Corollary 3.3 appears better than amplification via shuffling, our technique does include dummy updates which do not occur in the other techniques. For linear optimization problems, it is easy to see that our technique will add a factor of more noise as compared to the other two privacy amplification techniques at the same privacy level.

Source of Privacy Amplification for Central DP None [14, 33] Uniform subsampling [25, 8, 1] Shuffling [19] Shuffling (Theorem 5.1, This paper) Random check-ins with a fixed window (Theorem 3.2, This paper)
Table 1: Comparison with existing amplification techniques for a data set of size , running iterations of DP-SGD with batch size of 1 and -DP local randomizers. For ease of exposition, we assume , and hide polylog factors in and .

Proof Sketch for Theorem 3.2

Here, we provide a summary of the argument777Full proofs for every result in the paper are provided in Appendix A. used to prove Theorem 3.2 in the case . First, note that it is enough to argue about the privacy of the sequence of noisy gradients by post-processing. Also, the role each client plays in the protocol is symmetric, so w.l.o.g. we can consider two datasets differing in the first position. Next, we imagine that the last clients make the same random check-in choices in and . Letting denote the number of such clients that check-in into step , we model these choices by a pair of sequences where is the data record of an arbitrary client who checked-in into step (with representing a “dummy” data record if no client checked-in), and represents the probability that client ’s data will be picked to participate in the protocol at step if she checks-in in step . Conditioned on these choices, the noisy gradients produced by can be obtained by: (1) initializing a dataset ; (2) sampling , and replacing with in w.p. ; (3) producing the outputs by applying a sequence of -DP adaptive local randomizers to by setting . Here each of the uses all past gradients to compute the model and return .

The final step involves a variant of the amplification by swapping technique [19, Theorem 8] which we call amplification by probable replacement. The key idea is to reformulate the composition of the applied to the random dataset , to a composition of mechanisms of the form . Mechanism uses the gradient history to compute and returns with probability , and otherwise. Note that before the process begins, we have for every

; our analysis shows that the posterior probability after observing the first

gradients is not too far from the prior: . The desired bound is then obtained by using the overlapping mixtures technique [5] to show that is -DP with respect to changes on , and heterogeneous advanced composition [24] to compute the final of composing the adaptively.

3.3 Utility Analysis

Proposition 3.4 (Dummy updates in random check-ins with a fixed window).

For algorithm described in Theorem 3.2, the expected number of dummy updates performed by the server is at most . For if , we get at most expected dummy updates.

Utility for Convex ERMs

We now instantiate our amplification theorem (Theorem 3.2) in the context of differentially private empirical risk minimization (ERM). For convex ERMs, we will show that DP-SGD [34, 8, 1] in conjunction with our privacy amplification theorem (Theorem 3.2) is capable of achieving the optimal privacy/accuracy trade-offs [8].

Theorem 3.5 (Utility guarantee).

Suppose in algorithm described in Theorem 3.2 the loss is -Lipschitz and convex in its second parameter and the model space has dimension and diameter , i.e., . Furthermore, let be a distribution on , define the population risk , and let . If

is a local randomizer that adds Gaussian noise with variance

, and the learning rate for a model update at step is set to be , then the output of on a dataset containing i.i.d. samples from satisfies888Here, hides a polylog factor in .

Remark 3

Note that as , it is easy to see for that Theorem 3.5 achieves the optimal population risk trade-off [8, 7].

4 Variations: Thrifty Updates, and Sliding Windows

This section presents two variants of the main protocol from the previous section. The first variant makes a better use of the updates provided by each user at the expense of a small increase in the privacy cost. The second variant allows users to check-in into a sliding window to model the case where different users might be available during different time windows.

4.1 Leveraging Updates from Multiple Users

Server-side protocol:   parameters: total update steps      Initialize model   for  do            if  is empty then               else                  for  do            Request User for update to model            Receive from User                           Output  
Algorithm 2 - Distributed DP-SGD with random check-ins (averaged updates).

Now, we present a variant of Algorithm 1 which, at the expense of a mild increase in the privacy cost, removes the need for dummy updates, and for discarding all but one of the clients checked-in at every time step. The server-side protocol of this version is given in Algorithm 2 (the client-side protocol is identical as Algorithm 1). Note that here, if no client checked-in at some step , the server simply skips the update. Furthermore, if at some step multiple clients checked in, the server requests gradients from all the clients, and performs a model update using the average of the submitted noisy gradients.

These changes have the obvious advantage of reducing the noise in the model coming from dummy updates, and increasing the algorithm’s data efficiency by utilizing gradients provided by all available clients. The corresponding privacy analysis becomes more challenging because (1) the adversary gains information about the time steps where no clients checked-in, and (2) the server uses the potentially non-private count of clients checked-in at time when performing the model update. Nonetheless, we show that the privacy guarantees of Algorithm 2 are similar to those of Algorithm 1 with an additional factor, and the restriction of non-collusion among the participating clients. For simplicity, we only analyze the case where each client has check-in probability .

Theorem 4.1 (Amplification via random check-ins with averaged updates).

Suppose is an -DP local randomizer. Let be the protocol from Algorithm 2 performing averaged model updates with check-in probability and check-in window for each user . Algorithm is -DP with

where . In particular, for we get . Furthermore, if is -DP with , then is -DP with and .

Next, we provide a utility guarantee for in terms of the excess population risk for convex ERMs (similar to Theorem 3.5).

Theorem 4.2 (Utility guarantee).

Suppose in algorithm described in Theorem 4.1 the loss is -Lipschitz and convex in its second parameter and the model space has dimension and diameter , i.e., . Furthermore, let be a distribution on , define the population risk , and let . If is a local randomizer that adds Gaussian noise with variance , and the learning rate for a model update at step is set to be , then the output of on a dataset containing i.i.d. samples from satisfies

Furthermore, if the loss is -smooth in its second parameter and we set the step-size , then we have

Comparison to Algorithm 1 in Section 3: Recall that in we can achieve a small fixed by taking and , in which case the excess risk bound in Theorem 3.5 becomes . On the other hand, in we can obtain a fixed small by taking . In this case the excess risks in Theorem 4.2 are bounded by in the convex case, and by in the convex and smooth case. Thus, we observe that all the bounds recover the optimal population risk trade-offs from [8, 7] as , and for and non-smooth loss provides a better trade-off than , while on smooth losses and are incomparable. Note that (with ) will not attain a better bound on smooth losses because each update is based on a single data-point. Setting will reduce the number of updates to for , whereas to get an excess risk bound for for smooth losses where more than one data point is sampled at each time step will require extending the privacy analysis to incorporate the change, which is beyond the scope of this paper.

4.2 Random Check-Ins with a Sliding Window

The second variant we consider removes the need for all clients to be available throughout the training period. Instead, we assume that the training period comprises of time steps, and each client is only available during a window of time steps. Clients perform a random check-in to provide the server with an update during their window of availability. For simplicity, we assume clients wake up in order, one every time step, so client will perform a random check-in within the window . The server will perform updates starting at time to provide a warm-up period where the first clients perform their random check-ins.

Theorem 4.3 (Amplification via random check-ins with sliding windows).

Suppose is an -DP local randomizer. Let be the distributed algorithm performing model updates with check-in probability and check-in window for each user . For any , algorithm is -DP with . For and , we get . Furthermore, if is -DP with , then is -DP with and .

Remark 4

We can always increase privacy in the statement above by increasing . However, that also increases the number of clients who do not participate in training because their scheduled check-in time is before the process begins, or after it terminates. Moreover, the number of empty slots where the server introduces dummy updates will also increase, which we would want to minimize for good accuracy. Thus, introduces a trade-off between accuracy and privacy.

Proposition 4.4 (Dummy updates in random check-ins with sliding windows).

For algorithm described in Theorem 4.3, the expected number of dummy gradient updates performed by the server is at most .

5 Improvements to Amplification via Shuffling

Here, we provide an improvement on privacy amplification by shuffling. This is obtained using two technical lemmas (deferred to the supplementary material) to tighten the analysis of amplification by swapping, a central component in the analysis of amplification by shuffling given in [19].

Theorem 5.1 (Amplification via Shuffling).

Let , , be a sequence of adaptive -DP local randomizers. Let be the algorithm that given a dataset samples a uniform random permutation over , sequentially computes and outputs . For any , algorithm satisfies -DP with . Furthermore, if , , is -DP with , then satisfies -DP with and .

For comparison, the guarantee in [19, Theorem 7] in the case results in

6 Conclusion

Our work highlights the fact that proving DP guarantees for distributed or decentralized systems can be substantially more challenging than for centralized systems, because in a distributed setting it becomes much harder to precisely control and characterize the randomness in the system, and this precise characterization and control of randomness is at the heart of DP guarantees. Specifically, production FL systems do not satisfy the assumptions that are typically made under state-of-the-art privacy accounting schemes, such as privacy amplification via subsampling. Without such accounting schemes, service providers cannot provide DP statements with small ’s. This work, though largely theoretical in nature, proposes a method shaped by the practical constraints of distributed systems that allows for rigorous privacy statements under realistic assumptions.

Nevertheless, there is more to do. Our theorems are sharpest in the high-privacy regime (small ’s), which may be too conservative to provide sufficient utility for some applications. While significantly relaxed from previous work, our assumptions will still not hold in all real-world systems. Thus, we hope this work encourages further collaboration between distributed systems and DP theory researchers in establishing protocols that address the full range of possible systems constraints as well as improving the full breadth of the privacy vs. utility Pareto frontier.


The authors would like to thank Vitaly Feldman for suggesting the idea of privacy accounting in DP-SGD via shuffling, and for help in identifying and fixing a mistake in the way a previous version of this paper handled -DP local randomizers.


Appendix A Omitted Results and Proofs

Lemma A.1.

Let be an -DP local randomizer. For , and , define to return with probability , and a sample from an arbitrary distribution over with probability . For any and any set of outcomes , we have


Fix a set of outcomes . By -LDP of , for any , we get


Now, for dataset and , we have:

where the third equality follows as , and the first inequality follows using inequality 1, and the fourth equality follows as . ∎

Lemma A.2.

Let be mechanisms of the form . Suppose there exist constants and such that each is -DP with . Then, for any , the -fold adaptive composition of is -DP with .


We start by applying the heterogeneous advanced composition for DP [24] for the sequence of mechanisms to get -DP for the composition, where


Let us start by bounding the second term in equation 2. First, observe that:


where the first inequality follows from .

Now, we have: