1 Introduction
There is an increasing concern that most current published research findings are false [falseResearch, statistCrisis]. This “crisis” is mainly due to the difficulty of analyzing large amounts of data. In particular, the statistical inference theory typically assumes that the tests/procedures to be performed are selected before the data are gathered, i.e., statistical independence. By contrast, the data analysis practice is an inherently adaptive process: new hypotheses and analyses are formulated based on the outcomes of previous tests on the same data.
To circumvent this issue, one could collect fresh samples of data for every new test performed, but this is often expensive and impractical. Alternatively, one could naively divide the original dataset into smaller subsets, and apply each new test to a new subset. However, this severely limits the amount of data available for each test, which in turn negatively affects its accuracy. A more common approach is to save one subset for testing, called the holdout set [genAdap]
, and to reuse it multiple times. This may, however, lead to overfitting to the holdout set itself. This can be observed, for example, in the machine learning competitions Kaggle
[genAdap]or ImageNet
[statValidity]. In these competitions, the participants are given a sequence of samples and are required to provide a model with good prediction capabilities. The model is then tested on a hidden test set by the organizers, and a score is returned to the participants, who can then submit a new model. This procedure is repeated until the end of the competition, wherein the best models are evaluated on yet a different test set: the contrast between the scores obtained at the end and the ones from the continuously reused test set shows significant evidence of overfitting.The difference between the performance of the model on the training data versus a fresh sample of data is called the generalization error in the machine learning literature. Standard approaches [stability] to bound this error rely on the notion of stability of learning algorithms. Roughly speaking, an algorithm generalizes well if small changes in its input lead to only small changes in its output (i.e., it is stable). A more recent line of work [statValidity, genAdap, algoStability], initiated by Dwork et al. [statValidity], takes adaptivity explicitly into account. These works mainly rely on the notion of differential privacy (which originated in the database security literature [DworkCalibrating]). Besides inducing a suitable notion of stability, differentially private algorithms behave well under composition. That is, a sequence of differentially private algorithms induces a differentially private algorithm, allowing us to make generalization guarantees even in the adaptive setting. However, as it comes from the privacy literature, differential privacy is quite restrictive, which led to the introduction of more relaxed notions (that still behave well under composition) such as max information, [genAdap, maxInfo].
In this paper, we view the learning algorithms as conditional distributions, and provide generalization guarantees using the notion of maximal leakage. Maximal leakage is a secrecy metric that has appeared both in the computer security literature [CompSecQuantLeakage, LeakageGeneralized, CompSecAddMultMaxLeakage], and the information theory literature [leakage]
. It quantifies the leakage of information from a random variable
to another random variable , and is denoted by . The basic insight is as follows: if a learning algorithm leaks little information about the training data, then it will generalize well. Moreover, similarly to differential privacy, maximal leakage behaves well under composition: we can bound the leakage of a sequence of algorithms if each of them has bounded leakage. It is also robust against postprocessing. In addition, it is much less restrictive than differential privacy (e.g., maximal leakage is always bounded for finite random variables, whereas differential privacy can be infinite). As compared to , which is rarely given in closed form, is easier to compute and analyze. Indeed, it is simply given by the following formula (for finite and ):(1) 
1.1 Overview of Results
Let and be two random variables distributed over and , respectively, and let be an event. Our first result bounds the probability of
under the joint distribution of
in terms of the marginal distribution of , the fibers (i.e., ), and the leakage from to :(2) 
This type of bound allows us to connect the probability of an event, measured when the dependence is considered, with the probability of the same event but under the assumption of statistical independence (with the same marginals). Whenever we have independence, we have . Hence , and we retrieve the bound that does not account for the dependence (examples of these bounds, exploited later on, are McDiarmid’s inequality and the significance level used in hypothesis testing to control the false discovery probability). That is, our bounds can recover the classical bounds of the nonadaptive setting. Otherwise, when independence is not satisfied, our bound introduces a multiplicative term that grows as the dependence of on grows. The formal statement and the proof can be found in Section 3.1, Theorem 3.1.
With the suitable choice of what , , and represent, we derive useful bounds in the contexts of both the generalization error and postselection hypothesis testing. To wit, our main application is the following bound on the generalization error for loss functions.
Theorem (Theorem 3.2 below).
Let be the sample space and be the set of hypotheses. Let be a learning algorithm that, given a sequence of points, returns a hypothesis . Suppose is sampled i.i.d according to some distribution , i.e., . Let . Then,
(3) 
Theorem 3.2 offers a bound on the generalization error that can be exponentially decreasing with since the leakage cannot grow more than linearly in . For instance, if is DP with , we show that the above bound is exponentially decreasing. However, the bound applies more generally even if does not satisfy differential privacy for any . More details about this comparison can be found in Section 3.2.
Another connection between generalization error and informationtheoretic measures can be found in [learningMI] where, under the same assumption, it is shown that . Whenever our bound is exponentially decreasing in , the improvement in the sample complexity is exponential in the confidence of the bound. This comparison is further discussed in Section 3.4.
Our analysis can be applied for settings more general than the  loss functions:
Theorem (Theorem 3.4 below).
Let be an algorithm. Let be a random variable distributed over and let . Given , for every let satisfy . Then,
(4) 
The idea behind Theorem 3.4 is the following: suppose we have a collection of events and that each of these events has a small probability of occurring. Then, even when , the probability that remains small as long as is small. This implies that, in the spirit of [statValidity], given an algorithm with bounded maximal leakage, adaptive analyses involving can be thought as almost nonadaptive, up to a correction factor equal to .
In [statValidity, Theorem 11], under the assumption of being DP for some range of values of , it is shown that We thus provide a more general result, as the class of algorithms with bounded leakage is not restricted to the differentially private ones. Furthermore, by exploiting a connection between differential and maximal leakage, we show that our bound is tighter if . Moreover, an immediate consequence of the theorem is that , since , thus recovering the same statement of [genAdap, Theorem 9]. The details of the comparisons can be found in Sections 3.2 and 3.5.
This rewriting of our results allows us to apply them even in the context of adaptive hypothesis testing.
Theorem (Theorem 3.5 below).
Let
be an algorithm for selecting a test statistic
. Let be a random dataset over . Suppose that is the significance level chosen to control the false discovery probability for the test statistic . Denote with the event thatselects a statistic such that the null hypothesis is true but its pvalue is at most
(event that we make a false discovery using as significance value). Then,(5) 
Similar results based on are derived in [maxInfo]. However, requires the knowledge of the prior to be correctly computed. By contrast, to compute we only need to know the support of , and partial information about the conditional .
Other than providing generalization guarantees, one important characteristic of maximal leakage is that it is robust under postprocessing and composes adaptively.
Lemma (Robustness to postprocessing, Lemma 2.2 below).
Let and be two algorithms. Then, .
The implication is the following: no matter which postprocessing you apply to the outcome of your algorithm, the generalization guarantees provided by having a bounded leakage on will extend to as well, making it a robust measure.
Lemma (Adaptive composition, Lemma 2.5 below).
Consider a sequence of algorithms: such that for every the outputs of are inputs to . Then, denoting by the (random) outputs of the algorithm:
(6) 
From this we can derive that if each of the has a bounded leakage (and thus, generalizes), even the adaptive composition of the whole sequence will have bounded leakage (although, with a worse bound) and potentially maintain the generalization guarantees and avoid overfitting. This property is fundamental for practical applications.
1.2 More Related Work
In addition to differentially private algorithms, Dwork et al. [genAdap] show that algorithms whose output can be described concisely generalize well. They further introduce max information to unify the analysis of both classes of algorithms. Consequently, one can provide generalization guarantees for a sequence of algorithms that alternate between differential privacy and short description. In [maxInfo], the authors connect max information with the notion of approximate differential privacy, but show that there are no generalization guarantees for an arbitrary composition of algorithms that are approximateDP and algorithms with short description length.
With a more informationtheoretic approach, bounds on the exploration bias and/or the generalization error are given in [explBiasMI, infoThGenAn, jiao2017dependence, explBiasLeak], using mutual information and other dependencemeasures.
1.3 Outline
The work is organized in the following way. In Section 2 we will discuss the problem setting. We will properly define Adaptive Data Analysis in Section 2.1 and Maximal Leakage in Section 2.2, along with the presentation of most of its properties (including robustness to postprocessing and adaptive composition). In Section 3 we will present our main results: the generalization guarantees implied by a bound on the maximal leakage (Section 3.1), its application to postselection hypothesis testing (Section 3.3), we will also compare the measure to Differential Privacy, Mutual Information and MaxInformation (Sections 3.2, 3.4, 3.5). For the proofs the reader is referred to the Appendices.
2 Problem setting
2.1 Adaptive Data Analysis
Before entering into the details of our results, we define the model of adaptive composition we will be considering throughout the exposition and already used in [statValidity, genAdap, maxInfo].
Definition 2.1.
Let be a set. Let be a random variable over . Let be a sequence of algorithms such that . Denote with . The adaptive composition of is an algorithm that takes as an input and sequentially executes the algorithms as described by the sequence
ß This level of generality allows us to formalize the behaviour of a data analysts who, after viewing the previous outcomes of the analysis performed, decides what to do next. A potential analyst would execute a sequence of algorithms that are known to have a certain property (e.g. generalize well) when used without adaptivity. The question we would like to address is the following: is this property also maintained by the adaptive composition of the sequence? The answer is not trivial as, for every , the outcome of depends both on and on the previous outputs, that depend on the data themselves. However, when this property is guaranteed by some measure that composes adaptively itself (like differential privacy or, as we will show soon, maximal leakage) then it can be preserved.
For the remainder of this paper, we will only consider finite sets, and is taken to the base .
2.2 Maximal Leakage
In this section, we review some basic properties of maximal leakage. As mentioned earlier, the main properties of (approximate) differential privacy and max information that make them useful for adaptive data analysis are: 1) robustness to postprocessing, and 2) adaptive composition. We show that maximal leakage also satisfies these properties, with the advantage of being less restrictive than differential privacy, and easier to analyze than . was introduced as a way of measuring the leakage from to , hence the following definition:
Definition 2.2 (Def. 1 of [leakage]).
Given a joint distribution on finite alphabets and , the maximal leakage from to is defined as:
(7) 
where and take values in the same finite, but arbitrary, alphabet.
It is shown in [leakage, Theorem 1] that:
(8) 
Some important properties of the maximal leakage are the following:
Lemma 2.1 ([leakage]).
For any joint distribution on finite alphabets and ,

, with equality if and only if and are independent.

.

.
The following is a direct application of the first property of the lemma:
Lemma 2.2 (Robustness to postprocessing).
Let be the sample space and let be distributed over . Let and be output spaces, and consider and . Then, .
The useful implication of this result is as follows: any generalization guarantees provided by cannot be invalidated by further processing the output of . In order to analyze the behavior of the adaptive composition of algorithms in terms of maximal leakage, we first need the following definition of conditional maximal leakage
Definition 2.3 (Conditional Maximal Leakage [leakageLong]).
Given a joint distribution on alphabets , define:
(9) 
where takes value in an arbitrary finite alphabet and we consider
to be the optimal estimators of
given and , respectively.Again, it is shown in [leakageLong] that:
(10)  
(11) 
A useful consequence of this inequality is the following result.
Lemma 2.3 (Adaptive Composition of Maximal Leakage).
Let be an algorithm such that . Let be an algorithm such that for all . Then .
The proof of this lemma relies crucially on the fact that maximal leakage depends on the marginal only through its support. In order to generalize the result to the adaptive composition of algorithms, we need to lift the property stated in the inequality (11) to more than outputs.
Lemma 2.4.
Let and be random variables.
(12) 
An immediate application of Lemma 2.4 leads us to the following result.
Lemma 2.5.
Consider a sequence of algorithms: where for each , . Suppose that for all and for all , . Then, denoting by the (random) outputs of the algorithm:
(13) 
The proofs can be found in Appendix C.
3 Main Results
Inspired by the results based on differential privacy [learningDp] and mutual information [learningMI], we derive new bounds using maximal leakage. The underlying intuition is the following: if the outcome of a learning algorithm leaks only little information from the training data, then it will generalize well. In addition to adaptive composition and robustness against postprocessing, maximal leakage has the following advantages:

it can be computed using only a high level description of the conditional distributions and depends on only through the support;

it allows us to obtain an exponentially decreasing bound in the number of samples of the training set.
3.1 Low Leakage implies low generalization error
Our main result allows us to connect the probability of some event happening under the assumption of statistical independence between the input and output random variables and and the probability of the same event, taking into account the dependence of the output from the input. This connection relies on the Rényi divergence of order infinity between the distributions of and , the first one representing the statistical independence scenario and the second one considering the dependence between the two. (In [learningMI], a similar connection is considered, using KL divergence instead.)
Theorem 3.1.
Let be a distribution on the space and denote with the marginal distribution of the random variable . Let be an event and for every , denote with the set . Then,
(14) 
The proof is in Appendix B
. An immediate application can be found in statistical learning theory and relies on McDiarmid’s inequality. More precisely, the object of study is the generalization error of learning algorithms. To wit:
Definition 3.1.
Let be two sets, respectively the domain set and the label set and let . Let be the hypothesis class, i.e. a set of prediction rules . Given , a learning algorithm is a map that, given as an input a finite sequence of domain pointslabel pairs
, outputs some classifier
.In order to estimate the capability of some learning algorithm of correctly classifying unseen instances of the domain set i.e. its generalization capability, the concept of generalization error is introduced.
Definition 3.2.
Let be some distribution over . The error (or risk) of a prediction rule with respect to is defined as
(15) 
while, given a sample , the empirical error of with respect to is defined as
(16) 
Moreover, given a learning algorithm , its generalization error with respect to is defined as:
(17) 
Theorem 3.2.
Let be the sample space and be the set of hypotheses. Let be a learning algorithm that, given a sequence of points, returns a hypothesis . Suppose is sampled i.i.d according to some distribution over , i.e., . Given , let . Then,
(18) 
Proof.
Whenever is independent from the samples we have that and we immediately fall back to the nonadaptive scenario. We are hence proposing a generalization of these bounds whenever adaptivity is introduced. A concrete example can be seen with Theorem 3.2: in this particular case, whenever is independent from we immediately retrieve that
i.e. McDiarmid’s inequality with sensitivity .
3.2 Maximal Leakage and Differential Privacy
In the line of work started by Dwork et al. [statValidity, genAdap], the idea is to exploit the stability induced by differential privacy in order to derive generalization guarantees. The notion of algorithms that we will be considering is slightly more general than the notion of learning algorithms exposed before. Before proceeding to the comparison let us state a simple, but useful relationship between maximal leakage and pure differential privacy, proved in Appendix C.
Lemma 3.1.
Let be an Differentially Private randomized algorithm, then .
This suggests an immediate application of Theorem 3.2. Indeed, suppose is an DP algorithm, then:
(20) 
In order for the bound to be decreasing with , we need leading us to , where represents the accuracy of the generalization error and the privacy parameter. Thus, for fixed , as long as the privacy parameter is smaller than , we have guaranteed generalization capabilities for with an exponentially decreasing bound. For , it is shown in [statValidity, Theorem 9] that It is easy to check that, for large enough , our bound is tighter if . A more general result shown in [statValidity] is the following:
Theorem 3.3 (Thm. 11 of [statValidity]).
Let be an differentially private algorithm. Let be a random variable. Let be the corresponding output random variable. Assume that for every there is a subset such that . Then, for we have that
(21) 
The theorem just stated shows that, if we have a collection of events and each of these has a small probability of happening, then even considering , with , i.e. introducing adaptivity, the probability will remain small. This theorem, when applied together with McDiarmid’s inequality allows us to characterize the generalization capabilities of DP algorithms, by simply using as the righthand side of McDiarmid’s inequality. A rephrasing of Theorem 3.1 allows us to compare our results with Theorem 3.3:
Theorem 3.4.
Let be an algorithm. Let be a random variable distributed over according to and let . Assume that for every there is a subset with the property that . Then,
(22) 
The proof is immediate once we have Theorem 3.1. The type of result we are providing here is qualitatively different from the ones derived with differential privacy. We do not pose any constraint on the algorithm itself but rather propose a way of estimating how the probabilities we are interested in change, by measuring the level of dependence we are introducing using maximal leakage.
Now, suppose we have an differentially private algorithm with . Theorem 3.3 provides a fixed bound of , while with our Theorems 3.4 and 3.1 we obtain that:
(23) 
Hence, whenever the privacy parameter is lower than we are able to provide a better bound. In the spirit of comparison with the differential privacyderived results, let us also state Theorem 3.2 with a general sensitivity :
(24) 
Corollary 7 of [genAdap] states that whenever an algorithm outputs a function of sensitivity and is DP then, denoting with a random variable distributed over and with we have that . It is easy to see that Theorem (24) provides a tighter bound whenever the accuracy . As already stated before, the family of algorithms with bounded maximal leakage is not restricted to algorithms that are differentially private: a simple example can be found in algorithms with a bounded range, since . This simple relation allows us to immediately retrieve another result stated in [genAdap, Theorem 9]: , showing how Theorem 3.4 is more general than both Theorems 6 and 9 of [genAdap].
3.3 Postselection Hypothesis Testing
As already underlined in [maxInfo], most of the state of the art results that rely on differential privacy provide interesting guarantees only when we are considering lowsensitivity functions. They cannot be applied, for instance, to the problem of adaptively performing hypothesis tests while providing statistical guarantees. The reason for this is that values have a sensitivity larger that [maxInfo, Lemma B.1]. The theorem proven in [algoStability] that relies on DP, when applied to values, provides trivial error guarantees. When applying Theorem 3.2 to pvalues, the result we get is the following:
(25) 
suggesting that, in order to get a meaningful bound, according to this measure of leakage, it is necessary to leak very little information about the dataset. More precisely, we need . Notice that in the same framework but without adaptivity, McDiarmid’s inequality itself is not able to provide a better bound than .
Consider instead the problem of bounding the probability of making a false discovery, when the statistics to apply is selected with some data dependent algorithm. In this context, the guarantees that allow to upperbound this probability by the significance value no longer hold. It is possible to show, using maximal leakage, how to adjust the significance level in these adaptive settings, in order to have a guaranteed bound on the probability of error. As a Corollary of Theorem 3.1 we can retrieve the following:
Theorem 3.5.
Let be a data dependent algorithm for selecting a test statistic . Let be a random dataset over . Suppose that is the significance level chosen to control the false discovery probability for the test statistic . Denote with the event that selects a statistic such that the null hypothesis is true but its pvalue is at most . Then,
(26) 
This result suggests a very simple approach: if the analyst wishes to achieve a bound of on the probability of making a false discovery in adaptive settings, the significance level to be used should be no higher than . Suppose, as typically done in the field, that we wish to achieve an upperbound of . Whenever we intend to choose a statistic from a set , one straightforward approach would be to use as a significance value the quantity . With , this would imply choosing a significance level . The application of Theorem 3.5 is thus almost immediate. Once again, if is independent from , we recover the classical bound of . Also, Theorem 3.5 is similar to a result obtained in [maxInfo] and that uses the notion of approximate maxinformation instead.
3.4 Maximal Leakage and Mutual Information
One interesting result in the field, that connects the generalization error with Mutual Information, under the same assumptions of Theorem 3.2, is the following (Theorem 8 of [learningMI]):
(27) 
Let us compare this result with Theorem 3.2 in terms of sample complexity.
Definition 3.3.
Fix . Let be an hypothesis class. The sample complexity of with respect to , denoted by , is defined as the smallest for which there exists a learning algorithm such that, for every distribution over the domain yields
(28) 
If there is no such then .
From Theorem 3.2, it follows that using a sample size of yields a learner for with accuracy and confidence and this, in turn, implies that . Using the same reasoning with inequality (27), we get The reduction in the sample complexity, in this regime, is exponential in . Moreover, as shown in [learningMI], if we consider the case where and , we have that the VCdimension of is and, being , our bound recovers exactly the VCdimension bound [learningBook], which is always sharp.
3.5 Maximal Leakage and Max Information
One of the main reasons that brought to the definition of approximate maxinformation is related to the generalization guarantees it provides, now recalled for convenience.
Lemma 3.2 (Generalization via MaxInformation, Thm. 13 of [genAdap]).
Let be a random dataset in and let be such that for some , then, for any event :
(29) 
The result looks quite similar to Theorem 3.2, but the two measures, maxinformation and maximal leakage, although related, can be quite different. In this section we will analyze the connections and differences between the two measures underlining the corresponding implications. Thanks to the constraint on the distributions that a bound on maxinformation imposes, we can formalize the following connection.
Theorem 3.6.
Let be a randomized algorithm such that . Then,
Proof.
. Having a bound of on the MaxInformation of means that for all and ; and this implies that ∎
With respect to approximate maxinformation instead, we can state the following.
Theorem 3.7.
Let be a randomized algorithm. Let be a random variable distributed over and let . For any
(30) 
Before showing this theorem we need the following, intermediate result. We denote with the distribution associated with .
Lemma 3.3 (Lemma 18 of [genAdap] ).
Let be two random variables over the same domain . If then .
Proof.
Fix any . Denote with , we want to show that using Lemma 3.3.
Notice that .
Using Markov Inequality we can proceed in the following way:
(31)  
(32)  
(33) 
Hence, ∎
Here, the relationship between the measures is inverted, but the role played by can lead to undesirable behaviours of approx MI. The following example, indeed, shows how approx MI can be unbounded while, in the discrete case, the maximal leakage between two random variables is always bounded by the logarithm of the smallest cardinality.
Example 3.1.
Let us fix a Suppose . We have that . For the approximate maxinformation we have: . It can thus be arbitrarily large.
Another interesting characteristic of maxinformation is that, differently from differential privacy, it can be bounded even if we have deterministic algorithms. It is easy to see that whenever there is a deterministic mapping and Differential Privacy is enforced on it, a lower bound on of is retrieved. Trying to relax it to Differential Privacy does not help either, as one would need rendering it practically useless. Maxinformation, instead, can be finite, and this observation is implied by the connection with what in literature is known as “description length” of an algorithm, and synthesized in the following result [genAdap]: Let be a randomized algorithm, for every ,
(34) 
In terms of generalization guarantees, according to Lemma 3.2, this translates to:
(35) 
In contrast, the bound on maximal leakage that bounded description length provides is the following:
Lemma 3.4.
Let be a randomized algorithm, then .
Lemma 3.4 along with Theorem 3.1, immediately translates in the following generalization guarantees:
(36) 
The bound is different from the one in (35) as on the righthand side we have
while in (35) appears . For practical purposes though, the quantity is typically bounded for every , as for instance when using McDiarmid’s inequality, like in Theorem 3.2. This would imply a bound on both the max and the expectation, and when such technique is used then our bound is clearly tighter as .
It is also worth noticing that (34) can be seen as a consequence of Theorem 3.7 and Lemma 3.4.
Once again, due to the presence of , maximal leakage can provide tighter bounds, as we can see from the following example.
Example 3.2.
Let be an algorithm. Denoting with the size of the range of , we can bound Approximate MaxInformation by while maximal leakage is always bounded by . Since is typically very small in the key applications, and the corresponding multiplicative factors in the bounds are and , the difference between the two bounds can be substantial.
It is important to also notice that the difference between the two measures is not uniquely restricted to deterministic mechanisms. The following is an example of a randomized mapping where maximal leakage is smaller than approximatemaxinformation, for small .
Example 3.3.
Consider and (the output of a Binary Erasure Channel, with erasure probability , when X is transmitted). More formally, we have that and the following randomized mapping: and . In this case, the Maximal Leakage is [leakageLong]; while, for Approximate MaxInformation one finds (after a series of computations) that: It is easy to see how for a fixed and for going to , Approximate MaxInformation approaches while Maximal Leakage is strictly smaller.
Since we are mostly interested in the regimes where is small, these examples show how using Maximal Leakage provides tighter bounds and, in general, a wider applicability of our results as, in discrete settings, it is always bounded.
Appendix A Definitions and Tools
Definition A.1.
Let
be two discrete random variables defined over the same alphabet
, the maxdivergence of from is defined as:(37) 
while, the approximate maxdivergence is defined as:
(38) 
Definition A.2 (Differential Privacy [genAdap]).
Let . A randomized algorithm is said to be differentially private if, for all the pairs of datasets that differ in a single component we have that
An important probability tool, often used in the field is McDiarmid’s inequality, a concentration of measure for functions of a certain sensitivity. We will say that a function has sensitivity if
(39) 
Lemma A.1 (McDiarmid’s inequality).
Let be independent random variables taking values in the common alphabet .
Let be a function of sensitivity .
For every
(40) 
Definition A.3 (Def. 10 of [genAdap]).
Let be two random variables jointly distributed according to . Then, the maxinformation between and , is defined as follows:
(41) 
while, the approximate maxinformation is defined as:
(42) 
Appendix B Proof of Theorem 3.1
Proof.
Fix , we have the following two distributions on : . Denote with
By definition of Rényi Divergence of order we have [RenyiKLDiv],
(43) 
We can say that, for every :
(44)  
(45)  
(46) 
Taking expectation on both sides we get: