On the Optimality of Trees Generated by ID3

07/11/2019 ∙ by Alon Brutzkus, et al. ∙ 0

Since its inception in the 1980s, ID3 has become one of the most successful and widely used algorithms for learning decision trees. However, its theoretical properties remain poorly understood. In this work, we analyze the heuristic of growing a decision tree with ID3 for a limited number of iterations t and given that nodes are split as in the case of exact information gain and probability computations. In several settings, we provide theoretical and empirical evidence that the TopDown variant of ID3, introduced by Kearns and Mansour (1996), produces trees with optimal or near-optimal test error among all trees with t internal nodes. We prove optimality in the case of learning conjunctions under product distributions and learning read-once DNFs with 2 terms under the uniform distribition. Using efficient dynamic programming algorithms, we empirically show that TopDown generates trees that are near-optimal (∼%1 difference from optimal test error) in a large number of settings for learning read-once DNFs under product distributions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Decision tree algorithms are widely used in various learning tasks and competitions. The most popular algorithms, which include ID3 (Quinlan, 1986) and its successors C4.5 and CART, use a greedy top-down approach to grow trees. In each iteration, ID3 chooses a leaf and replaces it with an internal node connected to two new leaves. This splitting operation is based on a splitting criterion, which promotes reduction of the training error. The popularity of this algorithm stems from its simplicity, interpretability and good generalization performance.

Despite its success in practice, the theoretical properties of trees generated by ID3 are not well understood. For example, consider the heuristic of running ID3 for a limited number of iterations . In this case, the best guarantee one can get is that the generated tree has the lowest test error among all trees with internal nodes. ID3 may not generate an optimal tree with

internal nodes. For example, this holds for learning the parity function under the uniform distribution

(Kearns, 1996). However, to the best of our knowledge, there are no results which show under which conditions ID3 does generate a bounded-size tree whose error is close to the error of an optimal tree with the same number of internal nodes. The empirical success of ID3 may suggest that such conditions exist.

In this work, we analyze the optimality of the trees generated by TopDown (Kearns & Mansour, 1999), a variant of ID3, after running for iterations. TopDown is an implementation of ID3, where in each iteration the leaf that it chooses to split is the one with the largest gain reduction weighted by the probability to reach the leaf. We provide theoretical and empirical evidence, that in the case of exact gain and probability computations, 111See Remark 3.1 for a discussion on this assumption.TopDown does generate optimal or near-optimal trees in several settings.

On the theory side, we consider two settings for which we show that for all , TopDown generates the tree with optimal test error among all trees with internal nodes. We show this for learning conjunctions under product distributions and learning read-once DNFs with two terms under the uniform distribution. Empirically, we devise efficient dynamic programming algorithms to calculate the optimal trees for a large number of settings for learning read-once DNF under product distribitions. For each DNF, we calculate the average of the difference between the test error of the generated tree and the optimal tree, across . We show that for all DNFs this average is .

Our results suggest that for product distributions, the TopDown algorithm is a good choice for learning read-once DNFs using bounded-size decision trees. Surprisingly, to the best of our knowledge, TopDown  is not widely used in practice. Rather, a similar variant (Shi, 2007), which we denote BestFirst, is used. For instance, it used in WEKA (Hall et al., 2009). In each iteration, BestFirst  chooses to split the leaf with the largest gain reduction without taking into account the probability to reach the leaf. TopDown is a more natural choice than BestFirst, because in each iteration it induces a larger reduction on an upper bound on the training error. 222See Section 3 for further details. We further corrobarate this in our settings and show in theory and experiments that TopDown has a clear advantage in performance over BestFirst.

2 Related Work

The ID3 algorithm was introduced by Quinlan (1986). There are a few papers which study its theoretical properties. The main difference between our work and previous ones, is that we provide test guarantees for trees of practical size by analyzing trees generated by ID3 in each iteration. In contrast, previous works provide guarantees for ID3 in the cases of building a large tree which implements the target function exactly or a tree with a very large polynomial size.

The most related to our work, is Fiat & Pechyony (2004), which show that ID3 can learn read-once DNF and linear functions under the uniform distribution in polynomial time. They also show that ID3 builds the tree with minimal size among all trees that implement the ground-truth function. Their results require ID3 to build a tree which implements the ground-truth exactly. This can result in a very large tree.

In a concurrent work, Brutzkus et al. (2019) use smoothed analysis to show that ID3 can learn -juntas over variables under product distributions in polynomial time. Their result requires to build a tree with a large polynomial size that implements the target function exactly. Furthermore, their analysis and techniques are different from ours. The paper is given in the supplementary material.

Another related work is Kearns & Mansour (1999), which introduce the TopDown variant of ID3. They show that TopDown is a boosting algorithm under the assumptions that there is a weak approximation of the target function in each node. To get a test error guarantee of , there result requires to build a tree with at least nodes, which is highly non-practical. Other works study learnability of decision trees through algorithms which are different from algorithms used in practice (O’Donnell & Servedio, 2007; Kalai & Teng, 2008; Bshouty & Burroughs, 2003; Bshouty et al., 2005; Ehrenfeucht & Haussler, 1989; Chen & Moitra, 2018) or show hardness results for learning decision trees (Rivest, 1987; Hancock et al., 1996).

3 Preliminaries

Distributional Assumptions: Let be the domain and be the label set. Let be a product distribution on realizable by a read-once DNF. Namely, for it holds that where and for a read-once DNF . Recall that a read-once DNF is a DNF where each variable appears at most once, e.g., .

Decision Trees: Let be any decision tree whose internal nodes are labeled with features . For a node in the tree , we let be the probability that a randomly chosen reaches in and let be the probability that given that reaches . For convenience, we will usually omit the subscript from the latter definitions when the tree used is clear from the context. Let be the set of leaves of and be the set of internal nodes (non-leaves). We assume that each leaf is labeled if and otherwise. If we let be the same as the tree , except that the leaf is replaced with an internal node labeled by and connected to two leaves and . The leaf corresponds to the assignment and each leaf is labeled according to the majority label with respect to conditioned on reaching the leaf. For a leaf , let be the set of features that are not on the path from the root to .

Let be the error of the tree . Then it holds that, where . Let be the entropy function, where the is base , and define the entropy of to be which satisfies .

Algorithm: The TopDown  algorithm introduced by Kearns & Mansour (1999) is a variant of ID3 (Quinlan, 1986). For our analysis we assume that TopDown can compute exact probabities and information gain computations in each iteration. Thus, WLOG, we can assume that it has access to the distribution .

The main difference between TopDown and ID3 is the choice of the splitting node in each iteration. TopDown  chooses the node which maximally decreases and therefore hopefully reduces as well. Formally, in each iteration, it chooses a leaf and feature , where , which maximize:

(1)

where is the probability that given that reaches . We let be the tree computed by TopDown  at iteration . The algorithm is given in Figure 1.

Remark 3.1.

In this work we focus on the optimality of trees that TopDown generates in each iteration. In practice, the number of iterations of algorithms such as TopDown are limited to avoid overfitting. Ultimately, we would like the algorithm to choose the leaf and feature in each iteration as in the case of exact gain and probability calculations. This case may occur in practice for a bounded-size tree where in each split there is sufficient data for accurate estimation. Thus, it is desirable to provide guarantees in this case and we assume that this holds in our analysis. Notice that our analysis is different from the standard PAC setting where sample complexity guarantees are given.

  Initialize to be a single leaf labeled by the majority label with respect to .
  while has less than nodes:
     .
     for each pair and :
        .
        if then:
           ; ; .
     .
  return
Algorithm
Figure 1: TopDown algorithm.

4 Conjunctions and Product Distributions

In this section we consider learning a conjunction on out of bits with TopDown under a product distribution. We will show that in the case of exact information gain computations, for each number of iterations , TopDown generates the tree with the best test error among all trees with at most internal nodes.

4.1 Setup and Additional Notations

Let be a subset of indexes such that . In this section we assume a target function . Note that is realizable by a depth tree. Let be the product distribution on defined in Section 3. We assume, without loss of generality, that and denote . Denote where and define for any and .

For any tree let be the set of features that appear in all of its nodes. We let be the set of all decision trees with internal nodes. For simplicity, we define to be the set of features of the tree and let

. We say that a binary tree is right-skewed if the left child of each internal node is a leaf with label 0. We denote by

the set of all right-skewed trees such that .

4.2 Main Result

In this section we will provide a partial proof of the following theorem. The remaining details are deferred to the supplementary material.

Theorem 4.1.

Assume that ID3 runs for iterations. Then it outputs the tree with optimal test error among all trees in .

For the proof we will need the following key lemma which is used throughout our analysis. The proof is given in the supplementary material.

Lemma 4.2.

Let and . Then:

  1. and this inequality is strict if .

  2. .

The proof outline of Theorem 4.1 goes as follows. First, we show that the set of optimal trees in intersects with the set (Lemma 4.3). Then we will show that TopDown chooses features in in ascending order of (Lemma 4.4). In Lemma 4.5 we will prove that the tree found by TopDown has the best test error in the set . By combining all of these facts together we get the theorem.

Lemma 4.3.

For any there exists a right-skewed tree such that and has the lowest test error among all trees in .

The proof idea is to use Lemma 4.2 to show that any tree in can be converted to a right-skewed tree such that , without increasing the test error. The full proof appears in the supplementary material.

The next lemma shows that ID3 chooses features in in ascending order of using Lemma 4.2. The proof is given in the supplementary material.

Lemma 4.4.

Assume that ID3 runs for iterations. Then is right-skewed, and has test error . 333In the case that there are features with and , we assume, without loss of generality, that TopDown chooses feature before .

The next lemma shows that the test error of the tree generated by ID3 at iteration , is the lowest among all test errors of trees in .

Lemma 4.5.

The following equality holds:

.

Proof.

Define . Let such that and . By definition of , there exists and such that . Define . It suffices to prove that . Denote and . It holds that and . Since , and , we conclude that by Lemma 4.2. ∎

We are now ready to prove the theorem.

Proof of Theorem 4.1.

By Lemma 4.3, there exists a tree in that has optimal test error among all trees in . By Lemma 4.4, ID3 generates a tree such that . Finally, Lemma 4.5 implies that , which proves the theorem. ∎

5 Read-Once DNF with 2 Terms and Uniform Distribution

In this section, we analyze the TopDown  algorithm for learning read-once DNFs with 2 terms under the uniform distribution. Similarly to the previous section, we show that for each iteration , TopDown generates a tree with the best test error among all trees with at most internal nodes. Hovever, in this case the analysis is more involved.

5.1 Setup and Additional Notations

Learning Setup: We assume a boolean target function , where where and all literals are of different variables. WLOG, we only consider literals which are variables and not their negations. By symmetry, our analysis holds for all literal configurations. We assume that is a uniform distribiution over . For convenience, we will denote , where each variable in this formula corresponds to an entry of . We say that the are -variables and similarly define -variables.

Additional Notations and Definitions: Let be any decision tree. Let be the root of and and be the left and right sub-trees of , respectively. For a node , let be its depth in the tree, where . We let be the DNF formula corresponding to the node . This is the ground truth DNF conditioned on all variables and assignments in the path from the root to . If a node is split with respect to an -variable then we say that is an -node. Similarly, we define -nodes. We define all nodes which are not -nodes or -nodes as -nodes. From now on, we consider trees whose nodes are one of the latter 3 types.

Let be the set of all trees with nodes. For a node , consider its split with respect to the variable with maximal information gain. Let be its left child and its right child after the split. We define the weighted gain of as . In each iteration, TopDown chooses the leaf for which is maximal. Equivalently, this is the leaf which maximally decreases . We also define the error reduction of the node to be and the error reduction of . In each iteration, in which a node is split, the test error is decreased by . Therefore, we get the following identity:

(2)

By Equation 2, we can reason about through . For to be minimal we need to be maximal.

For a tree we define its right-path to be the nodes in its right-most path. If the right-path consists only of -nodes, we say that it is a right -path. Similarly, we define a right -path. We define to be the set of all trees that consist only of a right-path where the nodes are either all nodes or all -nodes. We also say that these trees are right-paths. We define to be the set of all trees such that for each node in the right-path the following holds. If it is an -node, then its left sub-tree is a tree in with -nodes. Similarly, if it is a -node, then its right sub-tree is a tree in with -nodes. In Figure 2 we illustrate these sets of trees. For a tree such that we say that is a full right -path if has -nodes.

(a)
(b)
Figure 2: Two examples of trees in . -nodes are in blue, -nodes in green and leaves in red. The left sub-trees of nodes on the right-path are trees in .

5.1.1 Main Result

We will show that for each number of iterations , TopDown builds a tree, which we denote by , and this tree has the best test error among all trees with at most internal nodes. Formally, is a tree of size in whose right-path consists only of -nodes. The left sub-tree of each -node is in and consists only of -nodes. Furthermore, for any two -nodes and in the right-path such that is deeper than , the following holds. The left sub-tree of is not a leaf only if the left sub-tree of is a full right -path. Figure 3 shows for the formula . We state our main result in the following theorem:

Theorem 5.1.

Let . Then, and is a tree with the optimal test error in .

5.1.2 Proof Sketch of Theorem 5.1

The proof proceeds as follows. In the first part we show that for each iteration , TopDown outputs the tree , i.e., (Proposition 5.2). In the second part, we show that for any , has minimum test error among all trees in (Proposition 5.3). These two parts together prove Theorem 5.1.

We begin with the first part:

Proposition 5.2.

Assume TopDown runs for iterations. Then .

The proof uses a result of (Fiat & Pechyony, 2004), which show that in the setting of this section, for each node that ID3 splits, it chooses a variable in which is in a minimal size term. For example, if , then ID3 chooses either or (they have the same gain due to the uniform distribution assumption). Then, the proof follows by several inequalities involving the entropy function. These inequalities arise from comparing the weighted gain of pairs of nodes and showing that their correctness implies that . We defer the proof to the supplementary material.

Next, we show the following proposition.

Proposition 5.3.

Let . Then, is a tree with optimal test error among all trees in .

(a)
(b)
(c)
(d)
Figure 3: examples for DNF . -nodes are in blue, -nodes in green and leaves in red. (a) . (b) . (c) . (d) . This tree has 0 test error.

The idea of the proof is to first show by induction on that there exists an optimal tree in . Then, the proof proceeds by showing that any tree in with internal nodes can be converted to without increasing the test error. To illustrate the latter part with a simple example, consider the case where . In this case the tree in Figure 2(b), which we denote by , is equal to , whereas the tree in Figure 1(a), which we denote by , is a tree in with 5 internal nodes which is not . In this example, by direct calculation it can be shown that . However, to illustrate our proof in the general case, let be the left child of the root in and let be the left child of the right child of the root in . Then , but . Therefore, . By continuing this way for the rest of the nodes on the left sub-trees, we get . This implies by equation 2 that . This technique of comparing error reduction of nodes allows us to handle more complex cases, e.g., to show that the tree in Figure 1(b) is not optimal. The full proof is given in the supplementary material.

6 Empirical Results

In this section we present dynamic programming algorithms that allow us to calculate optimal trees efficiently in a large number of settings. We will need the following notations for this section. For any , let be the minimal test error of all trees of with internal nodes, over with ground-truth DNF . For any we let be the test error of the tree TopDown outputs after iterations assuming ground-truth . We define , and .

6.1 Uniform Distribution

In this section we assume that is the uniform distribution. Let be a read-once DNF with terms . We refer to as a set over the terms. For a term , let be a term with . Let be the set of all read-once DNFs over the uniform distribution with at most 8 terms and most 8 literals in each term. We use the following relation to compute optimal trees:

The correctness of the formula follows since is a product distribution and a sub-tree of an optimal tree is optimal. See supplemantary for details. 444Note that we do not need to consider trees with variables that are not in the DNF. See supplementary for details. We calculated for all and for all . For each we calculated and . Empirically, we got and . In Figure 3(a), we plot for all . These results show that for many read-once DNFs and number of iterations , TopDown is near-optimal with difference from optimal error roughly .

6.2 Product Distributions

In this section we assume that the distribution over variables is a product distribution where each variable has distribution or for . We experimented with the pairs . Let be a read-once DNF with terms . We refer to as a set over the terms. For each term , let be the number of variables with disribution and similarly define . Denote by a term with variables with distribution and variables with distribution . Let be the set of all read-once DNFs over with at most 4 terms and at most 5 literals, where each literal is a variable (not its negation) which has distribution or .

We use the following relation to compute optimal trees:

(3)

where

and

The correctness of the formula follows similarly as in the uniform case. See supplementary for details. We calculated for all and for all . For each we calculated and . Empirically, we got and for all the three pairs. In Figure 4 we plot for all for the three pairs. As in the previous section, these results show that TopDown is near-optimal with difference from optimal error roughly .

(a)
(b)
(c)
(d)
Figure 4: Empirical results for TopDown and BestFirst. The plots show histogram of values for all DNFs in the corresponding setting. The -axis is in log-scale. (a) Performance of TopDown under the uniform distribution. (b) Performance of TopDown under product distribution with , . (c) Performance of TopDown and BestFirst under product distribution with , . (d) Performance of TopDown and BestFirst under product distribution with , .

7 Comparision with BestFirst

The BestFirst algorithm (Shi, 2007), is similar to TopDown but with a different policy to choose leaves in each iteration. A version of BestFirst is used in WEKA (Hall et al., 2009). Instead of choosing the leaf and feature with maximal weighted gain (equation 1), it chooses the leaf and feature with maximal gain, i.e, which maximize . This can degrade the performance compared to TopDown. For example, consider learning the formula under the uniform distribution. As shown in Section 5, after 5 iterations, TopDown will generate the tree in Figure 2(b) and this is the optimal tree with 5 internal nodes. However, it can be shown that BestFirst can generate the tree in Figure 1(a) after 5 iterations (see supplementary material for details). As shown in Section 5.1.2, the latter tree is sub-optimal. Empirically, we ran the experiments of Section 6.2 with BestFirst and . Figure 3(c) and 3(d) show that the trees generated by BestFirst can be far from optimal.

8 Conclusion

In this work we analyze the optimality of trees generated by the TopDown algorithm. We show through theory and experiments that in a large number settings for learning read-once DNFs under product distributions, TopDown generates trees with optimal or near-optimal test error. There are many interesting directions for future work. First, it would be interesting to close the gap between our theory and experiments. We conjecture that in most cases, TopDown generates near-optimal trees for learning read-onces DNF under product distributions. It would be interesting to consider other distributions with dependencies between variables and other DNFs. Providing theoretical guarantees for random forests and gradient boosting is a challenging direction for future work. Following the results in Section

7, it would be interesting to see if TopDown can be used to improve performance in practical applications.

References

  • Brutzkus et al. (2019) Brutzkus, A., Daniely, A., and Malach, E. Id3 learns juntas for smoothed product distributions. arXiv preprint arXiv:1906.08654, 2019.
  • Bshouty & Burroughs (2003) Bshouty, N. H. and Burroughs, L. On the proper learning of axis-parallel concepts.

    Journal of Machine Learning Research

    , 4(Jun):157–176, 2003.
  • Bshouty et al. (2005) Bshouty, N. H., Mossel, E., O’Donnell, R., and Servedio, R. A. Learning dnf from random walks. Journal of Computer and System Sciences, 71(3):250–265, 2005.
  • Chen & Moitra (2018) Chen, S. and Moitra, A. Beyond the low-degree algorithm: Mixtures of subcubes and their applications. arXiv preprint arXiv:1803.06521, 2018.
  • Ehrenfeucht & Haussler (1989) Ehrenfeucht, A. and Haussler, D. Learning decision trees from random examples. Information and Computation, 82(3):231–246, 1989.
  • Fiat & Pechyony (2004) Fiat, A. and Pechyony, D. Decision trees: More theoretical justification for practical algorithms. In International Conference on Algorithmic Learning Theory, pp. 156–170. Springer, 2004.
  • Hall et al. (2009) Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I. H. The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11(1):10–18, 2009.
  • Hancock et al. (1996) Hancock, T., Jiang, T., Li, M., and Tromp, J. Lower bounds on learning decision lists and trees. Information and Computation, 126(2):114–122, 1996.
  • Kalai & Teng (2008) Kalai, A. T. and Teng, S.-H. Decision trees are pac-learnable from most product distributions: a smoothed analysis. arXiv preprint arXiv:0812.0933, 2008.
  • Kearns (1996) Kearns, M. Boosting theory towards practice: Recent developments in decision tree induction and the weak learning framework. In

    PROCEEDINGS OF THE NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE

    , pp. 1337–1339, 1996.
  • Kearns & Mansour (1999) Kearns, M. and Mansour, Y. On the boosting ability of top–down decision tree learning algorithms. Journal of Computer and System Sciences, 58(1):109–128, 1999.
  • O’Donnell & Servedio (2007) O’Donnell, R. and Servedio, R. A. Learning monotone decision trees in polynomial time. SIAM Journal on Computing, 37(3):827–844, 2007.
  • Quinlan (1986) Quinlan, J. R. Induction of decision trees. Machine learning, 1(1):81–106, 1986.
  • Rivest (1987) Rivest, R. L. Learning decision lists. Machine learning, 2(3):229–246, 1987.
  • Shi (2007) Shi, H. Best-first decision tree learning. PhD thesis, The University of Waikato, 2007.

Appendix A Proofs for Section 4

a.1 Proof of Lemma 4.2

  1. Define . Then and by the fact that we get for :

    where the last inequality follows since and . This completes the proof.

  2. We will consider several cases. If then holds iff which is equivalent to . If then and and the claim holds. Finally, if then the desired inequality is equivalent to which holds since .

a.2 Proof of Lemma 4.3

Let be a tree that has the lowest test error among all trees in . We will construct from a right-skewed tree such that while not increasing the test error. If then we are done. This follows since for each node that is in the right-most path from the root to the right-most leaf in with feature in , its left sub-tree can be replaced with a left leaf with label , without increasing the test error. This results in a right-skewed tree with at most internal nodes. By adding more nodes with features in we cannot increase the test error. To see this, let and assume we add to as a right child of the right leaf in . Denote by the resulting tree. Then, because any right-skewed tree with , can only err in the case that for all . Similarly, . Let , , and . Then, by Lemma 4.2, we have , which is equivalent to . Therefore, we can get the desired tree .

Now assume that contains a node with a feature in . Let be such a node for which the tree rooted at contains, besides , only nodes with features in . Denote this sub-tree by . Then has the following structure. Without loss of generality, the right sub-tree and the left sub-tree of are both right-skewed (because otherwise we can replace each with a right leaf with label without increasing the test error). Consider the following modification to . Connect the left sub-tree of to the right-most leaf of , remove the node and replace it with its right child. Let , be the new right leaf in the tree (that was previously the right leaf of the left sub-tree of ). Choose the label for which results in lowest test error. Finally, remove nodes such that for each feature, there is at most one node with that feature in the path from the root to . Let be the tree obtained by this modification to . Then has one less node with feature in compared to . It remains to show that . This will finish the proof, because we can apply this modification multiple times until we have only features with nodes in . Then we can use the previous argument in the case that .

We will now show that . Let be the set of nodes in the path from the root to node in the tree , excluding . Let be the internal nodes in the right sub-tree of and be the internal nodes in the left sub-tree of . Recall that the left and right sub-tree are right-skewed. For any node with feature in let be the corresponding probability according to the label of in the path. Then we get the following:

(4)

where

This follows since is the error of the path in from the root to the right most leaf in the right sub-tree of node . Similarly, is the error of the path from the root to the right most leaf in the left sub-tree of node and is the error in the path in from the root to the new right leaf.

Let , , and . By Lemma 4.2, it holds that , or equivalently, . Similarly, we have . Hence, by Equation 4 we conclude that .

a.3 Proof of Lemma 4.4

We will first prove by induction that . For the base case . Assume that up until iteration , ID3 chose the features . First we note that since feature is independent of features in , and depends only on features in , it follows that for any iteration, the gain of feature is zero.

Now, for any the gain of feature is

where the last inequality follows from the concavity of and the fact that . Therefore, if we are done because TopDown will choose feature which has the only non-zero gain.

If then let . By setting , , and applying Lemma 4.2 we have , or equivalently, and the inequality is strict if . Therefore, has the largest gain in iteration and TopDown will choose it.

Finally, we note that the latter proof shows that TopDown builds a right-skewed tree. It follows that the test error of is .

Appendix B Proofs for Section 5

b.1 Proof of Proposition 5.2

We first prove several inequalities which involve the entropy function.

Lemma B.1.

Let , then

Proof.

Define . Since , it suffices to prove that for . We have,

First, we notice that . Therefore, we are left to show that

(5)

By the following 2 inequalities:

  1. .

  2. .

proving Equation 5 reduces to showing that , which is true, as desired.

Lemma B.2.

Let , and . Then,

Proof.

Define . We will prove that the inequality holds for all . The inequality can be proved to hold for the cases by calcaluting with sufficiently high precision.

Since . It suffices to show that for . We have,