We consider a class of machine learning algorithms that use hierarchical structures of classifiers to reduce the computational complexity of training and prediction in large-scale problems characterized by a large number of labels. Problems of this type are often referred to as extreme classification(Prabhu and Varma, 2014). The hierarchical structure usually takes a form of a label tree in which a leaf corresponds to one and only one label. The nodes of the tree contain classifiers that direct the test examples from the root down to the leaf nodes. We study the subclass of these algorithms with probabilistic classifiers, i.e., classifiers with responses in the range . Examples of such algorithms for multi-class classification include hierarchical softmax (HSM) (Morin and Bengio, 2005), as implemented for example in fastText (Joulin et al., 2017)2009a). For multi-label classification this idea is known under the name of probabilistic label trees (PLTs) (Jasinska et al., 2016), and has been implemented in Parabel (Prabhu et al., 2018) and extremeText (Wydmuch et al., 2018). Note that the PLT model can be treated as a generalization of algorithms for both multi-class and multi-label classification (Wydmuch et al., 2018).
We present a wide spectrum of theoretical results concerning training and prediction costs of PLTs. We first define the multi-label problem (Section 2). Then, we define the PLT model and state some of its important properties (Section 3). As a starting point of our analysis, we define the training cost for a single instance as the number of nodes where it is involved in training classifiers (Section 4). The rationale behind this cost is that the learning methods, often used to train the node classifiers, scale linearly with the sample size. We note that the popular 1-vs-All approach has the cost equal , the number of labels, according to our definition. This cost can be significantly reduced by using PLTs. We then address the problem of finding a tree structure that minimizes the training cost (Section 5). We first show that the decision version of this problem is NP-complete (Section 5.1). Nevertheless, there exists a approximation that can be computed in linear time (Section 5.2). We also consider two special cases: multi-class (Section 5.3) and multi-label with nested labels (Section 5.4), for which we obtain constant approximation and exact solution, respectively, both computed in linear time in . We also consider the prediction cost defined as the number of nodes visited during classification of a test example (Section 6). We first show that under additional assumptions prediction can be made in time. Finally, we prove an upper bound on the expected prediction cost expressed in terms of the expected training cost and statistical error of the node classifiers.
The problem of optimizing the training cost is closely related to the binary merging problem in databases (Ghosh et al., 2015). The hardness result in (Ghosh et al., 2015), however, does not generalize to our setting as it is limited to binary trees only. Nevertheless, our approximation result is partly based on the results from (Ghosh et al., 2015). The training cost we use is similar to the one considered in (Grave et al., 2017), but the authors there consider a specific class of shallow trees. The Huffman tree is a popular choice for HSM (many word2vec implementations (Mikolov et al., 2013) and fastText (Joulin et al., 2017) use binary Huffman trees). This strategy is justified as for multi-class with binary trees the Huffman code is optimal (Wydmuch et al., 2018). Surprisingly, the solution for the general multi-class case has been unknown prior to this work. The problem of learning the tree structure to improve the predictive performance is studied in (Jernite et al., 2017; Prabhu et al., 2018). Ideally, however, one would like to have a procedure that minimizes two objectives: the computational cost and statistical error.
2 Multi-label classification
Let denote an instance space, and let be a finite set of class labels. We assume that an instance is associated with a subset of labels (the subset can be empty); this subset is often called the set of relevant labels, while the complement is considered as irrelevant for . We assume to be a large number (e.g., ), but the size of the set of relevant labels is usually much smaller than , i.e., . We identify the set
of relevant labels with the binary vector, in which . By we denote the set of all possible label vectors. We assume that observations
are generated independently and identically according to a probability distributiondefined on . Observe that the above definitions include as special cases multi-class classification (where ) and -sparse multi-label classification (where ).111We use to denote the set of integers from to , and to denote the norm of .
We are interested in multi-label classifiers that estimate conditional probabilities of labels, , , as accurately as possible, i.e., with possibly small -estimation error, i.e., , where is an estimate of . This statement of the problem is justified by the fact that optimal predictions in terms of the statistical decision theory for many performance measures used in multi-label classification, such as the Hamming loss, precision@k, and the micro- and macro F-measure, are determined through the conditional probabilities of labels (Dembczyński et al., 2010; Kotlowski and Dembczyński, 2016; Koyejo et al., 2015).
3 Probabilistic label trees (Plts)
We will work with the set of rooted, leaf-labeled trees with leaves. We denote a single tree by and its set of leaves by . The leaf corresponds to the label . The set of leaves of a (sub)tree rooted in an inner node is denoted by . The parent node of is denoted by , and the set of child nodes by . The path from node to the root is denoted by . The length of the path, i.e., the number of nodes on the path, is denoted by . The set of all nodes is denoted by . The degree of a node , i.e., the number of its children, is denoted by .
PLT uses tree to factorize the conditional probabilities of labels, , for . To this end let us define for every a corresponding vector of length ,222Note that depends on , but will always be obvious from the context. whose coordinates, indexed by ,333We will also use leaves to index the elements of vector . are given by:
With the above definition, it holds based on the chain rule that for any:
where for non-root nodes, and for the root (see, e.g., Jasinska et al. 2016). Notice that for the leaf nodes we get the conditional probabilities of labels, i.e.,
The following result states the relation between probabilities of the parent node and its children.
For any and , the probability of any internal node satisfies:
We first prove the first inequality. From the definition of tree and , we have that since . Taking the expectation with respect to , we obtain that for every .
For the second inequality, obviously we have . Furthermore, if , then there exists at least one for which . In other words, . Therefore, by taking expectation with respect to we obtain . ∎
To estimate , for , we use a function class which contains probabilistic classifiers of choice, for example, logistic regressors. We assign a classifier from to each node of the tree . We shall index this set of classifiers by the elements of as . We also denote by the estimate of obtained for a given in node . The estimates obey the analogous equations to (1) and (2). However, as the probabilistic classifiers can be trained independently from each other, Proposition 1 may not apply to the estimated probabilities. This can be fixed by a proper normalization during prediction.
The quality of the estimates of conditional probabilities , can be expressed in terms of the -estimation error in each node classifier, i.e., by . Based on similar results from (Beygelzimer et al., 2009b) and (Wydmuch et al., 2018) we get the following bound, which for gives the guarantees for , .
For any tree and the following holds for :
where for the root node .
This result can be found as a part of the proof of Theorem 1 in Appendix A in (Wydmuch et al., 2018). It is presented in Eq. (6) therein. However, this result is stated only for conditional probabilities of labels and their estimates . The generalization to any node is straightforward as the chain rule (1) applies to any node and the necessary transformations to get the result can be applied. ∎
4 Training complexity
Training data consist of tuples of feature vector and label vector . The labels for the entire training set can be written in a matrix form whose -th column is denoted by . We also use a corresponding matrix , with columns indexed by and denoted by .
We define the training complexity of PLTs in terms of the number of nodes in which a training example is used. This number follows from the definition of the tree and the PLT model (1). We use each training example in the root (to estimate ) and in each node for which (to estimate ). Therefore, we define the training cost for a single training example by:
Algorithm 1 shows the AssignToNodes method which identifies for a training example the set of positive and negative nodes, i.e., the nodes for which the training example is treated respectively as positive (i.e, ) or negative (i.e., ) (see the pseudocode and the comments there for details of the method).444Notice that the AssignToNodes method has time complexity assuming that the set operations are performed in time (e.g., the set is implemented by hash table). Based on this assignment a learning algorithm of choice, either batch or online, trains the node classifiers . The training cost for set is then expressed by:
The above quantities are justified from the learning point of view by the following reasons. On the one hand, in an online setting, the complexity of an update of PLT based on a single sample is indeed
, using a linear classifier in the inner node trained by optimizing some smooth loss with stochastic gradient descent (SGD) which is often the method of choice along with PLTs. Moreover, even if SGD is used in an offline setting, the SOTA packages, like fastText
, run several epochs over the training data. Therefore, their training time is, not taking into account the complexity of other layers. On the other hand, if we update the inner node models in a batch setting, the training time is again linear in
for several large-scale learning methods whose training process is based on optimizing some smooth loss, such as logistic regression(Allen-Zhu, 2017).
The next proposition gives an upper bound for the cost .
For any tree and vector it holds that:
where is the depth of the tree, and is the highest degree of a node in .
First notice that a training example is always used in the root node, either as a positive example , if , or as a negative example , if . Therefore the cost is bounded by 1. If , the training example is also used as a positive example in all the nodes on paths from the root to leaves corresponding to labels for which in . As the root has been already counted, we have at most such nodes for each positive label in . Moreover, the training example is used as a negative example in all siblings of the nodes on the paths determined above, unless it is already a positive example in the sibling node. The highest degree of node in the tree is . Taking the above into account, the cost is upperbounded by . The bound is tight, for example, if and is a perfect -ary tree (all non-leaf nodes have equal degree and the paths to the root from all leaves are of the same length).
Consider -sparse multi-label classification (i.e., ). For a balanced tree of constant and , the training cost is .
In the proposition below we express the cost in terms of vectors . Each such vector indicates the positive examples for node . We refer to as the Hamming weight of the node . Moreover, we use for the cost of the node .
For any tree and label matrix it holds that:
Obviously, we have that:
as elements constitute matrix with columns corresponding to the nodes of . Next, notice that for each , we have:
The last sum is over all nodes as for we have . The final equation is obtained by definition of the cost of the node , i.e., . ∎
Next we show a counterpart of Proposition 1 for training data.
For any and label matrix , the Hamming weight of any internal node satisfies:
with equality on the left holding for label covering distributions, i.e., , and equality on the right holding for multi-class distributions, i.e., .
The proof follows the same steps as the proof of Proposition 1 with the difference that instead of expectation with respect to , we take the sum over the training examples.
The left inequality becomes equality, for example, for the label covering distribution, since for the child node under which there is label , i.e., , or is the leaf node corresponding to label .
The right inequality becomes equality, for example, for the multi-class distribution, since there is always only one child for which . ∎
Another important quantity we use is the expected training cost:
For any tree and distribution it holds that:
The result follows immediately by taking the expectation of and the same observation as in Proposition 4. For , we have:
Namely, we have
The last sum is over all nodes as for we have . ∎
For any tree and distribution it holds that:
where is the depth of the tree, and is the highest degree of a node in .
The proof follows immediately from Proposition 2 by taking the expectation over . ∎
For any and distribution , the probability of any internal node satisfies:
The proposition follows immediately from Proposition 1 by taking the expectation over . ∎
Next, we state the relation between the finite sample and expected training costs. Using the fact that has bounded difference property, we can compute its deviation from its mean as follows.
For any PLT with label tree , it holds that
We can directly apply the concentration result for functions with bounded difference (see Section 3.2 of Boucheron et al. 2013). It only remains to upper bound for any , where is the same as except that the component is flipped. First, consider the case when and let us flip its value. Based on Proposition 5, the training algorithm of PLT updates each children of an inner node if there is at least one leaf in the subtree below it for which , otherwise it does not update the children classifier with the given example. Thus cannot be bigger than . The same argument applies to the case when which concludes the proof. ∎
Note that for balanced binary trees, thus is close to its expected value with samples with high probability. This lower bound suggests that one should not consider optimizing the training complexity based on fewer examples, since the empirical value which one would like to optimize over the space of labeled trees, might significantly deviate from its expected value.
5 Optimizing the training complexity ()
In this section, we focus on the algorithmic and hardness results for minimizing the cost . In the analysis, we mainly refer to matrices and via their columns , , and , , respectively. We assume to be stored efficiently, for example, as a sparse matrix whenever it is possible. We also use and , which are the fractions of positive examples in the corresponding nodes.
5.1 Hardness of training cost minimization
First we formally define the decision version of the cost minimization problem.
Definition 1 (PLT training cost problem).
For a label matrix and a parameter decide whether there exists a tree such that .
We prove NP-hardness of PLT training cost by a reduction from the Clique problem (which is one of the classical NP-complete problems Garey and Johnson 1979) defined as follows.
Definition 2 (Clique).
For an undirected graph and a parameter , decide whether contains a clique on nodes.
The PLT training cost problem is NP-complete.
We remark that a problem similar to PLT training cost has been studied in the database literature. In particular, the problem of finding an optimal binary tree is proven to be NP-hard in (Ghosh et al., 2015). Note that the result of (Ghosh et al., 2015) does not imply hardness of the PLT training cost problem.
5.2 Logarithmic approximation for multi-label case
Despite the hardness of the problem, we are able to give a simple algorithm which achieves an approximation. As remarked above, the problem of finding an optimal binary PLT tree is equivalent to the binary merging problem considered in (Ghosh et al., 2015).
Definition 3 (Binary merging).
For a ground set of size , and a collection of sets, where each , a merge schedule is a pair of a full binary tree555A full binary tree is a tree where every non-leaf node has exactly children. with labeled leaves, and a permutation which assigns every set to the leaf number . The binary merging problem is to find a merge schedule of the minimum cost:
where is the union of sets assigned to the leaves of the subtree rooted at the node .
While binary merging is NP-complete, it admits an approximation (Ghosh et al., 2015). The lemma below, showing that any PLT training cost problem can be 2-approximated by a binary PLT tree, gives a simple -approximation for the PLT training cost problem: it suffices to find an optimal binary tree using the algorithms from (Ghosh et al., 2015) (e.g., one of the algorithms presented there is a simple modification of the Huffman tree building algorithm).
For any PLT training cost instance , it holds that
where denotes the set of trees in which each internal node (including the root) has degree .
Consider an optimal tree . Starting from the root, replace every node with more than children by an arbitrary binary tree whose set of leaves is the set of children of this node. Consider a node of , let be the children of . The cost of the node is . Any binary tree with the leaves has internal nodes, each of them has degree two and the Hamming weight of its label is at most . Thus, the sum of the costs of the internal nodes of this binary tree is at most . When we repeat this procedure for all internal nodes of , we increase the cost of each node by at most a factor of . Thus, the resulting binary tree is a -approximation of . ∎
We are able, however, to give another algorithm, based on ternary complete trees, with a slightly better constant in the approximation ratio.666We use to denote the logarithm base .
There exists an algorithm which runs in time and achieves an approximation guarantee of for the PLT training cost problem, i.e., the output of the algorithm satisfies
The algorithm constructs in linear time a complete ternary tree of depth , and assigns the vectors to the leaves arbitrarily. From the definition of the cost function we have that for every tree . On the other hand, from Proposition 3 we have that , which completes the proof. ∎
We remark that any improvement of the approximation ratio of Theorem 3 would solve an open problem. Indeed, since the proof of Lemma 1 is constructive and efficient, any -approximation algorithm for the PLT training cost problem would imply an -approximation of an optimal binary tree, and this would improve the best known approximation ratio for the binary merging problem.
5.3 Multi-class case
In the multi-class case, we have for each in . For ease of exposition, we assume that the columns are sorted such that .
Remark that for trees of a fixed degree for all internal nodes, the optimal solution is the -ary Huffman tree. Here, we do not have this restriction and have different costs for nodes of different degrees, which makes the problem more difficult. Nevertheless, we give two efficient algorithms which find almost optimal solutions for every instance of the multi-class PLT training cost problem. Namely, these algorithms find a solution within a small additive error. Moreover, these algorithms run in linear time .
We will use the entropy function defined as , for , and .777For ease of exposition, we do not require the arguments of the entropy function to sum up to . We will use the fact that for (this follows from Jensen’s inequality). We will also make use of the following corollary of Jensen’s inequality.
Let , and . Let . Then
Since is concave for , by Jensen’s inequality we have that:
We start by showing a lower bound for the multi-class case.
Let be an instance of the multi-class case. The cost of any tree for is at least
We prove this Lemma by induction on the number of inner nodes of . If has only one inner node (the root), then
because for every integer .
Now assume has more than one inner nodes. Consider an inner node of on the longest distance from the root. All children of are leaves. W.l.o.g. assume that the children of are for . In the multi-class case we have that , and the cost . Now let be the tree with the children of removed (while keeping the label of the new leaf ). Then where derived from by replacing the columns by the column . By the induction hypothesis, . Let . Then we have that
where the second inequality is due to Proposition 9, and the last ineqaulity holds for every integer . ∎
As an upper bound, we prove that both a ternary Shannon code and a ternary Huffman tree give an almost optimal solution in the multi-class case. Both algorithms will construct a tree where each node (possibly except for one) has exactly three children. Remark that in the multi-class case the Hamming weight of each internal node is the sum of the Hamming weights of all leaves in its subtree (which follows from Proposition 4).
A ternary Shannon code and a ternary Huffman tree for , which both can be constructed in time , solve the multi-class PLT training cost problem with an additive error of at most , i.e., the output of the algorithm satisfies
Recall that for a leaf corresponding to the vector , denotes the number of nodes on the path from to the root of the tree. Since in ternary Shannon and Huffman trees, the degree of each node is at most , the total cost of these trees is at most .
It is known that the value of the Shannon code is upper bounded by (see, e.g., Section 5.4 in Cover and Thomas 2012). This implies that the cost of the corresponding ternary Shannon tree is
It is also know that the weight of the ternary Huffman code is upper bounded by the same quantity (see, e.g., Section 5.8 in Cover and Thomas 2012). Thus, the same upper bound holds for a ternary Huffman tree for the PLT training cost problem. This, together with Lemma 2, implies approximation with an additive error of at most .
Now we show that in our case, PLT trees corresponding to Shannon and Huffman codes can be constructed even more efficiently. We assume a sparse representation of th input by the numbers . From now on we will only store and work with . Since all are integers from to , we can sort them using Bucket sort in time . In Shannon code, the depth . We can construct the corresponding tree going from the root. We add internal nodes one by one, and connect leaves of the corresponding depth to this tree in the ascending order of . This algorithm takes one pass over the sorted data, and also runs in time . Thus, the running time of the algorithm is .
For the Huffman code, we will also store a Bucket sorting of the current set of . Namely, we introduce an array where