Locally Constant Networks

09/30/2019 ∙ by Guang-He Lee, et al. ∙ MIT 0

We show how neural models can be used to realize piece-wise constant functions such as decision trees. Our approach builds on ReLU networks that are piece-wise linear and hence their associated gradients with respect to the inputs are locally constant. We formally establish the equivalence between the classes of locally constant networks and decision trees. Moreover, we highlight several advantageous properties of locally constant networks, including how they realize decision trees with parameter sharing across branching / leaves. Indeed, only M neurons suffice to implicitly model an oblique decision tree with 2^M leaf nodes. The neural representation also enables us to adopt many tools developed for deep networks (e.g., DropConnect (Wan et al. 2013)) while implicitly training decision trees. We demonstrate that our method outperforms alternative techniques for training oblique decision trees in the context of molecular property classification and regression tasks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Decision trees (Breiman et al., 1984)

employ a series of simple decision nodes, arranged in a tree, to transparently capture how the predicted outcome is reached. Functionally, such tree-based models, including random forest 

(Breiman, 2001), realize piece-wise constant functions. Beyond their status as de facto interpretable models, they have also persisted as the state of the art models in some tabular (Sandulescu and Chiru, 2016) and chemical datasets (Wu et al., 2018). Deep neural models, in contrast, are highly flexible and continuous, demonstrably effective in practice, though lack transparency. We merge these two contrasting views by introducing a new family of neural models that implicitly learn and represent oblique decision trees.

Prior work has attempted to generalize classic decision trees by extending coordinate-wise cuts to be weighted, linear classifications. The resulting family of models is known as oblique decision trees (Murthy et al., 1993)

. However, the generalization accompanies a challenging combinatorial, non-differentiable optimization problem over the linear parameters at each decision point. Simple sorting procedures used for successively finding branch-wise optimal coordinate cuts are no longer available, making these models considerably harder to train. While finding the optimal oblique decision tree can be cast as a mixed integer linear program 

(Bertsimas and Dunn, 2017), scaling remains a challenge.

In this work, we provide an effective, implicit representation of piece-wise constant mappings, termed locally constant networks. Our approach exploits piece-wise linear models such as ReLU networks as basic building blocks. Linearity of the mapping in each region in such models means that the gradient with respect to the input coordinates is locally constant. We therefore implicitly represent piece-wise constant networks through gradients evaluated from ReLU networks. We prove the equivalence between the class of oblique decision trees and these proposed locally constant neural models. However, the sizes required for equivalent representations can be substantially different. For example, a locally constant network with neurons can implicitly realize an oblique decision tree whose explicit form requires oblique decision nodes. The exponential complexity reduction in the corresponding neural representation illustrates the degree to which parameters are shared across the locally constant regions.

Our locally constant networks can be learned via gradient descent, and they can be explicitly converted back to oblique decision trees for interpretability. For learning via gradient descent, however, it is necessary to employ some smooth annealing of piece-wise linear activation functions so as to keep the gradients themselves continuous. Moreover, we need to evaluate the gradients of all the neurons with respect to the inputs. To address this bottleneck, we devise a dynamic programming algorithm which computes all the necessary gradient information in a single forward pass. A number of extensions are possible. For instance, we can construct

approximately

locally constant networks by switching activation functions, or apply helpful techniques used with normal deep learning models (e.g., DropConnect 

(Wan et al., 2013)) while implicitly training tree models.

We empirically test our model in the context of molecular property classification and regression tasks (Wu et al., 2018)

, where tree-based models remain state-of-the-art. We compare our approach against recent methods for training oblique decision trees and classic ensemble methods such as gradient boosting 

(Friedman, 2001) and random forest. Empirically, a locally constant network always outperforms alternative methods for training oblique decision trees by a large margin, and the ensemble of locally constant networks is competitive with classic ensemble methods.

2 Related Work

Locally constant networks are built on a mixed integer linear representation of piece-wise linear networks, defined as any feed-forward network with a piece-wise linear activation function such as ReLU (Nair and Hinton, 2010). One can specify a set of integers encoding the active linear piece of each neuron, which is called an activation pattern (Raghu et al., 2017). The feasible set of an activation pattern forms a convex polyhedron in the input space (Lee et al., 2019), where the network degenerates to a linear model. The framework motivates us to leverage the locally invariant derivatives of the networks to construct a locally constant network. The activation pattern is also exploited in literature for other purposes such as deriving robustness certificates (Weng et al., 2018). We refer the readers to the recent work (Lee et al., 2019) and the references therein.

The class of locally constant networks is equivalent to the class of oblique decision trees. There are some classic methods that also construct neural networks that reproduce decision trees 

(Sethi, 1990; Brent, 1991; Cios and Liu, 1992), by utilizing step functions and logic gates (e.g., AND/NEGATION) as the activation function. The methods were developed when back-propagation was not yet practically useful, and the motivation is to exploit effective learning procedures of decision trees to train neural networks. Instead, our goal is to leverage the successful deep models to train oblique decision trees.

Learning oblique decision trees is challenging, even for a greedy algorithm; for a single oblique split, there are at most different ways to separate data points in -dimensional space (Murthy et al., 1993) (cf. possibilities for coordinate-cuts). Existing learning algorithms for oblique decision trees include greedy induction, global optimization, and iterative refinements on an initial tree. We review some representative works, and refer the readers to the references therein.

Optimizing each oblique split in greedy induction can be realized by coordinate descent (Murthy et al., 1994) or a coordinate-cut search in some linear projection space (Menze et al., 2011; Wickramarachchi et al., 2016). However, the greedy constructions tend to get stuck in poor local optimum. There are some works which attempt to find the global optimum given a fixed tree structure by formulating a linear program (Bennett, 1994) or a mixed integer linear program (Bertsimas and Dunn, 2017), but the methods are not scalable to ordinary tree sizes (e.g., depth more than 4). The iterative refinements are more scalable than global optimization, where CART (Breiman et al., 1984) is the typical initialization. Carreira-Perpinán and Tavallali (2018)

develop an alternating optimization method via iteratively training a linear classifier on each decision node, which yield the state-of-the-art empirical performance, but the approach is only applicable to classification problems.

Norouzi et al. (2015) proposed to do gradient descent on a sub-differentiable upperbound of tree prediction errors, but the gradients with respect to oblique decision nodes are unavailable whenever the upperbound is tight. In contrast, our method conducts gradient descent on a differentiable relaxation, which is gradually annealed to a locally constant network.

3 Methodology

In this section, we introduce the notation and basics in §3.1, construct the locally constant networks in §3.2-3.3, analyze the networks in §3.4-3.5, and develop practical formulations and algorithms in §3.6-3.7. Note that we will propose two (equivalent) architectures of locally constant networks in §3.3 and §3.6, which are useful for theoretical analyses and practical purposes, respectively.

3.1 Notation and basics

The proposed approach is built on feed-forward networks that yield piece-wise linear mappings. Here we first introduce a canonical example of such networks, and elaborate its piece-wise linearity. We consider the densely connected architecture (Huang et al., 2017), where each hidden layer takes as input all the previous layers; it subsumes other existing feed-forward architectures such as residual networks (He et al., 2016). For such a network with the set of parameters , we denote the number of hidden layers as and the number of neurons in the layer as ; we denote the neurons in the layer, before and after activation, as and , respectively, where we sometimes interchangeably denote the input instance as with . To simplify exposition, we denote the concatenation of as with , . The neurons are defined via the weight matrix

and the bias vector

in each layer . Concretely,

(1)

where is a point-wise activation function. Note that both and are functions of the specific instance denoted by , where we drop the functional dependency to simplify notation. We use the set to denote the set of all the neuron indices in this network . In this work, we will use ReLU (Nair and Hinton, 2010) as a canonical example for the activation function

(2)

but the results naturally generalize to other piece-wise linear activation functions such as leaky ReLU (Maas et al., 2013). The output of the entire network is the affine transformation from all the hidden layers with the weight matrix and bias vector .

3.2 Local linearity

It is widely known that the class of networks yields a piece-wise linear function. The results are typically proved via associating the end-to-end behavior of the network with its activation pattern – which linear piece in each neuron is activated; once an activation pattern is fixed across the entire network, the network degenerates to a linear model and the feasible set with respect to an activation pattern is a natural characterization of a locally linear region of the network.

Formally, we define the activation pattern as the set of activation indicator functions for each neuron (or equivalently, the derivatives of the ReLU units; see below)111Note that each is again a function of , where we omit the dependency for brevity.:

(3)

where is the indicator function. Note that, for mathematical correctness, we define at ; this choice is arbitrary, and one can change it to at without affecting most of the derivations. Given a fixed activation pattern , we can specify a feasible set in that corresponds to this activation pattern (note that each is a function of ). Due to the fixed activation pattern, the non-linear ReLU can be re-written as a linear function for all the inputs in the feasible set. For example, for an , we can re-write . As a result, the network has a consistent end-to-end linear behavior across the entire feasible set. One can prove that all the feasible sets partition the space into disjoint convex polyhedra222The boundary of the polyhedron depends on the specific definition of the activation pattern, so, under some definition in literature, the resulting convex polyhedra may not be disjoint in the boundary., which realize a natural representation of the locally linear regions. Since we will only use the result to motivate the construction of locally constant networks, we refer the readers to Lee et al. (2019) for a detailed justification of the piece-wise linearity of such networks.

3.3 Canonical locally constant networks

Figure 1: Toy examples for the equivalent representations of the same mappings for different . Here the locally constant networks have 1 neuron per layer. We show the locally constant networks on the LHS, the raw mappings in the middle, and the equivalent oblique decision trees on the RHS.

Since the ReLU network is piece-wise linear, it immediately implies that its derivatives with respect to the input is a piece-wise constant function. Here we use to denote the Jacobian matrix (i.e., ), and we assume the Jacobian is consistent with Eq. (3) at the boundary of the locally linear regions. Since any function taking the piece-wise constant Jacobian as input will remain itself piece-wise constant, we can construct a variety of locally constant networks by composition.

However, in order to simplify the derivation, we first make a trivial observation that the activation pattern in each locally linear region is also locally invariant. More broadly, any invariant quantity in each locally linear region can be utilized so as to build locally constant networks. We thus define the locally constant networks as any composite function that leverage the local invariance of piece-wise linear networks. For the theoretical analyses, we consider the below architecture.

Canonical architecture. We denote as the concatenation of . We will use the composite function as the canonical architecture of locally constant networks for theoretical analyses, where is simply a table.

Before elucidating on the representational equivalence to oblique decision trees, we first show some toy examples of the canonical locally constant networks and their equivalent mappings in Fig. 1, which illustrates their constructions when there is only neuron per layer (i.e., , and similarly for and ). When , , thus the locally constant network is equivalent to a linear model shown in the middle, which can also be represented as an oblique decision tree with depth . When , the activations in the previous layers control different linear behaviors of a neuron with respect to the input, thus realizing a hierarchical structure as an oblique decision tree. For example, for , and ; hence, it can also be interpred as the decision tree on the RHS, where the concrete realization of depends on the previous decision variable . Afterwards, we can map either the activation patterns on the LHS or the decision patterns on the RHS to an output value, which leads to the mapping in the middle.

3.4 Representational equivalence

In this section, we prove the equivalence between the class of oblique decision trees and the class of locally constant networks. We first make an observation that any unbalanced oblique decision tree can be re-written to be balanced by adding dummy decision nodes . Hence, we can define the class of oblique decision trees with the balance constraint:

Definition 1.

The class of oblique decision trees contains any functions that can be procedurally defined (with some depth ) for :

  1. , where and denote the weight and bias of the root decision node.

  2. For , where and denote the weight and bias for the decision node after the decision pattern .

  3. outputs the leaf value associated with the decision pattern .

The class of locally constant networks is defined by the canonical architecture with finite and . We first prove that we can represent any oblique decision tree as a locally constant network. Since a typical oblique decision tree can produce an arbitrary weight in each decision node (cf. the structurally dependent weights in the oblique decision trees in Fig. 1), the idea is to utilize a network with only hidden layer such that the neurons do not constrain one another. Concretely,

Theorem 2.

The class of locally constant networks the class of oblique decision trees.

Proof.

For any oblique decision tree with depth , it contains weights and biases. We thus construct a locally constant network with and such that each pair of in the oblique decision tree is equal to some and in the constructed locally constant network.

For each leaf node in the decision tree, it is associated with an output value and decisions; the decisions can be written as for and for for some index function and some . We can set the table of the locally constant network as

As a result, the constructed locally constant network yields the same output as the given oblique decision tree for all the inputs that are routed to each leaf node, which concludes the proof. ∎

Then we prove that the class of locally constant networks is a subset of the class of oblique decision trees, which simply follows the construction of the toy examples in Fig. 1.

Theorem 3.

The class of locally constant networks the class of oblique decision trees.

Proof.

For any locally constant network, it can be re-written to have neuron per layer, by expanding any layer with neurons to be different layers such that they do not have effective intra-connections. Below the notation refers to the converted locally constant network with neuron per layer. We define the following oblique decision tree with for :

  1. with and .

  2. For , where and . Note that .

  3. .

Note that, in order to be a valid decision tree, and have to be unique for all that yield the same decision pattern . To see this, for , as , we know each is a fixed affine function given an activation pattern for the preceding neurons, so and are fixed quantities given a decision pattern .

Since and , we conclude that they yield the same mapping. ∎

Despite the simplicity of the proof, it has some practical implications:

Remark 4.

The proof of Theorem 3 implies that we can train a locally constant network with neurons, and convert it to an oblique decision tree with depth (for interpretability).

Remark 5.

The proof of Theorem 3 establishes that, given a fixed number of neurons, it suffices (representationally) to only consider the locally constant networks with one neuron per layer.

Remark 5 is important for learning small locally constant networks (which can be converted to shallow decision trees for interpretability), since representation capacity is critical for low capacity models. In the remainder of the paper, we will only consider the setting with .

3.5 Structurally shared parameterization

Although we have established the exact class-level equivalence between locally constant networks and oblique decision trees, once we restrict the depth of the locally constant networks , it can no longer re-produce all the decision trees with depth . The result can be intuitively understood by the following reason: we are effectively using pairs of (weight, bias) in the locally constant network to implicitly realize pairs of (weight, bias) in the corresponding oblique decision tree. Such exponential reduction on the effective parameters in the representation of oblique decision trees yields “dimension reduction” of the model capacity. This section aims to reveal the implied shared parameterization embedded in the oblique decision trees derived from locally constant networks.

In this section, the oblique decision trees and the associated parameters refer to the decision trees obtained via the proof of Theorem 3. We start the analysis by a decomposition of among the preceding weights . To simplify notation, we denote . Since and is an affine transformation of the vector ,

where we simply re-write the derivatives in terms of tree parameters. Since is fixed for all the , the above decomposition implies that, in the induced tree, all the weights in the same depth are restricted to be a linear combination of the fixed basis and the corresponding preceding weights . We can extend this analysis to compare weights in same layer, and we begin the analysis by comparing weights whose distance in decision pattern is . To help interpret the statement, note that is the weight that leads to the decision (or ; see below).

Lemma 6.

For an oblique decision tree with depth , and any , such that for all except that for some , we have

The proof involves some algebraic manipulation, and is deferred to Appendix A. Lemma 6 characterizes an interesting structural constraint embedded in the oblique decision trees realized by locally constant networks, where the structural discrepancy in decision patterns ( versus ) is reflected on the discrepancy of the corresponding weights (up to a scaling factor ). The analysis can be generalized for all the weights in the same layer, but the message is similar.

Proposition 7.

For the oblique decision tree with depth , and any , such that for all except for coordinates , we have

(4)

The statement can be proved by applying Lemma 6 multiple times.

Discussion. Here we summarize this section and provide some discussion. Locally constant networks implicitly represent oblique decision trees with the same depth and structurally shared parameterization. In the implied oblique decision trees, the weight of each decision node is a linear combination of a shared weight across the whole layer and all the preceding weights. The analysis explains how locally constant networks use only weights to model a decision tree with decision nodes; it yields a strong regularization effect to avoid overfitting, and helps computation by exponentially reducing the memory consumption on the weights.

3.6 Standard locally constant networks and extensions

The simple structure of the canonical locally constant networks is beneficial for theoretical analysis, but the structure is not practical for learning since the discrete activation pattern does not exhibit gradients for learning the networks. Indeed, is undefined, which implies that is also undefined. Here we present another architecture that is equivalent to the canonical architecture, but exhibits sub-gradients with respect to model parameters and is flexible for model extension.

Standard architecture. We assume . We denote the Jacobian of all the neurons after activation as , and denote as the vectorized version. We then define the standard architecture as , where is a fully-connected network.

We abbreviate the standard locally constant networks as Lcn. Note that each is locally linear and thus the Jacobian is locally constant. We replace with as the invariant representation for each locally linear region333In practice, we also include each bias , which is omitted here to simplify exposition., and replace the table with a differentiable function that takes as input real vectors. The gradients of Lcn with respect to parameters is thus established through the derivatives of and the mixed partial derivatives of the neurons (derivatives of ).

One can prove the equivalence between the standard architecture and the canonical architecture. Here we provide a sketch of proof. Since the activation pattern uniquely identifies the Jacobian , and the table maps each region to a vector in an unconstrained manner, the canonical architecture is no less powerful than the standard architecture. To prove the other direction, we first make an observation that given fixed , either the following condition exists: ‘ always holds and is impossible’ (it may happen when ) or ‘’. Either case implies that identifies , and we can construct a big network to match the table .

Discussion. All the previous analyses extend to the standard architecture due to the above representational equivalence. In addition, the standard architecture yields a new property that is only partially exhibited in the canonical architecture. For all the decision and leaf nodes which no training data is routed to, there is no way to obtain learning signals in classic oblique decision trees. However, due to shared parameterization (see §3.5), locally constant networks can “learn” all the decision nodes in the implied oblique decision trees (if there is a way to optimize the networks), and the standard architecture can even “learn” all the leaf nodes due to the parameterized output function .

Extensions. The construction of (standard) locally constant networks enables several natural extensions due to the flexibility of the neural architecture and the interpretation of decision trees. The original locally linear networks (Lln) , which outputs a linear function instead of a constant function for each region, can be regarded as one extension. Here we discuss two examples.

  • [leftmargin=4mm]

  • Approximately locally constant networks (Alcn): we can change the activation function while keeping the model architecture of Lcn. For example, we can replace ReLU with softplus , which will lead to an approximately locally constant network, as the softplus function has an approximately locally constant derivative for inputs with large absolute value. Note that the canonical architecture (tabular ) is not compatible with such extension.

  • Ensemble locally constant networks (Elcn): since each Lcn can only output different values, it is limited for complex tasks like regression (akin to decision trees). We can instead use an additive ensemble of Lcn or Alcn to increase the capacity. We use to denote a base model in the ensemble, and denote the ensemble with models as .

3.7 Learning and computation

In this section, we discuss training algorithms and efficient computation for the proposed models.

Training Lcn and Alcn. Even though Lcn is sub-differentiable, the network does not exhibit useful gradient information for learning each locally constant representation (note that ) whenever , since, operationally, implies and there is no useful gradient of with respect to model parameters. To alleviate the problem, we propose to leverage softplus as an infinitely differentiable approximation of ReLU to obtain meaningful learning signals for . Concretely, we conduct the annealing during training:

(5)

where is an iteration-dependent annealing parameter. Both Lcn and Alcn can be constructed as a special case of Eq. (5). We train Lcn with

equal to the ratio between the current epoch and the total epochs, and

Alcn with . Both models are optimized via gradient descent.

We also include DropConnect (Wan et al., 2013) to the weight matrices during training. Despite the simple structure of DropConnect in the locally constant networks, it entails a structural dropout on the weights in the corresponding oblique decision trees (see §3.5), which is challenging to reproduce in typical oblique decision trees. In addition, it also encourages the exploration of parameter space, which is easy to see for the raw Lcn: the randomization enables the exploration that flips to to establish effective learning signal. Note that the standard DropOut (Srivastava et al., 2014) is not ideal for the low capacity models that we consider here.

Training Elcn. Since each ensemble component is sub-differentiable, we can directly learn the whole ensemble through gradient descent. However, the approach is not scalable due to memory constraints in practice. Instead, we propose to train the ensemble in a boosting fashion:

  1. We first train an initial locally constant network .

  2. For each iteration , we incrementally optimize .

Note that, in the second step, only the latest model is optimized, and thus we can simply store the predictions of the preceding models without loading them into the memory. Each partial ensemble can be directly learned through gradient descent, without resorting to complex meta-algorithms such as adaptive boosting (Freund and Schapire, 1997) or gradient boosting (Friedman, 2001).

Computation. Lcn and Alcn are built on the gradients of all the neurons , which can be computationally challenging to obtain. Existing automatic differentiation (e.g., back-propagation) only computes the gradient of a scalar output. Instead, here we propose an efficient dynamic programming procedure which only requires a forward pass:

  1. .

  2. ,

where is the computed gradient in the preceding layers. The complexity of the dynamic programming is due to the inner-summation inside each iteration. Straightforward back-propagation re-computes the partial solutions for each , so the complexity is . We can parallelize the inner-summation on a GPU, and the complexity of the dynamic programming and straightforward back-propagation will become and , respectively.

4 Experiment

Dataset Bace HIV SIDER Tox21 PDBbind
Task (Multi-label) binary classification Regression
Number of labels 1 1 27 12 1
Number of data 1,513 41,127 1,427 7,831 11,908
Table 1: Dataset statistics

Here we evaluate the efficacy of our models (Lcn, Alcn, and Elcn) using the chemical property prediction datasets from MoleculeNet (Wu et al., 2018), where random forest performs competitively. We include (multi-label) binary classification datasets and regression dataset. The statistics are available in Table 1. We follow the literature to construct the feature (Wu et al., 2018). Specifically, we use the standard Morgan fingerprint (Rogers and Hahn, 2010), 2,048 binary indicators of chemical substructures, for the classification datasets, and ‘grid features’ (fingerprints of pairs between ligand and protein, see Wu et al. (2018)) for the regression dataset. Each dataset is splitted into (train, validation, test) sets under the criterion specified in MoleculeNet.

We compare Lcn and its extensions (Lln, Alcn, and Elcn) with the following baselines:

  • [leftmargin=4mm]

  • (Oblique) decision trees: Cart (Breiman et al. (1984)), Hhcart (Wickramarachchi et al. (2016); oblique decision trees induced greedily on linear projections), and Tao (Carreira-Perpinán and Tavallali (2018); oblique decision trees trained via alternating optimization).

  • Tree ensembles: Rf (Breiman (2001); random forest) and Gbdt (Friedman (2001); gradient boosting decision trees).

  • Graph networks: Gcn (Duvenaud et al. (2015); graph convolutional networks on molecules).

For decision trees, Lcn, Lln, and Alcn, we tune the tree depth in . For Lcn, Lln, and Alcn

, we also tune the DropConnect probability in

. Since regression tasks require precise estimations of the prediction values while classification tasks do not, we tune the number of hidden layers of

in (each with neurons) for the regression task, and simply use a linear model for the classification tasks. For Elcn, we use Alcn as the base model, tune the ensemble size for the classification tasks, and for the regression task. To train our models, we use the cross entropy loss for the classification tasks, and mean squared error for the regression task. Other minor details are available in Appendix B.

We follow the chemistry literature (Wu et al., 2018)

to measure the performance by AUC for classification, and root-mean-squared error (RMSE) for regression. For each dataset, we train a model for each label, compute the mean and standard deviation of the performance across

different random seeds, and report their average across all the labels within the dataset. The results are in Table 2.

Among the (oblique) decision tree training algorithms, our Lcn achieves the state-of-the-art performance. The continuous extension (Alcn) always improves the empirical performance of Lcn, which is expected since Lcn is limited for the number of possible outputs (leaf nodes). Among the ensemble methods, the proposed Elcn always outperforms the classic counterpart, Gbdt, and sometimes outperforms Rf. Overall, Lcn is the state-of-the-art method for learning oblique decision trees, and Elcn performs competitively against other alternatives for training tree ensembles.

Dataset Bace (AUC) HIV (AUC) SIDER (AUC) Tox21 (AUC) PDBbind (RMSE)
Cart 0.652 0.024 0.544 0.009 0.570 0.010 0.651 0.005 1.573 0.000
Hhcart 0.545 0.016 0.636 0.000 0.570 0.009 0.638 0.007 1.530 0.000
Tao 0.734 0.000 0.627 0.000 0.577 0.004 0.676 0.003 Not applicable
Lcn 0.839 0.013 0.728 0.013 0.624 0.044 0.781 0.017 1.508 0.017
Lln 0.818 0.007 0.737 0.009 0.677 0.014 0.813 0.009 1.627 0.008
Alcn 0.854 0.007 0.738 0.009 0.653 0.044 0.814 0.009 1.369 0.007
Rf 0.869 0.003 0.796 0.007 0.685 0.011 0.839 0.007 1.256 0.002
Gbdt 0.859 0.005 0.748 0.001 0.668 0.014 0.812 0.011 1.247 0.002
Elcn 0.874 0.005 0.757 0.011 0.685 0.010 0.822 0.006 1.219 0.007
Gcn 0.783 0.014 0.763 0.016 0.638 0.012 0.829 0.006 1.44 0.12
Table 2: Main results. The row refers to (oblique) decision tree methods, the row refers to single model extensions of Lcn, the row refers to ensemble methods, and the last row is Gcn. The results of Gcn are copied from (Wu et al., 2018). The best result in each row is in bold letters.
(a) Learning curve of Lcn
(b) Training performance
(c) Testing performance
Figure 2: Empirical analysis for oblique decision trees on the HIV dataset. Fig. 1(a) is an ablation study for Lcn and Fig. 1(b)-1(c) compare different training methods.

Empirical analysis. Here we analyze the proposed Lcn in terms of the optimization and generalization performance in the large HIV dataset. We conduct an ablation study on the proposed method for training Lcn in Figure 1(a). Direct training (without annealing) does not suffice to learn Lcn, while the proposed annealing succeed in optimization; even better optimization and generalization performance can be achieved by introducing DropConnect, which corroborates our hypothesis on the exploration effect during training in §3.7 and its well-known regularization effect. Compared to other methods (Fig. 1(b)), only Tao has a comparable training performance. In terms of generalization (Fig. 1(c)), all of the competitors do not perform well and overfit fairly quickly. In stark contrast, Lcn outperforms the competitors by a large margin and gets even more accurate as the depth increases. This is expected due to the strong regularization of Lcn that uses a linear number of effective weights to construct an exponential number of decision nodes, as discussed in §3.5. Some additional analysis and the visualization of the tree converted from Lcn are included in Appendix C.

5 Discussion and conclusion

We create a novel neural architecture by casting the derivatives of deep networks as the representation, which realizes a new class of neural models that is equivalent to oblique decision trees. The induced oblique decision trees embed rich structures and are compatible with deep learning methods. This work can be used to interpret methods that utilize derivatives of a network, such as training a generator through the gradient of a discriminator (Goodfellow et al., 2014). The work opens up many avenues for future work, from building representations from the derivatives of neural models to the incorporation of more structures, such as the inner randomization of random forest.

References

  • K. P. Bennett (1994) Global tree optimization: a non-greedy decision tree algorithm. Computing Science and Statistics, pp. 156–156. Cited by: §2.
  • D. Bertsimas and J. Dunn (2017) Optimal classification trees. Machine Learning 106 (7), pp. 1039–1082. Cited by: §1, §2.
  • L. Breiman, J. Friedman, R. Olshen, and C. Stone (1984) Classification and Regression Trees. Wadsworth and Brooks, Monterey, CA. Cited by: §1, §2, 1st item.
  • L. Breiman (2001) Random forests. Machine learning 45 (1), pp. 5–32. Cited by: §1, 2nd item.
  • R. P. Brent (1991) Fast training algorithms for multilayer neural nets. IEEE Transactions on Neural Networks 2 (3), pp. 346–354. Cited by: §2.
  • M. A. Carreira-Perpinán and P. Tavallali (2018) Alternating optimization of decision trees, with application to learning sparse oblique trees. In Advances in Neural Information Processing Systems, pp. 1211–1221. Cited by: §2, 1st item.
  • K. J. Cios and N. Liu (1992) A machine learning method for generation of a neural network architecture: a continuous id3 algorithm. IEEE Transactions on Neural Networks 3 (2), pp. 280–291. Cited by: §2.
  • D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams (2015) Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: 3rd item.
  • Y. Freund and R. E. Schapire (1997) A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55 (1), pp. 119–139. Cited by: §3.7.
  • J. H. Friedman (2001) Greedy function approximation: a gradient boosting machine. Annals of statistics, pp. 1189–1232. Cited by: §1, §3.7, 2nd item.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §5.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §3.1.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708. Cited by: §3.1.
  • G. Lee, D. Alvarez-Melis, and T. S. Jaakkola (2019) Towards robust, locally linear deep networks. In International conference on learning representations, External Links: Link Cited by: §2, §3.2.
  • A. L. Maas, A. Y. Hannun, and A. Y. Ng (2013) Rectifier nonlinearities improve neural network acoustic models. In Proceedings of the 30th international Conference on Machine Learning, Vol. 30, pp. 3. Cited by: §3.1.
  • B. H. Menze, B. M. Kelm, D. N. Splitthoff, U. Koethe, and F. A. Hamprecht (2011) On oblique random forests. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 453–469. Cited by: §2.
  • S. K. Murthy, S. Kasif, S. Salzberg, and R. Beigel (1993) OC1: a randomized algorithm for building oblique decision trees. In Proceedings of AAAI, Vol. 93, pp. 322–327. Cited by: §1, §2.
  • S. K. Murthy, S. Kasif, and S. Salzberg (1994) A system for induction of oblique decision trees. Journal of artificial intelligence research 2, pp. 1–32. Cited by: §2.
  • V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning, pp. 807–814. Cited by: §2, §3.1.
  • M. Norouzi, M. Collins, M. A. Johnson, D. J. Fleet, and P. Kohli (2015) Efficient non-greedy optimization of decision trees. In Advances in neural information processing systems, pp. 1729–1737. Cited by: §2.
  • F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. (2011) Scikit-learn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: 2nd item, 3rd item.
  • M. Raghu, B. Poole, J. Kleinberg, S. Ganguli, and J. S. Dickstein (2017) On the expressive power of deep neural networks. In Proceedings of the 34th international conference on machine learning, pp. 2847–2854. Cited by: §2.
  • S. J. Reddi, S. Kale, and S. Kumar (2018) On the convergence of adam and beyond. In International conference on learning representations, External Links: Link Cited by: Appendix B.
  • D. Rogers and M. Hahn (2010) Extended-connectivity fingerprints. Journal of chemical information and modeling 50 (5), pp. 742–754. Cited by: §C.2, §4.
  • V. Sandulescu and M. Chiru (2016) Predicting the future relevance of research institutions-the winning solution of the kdd cup 2016. arXiv preprint arXiv:1609.02728. Cited by: §1.
  • I. K. Sethi (1990) Entropy nets: from decision trees to neural networks. Proceedings of the IEEE 78 (10), pp. 1605–1613. Cited by: §2.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §3.7.
  • L. Wan, M. Zeiler, S. Zhang, Y. Le Cun, and R. Fergus (2013) Regularization of neural networks using dropconnect. In Proceedings of the 30th international conference on machine rearning, pp. 1058–1066. Cited by: Locally Constant Networks, §1, §3.7.
  • T. Weng, H. Zhang, H. Chen, Z. Song, C. Hsieh, D. Boning, I. S. Dhillon, and L. Daniel (2018) Towards fast computation of certified robustness for relu networks. Proceedings of the 35th international conference on machine learning. Cited by: §2.
  • D. Wickramarachchi, B. Robertson, M. Reale, C. Price, and J. Brown (2016) HHCART: an oblique decision tree. Computational Statistics & Data Analysis 96, pp. 12–23. Cited by: §2, 1st item.
  • Z. Wu, B. Ramsundar, E. N. Feinberg, J. Gomes, C. Geniesse, A. S. Pappu, K. Leswing, and V. Pande (2018) MoleculeNet: a benchmark for molecular machine learning. Chemical science 9 (2), pp. 513–530. Cited by: §1, §1, Table 2, §4, §4.

Appendix A Proof of Lemma 6

Proof.

We fix and do induction on . Without loss of generality, we assume .

If , since , we have

Hence, we have .

We assume the statement holds for up to some integer :

For , we have

The proof follows by induction. ∎

Appendix B Implementation details

Here we provide the full version of the implementation details.

For the baseline methods:

  • [leftmargin=4mm]

  • Cart, Hhcart, and Tao: we tune the tree depth in .

  • Rf: we use the scikit-learn (Pedregosa et al., 2011) implementation of random forest. We set the number of estimators as .

  • Gbdt: we use the scikit-learn (Pedregosa et al., 2011) implementation of gradient boosting trees. We tune the number of estimators in .

For Lcn, Lln, and Alcn, we run the same training procedure. For all the datasets, we tune the depth in and the DropConnect probability in

. The models are optimized with mini-batch stochastic gradient descent with batch size set to

. For all the classification tasks, we set the learning rate as , which is annealed by a factor of for every epochs ( epochs in total). For the regression task, we set the learning rate as , which is annealed by a factor of for every epochs ( epochs in total).

Both Lcn and Alcn have an extra fully-connected network , which transforms the derivatives to the final outputs. Since regression tasks require precise estimation of prediction values while classification tasks do not, we tune the number of hidden layers of in (each with neurons) for the regression dataset, and simply use a linear for the classification datasets.

For Elcn, we fix the depth to and tune the number of base models for the classification tasks, and for the regression task. We set the DropConnect probability as to encourage strong regularization for the classification tasks, and as to impose mild regularization for the regression task (because regression is hard to fit). We found stochastic gradient descent does not suffice to incrementally learn the Elcn, so we use the AMSGrad optimizer (Reddi et al., 2018) instead. We set the batch size as and train each partial ensemble for epochs. The learning rate is for the classification tasks, and for the regression task.

To train our models, we use the cross entropy loss for the classification tasks, and mean squared error for the regression task.

Appendix C Supplementary empirical analysis and visualization

Depth 8 9 10 11 12
# of possible patterns 256 512 1024 2048 4096

# of training patterns

72 58 85 103 86
# of testing patterns 32 31 48 49 40
# of testing patterns - training patterns 5 2 11 8 11
Ratio of testing points w/ unobserved patterns 0.040 0.013 0.072 0.059 0.079
Testing performance - observed patterns 0.8505 0.8184 0.8270 0.8429 0.8390
Testing performance - unobserved patterns 0.8596 0.9145 0.8303 0.7732 0.8894
Table 3: Analysis for “unobserved decision patterns” of Lcn in the Bace dataset.

c.1 Supplementary empirical analysis

In this section, we investigate the learning of “unobserved branching / leaves” discussed in §3.6. The “unobserved branching / leaves” refer to the decision and leaf nodes of the oblique decision tree converted from Lcn, such that there is no training data that are routed to the nodes. It is impossible for traditional (oblique) decision tree training algorithms to learn the values of such nodes (e.g., the output value of a leaf node in the traditional framework is based on the training data that are routed to the leaf node). However, the shared parameterization in our oblique decision tree provides a means to update such unobserved nodes during training (see the discussion in §3.6).

Since the above scenario in general happens more frequently in small datasets than in large datasets, we evaluate the scenario on the small Bace dataset (binary classification task). Here we empirically analyze a few things pertaining to the unobserved nodes:

  • [leftmargin=4mm]

  • # of training patterns: the number of distinct end-to-end activation / decision patterns encountered in the training data.

  • # of testing patterns: the number of distinct end-to-end activation / decision patterns encountered ib the testing data.

  • # of testing patterns - training patterns: the number of distinct end-to-end activation / decision patterns that is only encountered in the testing data but not in the training data.

  • Ratio of testing points w/ unobserved patterns: the number of testing points that yield unobserved patterns divided by the total number of testing points.

  • Testing performance - observed patterns: here we denote the number of testing data as , the prediction and label of the as and , respectively. We collect the subset of indices of the testing data such that their activation / decision patterns are observed in the training data, and then compute the performance of their predictions. Since the original performance is measured by AUC, here we generalize AUC to measure a subset of points as:

    (6)

    When , the above measure recovers AUC.

  • Testing performance - unobserved patterns: the same as above, but use for the testing data such that their activation / decision patterns are unobserved in the training data.

The results are in Table 3. There are some interesting findings. For example, there is an exponential number of possible patterns, but the number of patterns that appear in the dataset is quite small. The ratio of testing points with unobserved patterns is also small, but these unobserved branching / leaves seem to be controlled properly. They do not lead to completely different performance compared to those that are observed during training.

c.2 Visualization

Here we visualize the learned locally constant network on the HIV dataset in the representation of its equivalent oblique decision tree in Fig. 3. Since the dimension of Morgan fingerprint (Rogers and Hahn, 2010) is quite high (2,048), here we only visualize the top-K weights (in terms of the absolute value) for each decision node. We also normalize each weight such that the norm of each weight is . Since the task is evaluated by ranking (AUC), we visualize the leaf nodes in terms of the ranking of output probability among the leaf nodes (the higher the more likely).

Note that a complete visualization requires some engineering efforts. Our main contribution here is the algorithm that transforms an Lcn to an oblique decision tree, rather than the visualization of oblique decision trees, so we only provide the initial visualization as a proof of concept.

Figure 3: Visualization of learned locally constant network in the representation of oblique decision trees using the proof of Theorem 3. The number in the leaves indicates the ranking of output probability among the leaves (the exact value is not important). See the descriptions in Appendix C.2.