A short note on the decision tree based neural turing machine

by   Yingshi Chen, et al.

Turing machine and decision tree have developed independently for a long time. With the recent development of differentiable models, there is an intersection between them. Neural turing machine(NTM) opens door for the memory network. It use differentiable attention mechanism to read/write external memory bank. Differentiable forest brings differentiable properties to classical decision tree. In this short note, we show the deep connection between these two models. That is: differentiable forest is a special case of NTM. Differentiable forest is actually decision tree based neural turing machine. Based on this deep connection, we propose a response augmented differential forest (RaDF). The controller of RaDF is differentiable forest, the external memory of RaDF are response vectors which would be read/write by leaf nodes.


page 1

page 2

page 3

page 4


Dynamic Neural Turing Machine with Soft and Hard Addressing Schemes

We extend neural Turing machine (NTM) model into a dynamic neural Turing...

Deep differentiable forest with sparse attention for the tabular data

We present a general architecture of deep differentiable forest and its ...

Attention augmented differentiable forest for tabular data

Differentiable forest is an ensemble of decision trees with full differe...

Lie Access Neural Turing Machine

Following the recent trend in explicit neural memory structures, we pres...

Optimization of Decision Tree Evaluation Using SIMD Instructions

Decision forest (decision tree ensemble) is one of the most popular mach...

Making CNNs Interpretable by Building Dynamic Sequential Decision Forests with Top-down Hierarchy Learning

In this paper, we propose a generic model transfer scheme to make Convlu...

Learning Unsplit-field-based PML for the FDTD Method by Deep Differentiable Forest

Alternative unsplit-filed-based absorbing boundary condition (ABC) compu...

1 Introduction

Neural Turing Machine(NTM) [1, 2, 3, 4]

enhance classical turing machine with differentiable attention mechanism to visit datas/programs in the external memory bank. NTM has different names in different papers, for example, memory augmented neural networks(MANN), reservoir memory machines, differentiable neural computer… These names reveal the interesting aspects of NTM model in different ways. With external “artificial” working memory, NTM would have strong capacity to store and retrieve pieces of information just like human being. Some experiments in

[1] verified its defectiveness over famous LSTMs models. Since the pioneering work of [1]

, NTM has been deeply studied and applied to many applications, including machine translation, recommendation systems, slam, reinforcement learning, object tracking, video understanding, Graph Networks, etc

[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. All the controller in these works are neural networks. Our work is the first proposition of differentiable decision tree based NTM.

Differentiable decision tree[16, 17]

brings differentiable properties to classical decision tree. Compared with widely used deep neural networks, tree model still has some advantages. Its structure is very simple, easy to use and explain the decision process. Especially for tabular data, gradient boosted decision trees(GBDT)


models usually have better accuracy than deep networks, which is verified by many Kaggle competitions and real applications. But the classical tree-based models lack of differentiability, which is the key disadvantage compare to the deep neural networks. Now the differentiable trees also have full differentiability. So we could train it with many powerful gradient-based optimization algorithms (SGD, Adam,…). We could use batch training to reduce memory usage greatly. And we could use the end-to-end learning to reduce many preprocess works. The differential operation would search in a larger continuous space, so it can find more accurate solutions. The experiments on some large datasets showed differentiable forest model has higher accuracy than best GBDT libs (LightGBM, Catboost, and XGBoost)


In recent years, different research teams[16, 17, 20, 21, 22, 23, 24, 25] have proposed different models and algorithms to implement the differentiability.[16] is a pioneering work in the differentiable decision tree model. They present a stochastic routing algorithm to learn the split parameters via back propagation. The tree-enhanced network gets higher accuracy than GoogleNet [26], which was the best model at the time of its publication. [17]

introduces neural oblivious decision ensembles (NODE) in the framework of deep learning. The core unit of NODE is oblivious decision tree, which uses the same feature and decision threshold in all internal nodes of the same depth. There is no such limitation in our algorithm. As described in section

LABEL:sec:algorithm, each internal node in our model could have independent feature and decision threshold. [20] presents adaptive neural trees (ANTs) constructed on three differentiable modules(routers, transformers and solvers). In the training process, ANTs would adaptively grow the tree structure. Their experiment showed its competitive performance on some testing datasets. [21] presents a network-based tree model(DNDT). Their core module is soft binning function, which is a differentiable approximation of classical hard binning function. [22]

propose random hinge forests or random ferns with differentiable ReLU-like indicator function. Then the loss function would be optimized end-to-end with stochastic gradient descent algorithm.

It’s clear that these groups did not realize the connection with NTM. Our work in section 2 reveals that differentiable forest is a special case of NTM. This would help to improve the differentiable forest model.

2 Differentiable forest based neural turing machine

In this section, we would first give the main structure/algorithm of differentiable forest and NTM, then reveal the essential connection between them. That is, differentiable forest is also neural turing machin with specific controller and attention mechanism. To our knowledge, this is the first proposition of differentiable forest based NTM.

2.1 Neural Turing Machine

Just like the classic turing machine, NTM is a recurrent machine with two main modules, the first is the controller and the second is external memory bank . The external memory bank is usually defined as matrix, which contains cells (memory locations) and the size of each cell is . The controller would read and write to update its state. The main innovation of NTM is its attention mechanism which would update the weights of each read/write operation.

At each time step , NTM would update the weight of each cell. Then the read operation would get and write operation would update each cell as :


where is additional erase vector and is add vector. For more detail of practical robust implementation of NTM, please see [3].

2.2 Response augmented differential forest (RaDF)

The response augmented differential forest has two main modules. The first is controller which has differentiable decision trees ; the second is external memory bank . Each cell of is actually response corresponding to some leaf nodes[19]. The controller would read an write response bank . Each leaf node is just the head in the NTM to read/write response bank . So the response augmented differential forest is just a specific NTM.

For a dataset with N samples and its target . Each has M attributes, . The would learn differentiable decision trees and the response bank to minimize the loss between the target and prediction .


Figure 1.(a) shows the simplest case of RaDF. The controller is just a simple decision tree(one root node,two leaf nodes). All leaf nodes in the controller would read/write corresponding response in the external memory. For each input , the gating function

at root node would generate probabilities for both leaf nodes. Formula

3 gives the general definition of gating function with learn-able parameters and threshold . would map

to probability between [0,1], for example, the sigmoid function.


So as shown in Figure 1.(b), The sample would be directed to each nodal with probability . And finally, the input would reach all leaves. For a tree with depth , we represent the path as , where is the root node and is the leaf node . is just the product of the probabilities of all nodes in this path:


In the model of classical decision tree, the gating function is just the heave-side function, either left or right. So for each sample , only one leaf node is activated in the classical decision tree, while in RaDF model, all leaf nodes are activated(with probability ) to read/write response bank .

(a) Simplest response augmented differential forest with only three nodes (one root node with two child nodes)
(b) Response augmented differential forest and its response bank. In this sample, The input would reach with probability and reach with probability
Figure 1: Differentiable tree with response in outer memory

The read operation of leaf node would return a response vector . Then the output of tree is just the probability average of these responses.


A single tree is a very weak learner, so we should merge many trees to get higher accuracy, just like the random forest or other ensemble learning method. The final prediction

is weighted summary of all trees. In the simplest case, the weight is always , is just the average result.


Let represents all parameters , then the final loss would be represented by the general form of formula 7


where is a function that maps vector to object value. In the case of classification problem, the classical function of is cross-entropy. For regression problem, maybe mse, mae, huber loss or others. To minimize the loss in formula 7, we use stochastic gradient descent(SGD) method to train this model.

2.3 Algorithm of response augmented differentiable forest

As a specified case of nTM, RaDF would also be trained by SGD just like neural network. The main difference in the implementation is the update process of controller.

Based on the general loss function defined in formula 8, we use stochastic gradient descent method [27, 28] to reduce the loss. As formula 8 shows, update all parameters batch by batch:


where is the current batch, is the learning rate, is the sample in current batch.

The detailed algorithm is as follows:

Input: input training, validation and test dataset
Output: learned model: response bank in external memory and threshold values

1:Init feature weight mattrix
2:Init response bank at external memory. Each leaf node has its response stored in one cell of .
3:Init threshhold values
4:Init time step
5:while not converge do
6:     for each batch do
7:         Calculate gating value at each internal node
9:         Calculate probability at each leaf node
11:         Read response bank , update prediction of each tree
13:         Calculate the loss

         Backpropagate to get the gradient

16:         Update weight of each response cell
17:         Update response bank with erase vector and add vector
19:         Update the parameters: ,      
20:     Evalue loss at validation dataset
22:return learned model
Algorithm 1 Implementation of response augmented differentiable forest

3 Conclusion

In this short note, we revealed the deep connection between differentiable forest and neural turing machine. Based on the detailed analysis of both models, the Response augmented differential forest (RaDF) is actually a special case of NTM. The controller of RaDF is differentiable forest, the external memory cells of RaDF are response vectors which would be read/write by leaf nodes. This novel discovery will deepen the understanding of both two models and inspire some new algorithms. We give a detailed training algorithm of RaDF. We will give more detailed experiments in later papers.