1 Introduction
Neural Turing Machine(NTM) [1, 2, 3, 4]
enhance classical turing machine with differentiable attention mechanism to visit datas/programs in the external memory bank. NTM has different names in different papers, for example, memory augmented neural networks(MANN), reservoir memory machines, differentiable neural computer… These names reveal the interesting aspects of NTM model in different ways. With external “artificial” working memory, NTM would have strong capacity to store and retrieve pieces of information just like human being. Some experiments in
[1] verified its defectiveness over famous LSTMs models. Since the pioneering work of [1], NTM has been deeply studied and applied to many applications, including machine translation, recommendation systems, slam, reinforcement learning, object tracking, video understanding, Graph Networks, etc
[5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. All the controller in these works are neural networks. Our work is the first proposition of differentiable decision tree based NTM.Differentiable decision tree[16, 17]
brings differentiable properties to classical decision tree. Compared with widely used deep neural networks, tree model still has some advantages. Its structure is very simple, easy to use and explain the decision process. Especially for tabular data, gradient boosted decision trees(GBDT)
[18]models usually have better accuracy than deep networks, which is verified by many Kaggle competitions and real applications. But the classical treebased models lack of differentiability, which is the key disadvantage compare to the deep neural networks. Now the differentiable trees also have full differentiability. So we could train it with many powerful gradientbased optimization algorithms (SGD, Adam,…). We could use batch training to reduce memory usage greatly. And we could use the endtoend learning to reduce many preprocess works. The differential operation would search in a larger continuous space, so it can find more accurate solutions. The experiments on some large datasets showed differentiable forest model has higher accuracy than best GBDT libs (LightGBM, Catboost, and XGBoost)
[19].In recent years, different research teams[16, 17, 20, 21, 22, 23, 24, 25] have proposed different models and algorithms to implement the differentiability.[16] is a pioneering work in the differentiable decision tree model. They present a stochastic routing algorithm to learn the split parameters via back propagation. The treeenhanced network gets higher accuracy than GoogleNet [26], which was the best model at the time of its publication. [17]
introduces neural oblivious decision ensembles (NODE) in the framework of deep learning. The core unit of NODE is oblivious decision tree, which uses the same feature and decision threshold in all internal nodes of the same depth. There is no such limitation in our algorithm. As described in section
LABEL:sec:algorithm, each internal node in our model could have independent feature and decision threshold. [20] presents adaptive neural trees (ANTs) constructed on three differentiable modules(routers, transformers and solvers). In the training process, ANTs would adaptively grow the tree structure. Their experiment showed its competitive performance on some testing datasets. [21] presents a networkbased tree model(DNDT). Their core module is soft binning function, which is a differentiable approximation of classical hard binning function. [22]propose random hinge forests or random ferns with differentiable ReLUlike indicator function. Then the loss function would be optimized endtoend with stochastic gradient descent algorithm.
It’s clear that these groups did not realize the connection with NTM. Our work in section 2 reveals that differentiable forest is a special case of NTM. This would help to improve the differentiable forest model.
2 Differentiable forest based neural turing machine
In this section, we would first give the main structure/algorithm of differentiable forest and NTM, then reveal the essential connection between them. That is, differentiable forest is also neural turing machin with specific controller and attention mechanism. To our knowledge, this is the first proposition of differentiable forest based NTM.
2.1 Neural Turing Machine
Just like the classic turing machine, NTM is a recurrent machine with two main modules, the first is the controller and the second is external memory bank . The external memory bank is usually defined as matrix, which contains cells (memory locations) and the size of each cell is . The controller would read and write to update its state. The main innovation of NTM is its attention mechanism which would update the weights of each read/write operation.
At each time step , NTM would update the weight of each cell. Then the read operation would get and write operation would update each cell as :
(1) 
where is additional erase vector and is add vector. For more detail of practical robust implementation of NTM, please see [3].
2.2 Response augmented differential forest (RaDF)
The response augmented differential forest has two main modules. The first is controller which has differentiable decision trees ; the second is external memory bank . Each cell of is actually response corresponding to some leaf nodes[19]. The controller would read an write response bank . Each leaf node is just the head in the NTM to read/write response bank . So the response augmented differential forest is just a specific NTM.
For a dataset with N samples and its target . Each has M attributes, . The would learn differentiable decision trees and the response bank to minimize the loss between the target and prediction .
(2) 
Figure 1.(a) shows the simplest case of RaDF. The controller is just a simple decision tree(one root node,two leaf nodes). All leaf nodes in the controller would read/write corresponding response in the external memory. For each input , the gating function
at root node would generate probabilities for both leaf nodes. Formula
3 gives the general definition of gating function with learnable parameters and threshold . would mapto probability between [0,1], for example, the sigmoid function.
(3) 
So as shown in Figure 1.(b), The sample would be directed to each nodal with probability . And finally, the input would reach all leaves. For a tree with depth , we represent the path as , where is the root node and is the leaf node . is just the product of the probabilities of all nodes in this path:
(4) 
In the model of classical decision tree, the gating function is just the heaveside function, either left or right. So for each sample , only one leaf node is activated in the classical decision tree, while in RaDF model, all leaf nodes are activated(with probability ) to read/write response bank .
The read operation of leaf node would return a response vector . Then the output of tree is just the probability average of these responses.
(5) 
A single tree is a very weak learner, so we should merge many trees to get higher accuracy, just like the random forest or other ensemble learning method. The final prediction
is weighted summary of all trees. In the simplest case, the weight is always , is just the average result.(6) 
Let represents all parameters , then the final loss would be represented by the general form of formula 7
(7) 
where is a function that maps vector to object value. In the case of classification problem, the classical function of is crossentropy. For regression problem, maybe mse, mae, huber loss or others. To minimize the loss in formula 7, we use stochastic gradient descent(SGD) method to train this model.
2.3 Algorithm of response augmented differentiable forest
As a specified case of nTM, RaDF would also be trained by SGD just like neural network. The main difference in the implementation is the update process of controller.
Based on the general loss function defined in formula 8, we use stochastic gradient descent method [27, 28] to reduce the loss. As formula 8 shows, update all parameters batch by batch:
(8) 
where is the current batch, is the learning rate, is the sample in current batch.
The detailed algorithm is as follows:
3 Conclusion
In this short note, we revealed the deep connection between differentiable forest and neural turing machine. Based on the detailed analysis of both models, the Response augmented differential forest (RaDF) is actually a special case of NTM. The controller of RaDF is differentiable forest, the external memory cells of RaDF are response vectors which would be read/write by leaf nodes. This novel discovery will deepen the understanding of both two models and inspire some new algorithms. We give a detailed training algorithm of RaDF. We will give more detailed experiments in later papers.
References
 [1] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014.
 [2] Benjamin Paaßen and Alexander Schulz. Reservoir memory machines. arXiv preprint arXiv:2003.04793, 2020.
 [3] Mark Collier and Joeran Beel. Implementing neural turing machines. In International Conference on Artificial Neural Networks, pages 94–104. Springer, 2018.
 [4] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka GrabskaBarwińska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, et al. Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476, 2016.

[5]
Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wierstra, and Timothy
Lillicrap.
Metalearning with memoryaugmented neural networks.
In
International conference on machine learning
, pages 1842–1850, 2016.  [6] Emilio Parisotto and Ruslan Salakhutdinov. Neural map: Structured memory for deep reinforcement learning. arXiv preprint arXiv:1702.08360, 2017.
 [7] Xu Chen, Hongteng Xu, Yongfeng Zhang, Jiaxi Tang, Yixin Cao, Zheng Qin, and Hongyuan Zha. Sequential recommendation with user memory networks. In Proceedings of the eleventh ACM international conference on web search and data mining, pages 108–116, 2018.
 [8] Jack Rae, Jonathan J Hunt, Ivo Danihelka, Timothy Harley, Andrew W Senior, Gregory Wayne, Alex Graves, and Timothy Lillicrap. Scaling memoryaugmented neural networks with sparse reads and writes. In Advances in Neural Information Processing Systems, pages 3621–3629, 2016.
 [9] Jingwei Zhang, Lei Tai, Joschka Boedecker, Wolfram Burgard, and Ming Liu. Neural slam: Learning to explore with external memory. arXiv preprint arXiv:1706.09520, 2017.
 [10] Travis Ebesu, Bin Shen, and Yi Fang. Collaborative memory network for recommendation systems. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pages 515–524, 2018.

[11]
Tianyu Yang and Antoni B Chan.
Learning dynamic memory networks for object tracking.
In
Proceedings of the European Conference on Computer Vision (ECCV)
, pages 152–167, 2018.  [12] Seil Na, Sangho Lee, Jisung Kim, and Gunhee Kim. A readwrite memory network for movie story understanding. In Proceedings of the IEEE International Conference on Computer Vision, pages 677–685, 2017.
 [13] Caglar Gulcehre, Sarath Chandar, Kyunghyun Cho, and Yoshua Bengio. Dynamic neural turing machine with continuous and discrete addressing schemes. Neural computation, 30(4):857–884, 2018.

[14]
Trang Pham, Truyen Tran, and Svetha Venkatesh.
Graph memory networks for molecular activity prediction.
In
2018 24th International Conference on Pattern Recognition (ICPR)
, pages 639–644. IEEE, 2018.  [15] Tom Kenter and Maarten de Rijke. Attentive memory networks: Efficient machine reading for conversational search. arXiv preprint arXiv:1712.07229, 2017.
 [16] Peter Kontschieder, Madalina Fiterau, Antonio Criminisi, and Samuel Rota Bulo. Deep neural decision forests. In Proceedings of the IEEE international conference on computer vision, pages 1467–1475, 2015.
 [17] Sergei Popov, Stanislav Morozov, and Artem Babenko. Neural oblivious decision ensembles for deep learning on tabular data. arXiv preprint arXiv:1909.06312, 2019.
 [18] Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189–1232, 2001.
 [19] Yingshi Chen. Attention augmented differentiable forest for tabular data. arXiv preprint arXiv:2010.02921, 2020.
 [20] Ryutaro Tanno, Kai Arulkumaran, Daniel C Alexander, Antonio Criminisi, and Aditya Nori. Adaptive neural trees. arXiv preprint arXiv:1807.06699, 2018.
 [21] Yongxin Yang, Irene Garcia Morillo, and Timothy M Hospedales. Deep neural decision trees. arXiv preprint arXiv:1806.06988, 2018.
 [22] Nathan Lay, Adam P Harrison, Sharon Schreiber, Gitesh Dawer, and Adrian Barbu. Random hinge forest for differentiable learning. arXiv preprint arXiv:1802.03882, 2018.
 [23] Ji Feng, Yang Yu, and ZhiHua Zhou. Multilayered gradient boosting decision trees. In Advances in neural information processing systems, pages 3551–3561, 2018.
 [24] Andrew Silva, Taylor Killian, Ivan Dario Jimenez Rodriguez, SungHyun Son, and Matthew Gombolay. Optimization methods for interpretable differentiable decision trees in reinforcement learning. arXiv preprint arXiv:1903.09338, 2019.
 [25] Hussein Hazimeh, Natalia Ponomareva, Petros Mol, Zhenyu Tan, and Rahul Mazumder. The tree ensemble layer: Differentiability meets conditional computation. arXiv preprint arXiv:2002.07772, 2020.
 [26] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
 [27] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [28] Jerry Ma and Denis Yarats. Quasihyperbolic momentum and adam for deep learning. arXiv preprint arXiv:1810.06801, 2018.