Auto-encoder (Vincent et al., 2010) is a class of models which aim to map the input to a latent space and map it back to the original space, with low reconstruction error as its objective. Previous approaches for building such device mainly came from the neural network community. For instance, a neural network based auto-encoder usually consists of an encoder and a decoder. The encoder maps the input to a hidden layer and the decoder maps it back to the input space. By concatenating the two parts and setting the reconstruction error as learning objective, back-propagation can be used for training such models. It is widely used for dimensionality reduction (Hinton et al., 2006), representation learning (Bengio et al., 2013a), as well as some more recent works in generative models such as Variational Auto-encoders (Kingma and Welling, 2013).
Ensemble learning (Zhou, 2012)
is a powerful learning paradigm which trains multiple learners and combines to tackle the problem. It is widely used in a broad range of tasks and and demonstrates great performance. Tree ensemble methods, or forests, such as Random Forest(Breiman, 2001)
, for instance, is one of the best off-the-shelf methods for supervised learning(Fernández-Delgado et al., 2014)
. Other successful tree ensembles such as gradient based decision trees (GBDTs),(Chen and Guestrin, 2016) has also proven its ability during the past decade. Besides supervised learning, tree ensembles have also achieved great success in other tasks, such as isolation forest (Liu et al., 2008)
which is an efficient unsupervised method for anomaly detection. Recently, deep model based on forests has also been proposed(Zhou and Feng, 2017), and demonstrated competitive performance with DNNs across a broad range of tasks with much fewer hyper-parameters.
In this paper, we present the EncoderForest, (abbrv. eForest), by enabling a tree ensemble to perform forward encoding and backward decoding operations and can be trained in both supervised or unsupervised fashion. Experiments showed the eForest approach has the following advantages:
Accurate: Its experimental reconstruction error is lower than a MLP or CNN based auto-encoders.
Efficient: eForest on a single KNL (many-core CPU) runs even faster than a CNN auto-encoder runs on a Titan-X GPU for training.
Damage-tolerable: The trained model works well even when it is partially damaged.
Reusable: A model trained from one dataset can be directly applied on the other dataset in the same domain.
The rest of the paper is organized as follows: first we introduce related works, followed by the proposed eForest model, then experimental results are presented, finally conclusion and future works are discussed.
2 Related Work
Auto-encoding an important task for learning association from data, which is one of the key ingredient of deep learning.(Goodfellow et al., 2016). The study of auto-encoding dates back to (Bourlard and Kamp, 1988), of which the goal is to learning an auto-association relation which can be used to for representation learning. (Bengio et al., 2013a). Most of the previous approaches on auto-encoding are neural network based models. For instance, the under-complete auto-encoder, which purpose is to compress data for dimensionality reduction (Hinton and Salakhutdinov, 2006) and efficient coding(Liou et al., 2008), sparse auto-encoder gives a sparsity penalty on the on the activation layer (Hinton and Ranzato, 2010), which is related with sparse coding (Willmore and Tolhurst, 2001), and denoising auto-encoders (Bengio et al., 2013b)
forces the model to learn the mapping from a corrupted input to its noiseless version. Applications ranging from computer vision(Masci et al., 2011)2013) and semantic hashing (Ruslan et al., 2007) which uses autoencoders in information retrieval tasks. In fact, the concept of deep learning stated with training a stack of auto-encoders in a greedy layer-wised fashion. (Hinton et al., 2006). Auto-encoding has also been applied in some more recent works such as variational auto-encoder for generative models (Kingma and Welling, 2013).
Ensembles of decision trees, or called forest, are popularly used in ensemble learning (Zhou, 2012). For example, Bagging (Breiman, 1996) and Boosting (Freund and Schapire, 1999) usually take decision trees as component learners. Other famous decision tree ensemble methods including Random Forest (Breiman, 2001) and GBDT(Friedman, 2001)
; the former is a variant of Bagging, whereas the latter is a variant of Boosting. Some efficient implementations of GBDT, e.g. XGBoost(Chen and Guestrin, 2016), has been widely used in industry and various data analytics competitions. In addition to the above tree ensembles constructed in supervised setting, there are unsupervised tree ensembles also proven to be useful in various domains. For example, the iForest (Liu et al., 2008) is an unsupervised forest designed for anomaly detection, and its ingredient, completely-random decision trees, have also been applied to tasks such as streaming new class learning (Mu et al., in press). Note that both supervised and unsupervised forests, i.e. Random Forest and completely-random tree forest, have been simultaneously exploited in the construction of deep forest(Zhou and Feng, 2017).
3 The Proposed Method
An auto-encoder has two basic functions: encoding and decoding. There is no difficulty for a forest to do encoding, because at least the leaf nodes information can be regarded as a kind of encoding; needless to say, the subsets of nodes or even the branch of paths may be able to offer more information for encoding.
First, we propose the encoding procedure of EncoderForest. Given a trained tree ensemble model of trees, the forward encoding procedure takes an input data and send this data to each root node of trees in the ensemble, once the data traverse down to the leaf nodes for all trees, the procedure will return a
dimensional vector, where each elementis an integer index of the leaf node in tree .
A more concrete algorithm for forward encoding is shown in Algorithm 1. Notice that this encoding procedure is independent with the particular learning rule on how to split the nodes for trees. For instance, the decision rule can be learned in a supervised setting such as random forest, or can be learned in an unsupervised setting such as completely random trees.
On the other hand, however, the decoding function is not that obvious. In fact, forests are generally used for forward prediction, by going from the root of each tree to the leaves, whereas it is unknown how to do backward reconstruction, i.e., inducing the original samples from information obtained at the leaves.
Suppose we are handling a binary classification task, with four attributes. The first and second attributes are numerical ones; the third is a boolean attribute with values YES, NO; the fourth is a triple-valued attribute with values RED, BLUE, GREEN. Given an instance , let denotes the value of on the -th attribute.
Now suppose in the encoding step we have generated a forest as shown in Fig 1. Now, we only know the leaf nodes on which the instance falling into, as shown in Fig 1 as the red nodes, and wish to reconstruct .
Here, we propose an effective yet simple, possibly the simplest, strategy for backward reconstruction in forests. First, each leaf node actually corresponds to a path coming from the root, we can identify the path based on the leaf node without uncertainty.
For example, in Fig 1 the identified paths are highlighted in red color. Second, each path corresponds to a symbolic rule; for example, the highlighted tree paths correspond to the following rule set, where corresponds to the path of the -th tree in the forest, where denotes the negation of a judgment :
This rule set can be further adjusted into a more succinct form:
Then, we can derive the Maximal-Compatible Rule (MCR). MCR is such a rule that each of its component coverage cannot be enlarged, otherwise incompatible issue will occur. For example,from the above rule set we can get the corresponding MCR:
For each component of this MCR, such as , its coverage cannot be enlarged; for example, if it were enlarged to , it would have conflict with the condition in in . A more detailed description is shown in Algorithm 2.
It is very easy to prove the following theorem, and thus we omit the proof.
The original sample must reside in the input region defined by the MCR.
Thus, after obtaining the MCR, we can reconstruct the original sample. For categorical attributes such as and , the original sample must take these values in the MCR; for numerical attributes, such as , we can take a representative value, such as the mean value in (2, 1.5). Thus, the reconstructed sample is = [0.55, 1.75, GREEN, YES]. Note that for numerical value, we can have many alternative ways for the reconstruction, such as the median, max, min, or even calculate the histograms.
Given the above description, now we give a summary for conducting backward decoding of eForest. Concretely, given a trained forest with trees along with the forward encoding in for a particular data, the backward decoding will first locate the individual leaf node via each element in , and then obtain decision rules for the corresponding decision paths accordingly. Then, by calculating the MCR, we can thus get a reconstruction from back to in the input region. A concrete algorithm is shown in Algorithm 3.
By enabling the eForest to conduct the forward encoding and backward decoding operations, autoencoding tasks can thus be realized. In addition, although beyond the scope of this paper, the eForest model might give some insight on a theoretical treatment for the representation learning ability for tree ensemble models, as well as helping to design new models for deep forest.
4.1 Image Reconstruction
We evaluate the performance of eForest in both supervised and unsupervised setting. In this implementation, we take Random Forest (Breiman, 2001) to construct the supervised forest, whereas take the completely-random forest (Zhou and Feng, 2017) as the routine for the unsupervised forest. Notice that other decision tree ensemble construction methods can also be used for this purpose. Concretely, for supervised eForest, each non-terminal node randomly select attributes in the input space and pick the best possible split for information gain; for unsupervised eForest, each non-terminal node randomly pick one attributes and make a random split. In our experiments we simply grow the trees to pure leaf, or terminate when there are only two instances in a node. We evaluate eForest containing 500 trees or 1,000 trees, denoted by and respectively. Note that will re-represent the input instance as a -dimensional vector.
Since auto-encoders especially DNN-based auto-encoders are mainly designed for image tasks, in this section we run some experiments on image data first. We use the MNIST dataset (LeCun et al., 1998), which consists of 60,000 gray scale 2828 images (784 dimensional vector per sample) for training and 10,000 for testing. We also use CIFAR-10 dataset (Krizhevsky, 2009), which is a more complex dataset consists of 50,000 colored 3232 images (therefore each image is in per channel) for training and 10,000 colored images for testing. For colored images, the eForest process each channel separately for memory saving.
MLP based AutoEncoders (MLP-AEs) and a convolutional neural network based auto-encoder (CNN-AE) are used for comparison. For MLP-AEs, we follow the suggestions in (Bengio et al., 2007) and use two architectures, with 500-dimensional and 1000-dimensional inner representation, respectively. Concretely, the MLP-AE for MNIST is and the for MNIST is . Likewise, the MLP-AE for CIFAR-10 is and the for CIFAR-10 is
. For CNN-AE, we follow the implementations in the Keras documentation111https://blog.keras.io/building-autoencoders-in-keras.html with the following architecture: It consisting of a conv-layers with 16 (3 3) kernels followed by 2 conv-layers with 8 (3 3) kernels, and each conv-layer has a 2 2 maxpooling layer followed. The decoder we used has same structure as encoder except using up-sampling layer instead of pooling layers (for mapping the data back to its original input space). ReLUs are used for activations and logloss is used as training objective. During training, dropout is set to be 0.25 per layer.
Experimental results are summarized in Table 1. For DNN auto-encoders, cross validation are used for hyper-parameter tuning; for eForest, we just take the min value of the interval defined by the corresponding MCR as indicated in the last sampling step of decoding.
It can be seen that eForest achieves the best performance. Some reconstructed samples on the test set are shown in Figure 2. This result looks sad for CNN based auto-encoders on CIFAR-10 dataset, as we are using the architecture recommended for image auto-encoders by Keras documentation and have carefully tuned the other hyper-parameters via cross-validation. We believe that the DNN autoencoders can get improved performance by some further tuning; nevertheless, the eForest auto-encoder works well without careful parameter tuning.
It is worth noting that the unsupervised eForest had a better performance compared with the supervised eForest, given the same number of trees. Note that each decision tree path corresponds to a rule, whereas a longer rule will define a tighter MCR. We conjecture that a tighter MCR might lead to a more accurate reconstruction. Therefore for a forest with longer tree depth may have a better performance. For example, we measured the maximum depth as well as the average depth for all trees on MNIST dataset, as summarized in Tabel 2. Experimental results give positive supports, as shown in Table 2. An unsupervised eForest indeed has a longer average depth.
|Max depth||Ave. depth|
4.2 Text Reconstruction
In addition to image tasks, other tasks may also require auto-encoders. Thus, we study the performance of eForest for text reconstruction. Note that the DNN auto-encoders are mainly designed for images, and if to be applied to texts, some additional mechanism such as word2vec embedding(Mikolov et al., 2013) is required for pre-processing. Here, in our experiments, we want to study the performance of doing auto-encoding directly on text data.
Concretely, we used the IMDB dataset (Maas et al., 2011) which contains 25,000 documents for training and 25,000 documents for testing. Each document was stored as a 5,000 dimensional vector via tf/idf transformation. We used exactly the same
configuration of eForests for image data. Cosine distance is used for evaluation metric, which is the standard metric for measuring the similarities between documents represented by tf/idf vectors. The lower the cosine distance, the better. The results are summarized in Table3.
It should be highlighted that CNN based auto-encoders can not be applied on this kind of input data at all and MLP based auto-encoders is barely useful. After extensive cross-validation for parameter search, the best structure for MLP we could obtained is , with the performance of 0.512, more than two hundred times worse than eForest.
From the above results, we showed that eForest can also be applied on text data with high performance. In addition, notice that by using only of the bits of representation (eForest of 500 trees trained unsupervisedly), eForest can already reconstruct the original input with high accuracy. This is a promising result which can be further utilized for data compression.
4.3 Computation Efficiency
As a common advantage for tree ensemble models, eForest is also inherently apt for parallel implementation. We implement eForest on a single KNL-7250 (belongs to Intel XEON Phi many-core product family), and achieved a 67.7 speedup for training 1,000 trees in an unsupervised setting, compared with a serial implementation. For a comparison, we trained the corresponding MLPs and CNN-AEs with the same configurations as in the previous sections on one Titan-X GPU and the results for training cost as well as testing per sample cost are summarized in the Table 4.
From the above results, eForest is more than 100 times fast when training, but is slower during encoding time than DNN based auto-encoders. We hope that the decoding can be speedup by some more optimization in the future.
4.4 Damage Tolerable
There are cases when the model is partially damaged due to a various reasons such as memory or disk failure. For a partially damaged model is still able to function in such cases is one characteristic towards model robustness. The eForest approach for auto-encoding is one such model by its nature since we could still estimate the MCR when facing only a subset of trees in the forest.
In this section, we test the damage tolerable empirically on CIFAR-10 and MNIST datasets. Concretely, during testing time, we randomly drop 25%, 50% and 75% of the trees and measure the reconstruction error based on the pattern recovered using only the remaining trees. For a comparison, we also randomly turned off 25%, 50% and 75% of the neurons in thewith structure exactly the same as in the previous section. The performance results are illustrated in Figure 3.
Form the above result, the eForest approach is more damage tolerable than a MLP-AE, and the unsupervised eForest is the most damage tolerable model among others.
4.5 Model Reuse for eForest
In an open environment, the test data for encoding/decoding may belong to a different distribution with the training data. In this section, we test the ability for model reuse and the goal here is to train a model in one dataset and reuse it in another dataset without any modifications or re-training. The ability for model reuse in this context is an important property for future machine learning developments(Zhou, 2016).
Concretely, we evaluate the ability for model reuse as follows. We trained an unsupervised and an supervised eForest on CIFAR-10 dataset (converted and rescaled to 2828 gray scale data), each consisting of 1,000 trees , and then use the exact models to encoding/decoding data from the MNIST test dataset. Likewise, we also trained eForests consists of 1,000 trees on MNIST dataset, and directly test the encoding/decoding performance on the Omniglot datasets (Lake et al., 2015). For a fair comparison, we trained a CNN-Autoencoder and MLP-Autoencoder on the same dataset without fine-tuning. The architecture for MLP/CNN-AEs and the training procedures are the same in the previous sections accordingly. MSE is used for performance evaluation.
|Model||cifar train||Model||mnist train|
|mnist test||omniglot test|
Some randomly picked reconstructed samples are presented in Fig. 4, and the numerical evaluation on the whole test set is presented in Table 5. It can be inferred that eForests has out-performed the DNN approach by a factor more than 100. Specifically, for an eForest trained on CIFAR-10 can perform a better encoding/decoding task on MNIST dataset, and these two dataset are quite different. It showed the generalization ability in terms of model reuse for eForest.
In this paper, we propose the EncoderForest (abbrv. eForest), the first tree ensemble based auto-encoder model, by devising an effective procedure for enabling forests to reconstruct the original pattern by utilizing the Maximal-Compatible Rule (MCR) defined by decision paths of the trees. Experiments demonstrate its good performance in terms of accuracy and speed, as well as the ability of damage tolerance and model reusability. In particular, on text data, by using only of the input bits, the model is still able to reconstruct the original data with high accuracy. Another advantage of eForest lies in the fact that it can be applied to symbolic attributes or mixed attributes directly, without transforming the symbolic attributes to numerical ones, especially when considering that the transforming procedure generally either lose information or introduce additional bias.
Note that supervised and unsupervised eForest are actually the two ingredients utilized simultaneously in each level of the deep forest constructed by gcForst. This work might offer some additional understanding of gcForst(Zhou and Feng, 2017). Constructing a deep eForest model is also an interesting future issue.
- Bengio et al. (2013a) Bengio, Y., Courville, A., Vincent, P., 2013a. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8), 1798–1828.
- Bengio et al. (2007) Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H., 2007. Greedy layer-wise training of deep networks. In: Advances in neural information processing systems 20. pp. 153–160.
- Bengio et al. (2013b) Bengio, Y., Yao, L., Alain, G., Vincent, P., 2013b. Generalized denoising auto-encoders as generative models. In: Advances in Neural Information Processing Systems 26. pp. 899–907.
- Bourlard and Kamp (1988)
- Breiman (1996) Breiman, L., 1996. Bagging predictors. Machine Learning 24 (2), 123–140.
- Breiman (2001) Breiman, L., 2001. Random forests. Machine Learning 45 (1), 5–32.
- Chen and Guestrin (2016) Chen, T.-Q., Guestrin, C., 2016. XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM International Conference on Knowledge Discovery and Data Mining. pp. 785–794.
Fernández-Delgado et al. (2014)
Fernández-Delgado, M., Cernadas, E., Barro, S., Amorim, D., 2014. Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research 15, 3133–3181.
Freund and Schapire (1999)
Freund, Y., Schapire, R. E., 1999. A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence 14 (5), 771–780.
Friedman, J. H., 2001. Greedy function approximation: A gradient Boosting machine. The Annals of Statistics 29 (5), 1189–1232.
- Goodfellow et al. (2016) Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. MIT Press, Cambridge, MA.
Hinton and Ranzato (2010)
Hinton, G., Ranzato, M., 2010. Modeling pixel means and covariances using factorized third-order boltzmann machines. In: Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition. pp. 2551–2558.
- Hinton and Salakhutdinov (2006) Hinton, G., Salakhutdinov, R., 2006. Reducing the dimensionality of data with neural networks. Science 313 (5786), 504–507.
- Hinton et al. (2006) Hinton, G. E., Osindero, S., Simon, Y.-W., 2006. A fast learning algorithm for deep belief nets. Neural Computation 18 (7), 1527–1554.
- Kingma and Welling (2013) Kingma, D.-P., Welling, M., 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
- Krizhevsky (2009) Krizhevsky, A., 2009. Learning multiple layers of features from tiny images. Tech. rep., University of Toronto.
- Lake et al. (2015) Lake, B. M., Salakhutdinov, R., Tenenbaum, J. B., 2015. Human-level concept learning through probabilistic program induction. Science 350 (6266), 1332–1338.
- LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), 2278–2324.
- Liou et al. (2008) Liou, C.-Y., Huang, J.-C., Yang, W.-C., 2008. Modeling word perception using the elman network. Neurocomputing 71 (16), 3150–3157.
- Liu et al. (2008) Liu, F. T., Ting, K. M., Zhou, Z.-H., 2008. Isolation forest. In: Proceedings of the 8th IEEE International Conference on Data Mining. pp. 413–422.
Maas et al. (2011)
Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., Potts, C., 2011. Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics. pp. 142–150.
Masci et al. (2011)
Masci, J., Meier, U., Cireşan, D., Schmidhuber, J., 2011. Stacked convolutional auto-encoders for hierarchical feature extraction. In: Proceedings of International Conference on Artificial Neural Networks. pp. 52–59.
Mikolov et al. (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., Dean, J., 2013. Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems 26. pp. 3111–3119.
- Mu et al. (in press) Mu, X., Ting, K. M., Zhou, Z.-H., in press. Classification under streaming emerging new classes: A solution using completely-random trees. IEEE Trans. Knowledge and Data Engineering.
Ruslan et al. (2007)
Ruslan, S., A. Mnih, A., Hinton, G., 2007. Restricted boltzmann machines for collaborative filtering. In: Proceedings of the 24th International Conference on Machine learning. pp. 791–798.
Vincent et al. (2010)
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A., 2010. Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research 11, 3371–3408.
- Willmore and Tolhurst (2001) Willmore, B., Tolhurst, D., 2001. Characterizing the sparseness of neural codes. Network: Computation in Neural Systems 12 (3), 255–270.
- Zhou (2012) Zhou, Z.-H., 2012. Ensemble Methods: Foundations and Algorithms. CRC, Boca Raton, FL.
- Zhou (2016) Zhou, Z.-H., 2016. Learnware: on the future of machine learning. Frontiers of Computer Science 10 (4), 589–590.
- Zhou and Feng (2017) Zhou, Z.-H., Feng, J., 2017. Deep forest: Towards an alternative to deep neural networks. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. pp. 3553–3559.