Research in deep learning has made significant progress in recent years, especially for convolutional neural networ-ks (CNNs), which have produced many classic models (e.g. AlexNet)[17, 9, 11, 25, 23, 1]
. In contrast to traditional machine-learning, CNNs based models don t need features predesigned artificially but extract features directly from data due to its strong feature learning ability. CNNs have shown great superiority in image classification, object detection and voice recognition in recent years[17, 10, 7].
Since the emergence of residual learning , there is a consensus that deeper networks have better performance. Residual Network (ResNet) solves the gradient dispersion problem , which leads researchers to focus on deepening networks [15, 16, 19, 27, 24, 33, 31]. Although the deeper network showed superiority in various tasks [19, 33, 31], a major problem that comes with it is the bloat of the model. It is still a question worth considering about the tradeoff between performance and model bloat.
Besides expanding the depth of the network or increasi- ng the feature maps of networks [9, 11], another way to make the network obtains better capability is extending the network horizontally (e.g. Inception Model). Each layer of a Horizontal Expansion Network (HEN) consists of multi-scale convolution filters. HEN is more suitable for solving multi-scale problem: small object detection, image inpain-ting and super resolution reconstruction (SR), because these tasks require multi-scale information capture capability of the model. Compared to traditional methods, which scale data to get multi-scale information , HEN obtains multi-scale information by using multi-scale convolution operators. However, according to our understanding, the research of HEN is still at limited.
In this paper, we proposed a new HEN named Neuro- TreeNet (NTN), as inspired by Random Forest (RF)  and Inception Model (a typical HEN) . The core content of NTN is tree structure, which means other networks can convert to NTN by adding tree structure into their architecture. Our networks achieved better results in super resoluteion reconstruction, compared with other lightweight networks, as illustrated in Fig. 1. The main contributions of our work are summarized as follows:
(1) Tree structure. We proposed a new method to explore horizontal expansion network, in which the ideology of tree was included. The special structure makes full use of multi-scale information.
(2) Feature fusion and shared feature. Thanks to the special structure of tree which makes the feature fusion more flexible compared to other HENs. As Fig. 2
(c) and (d) shows, each branch of the tree can be thought of as an independent neural network, which means we could adjust the richness of features in the final feature fusion by controlling the number of nodes in each layer. Another advantage is that child nodes share the features of the root node, which is a good strategy for reducing parameters.
(3) Research on the attributes of NTN network. In this paper, the number of nodes and the location of the tree in the network and their impact on performance are explored. Meanwhile, a variable is defined to quantitatively describe the contribution of each branch to tree model. In addition, we also discuss two kinds of new combination of tree model.
2 Related Work
2.1 Horizontal expansion network
In recent years, research on neural networks has been mainly focused on expanding the depth of model and number of features [9, 11, 15, 16, 19, 27, 24, 33, 31], but neglecting the investigation of its breadth. Up to now, only Inception model is truly a HEN, of which each layer consists of multiple convolution filters with multi-scale information, as shows in Fig. 2 (a). Aiming at the problem of parameter proliferation caused by expansion, the corresponding solutions are proposed in several subsequent Inception variant models [13, 26, 24]. Although the GoogleNet consisting of multiple inception models has achieved good results in various fields such as image classification and object recognition , there are still some issues with this model such as scale solid-fication and inflexible. In addition, the relationship between network width and model performance of the GoogleNet has not been investigated.
2.2 Feature fusion methods
For some special tasks (i.e. small object detection), the use of multi-scale information is critical to the result. Meanwhile, feature fusion is an inevitable process using multi-scale features .
The main feature fusion methods in the field of deep learning are divided into two categories: multi-network feature fusion [3, 25, 14] and single network feature fusion [11, 3]. Multi-network feature fusion is always accompanied by multi-scale features, like Inception Model  or a set of multiple independent networks . Compare to multi-network feature fusion, the single network feature fusion is mainly used for the fusion of low-dimensional features and high-dimensional features, including only single-scale information, like Dense Network (DenseNet) .
As shown in Fig. 2 (b), the DenseNet stacks the features of the front layer to the following layers by depth, the features after stacking will be mixed by convolution filter. The unique architecture of DenseNet takes full advantage of each layer s features (both low-level and high-level),accompanied by significant performance improvements. However, two major flaws of DenseNet are: (1) feature redundancy, repeatedly stacking front layer features to the following layers; (2) single-scale, not enough to deal with multi-scale tasks.
2.3 Deep learning for image super resolution
The deep learning method has gradually occupied a dominant position in the field of SR due to its powerful feature extraction ability[5, 15, 16, 19, 27, 29, 6].
Since SRCNN  defeated other machine learning methods in SR, more innovative networks have been proposed to improve the results of super resolution reconstruction. Compared to SRCNN with only three layers and 64 feature maps in each layer, the subsequent network either extended the depth (e.g. VDSR ) or raised the number of feature maps (e.g. DRCN ), or even both (e.g. EDSR ). It is worth recognizing that the reconstruction results have indeed improved to a certain extent, which also confirms that the deeper network or network with more feature maps has better performance. However, none of these models use multi-scale information and consider horizontal expansion.
We applied our NTN in SR, the VDSR that a type single-scale network was chosen as our baseline due to its simple structure and small amount of computation. The application of multi-scale convolution and feature fusion makes our network have stronger feature extraction ability. Compared with VDSR, our model achieves better results in all scales with less parameters.
3 Neural Tree Network
We proposed the NTN which introduces the ideology of Random Forest , each neural network can be regarded as a tree due to the existence of hidden layer units. Some common problems in HENs, such as scale solidification and inflexibility were solved by this special structure.
3.1 Tree structure
Tree structure is a well-known data storage structure, and it is also the core module of RF 
and Decision Tree (DT). The architecture of tree we proposed is more like binary tree that each parent node has no more than two child nodes. Considering the limitation of the parameter quantity, other tree structures are not explored in this paper. We define that the convolutional size for every node as follows:
where and are left child node convolutional size and right child node convolution size at layer N respectively, is a constant number which we set to 2 in this work. These definitions ensure that each branch of NTN is unique, making NTN more diverse.
Different with single-scale networks, where each layer performs a convolution operation, our NTN represents a convolution operation at each node. As shown in Fig. 3, an NTN can be regarded as a set of deep neural networks (DNNs):
where means the number of the leaf nodes. NTN can be integrated with different levels of DNNs for different difficult tasks. Meanwhile, training an NTN is equivalent to training several DNNs at same time, which will greatly improve efficiency in some tasks.
Due to the existence of multiple branches, the final feature fusion for all different scales is inevitable. Different from the single-scale network feature fusion, the feature fusion of NTN is based on multi-scale network, which greatly increase the richness of features. Meanwhile, NTN obtains different degrees of multi-scale features by adjusting the convolutional size of nodes, which is more flexible than Inception Model. Another advantage is that each branch of NTN learns less features but get better performance after fusing all features of branches.
In addition, NTN has a unique advantage is feature sharing that solves problems in parameter explosion and feature redundancy. The feature sharing in NTN is defined as follows:
where and are input features of left nodes and right nodes at layer respectively, means the weight of convolutional filters.
To verify the performance of our models, we extend the tree structure to NTN and apply it in SR.
3.2 Branch contribution
As mentioned in previous sections, each branch of the NTN network can be regarded as an independent network. The branch network shares features through nodes, and each network extracts different information from input data. Thus, each branch network has different contribution to the entire network.
Contribution of each branch of NTN in feature fusion was measured through three indicators including PSNR value, parameter quantity, and the improvement of the receptive field brought by the branch. Furthermore, to better understand the quantitative relationship among above indicators, the contribution index (CI) defined as follows was used.
where PVI, RFI, PQI are the variables to describe PSNR value, receptive field, parameter quantity of each branch network, respectively. , , are the whole NTN s’ PSNR value, receptive field, parameter quantity, respectively. , , are each branch of NTN s’ PSNR value, receptive field, parameter quantity and we define  as a rounding down operation. is a parameter for weighting PVI. We believe that when the performance of the branch network is close to the NTN network, it is more difficult to have further promotion, so we set parameter as the reward . is a hyper-parameter to weighted RFI, we consider that the PSNR value is of higher importance in the evaluation task, so we set value to 0.5.
3.3 Network architecture for SR
As shown in Fig. 4, the NTN for SR have seven nodes in tree structure and five layers totally. Our model consists of one tree structure and two convolutional layers. By the way, other models can also convert to NTN by adding a tree structure. The skip connection between the input image and the output image is preserved, enabling the network to converge faster .
We also have investigated the relationship between the number of nodes and performance. In addition, the impact of the position of the tree structure in NTN is also investiga- ted, see detail in Section 4.3.
3.4 Variant of tree
As we all know, tree has many forms, manly determined by their branches and nodes, like binary tree, non-binary tree. When apply the tree structure in deep learning, the form of tree can be more various (caused by diversity of convolutional size). In this paper, we mainly discussed two variants of tree, which are reverse tree and encoder-decoder tree.
Reverse Tree Network (RTN). Compared with the ordinary tree, of which the number of nodes in each layer is greater than or equal to the number of nodes in previous layer, the reverse tree is just the opposite.
As Fig. 2 (d) shows, reverse tree is equivalent to rotat-ing the ordinary tree 180 degrees. Two parent nodes share a child node, which means child nodes obtain the sum of feature from two parent nodes. Different from ordinary tree which only mixes features once, the reverse tree mixes features in each layer. Accordingly, in parameter amount, the reverse tree is slightly larger than the ordinary tree.
We found that the reverse tree did not show superiority to ordinary tree in SR. Therefore, the ordinary tree was mainly discussed in this paper. However, the application of reverse tree in other fields still worth exploring.
Encoder-decoder Tree Network (EDTN). The encoder-decoder tree combines the ideology of auto-encoder  and tree structure. As Fig. 2 (e) shows, the encoder-decoder tree consists one ordinary tree and one reverse tree. In specific, the ordinary tree is the encoder and reverse tree is decoder. As with auto-encoder, the ordinary tree encodes input to feature of the hidden layer and the reverse tree decodes the feature from hidden layer to output.
4 Experimental Results
and ImageNet dataset.
To verify the effectiveness of our model fast, Set91 was used as training set. Training images were split into 41 by 41 patches with stride 41, data augmentation (rotation and flip) was used to increase the number of samples. Besides, same as in VDSR, we also contain all scale images in the training set for accelerating training.
4.2 Implementation details
In our NTN, the number of filters is set to 32 and 64 in tree structure and following convolution layers respectively, and the number of filters in the last layer is 1. All filter sizes are set to 3 except for the tree structure. In RTN, the number of filters in all convolutional layers is 64 except the first layer is 32 and last layer is 1. All convolutional layers are followed by rectified linear units (ReLu). The weights initialization follows He et al.
The hyperparameters setting for training are as follows: the learning rate is initialized to 0.1 for all layers and decre-ase by a factor of 10 for every 20 epochs for total 80 epochs. For optimization, we use SGD with momentum to 0.9 and weight decay to 1e-4. All experiments were conducted using Pytorch on NVIDIA TITAN X GPUs.
4.3 Model analysis
To demonstrate the effectivene-ss of our tree models, we constructed a series of lightweight NTNs and its derivative network with five layers. VDSR reduced to five layers (VDSR_5) without changing other structures was considered baseline correspondingly. In addition, another network based on Inception model was constructed for SR which contained two Inception module and five layers totally. In this experiment, seven models were conducted, including VDSR_5, Inception, NTN_32 (32 feature maps), NTN_32_D (dilated convolution), NTN_16 (16 feature maps), NTN_16_D (dilated convolution), and EDTN.
Compared with the traditional single-scale neural network represented by VDSR_5, the multi-scale informa-tion brought by HEN effectively improved the network performance (Table 1.). Inception model (a typical HEN) outperforms VDSR_5 at all scales on four test datasets. Benefit from the richer feature information brought by the special structure, the NTN and its derived network have achieved better performance than Inception. For 2 enlargement in Set5, NTN_32 achieves 36.97 dB which better 0.25 dB than Inception. For 3 and 4 enlargement, NTN_32 also better 0.21 dB, 0.19 dB than Inception respectively. In all models, EDTN achieved the best performance. Compared to NTN_32, for 2 , 3 and 4 enlargement, EDTN better 0.11 dB, 0.18 dB, 0.14dB than NTN_32 respectively. The results show that the richer multi-scale information brought by our tree model and the more effective feature utilization form can improve the performance of the network effectively.
Number of parameters. Compared to VDSR_5, the NTN_32 are slightly larger in parameters. To reduce our model s parameters without changing the architecture, dilated convolution was introduced to replace the ordinary convolution in tree structure (e.g. NTN_32_D). Reducing the feature maps in tree structure (e.g. NTN_16) was also discussed.
As Fig. 5 shows, the NTN_32 gets 33.09 dB which better 0.04 dB than NTN_32_D and NTN_16 better 0.08 than NTN_16_D. It is worth noting that although the dilated convolution does not change the receptive field and structure of NTN, it will reduce the performance to certain extent. Results show that model parameters of NTN_32_D, NTN_16, NTN_16_D are only 91%, 70%, 50% of VDSR_5. The corresponding PSNR of these models is higher than VDSR_5 0.32dB, 0.26dB, 0.18dB. Besides, the EDTN achieved the highest PSNR value, but its parameter amount is only 91% of NTN_32. These results demonstrate that our tree structure makes full use of multi-scale information to reduce parameters while improving performance.
To further compare the differences at feature level among seven models, the activation maps of NTN after feature fusion and corresponding activation maps of VDSR_5 and Inception were extracted. Fig. 6 shows that the activation maps of VDSR_5 have distortions and missing details, while the introduction of multi-scale information can alleviate the distortion to a certain extent, making the activation maps more delicate.
Nodes analysis. To further investigate the relationship between the nodes and performance of tree, thirteen NTNs were designed which the nodes of tree are from three to fifteen correspondingly. The Fig. 7 shows the architecture of all these models.
As illustrated in Fig. 8, the general trend is that performance improves with the number of nodes and parameters increases. However, we note that the improvement is mainly concentrated on the first five models. Subsequent model performance has not been significantly improved with the dramatic increase in the number of parameters. Thus, from the perspective of streamlined model and performance, the fifth model which has seven nodes is a good choice.
Tree position. Ordinary super-resolution reconstruction network consists of three parts: feature extraction (up layers), feature mapping (mid layers) and reconstruction (last layers). To explore the relationship between the position of tree in NTN (as different parts of network) and performance of model, we put the tree structure in up, middle and bottom of NTN which has 8 layers totally for comparison.
As Fig. 9 shows, the performance difference between the tree structure in middle of NTN and bottom of NTN is not obvious, which better 0.13 dB, 0.09dB than tree structure in up of NTN for 2enlargement respectively. Thus, we recommend to place the tree structure in the middle or bottom of the higher performance.
Branch contribution analysis. As described in Section 3.2 a complete NTN can be viewed as a collection of multiple branch networks, where different branch networks contribute different degrees to overall performance. Therefore, we evaluate each branch contribution for NTN_32.
As Table 2. shows, the performance of branch networks is lower than that of a complete NTN network. Branch 4 has the largest receptive field and obtains the best performance. However, its parameter quantity does not have a larger increase than that of branch 3. At the same time, a certain performance improvement has been achieved at more difficult level (closer to NTN). Subjectively, it is considered that the contribution of the branch 4 to the whole network is the highest, which is consistent with the CI results, indicating that the evaluation method proposed in this paper has certain rationality.
In this work, we have proposed a new method to explore horizontal expansion network named NTN, which solve the common problem of HEN through the special strategy of feature sharing. We have explored some properties of NTN and presented a new form of application of the NTN structure which achieved preferable performance in SR. Besides, a new metric (CI) was defined to measure the contribution of branch networks to NTN, which is instructive to design complex tree networks in the future. The results show that the NTN and its derivative network make full use of multi-scale feature information, which effectively enhances the feature representation ability of the model.
-  P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik. Contour detection and hierarchical image segmentation. IEEE transactions on pattern analysis and machine intelligence, 33(5):898–916, 2011.
-  M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. 2012.
N. Bodla, J. Zheng, H. Xu, J.-C. Chen, C. Castillo, and R. Chellappa.
Deep heterogeneous feature fusion for template-based face recognition.In
Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on, pages 586–595. IEEE, 2017.
-  C. E. Brodley and P. E. Utgoff. Multivariate decision trees. Machine learning, 19(1):45–77, 1995.
-  C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution. In European conference on computer vision, pages 184–199. Springer, 2014.
C. Dong, C. C. Loy, and X. Tang.
Accelerating the super-resolution convolutional neural network.In European Conference on Computer Vision, pages 391–407. Springer, 2016.
R. Girshick, J. Donahue, T. Darrell, and J. Malik.
Rich feature hierarchies for accurate object detection and semantic
Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
-  G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. IEEE Signal processing magazine, 29(6):82–97, 2012.
-  G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger. Densely connected convolutional networks. In CVPR, volume 1, page 3, 2017.
-  J.-B. Huang, A. Singh, and N. Ahuja. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5197–5206, 2015.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
-  J. Jeong, H. Park, and N. Kwak. Enhancement of ssd by concatenating feature maps for object detection. arXiv preprint arXiv:1705.09587, 2017.
-  J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1646–1654, 2016.
-  J. Kim, J. Kwon Lee, and K. Mu Lee. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1637–1645, 2016.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
-  A. Liaw, M. Wiener, et al. Classification and regression by randomforest. R news, 2(3):18–22, 2002.
-  B. Lim, S. Son, H. Kim, S. Nah, and K. M. Lee. Enhanced deep residual networks for single image super-resolution. In The IEEE conference on computer vision and pattern recognition (CVPR) workshops, volume 1, page 4, 2017.
-  D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on, volume 2, pages 416–423. IEEE, 2001.
-  S. Nah, T. H. Kim, and K. M. Lee. Deep multi-scale convolutional neural network for dynamic scene deblurring. In CVPR, volume 1, page 3, 2017.
-  O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi.
Inception-v4, inception-resnet and the impact of residual connections on learning.In AAAI, volume 4, page 12, 2017.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016.
-  Y. Tai, J. Yang, and X. Liu. Image super-resolution via deep recursive residual network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, volume 1, page 5, 2017.
-  R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang, L. Zhang, B. Lim, S. Son, H. Kim, S. Nah, K. M. Lee, et al. Ntire 2017 challenge on single image super-resolution: Methods and results. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 1110–1121. IEEE, 2017.
-  R. Timofte, V. De Smet, and L. Van Gool. A+: Adjusted anchored neighborhood regression for fast super-resolution. In Asian Conference on Computer Vision, pages 111–126. Springer, 2014.
-  J. Yang, J. Wright, T. S. Huang, and Y. Ma. Image super-resolution via sparse representation. IEEE transactions on image processing, 19(11):2861–2873, 2010.
-  L. Yu, H. Chen, Q. Dou, J. Qin, and P.-A. Heng. Automated melanoma recognition in dermoscopy images via very deep residual networks. IEEE transactions on medical imaging, 36(4):994–1004, 2017.
-  R. Zeyde, M. Elad, and M. Protter. On single image scale-up using sparse-representations. In International conference on curves and surfaces, pages 711–730. Springer, 2010.
-  K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang. Beyond a gaussian denoiser: Residual learning of deep cnn for image denoising. IEEE Transactions on Image Processing, 26(7):3142–3155, 2017.
-  R. Zhang, P. Isola, and A. A. Efros. In CVPR, volume 1, page 5, 2017.