1 Introduction
Structured databased learning has become a central topic in machine learning, as such data representations are met in numerous fields. We focus here on treebased representations, whose typical applications are parse trees in Natural Language Processing
[4], XML trees in web mining [8], or even hierarchical image representations [2]. Since tree structure plays an important role in tasks like classification or clustering, similarity measures taking explicitly into account topological characteristics of the tree are sought. Among them, kernels functions are appealing as they allow the use of popular kernel methods [10].Various kernels have indeed been proposed to cope with a tree as the underlying data structure (see [11] for a review). They mostly rely on a fundamental idea brought by Haussler with its convolution kernels [7], stating that a kernel defined on a complex structure can be formed by kernels computed on its substructures. Most often, those kernels are defined for ordered trees, that is to say trees for which left to right order among nodes or leaves is fixed (mostly because of the specific nature of the data or due to computational complexity issues). Examples of such kernels include the subset tree kernel [4] and the subtree kernel [13]. Unordered trees received much less attention, with the subpath kernel [8] being one of the very few existing solutions.
Meanwhile, an emerging paradigm for image classification has advocated the idea of relying on hierarchical representations [2], which are built using series of nested partitions or segmentations, rather than the usual flat representation. This is particularly true in remote sensing, where such representations allow for revealing different objects of interest at various scales through a tree structure. However, the induced trees are unordered and the nodes or regions are associated with numerical features, preventing the use of existing tree kernels.
We propose in this paper a new kernel that arises from the subpath kernel [8]. Based on some existing adaptations to numerical data from the graph kernel literature, the designed kernel is able to cope with unordered trees equipped with numerical data (see Sec. 2). Besides, it considers the complete set of subpaths on tree structures (instead of paths on graphs), leading to an efficient computation scheme. Experimental results are given in Sec. 3. They rely on artificial datasets as well as a real multispectral satellite image. We end the paper with some concluding remarks and directions for future work.
2 Proposed Kernel
We focus here on structured data represented by trees and subpaths. Let us first recall that a tree is a directed and connected acyclic graph with a single root node. A path connecting a node in the tree to one of its descendants is called a subpath. Individual nodes are also included in the set of subpaths.
We build upon the subpath kernel [8] to design a new kernel able to cope with numerical data. Let us recall the principles of the original subpath kernel, that exploits the hierarchical structure by counting all possible common subpaths embedded in two tree structures equipped with symbolic features. Given two trees and , the subpath kernel is defined as
(1) 
where kernels are computed between all subpaths of tree respectively. They rely on the number of occurrences of the subpaths in the tree written and . The Kronecker delta function equals iif the two subpaths are identical. Figure 1 illustrates for a given simple tree its possible subpaths and their occurrences in the tree.
2.1 Adaptation to numeric data
The original kernel depicted previously was introduced for data classification in bioinformatics where nodes take symbolic values. Adapting this kernel to numeric data requires one to change the terms and since strict identity between subpaths (and their respective node features) does not generally occur. We follow here the scheme proposed with graph kernels for image classification [1], but considering subpaths instead of random walks as the substructure component. Indeed, the use of trees allows a complete enumeration of the subpaths that is not achievable with graphs.
We replace and terms in Eq. (1) by a product of some atomic kernel functions computed between pairs of nodes and . Various atomic kernels are here available, e.g. Gaussian kernel [10] that has often been successfully used in many contexts. As long as these atomic kernels are positive definite, the proposed structured kernel is also positive definite (see [7]). Formulation of the subpath kernel for numeric data is then
(2) 
where the Kronecker delta function equals iif the two subpaths have the same length, and nodes are scanned in descending order along the two subpaths (from the root to the leaf). One might notice that if measures the identicalness of , Eq. (2) becomes just another form of Eq. (1). The naive complexity of comparing all pairs of subpaths between and is as indicated in [8] (the size of subpath set of a given tree T is in general , with refers to the number of vertices in the tree).
2.2 Efficient computation
Besides the proposed adaptation to numeric data, applying kernels on treebased representations of images also raises a computational issue. Indeed, images are most often made of millions or even billions of pixels, that are put in the tree structure. While such structure aims to reduce the image content through a hierarchical representation, it may still be characterized by a huge number of nodes and edges. So we also need to address this issue to make the proposed kernel relevant when dealing with images.
Let us note that in the original subpath kernel paper [8], some solutions were given to lower the computation time. But they are related to symbolic data and thus cannot be applied here. Inspired from [6], we suggest rather to use dynamic programming for comparing all subpaths. This strategy allows us to break down the complexity of all subpaths comparison into smaller subproblems in a recursive way, and reuse the solution of one subproblem to solve another one. Repeated comparisons are then avoided. More specifically, the comparison is computed recursively between two nodes and :
(3) 
with the set of children of in , and the atomic kernel measuring the similarity between and , by convention. We also have reaching 1 when either or has no child. For more details, let us denote two subtrees rooted at , respectively, a pair of subpaths with . Then sums the similarity measures of all pairs of subpaths with same length and starting at the same height in . The similarity is calculated recursively by the first and second term on the right side of equation, together with the third term introduced to prevent false subpaths that compare noncontiguous alignments.
Given a couple of trees and with respective roots and , we finally compute all pairs of subpaths embedded in the trees:
(4) 
Dynamic programming allows us to avoid the explicit computation of all pairs of subpaths between two trees and . Instead, only all pairs of vertices are considered. The complexity is thus where refers to the number of vertices in the tree .
2.3 Additional improvements
The proposed kernel shares with existing structured kernels two main issues. On the one hand, the value of depends on the size of the trees while some invariance might be sought. This problem has already been tackled in [4] through a normalized kernel, that is computed here as
(5) 
On the other hand, the structured kernel gives the same importance to every node in the tree. Here nodes represent regions of various size, and larger regions might be given more attention than smaller ones. By weighting the atomic kernel by the relative size (i.e., number of pixels) of a node w.r.t. the root of the tree, and given a parameter , the updated kernel becomes
(6) 
3 Experiments
We have conducted two experiments to proceed to a finer understanding of the kernel behavior using an artificial dataset, and to validate the kernel in a realistic context. Before giving indepth analysis of the results obtained on both datasets, we will present first the common experimental setup.
3.1 Experimental Setup
We evaluate kernels in a classification context, considering a oneagainstone
SVM classifier (using the Java implementation of LibSVM
[3]). The proposed subpath kernel is systematically compared with a kernel computed on the root of the tree only (i.e. ignoring all remaining nodes of the tree), called rooted kernel in the sequel. Our goal is to assess the importance of the various levels of information contained in the hierarchical representation w.r.t. a raw analysis of the whole data. Let us note that standard tree kernels based on substructures other than paths could hardly be applied here due to their computational complexity [11]). For the proposed kernel, nodes are compared individually using an atomic RBF kernel: each node being described by a feature or vector of
attributes, the kernel is defined for a pair of nodes with respective features as(7) 
We consider here two types of distances. The first one is an distance
(8) 
leading to a Gaussian kernel, for which the features
contain the average and variance computed from all information contained in the nodes (that can be accessed from their leaves). The second one is a distance between
d histograms:(9) 
where the features are here histograms of bins per dimension stacked together. We call kernel the resulting atomic kernel.
Three free parameters are determined by a grid search strategy over potential values: the bandwidth of the RBF atomic kernel (Eq. (7)), the SVM regularization parameter , and the size weight (Eq. (6)).
Accuracies (and standard deviations) of each setup are computed after 100 repetitions of each experiment, choosing randomly 20 data samples from each class as training samples, using the remaining samples for testing.
3.2 Artificial dataset
In this first experiment, we study the behavior of the proposed tree kernel though 3 different scenarios using an artificial dataset. Unless stated otherwise, we consider bins by dimension to construct the histogram. We call structure information the way leaves are aggregated and the initial number of leaves.
(a) Only the root is discriminative. We generate two types of leaves, and
, that are described by a 1D feature generated according to a uniform distribution, with non overlapping intervals.
Class 1 trees are composed of leaves of type only and Class 2 trees of type only (see Fig. 1(a)). Number of leaves and node merging parameters are defined randomly to produce various shapes of trees within each class. As shown in Tab. 1a, subpath kernel behaves similarly to rooted kernel: when the structure does not provide additional information, exploiting it in the proposed subpath kernel does not degrade the performances.(b) Only the structure is discriminative. We generate only type leaves. The two classes of trees can then be discriminated thanks to their structure, i.e. with different ranges of related parameters (number of leaves and number of fanouts for each node) for each class, see Fig. 1(b). As shown in Tab. 1b, rooted kernel achieves only 50% accuracy, while the subpath kernel is able to discriminate the two classes, thanks to the discriminative structure leading to different subpaths between the two classes. Let us note that when and , the subpath kernel turns into a kernel that computes the product of the number subpaths with common length embedded in the two trees.
(c) Only the features of the nodes are discriminative. We generate both type and leaves, and we force type leaves to merge with type leaves in Class 1, while in Class 2, type (resp. ) leaves always merge with type (resp. ) leaves (see Fig. 1(c)). Similarly to scenario (a), structure parameters are selected randomly. As shown in Tab. 1c, rooted kernel provides an accuracy about 50%, due to the non discriminative root. Discriminative information contained in the nodes benefits to the proposed subpath kernel, leading to a 100% accuracy. Note that even in the presence of irrelevant features, the subpath kernel still behaves correctly. Indeed, we have experimentally observed that adding 40 non discriminative features to each node (so only one dimension among the 41 is relevant) leads to an accuracy of 97.13 % with Gaussian atomic kernel, and 99.99 % with atomic kernel.
(c1) Robustness to outliers.
We modify the scenario (c) to introduce outlier leaves that take values outside ranges of type
and type leaves. The ratio of such leaves varies from 0% to 100%. We construct here (and here only) the histogram for the atomic kernel considering bins.(c2) Robustness to mislabelled leaves. We update the scenario (c1) with outlier leaves changed to mislabelled leaves. To do so, some leaves of type are changed into type , and vice versa. In the binary classification setup considered here, the ratio of mislabelled leaves in each class varies from 0 % to 50 %, leading to more confusing subpaths between the two classes.
We can derive two main observations from Fig. 3: a) the subpath kernel can maintain a good performance up to a certain ratio of structure distortion, and b) both accuracy drops (after 50% of outliers in c1, between 20% and 40% mislabelled leaves in c2) illustrate that subpath kernel performance is directly related to the discrimination of substructures between two group of complex structured data. Further, one might notice that in both scenarios, subpath using atomic kernel always performs better than using Gaussian atomic kernel. Indeed, histograms provide a rich distribution description of leaf attributes.
Scenario  Rooted kernel  Subpath kernel  

Gaussian  Gaussian  
(a) discriminative root only  100.0 (0.0)  100.0 (0.0)  100.0 (0.0)  100.0 (0.0) 
(b) discriminative structure only  48.9 (4.6)  49.4 (3.4)  100.0 (0.0)  100.0 (0.0) 
(c) discriminative nodes only  49.5 (3.5)  51.2 (4.6)  100.0 (0.0)  100.0 (0.0) 
3.3 Satellite Image Dataset
Beyond the evaluation conducted on some artificial datasets, we also perform some experiments on real datasets. Since hierarchical representations are common in remote sensing, we explore the relevance of the proposed kernel in this domain. To do so, we consider a QuickBird satellite image with high spatial resolution (i.e., m per pixel) of Strasbourg Illkirch in France, initially proposed and discussed in [9]. We can perform quantitative evaluation of the kernelbased classification procedure thanks to the availability of a ground truth (a partition of the initial image into 400 regions, each of them being associated to one of the 7 classes of interest, see list and distribution in Fig. 5). Figure 4 shows the satellite image and its associated ground truth.
We compute a tree representation of each single region of the ground truth, using a standard open source hierarchical image segmentation method called RHSEG [12]. RHSEG allows us to produce a fine segmentation map containing 3180 regions, that are subsequently aggregated to build coarser layers or segmentation maps with less regions (each iteration contains 300 regions less), as shown in Fig. 4. The coarsest segmentation is nothing but the ground truth. These different layers are stacked within tree structures. The last step consists in deleting the redundant nodes that remain unchanged through different scales. Finally we obtain 400 different trees, where each root represents a ground truth region and other nodes represent its components (subregions) at different scales.
3.3.1 Results
Both rooted and subpath kernels are involved in a supervised classification process. Several statistics are derived: overall accuracy (ratio of correctly classified regions), average accuracy (average of the accuracy measured on each class), and kappa index (percentage of agreement in the test set, corrected by the agreement that could be expected by chance). Results are reported in Tab. 2. We can see that the proposed subpath kernel always outperforms rooted kernel. It is able to exploit the additional spatial features provided by the hierarchical representation of individual regions. A deeper analysis is provided in Fig. 5 where accuracies are provided for each class. Subpath superiority is mainly observed on classes such as urban vegetation, industrial urban blocks, and agricultural zones. For some other classes, performances are weaker. Let us note that the reported results are for the tradeoff parameter providing the best overall accuracies. It may then lead to non adequate values for some classes, as by definition of the proposed kernel, setting would mimic the rooted kernel.
Method  OA  AA  time  

Rooted kernel, with Gaussian atomic kernel  53.1 (3.0)  56.2 (2.9)  0.447 (0.03)  1.4 
Subpath kernel, with Gaussian atomic kernel  58.4 (2.6)  60.8 (2.9)  0.498 (0.03)  19.5 
Rooted kernel, with atomic kernel  57.8 (2.2)  61.3 (2.6)  0.494 (0.03)  2.7 
Subpath kernel, with atomic kernel  61.4 (2.8)  64.4 (2.9)  0.532 (0.03)  98.8 
4 Conclusion
In this paper, we introduced a new structured kernel that is able to cope with unordered trees equipped with numeric data. By doing so, we were able to apply pattern recognition and machine learning techniques to hierarchical image representations that become more and more popular, especially in remote sensing. We built upon a subpath kernel initially designed for bioinformatics data, as well as some graph kernels that relies on random walks. We show by some preliminary experiments the abilities and robustness of the proposed kernel. The encouraging results call for further investigation.
Among future research directions, a comparison with existing kernels in image analysis is planned. Most of them are based on graph kernels. Since trees are a particular class of graphs, various graph kernels may be considered (e.g. [5]).
Acknowledgements
The authors acknowledge the support of the French Agence Nationale de la Recherche (ANR) under reference ANR13JS02000501 (Asterix project), and the support of Région Bretagne and Conseil Général du Morbihan (ARIA doctoral project). The authors would also like to thank A. Puissant from LIVE UMR CNRS 7362 (University of Strasbourg) for providing the Strasbourg dataset (Quickbird image and ground truth).
References
 [1] Aldea, E., Atif, J., Bloch, I.: Image classification using marginalized kernels for graphs. In: GraphBased Representations in Pattern Recognition, pp. 103–113. Springer (2007)
 [2] Blaschke, T., et al.: Geographic objectbased image analysis–towards a new paradigm. ISPRS J. of Photogrammetry and Remote Sensing 87, 180–191 (2014)

[3]
Chang, C.C., Lin, C.J.: LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology 2(3), 27:1–27:27 (2011)
 [4] Collins, M., Duffy, N.: Convolution kernels for natural language. In: Advances in neural information processing systems. pp. 625–632 (2001)
 [5] Dupé, F.X., Brun, L.: Tree covering within a graph kernel framework for shape classification. In: Image Analysis and Processing, pp. 278–287. Springer (2009)

[6]
Harchaoui, Z., Bach, F.: Image classification with segmentation graph kernels. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 1–8 (2007)
 [7] Haussler, D.: Convolution kernels on discrete structures. Tech. rep., Department of Computer Science, University of California at Santa Cruz (1999)
 [8] Kimura, D., Kuboyama, T., Shibuya, T., Kashima, H.: A subpath kernel for rooted unordered trees. In: Advances in Knowledge Discovery and Data Mining, pp. 62–74. Springer (2011)
 [9] Kurtz, C., Passat, N., Gancarski, P., Puissant, A.: Extraction of complex patterns from multiresolution remote sensing images: A hierarchical topdown methodology. Pattern Recognition 45(2), 685–706 (2012)
 [10] ShaweTaylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge, UK (2004)

[11]
Shin, K., Kuboyama, T.: A comprehensive study of tree kernels. In: New Frontiers in Artificial Intelligence, pp. 337–351. Springer (2014)
 [12] Tilton, J.: RHSEG users manual: Including HSWO, HSEG, HSEGExtract, HSEGReader and HSEGViewer, version 1.50 (2010)
 [13] Vishwanathan, S., Smola, A.J.: Fast kernels for string and tree matching. Kernel methods in computational biology pp. 113–130 (2004)
Comments
There are no comments yet.