1 Introduction
Flood extent mapping plays a crucial role in addressing grand societal challenges such as disaster management, national water forecasting, as well as energy and food security. For example, during Hurricane Harvey floods in 2017, first responders needed to know where flood water was in order to plan rescue efforts. In national water forecasting, detailed flood extent maps can be used to calibrate and validate the NOAA National Water Model [15], which can forecast the flow of over 2.7 million rivers and streams through the entire continental U.S. [4].
In current practice, flood extent maps are mostly generated by flood forecasting models, whose accuracy is often unsatisfactory in high spatial details [4]. Other ways to generate flood maps involve sending field crew on the ground to record highwater marks, or visually interpreting earth observation imagery [2]. However, the process is both expensive and time consuming. With the large amount of highresolution earth imagery being collected from satellites (e.g., DigitalGlobe, Planet Labs), aerial planes (e.g., NOAA National Geodetic Survey), and unmanned aerial vehicles, the costs of manually labeling flood extents become prohibitive.
The focus of this paper is to develop a classification model that can automatically classify earth observation imagery pixels into flood extent maps. The results can be used by first responders to plan rescue efforts, by hydrologists to calibrate and validate water forecasting models, as well as by insurance companies to process claims. Specifically, we can utilize a small set of manually collected ground truth (flood and dry locations) in one earth imagery to learn a classification model. Then the model can be used to classify flood pixels in other imagery where ground truth is not available.
However, flood mapping poses several unique challenges that are not well addressed in traditional classification problems. First, data contains rich noise and obstacles. For example, highresolution earth imagery often has noise, clouds and shadows. The spectral features of these pixels cannot be used to distinguish classes. Second, class confusion exists due to heterogeneous features. For instance, pixels of tree canopies overlaying flood water have the same spectral features with those trees in dry areas, yet their classes are different. Third, implicit directed spatial dependency exists between flood class locations. Specifically, due to gravity, flood water tends to flow to nearby lower locations following topography. Such dependency is not uniform in all directions (anisotropic). Finally, the data volume is huge in highresolution imagery (e.g., hundreds of millions of pixels in one city), requiring scalable algorithms.
To address these challenges, we propose a novel spatial classification model called geographical hidden Markov tree (HMT). It is a probablistic graphical model that generalizes the common hidden Markov model (HMM) from a onedimensional sequence to a two dimensional geographical map. Specifically, the hidden class layer contains nodes (pixels) in a reverse tree structure to represent anisotropic spatial dependency with a partial order constraint. Each hidden class node has an associated observed feature node for the same pixel. Such a unique model structure can potentially reduce classification errors due to noise, obstacles, and heterogeneity among spectral features of individual pixels.
We further investigate computational algorithms for reverse tree construction, model parameter learning, and class inference. Specifically, reverse tree is constructed following topological orders based on elevations. In order to learn model parameters given a hidden class layer, we utilize the EM algorithm with message propagation along the reverse tree. Finally, for class inference, we design a greedy algorithm that assign class labels for tree nodes to maximize overall probability following the partial order constraint.
In summary, we make the following contributions:

We propose a novel geographical hidden Markov tree (HMT) model that incorporates partial order class dependency in a reverse tree structure in a hidden class layer. Unlike existing hidden Markov trees [5] which model dependency in twodimensional timefrequency domain for signal processing, our geographical HMT captures anisotropic (directed) spatial dependency with a partial order constraint.

We design efficient algorithms for reverse tree construction, model parameter learning and class inference.

We conduct theoretical analysis on the correctness and time complexity of HMT algorithms.

We evaluate proposed model in both synthetic and real world datasets for flood mapping. Results show that proposed model outperforms multiple baseline methods in flood mapping, and our algorithms are scalable for large data sizes.
2 Problem Statement
2.1 Preliminaries
A spatial raster framework is a tessellation of a two dimensional plane into a regular grid of cells. Spatial neighborhood relationship exists between cells based on cell adjacency. The framework can consist of nonspatial explanatory feature layers (e.g., spectral bands in earth imagery), one spatial contextual layer (e.g., elevation), and one class layer (e.g., flood, dry).
Each cell in a raster framework is a spatial data sample, noted as , where ,
is a vector of nonspatial explanatory feature values with each element corresponding to one feature layer,
is a cell’s value in the spatial contextual layer, and is a binary class label.A raster framework with all samples is noted as , nonspatial explanatory features of all samples are noted as , the spatial contextual layer is noted as , and the class layer is noted as .
Due to physics, spatial dependency exists between cells based on their values in the spatial contextual layer. Such dependency is often nonuniform in different directions (anisotropic). For example, due to gravity, flood water can only flow to neighboring cells with lower elevation values.
Anisotropic dependency often follows a partial order constraint. Formally, assuming the spatial contextual layer is a potential field (e.g., elevation), a partial order dependency exists if and only if there exist a sequence of neighboring (adjacent) cells such that and for any .
Figure 1(a) shows an illustrative example with eight spatially adjacent cell samples in one dimensional space. Due to gravity, if cell is flood, its nearby cells with lower elevations including should also be flood, even if their feature values indicate otherwise. Thus, we can establish partial order spatial dependency between cell locations such as , .
Partial order dependency across all pairs of samples in a raster framework can be represented by a reverse tree structure, which is called spatial dependency (reverse) tree or dependency tree. We sometimes omit the word “reverse” for simplicity. The tree structure removed some redundant dependency between cell locations. Due to the reverse nature, a tree node can have at most one child , but multiple parents and multiple siblings , where represents a tree edge from a parent to a child.
Figure 1(b) shows an example of dependency tree corresponding to samples in Figure 1(a). Class dependency is redundant given dependency and . It is worth noting that we assume an arbitrary order when comparing nodes with the same elevation values. For instance, if node and node had the same elevation, the top of the tree could be either or .
2.2 Formal problem definition
We now formally define the problem.
Input:
Spatial raster framework
Explanatory features of samples
Spatial contextual layer (elevation) of samples:
Training samples
Output: A spatial classification model
Objective: minimize classification errors
Constraint:
Explanatory feature layers contain noise and obstacles
Partial order dependency exists between sample classes based on spatial contextual layer
Sample class is binary,
3 Proposed Approach
In this section, we start with overview of our hidden Markov tree model and its probabilistic formulation. We then introduce specific algorithms for dependency tree construction, model parameter learning and class inference.
3.1 Overview of Hidden Markov Tree
We propose a hidden Markov tree (HMT) model, which generalizes the common hidden Markov model from a total order chain structure to a partial order (reverse) tree structure. As illustrated in Figure 2, a HMT model consists of two layers: a hidden layer of sample classes (e.g., flood, dry), and an observation layer of sample feature vectors (e.g., spectral vectors). Each node corresponds to a spatial data sample (raster cell). Edge directions show probabilistic conditional dependence structure. Specifically, the model assumes that feature vectors of different samples are conditionally independent with each other given their classes, and sample classes follow a partial order dependency in a reverse tree structure.
Hidden Markov tree is a probabilistic graphic model. The joint distribution of all samples’ features and classes can be expressed as Equation
1, where is the set of parent samples of the th sample in the dependency tree, and is the set of class nodes corresponding to parents of the th sample. For a leaf node , , and .(1) 
The conditional probability of sample feature vector given its class can be assumed i.i.d. Gaussian for simplicity, as shown in Equation 2, where and are the mean and covariance matrix of feature vector for class (). It is worth noting that could be more general than i.i.d. Gaussian.
(2) 
Class transitional probability follows the partial order constraint. For example, due to gravity, if any parent’s class is dry, the child’s class must be dry; if all parents’ classes are flood, then the child has a high probability of being flood. Consider flood as the positive class (class value ) and dry as the negative class (class value ), the transitional probability is actually conditioned on the product of parent classes . The formula is in Equation 3, where is the probability of a child in class given all parents in class (note that we assume ). In other words, if any parent is in class (), the current node must also be in class (); if all parents are in class (), then the current node has a probability of being in class .
(3) 
For a leaf node , . The transitional probability is degraded into simple class probability , where is the probability of being in class .
Though we introduce our HMT in the context of flood mapping, the model can potentially be used for a broad class of spatial classification problems in which class labels follow a partial order dependency. Examples include predicting pollutants in river stream networks and traffic congestion in road networks.
3.2 Dependency Tree Construction
Given geopotential field values (e.g., elevation) of all cells in a raster framework, the goal is to produce a partial order class dependency tree, in which each node corresponds to the class label of a cell. The process is computationally challenging due to the large number of cells (tree nodes) in real world highresolution earth imagery (e.g., hundreds of millions of pixels).
To address the challenge, we propose an algorithm that constructs the tree by adding nodes in topological order. Details are in Algorithm 1. The algorithm starts with an empty tree and an empty set of visited cells (all cells are unvisited, step 1). It sorts all cells by their geopotential field (elevation) values (step 2). After this, unvisited cells are added into the tree (i.e., become visited) one by one following an ascending order of geopotential. Specifically, for each cell, the algorithm first marks it as visited (step 4), creates a tree node for the cell (step 5), and attach the tree node to the rear of the tree branch following every visited neighbor of the the cell (steps 6 to 9). If no neighbor of the cell is visited, the cell is a local minimum in geopotential field, and the algorithm creates a new tree branch starting from the node of the cell (steps 10 to 11).
We now use the example of Figure 1 to illustrate the algorithm execution trace. The example contains cells in one dimensional space, but generalization to the case of two dimensional space is trivial. The input contains eight cells from to . The algorithm first sorts these cells by ascending order of elevation, and gets a sequence . Then, leaf nodes are created for and , since none of their neighbors are visited by then. Next, when adding , its neighbor is visited, so the algorithm adds node to the rear of the branch following . Similarly, node and are attached to the two branches respectively. When adding the node for , both of its neighbors are visited, so is attached to the rear of both branches. After this, nodes and are added consecutively.
Time complexity analysis: Algorithm 1 involves a onetime sorting of cells, which is . Then, for each of the cells, the main operation is to attach the cell to the rear of the branches of its visited neighbors. A naive implementation will cost , making the total cost . A smarter way to do this is to maintain a rear node pointer for each branch when it is created (i.e., when a leaf node is added). Assuming that geopotential field values on neighboring cells are contiguous (this is often true since real world elevation of nearby locations do not change suddenly), finding the rear of a neighboring cell’s branch is within a constant cost, making the total time cost (cost after sorting is linear).
3.3 Model Parameter Learning
The parameters of hidden Markov tree include the mean and covariance matrix of sample features in each class, prior probability of leaf node classes, and class transition probability for nonleaf nodes. We denote the entire set of parameters as
. Learning the set of parameters poses two major challenges: first, there exist unknown hidden class variables , which are noni.i.d.; second, the number of samples (nodes) is huge (up to hundreds of millions of pixels).To address these challenges, we propose to use the expectationmaximization (EM) algorithm and message (belief) propagation. Our EMbased approach has the following major steps:

Initialize parameter set

Compute posterior distribution of hidden classes:

Compute posterior expectation of log likelihood:

Update parameters:
Return if it’s converged, otherwise goto (b)
Among the four steps above, step (b) that computes the joint posterior distribution of all sample classes is practicallly infeasible due to the large number of hidden class nodes that are noni.i.d. Fortunately, it is not necessary to compute the entire joint posterior distribution of all sample classes . In fact, we only need the marginal posterior distribution of a node’s and its parents’ classes for nonleaf nodes, as well as the marginal posterior distribution of a node’s class for leaf nodes. The reason can be explained through the expression of the posterior expectation of log likelihood in Equation 4.
(4) 
Note that for leaf node, , and the last term in the last line of above equation is degraded, .
To compute the marginal posterior distribution and , we propose to use the message propagation method based on the sum and product algorithm [12, 19].
Figure 3 illustrates the recursive message propagation process on our HMT model. Specifically, forward message propagation from leaves to root is based on Equation 5 and Equation 6, where and are the incoming message into and outgoing message from a hidden class node respectively.
(5) 
(6) 
Backward message propagation from root to leaves also follows a recursive process, as shown in Equation 7 and Equation 8, where and are the incoming and outgoing messages for class node respectively. The main difference from forward propagation is that when computing incoming message , we need to multiply not only outgoing message from a child node and class transitional probability, but also outgoing messages from sibling nodes in the forward propagation (also illustrated in Figure 3(b)).
(7) 
(8) 
After both forward and backward message propagation, we can compute marginal posterior distribution of hidden class variables based on the following theorem.
Theorem 1.
The unnormalized marginal posterior distribution of the class of a leaf node, as well as the classes of a nonleaf node with parents can be computed by (9) and (10) respectively. Their normalized marginal posterior distributions can be computed by (11) and (12) respectively.
(9) 
(10) 
(11) 
(12) 
Proof.
Detailed proof is in the Appendix at the end of this paper. ∎
After computation of marginal posterior distribution, we can update model parameters by maximizing the posterior expectation of log likelihood (the maximization or M step in EM). Taking the marginal posterior distributions in (11) and (12) above as well as parameters for probabilities in (2) and (3) into the posterior expectation of log likelihood in (4), we can easily get the following parameter update formulas.
(13) 
(14) 
(15) 
(16) 
Algorithm 2
summarizes the entire parameter learning process. First, we initialize the set of parameters either with random values within reasonable range or with initial estimates based on training samples (e.g., the mean and covariance of features in each class). After parameters are initialized, the algorithm starts the iteration till parameters converge. In each iteration, it propagates messages first from leaves to root (steps 45) and then from root to leaves (steps 67). Marginal posterior distribution of node classes are then computed (steps 89). Based on this, the algorithm updates parameters (step 10).
Time complexity: The cost of Algorithm 2 mainly comes from the iterations. In each iteration, message propagation is done through tree traversal, which costs ( is the total number of samples or tree nodes). It can also be seen easily that marginal probability computation and parameter update both have costs of . Thus, the total cost is , where is the number of iterations.
Is the model unsupervised or semisupervised? From discussions above, it is possible to learn HMT parameters in an unsupervised manner without training class labels. However, this relies on strong assumptions on data distributions. Particularly, it requires samples in different classes to be somehow distinguishable merely based on their feature distribution (), since class transitional probability in dependency tree only enforces a partial order constraint between class nodes. This assumption can be violated in many real world applications where different classes cannot be easily distinguished via unsupervised feature clustering. In such cases, we can utilize training samples with class labels to initialize parameters of , i.e., , by maximum likelihood estimation. In this way, initialized probability can better estimate class marginal distribution before iterations. This makes the model learning semisupervised [25].
3.4 Class Inference
After learning all model parameters, we can infer hidden class variables by maximizing the overall probability. In the traditional hidden Markov model, inference on hidden variables are done through the Viterbi algorithm [18] based on dynamic programming. However, its computational cost is still very high for a large number of nodes (e.g., up to hundreds of millions). To address this challenge, we propose a greedy algorithm that guarantees correctness based on the partial order class constraint. Taking the logarithm of joint probability in Equation 1, we have the objective function below.
(17) 
The goal of class inference is to assign a class label to each tree node such that the overall sum of log probability terms is maximized. Each term in the summation can be considered as a reward. For instance, is the reward for assigning class to node (i.e., node reward), is the reward for assigning class and to node and its parents respectively (i.e., edge reward). Thus, class inference in HMT becomes a node coloring problem. Our goal is to find a node coloring to maximize the overall sum of rewards. In addition, the color must follow a partial order constraint, e.g., dry nodes cannot follow flood nodes, because otherwise, .
Therefore, we can enumerate all feasible node coloring through one bottomup tree traversal, as described in Algorithm 3. We can initialize all node color as class (negative class, e.g., dry), and gradually changed node colors from class to class from leaves to the root. When we change the color of a node, only the reward of the node itself, as well as the rewards of edges between the nodes to its parents and child will be updated, as illustrated in Figure 4. Thus, we can easily compute the gain of rewards when updating node colors (), and maintain the current cumulative gain () as well as the maximum cumulative gain () that we’ve come across till the node so far. When we reach the root node, the maximum overall gain of rewards has been recorded. We can traverse the tree again to find its corresponding node coloring.
Time complexity analysis: The initialization steps cost , where is the number of samples or tree nodes. Each iteration of the for loop has a constant cost, making the total cost . Similarly, the breadth first traversal and recoloring in last step cost . Thus, the entire algorithm has a cost of .
4 Experimental Evaluation
In this section, we compared our proposed method with baseline methods on both synthetic dataset and two real world datasets in classification performance. We also evaluated the computational scalability of our method on synthetic data with different sizes. Experiments were conducted on a Dell workstation with Intel(R) Xeon(R) CPU E52687w v4 @ 3.00GHz, 64GB main memory. Candidate classification methods for comparison include:

Nonspatial classifiers with raw features
: We tested decision tree (
DT), random forest (
RF), maximum likelihood classifier (MLC), and gradient boosted tree (
GBM) in R packages on raw features (including red, green, blue spectral bands respectively). 
Nonspatial classifiers with preprocessed features: We tested DT, RF and MLC with additional elevation feature (elev.) We do not include GBM due to space limit.

Nonspatial classifier with postprocessing label propagation (LP): We also tested DT, RF and MLC on raw features but with postprocessing on predicted classes via label propagation [26]. We used 4neighborhood. We do not include GBM due to space limit.

Transductive SVM: Since our method utilizes features of test samples, we included Transductive SVM (SVMLight [10]), a semisupervised tranductive method for fair comparison.

Hidden Markov Tree (HMT): We implemented our HMT method in C++.
Unless specified otherwise, we used default parameters in open source tools for baseline methods.
4.1 Synthetic Data
We first evaluated our proposed approach on synthetic data. Specifically, we generated a regular grid with by pixels. Elevations and classes (flood, dry) of pixels are shown in Figure 5
(ab). Feature values of pixels in two classes follow two onedimensional Gaussian distributions with means
. To reflect the spatial autocorrelation effect, we generated one common feature value for a group of contiguous pixels in a coarse resolution (), as shown in Figure 5(c). Training samples from two classes were generated based on the two Gaussian distributions of feature values.Computational scalability: We measured the computational time costs of different components in our HMT algorithms on varying sizes of study area (from around million pixels to around million pixels). We also fixed the number of iterations as when running algorithms on different data sizes. The time costs were measured in the average of runs. Figure 6 shows the time costs of tree construction (Algorithm 1), parameter learning (Algorithm 2), and class inference (Algorithm 3) respectively. We can see that as the number of pixels increases, time costs of all algorithms are increasing. The parameter learning part takes the vast majority of time costs. Its time costs increase linearly with data sizes, because the message propagation in each iteration is done through tree traversal operations, which has a linear time complexity. Overall, our algorithms cost less than 5 minutes on a synthetic data with 20 million samples.
Classification performance:
We compared the Fscore of different methods on test pixels with different parameter settings of synthetic data generation. We exclude preprocessing and postprocessing methods because our synthetic data generation cannot simulate the real feature textures. In the first setting, we conducted comparison on varying numbers of training pixels from
, , to . Results in Figure 7(a) showed that the classification performance of different methods were relatively stable (easily reaching plateau) for different training set sizes. The reason was probably that one dimensional Gaussian distributions on feature values in two classes were very easy to learn. In the second setting, we fixed other parameters and varied the standard deviations of feature values in two classes. The higher the values were, the more confusion (Bayes error) there were between two classes. Results of different methods in Figure 7(b) showed that as increase, the classification performance of all methods degraded, but our HMT model persistently outperformed other baseline methods, due to incorporating anisotropic spatial dependency across locations.4.2 Hurricane Mathew Floods 2016
Here we validated our approach in flood inundation extent mapping during Hurricane Mathew, NC, 2016. We used highresolution aerial imagery from NOAA National Geodetic Survey [14] as explanatory features (three spectral band features including red, green, blue), and digital elevation map from the University of North Carolina Libraries [16]. All imagery data were resampled into a resolution of 2 meters. A test region with 1743 by 1349 pixels was used. A training set with 10000 pixels (5000 dry and 5000 flood) were manually labeled outside the test region, and 94608 test pixels (47092 dry, 47516 flood) were labeled within the test region.
Classifiers  Class  Precision  Recall  F  Avg. F 

DT+Raw  Dry  0.62  0.84  0.71  0.65 
Flood  0.76  0.48  0.59  
RF+Raw  Dry  0.59  0.96  0.73  0.61 
Flood  0.90  0.33  0.49  
GBM+Raw  Dry  0.69  0.76  0.72  0.71 
Flood  0.74  0.67  0.70  
MLC+Raw  Dry  0.64  0.93  0.76  0.69 
Flood  0.88  0.48  0.62  
DT+elev.  Dry  0.99  0.55  0.71  0.76 
Flood  0.69  0.99  0.82  
RF+elev.  Dry  0.99  0.66  0.79  0.82 
Flood  0.74  0.99  0.85  
MLC+elev.  Dry  0.84  0.90  0.87  0.87 
Flood  0.89  0.84  0.86  
DT+LP  Dry  0.61  0.92  0.74  0.65 
Flood  0.85  0.43  0.57  
RF+LP  Dry  0.57  0.99  0.72  0.57 
Flood  0.99  0.26  0.42  
MLC+LP  Dry  0.64  0.97  0.77  0.69 
Flood  0.95  0.46  0.62  
MRF  Dry  0.62  0.99  0.76  0.67 
Flood  0.98  0.41  0.58  
TSVM  Dry  0.62  0.86  0.72  0.66 
Flood  0.78  0.49  0.60  
HMT  Dry  0.93  0.99  0.96  0.96 
Flood  0.99  0.93  0.96 
Classification performance comparison: We compared different methods on their precision, recall, and Fscore on both the flood class and the dry class. Results were summarized in Table 1. We can see that decision tree, random forest, and maximum likelihood classifier all perform poorly on raw features, with overall Fscore less than 0.7. Adding postprocessing through label propagation slightly impaired performance. For example, after adding postprocessing (LP) into decision tree, the recall of the dry class got better but the recall of flood class got worse, probably due to oversmoothing of correctly classified flood pixels into the dry class. Markov random field and Transductive SVM had comparable results with decision tree. In contrast, adding elevation features improved the overall classification performance dramatically for decision tree, random forest, and maximum likelihood classifier. The reason is because most flood pixels have lower elevation than dry pixels. However, the performance on dry class was still quite inferior (0.86 to 0.9 in Fscore) compared with our hidden Markov tree (0.95 in Fscore), probably because classification models learned based on absolute elevation values cannot perfectly apply to the test region.
Some visualization of classification results were shown in Figure 8. The spectral features and elevation values were shown in Figure 8(ab). Results of decision tree were in Figure 8(c), which only identified flood pixels with open surface, and mistakenly classified the vast majority of flood pixels below tree canopies (the spectral features of trees indicated the dry class if not considering spatial dependency with nearby pixels). In contrast, our HMT model correctly identified most of the flood pixels, even if the flood water was below tree canopies. The reason is that our HMT incorporates the anisotropic spatial dependency across pixel locations (if a location is flood, its nearby lower locations should also be flood).
Sensitivity of HMT to initial parameters: We conducted sensitivity of our HMT model to different initial parameter values on prior class probability and class transitional probability (the parameters of were initialized based on maximum likelihood estimation on the training set). First, we fixed initial and varied initial from to . Results of converged value of together with final Fscore were shown in Figure 9(ab). It can be seen that our HMT model was quite stable with different initial values. Similarly, we fixed initial , and varied initial from , , to . Results in Figure 9(cd) showed the same trend. In practice, we can select an initial value around and a relatively high initial value such as (because flood pixels’ neighbor pixel is very likely to be flood due to spatial autocorrelation).
Parameter iterations and convergence in HMT: Here we fixed the initial and initial , and measured the parameter iterations and convergence behavior. Our convergence threshold was set . The parameter values of and at each iteration were summarized in Figure 10 (we omitted because there were too many variables). The parameters converged after 10 iterations.
4.3 Hurricane Harvey Floods 2017
The highresolution earth imagery we used were from Plant Labs. Inc. with red, green, and blue bands in 3 meter resolution, and the digital elevation data was from Texas natural resource management department. We manually collected a training set with 5000 flood samples and 5000 dry samples. We selected a test scene with 4174 rows and 4592 columns, within which we manually labeled 74305 flood samples, and 52658 dry samples as the test set.
We compared different methods on their precision, recall, and Fscore on each class. Results were summarized in Table 2. We can see that decision tree, random forest, and maximum likelihood classifier all perform poorly on raw features, with overall Fscore less than 0.75. Adding additional elevation feature improved the classification accuracy slightly, but the overall Fscore was still below 0.8. The reason may be that the absolute elevation values from training area are not as good an indicator for two classes as in the test area. Similarly, adding label propagation in postprocessing and the MRF model barely improved classification performance compared with decision tree and random forest on raw features, because the errors were mostly systematic instead of saltandpepper noise. TSVM performed even worse than supervised methods without unlabeled samples, which was somehow surprising. In contrast to baseline methods, our hidden Markov tree achieved superior performance with around 0.95 Fscore on both classes. Visualization of some results were shown in Figure 11. We can see that our HMT model significantly outperformed decision tree on flood locations under tree canopies, similar to the results in Figure 8.
Classifiers  Class  Precision  Recall  F  Avg. F 

DT+Raw  Dry  0.58  0.88  0.70  0.69 
Flood  0.87  0.56  0.68  
RF+Raw  Dry  0.62  0.96  0.76  0.74 
Flood  0.95  0.59  0.73  
GBM+Raw  Dry  0.56  0.80  0.66  0.66 
Flood  0.80  0.56  0.66  
MLC+Raw  Dry  0.63  0.93  0.75  0.74 
Flood  0.93  0.61  0.73  
DT+elev.  Dry  0.61  0.99  0.76  0.74 
Flood  0.99  0.56  0.72  
RF+elev.  Dry  0.66  0.99  0.79  0.79 
Flood  0.99  0.65  0.78  
MLC+elev.  Dry  0.65  0.97  0.78  0.77 
Flood  0.97  0.62  0.76  
DT+LP  Dry  0.59  0.90  0.71  0.70 
Flood  0.89  0.56  0.69  
RF+LP  Dry  0.63  0.97  0.76  0.75 
Flood  0.97  0.59  0.74  
MLC+LP  Dry  0.63  0.94  0.76  0.75 
Flood  0.93  0.61  0.74  
MRF  Dry  0.63  0.94  0.75  0.74 
Flood  0.94  0.61  0.74  
TSVM  Dry  0.55  0.67  0.60  0.63 
Flood  0.72  0.61  0.66  
HMT  Dry  0.91  0.98  0.94  0.95 
Flood  0.98  0.93  0.95 
5 Related Work
Over the years, various techniques have been developed to incorporate spatial properties into classification algorithms for earth imagery data. Many methods are based on preprocessing and postprocessing, including neighborhood window filters [3, 6], spatial contextual variables and textures [17], spatial autocorrelation statistics [9], morphological profiling [1], spatialspectral classifiers [21, 23] and objectbased image analysis [7]. Markov random field model explicitly captures spatial dependency, but the dependency is undirected [13]. [11]
proposes a spatial classification model that captures directed spatial dependency on classes but assumes dependency to follow a total order. Deep learning methods have recently been applied to earth imagery classification
[24] such as land cover mapping [8], target recognition and scene identification. To the best of our knowledge, none of these existing works focus on incorporating anisotropic spatial dependency in partial order constraints, which is important in hydrological applications such as flood mapping.Hidden Markov models have been extensively studied in the signal processing literature [18]. Learning and inference of hidden Markov models are often based on EM algorithms and message propagation (sumandproduct algorithm) [12]. [19] proposes a message propagation algorithm called “upwarddownward” on a dependence tree structure. [5] proposes a waveletdomain hidden Markov tree to model dependency in the twodimensional timefrequency plane. The model is used to characterize properties of wavelet coefficients in signal processing such as clustering, persistence, and compression, which is dramatically different from our HMT model which captures spatial dependency in the geographic space based on topography.
6 Conclusions and Future Works
In this paper, we propose hidden Markov tree (HMT), an anisotropic spatial classification model for flood mapping on earth imagery. Compared with existing methods, our HMT model explicitly captures directed spatial dependency with partial order constraint. Partial order constraint is reflected in a reverse tree structure in the hidden class layer. We also proposed algorithms for reverse tree construction, parameter learning and class inference. Evaluations on both synthetic data and real world data shows that our HMT model is scalable to large data sizes, and can utilize the partial order spatial dependency to reduce classification errors due to class confusion.
In future works, we plan to extend our HMT model for spatially nonstationary data. In this case, we need to generalize the tree structure in hidden class layer into polytree with each subtree for a spatial zone.
References
 [1] J. A. Benediktsson, J. A. Palmason, and J. R. Sveinsson. Classification of hyperspectral data from urban areas based on extended morphological profiles. IEEE Transactions on Geoscience and Remote Sensing, 43(3):480–491, 2005.
 [2] P. Brivio, R. Colombo, M. Maggi, and R. Tomasoni. Integration of remote sensing data and gis for accurate mapping of flooded areas. International Journal of Remote Sensing, 23(3):429–441, 2002.
 [3] R. H. Chan, C.W. Ho, and M. Nikolova. Saltandpepper noise removal by mediantype noise detectors and detailpreserving regularization. Image Processing, IEEE Transactions on, 14(10):1479–1485, 2005.
 [4] D. Cline. Integrated water resources science and services: an integrated and adaptive roadmap for operational implementation. Technical report, National Oceanic and Atmospheric Administration, 2009.
 [5] M. S. Crouse, R. D. Nowak, and R. G. Baraniuk. Waveletbased statistical signal processing using hidden markov models. IEEE Transactions on signal processing, 46(4):886–902, 1998.
 [6] S. Esakkirajan, T. Veerakumar, A. N. Subramanyam, and C. PremChand. Removal of high density salt and pepper noise through modified decision based unsymmetric trimmed median filter. Signal Processing Letters, IEEE, 18(5):287–290, 2011.
 [7] G. Hay and G. Castilla. Geographic objectbased image analysis (geobia): A new name for a new discipline. In Objectbased image analysis, pages 75–89. Springer, 2008.
 [8] X. Jia, A. Khandelwal, G. Nayak, J. Gerber, K. Carlson, P. West, and V. Kumar. Incremental dualmemory lstm in land cover prediction. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 867–876. ACM, 2017.
 [9] Z. Jiang, S. Shekhar, X. Zhou, J. Knight, and J. Corcoran. Focaltestbased spatial decision tree learning. IEEE Transactions on Knowledge and Data Engineering, 27(6):1547–1559, 2015.
 [10] T. Joachims. Making largescale SVM learning practical. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods  Support Vector Learning, chapter 11, pages 169–184. MIT Press, Cambridge, MA, 1999.
 [11] A. Khandelwal, V. Mithal, and V. Kumar. Post classification label refinement using implicit ordering constraint among data instances. In 2015 IEEE International Conference on Data Mining, ICDM 2015, Atlantic City, NJ, USA, November 1417, 2015, pages 799–804, 2015.
 [12] F. R. Kschischang, B. J. Frey, and H.A. Loeliger. Factor graphs and the sumproduct algorithm. IEEE Transactions on information theory, 47(2):498–519, 2001.
 [13] S. Z. Li. Markov random field modeling in image analysis. Springer Science & Business Media, 2009.
 [14] National Oceanic and Atmospheric Administration. Data and imagery from noaa’s national geodetic survey. https://www.ngs.noaa.gov.
 [15] National Oceanic and Atmospheric Administration. National Water Model: Improving NOAA’s Water Prediction Services. http://water.noaa.gov/documents/wrnnationalwatermodel.pdf, 2018.
 [16] NCSU Libraries. LIDAR Based Elevation Data for North Carolina. https://www.lib.ncsu.edu/gis/elevation, 2018.
 [17] A. Puissant, J. Hirsch, and C. Weber. The utility of texture analysis to improve perpixel classification for high to very high spatial resolution imagery. International Journal of Remote Sensing, 26(4):733–745, 2005.
 [18] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
 [19] O. Ronen, J. Rohlicek, and M. Ostendorf. Parameter estimation of dependence tree models using the em algorithm. IEEE Signal Processing Letters, 2(8):157–159, 1995.

[20]
R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala,
M. Tappen, and C. Rother.
A comparative study of energy minimization methods for markov random
fields.
In
European conference on computer vision
, pages 16–29. Springer, 2006.  [21] Y. Tarabalka, J. A. Benediktsson, and J. Chanussot. Spectral–spatial classification of hyperspectral imagery based on partitional clustering techniques. IEEE Transactions on Geoscience and Remote Sensing, 47(8):2973–2987, 2009.
 [22] The Middlebury Computer Vision Pages. C++ Source Code of MRF. http://vision.middlebury.edu/MRF/code/, 2018.
 [23] L. Wang, S. Hao, Q. Wang, and Y. Wang. Semisupervised classification for hyperspectral imagery based on spatialspectral label propagation. ISPRS Journal of Photogrammetry and Remote Sensing, 97:123–137, 2014.
 [24] L. Zhang, L. Zhang, and B. Du. Deep learning for remote sensing data: A technical tutorial on the state of the art. IEEE Geoscience and Remote Sensing Magazine, 4(2):22–40, 2016.
 [25] X. Zhu. Semisupervised learning literature survey. 2005.
 [26] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. 2002.
Appendix A Proof of Theorems
Theorem 1.
The unnormalized marginal posterior distribution of the class of a leaf node, as well as the classes of a nonleaf node with parents can be computed by (9) and (10) respectively. Their normalized margin posterior distributions can be computed by (11) and (12) respectively.
(18) 
(19) 
(20) 
(21) 
Proof.
Detailed proof can be found below. Some symbols are defined in Table 3.
Symbol  Description 

All nodes in the subtree rooted at node except for the root node  
All nodes in the subtree rooted at node  
All nodes in the reverse tree excluding  
All nodes in the reverse tree excluding 
Forward message propagation, the statistical meanings of the message are as below
(22)  
(23) 
Here, means all nodes in the subtree rooted at node except for the root node (i.e., all nodes visited before in its subtrees during bottomup propagation), means all nodes in the subtree rooted at node (i.e., ).
Base case: If is leaf node,
(24)  
(25) 
Induction step: Let be an internal node and and suppose Equation 22 and 23 is true for . Then, based on topological order, let be the child of node .
(26) 
(27) 
Thus, Equation 22 and 23 hold for internal nodes. Hence Equation 22 and 23 are correct for all nodes.
Backward message propagation, the statistical meanings of the message are as below
(28)  
(29) 
Here, means all nodes visited before during topdown propagation in the reverse tree (i.e., is all nodes in the reverse tree excluding ), means all nodes visited including during topdown propagation in the reverse tree (i.e., includes all nodes in the reverse tree excluding ).
Base case: If is root node,
(30)  
(31) 
Induction step: Let be an internal node and and suppose Equation 28 and 29 is true for . Then, based on topological order, let be one of the parents of node .
(32) 
(33) 
Thus, Equation 28 and 29 hold for internal nodes. Hence Equation 28 and 29 are correct for all nodes.
For unnormalized marginal posterior distribution, Equation 18 and 21 are proved as below. For simplicity, we omit on the right side of the following equations.
(34) 
(35) 
Comments
There are no comments yet.