Geographical Hidden Markov Tree for Flood Extent Mapping (With Proof Appendix)

05/24/2018 ∙ by Miao Xie, et al. ∙ 0

Flood extent mapping plays a crucial role in disaster management and national water forecasting. Unfortunately, traditional classification methods are often hampered by the existence of noise, obstacles and heterogeneity in spectral features as well as implicit anisotropic spatial dependency across class labels. In this paper, we propose geographical hidden Markov tree, a probabilistic graphical model that generalizes the common hidden Markov model from a one dimensional sequence to a two dimensional map. Anisotropic spatial dependency is incorporated in the hidden class layer with a reverse tree structure. We also investigate computational algorithms for reverse tree construction, model parameter learning and class inference. Extensive evaluations on both synthetic and real world datasets show that proposed model outperforms multiple baselines in flood mapping, and our algorithms are scalable on large data sizes.



There are no comments yet.


page 15

page 18

page 22

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Flood extent mapping plays a crucial role in addressing grand societal challenges such as disaster management, national water forecasting, as well as energy and food security. For example, during Hurricane Harvey floods in 2017, first responders needed to know where flood water was in order to plan rescue efforts. In national water forecasting, detailed flood extent maps can be used to calibrate and validate the NOAA National Water Model [15], which can forecast the flow of over 2.7 million rivers and streams through the entire continental U.S. [4].

In current practice, flood extent maps are mostly generated by flood forecasting models, whose accuracy is often unsatisfactory in high spatial details [4]. Other ways to generate flood maps involve sending field crew on the ground to record high-water marks, or visually interpreting earth observation imagery [2]. However, the process is both expensive and time consuming. With the large amount of high-resolution earth imagery being collected from satellites (e.g., DigitalGlobe, Planet Labs), aerial planes (e.g., NOAA National Geodetic Survey), and unmanned aerial vehicles, the costs of manually labeling flood extents become prohibitive.

The focus of this paper is to develop a classification model that can automatically classify earth observation imagery pixels into flood extent maps. The results can be used by first responders to plan rescue efforts, by hydrologists to calibrate and validate water forecasting models, as well as by insurance companies to process claims. Specifically, we can utilize a small set of manually collected ground truth (flood and dry locations) in one earth imagery to learn a classification model. Then the model can be used to classify flood pixels in other imagery where ground truth is not available.

However, flood mapping poses several unique challenges that are not well addressed in traditional classification problems. First, data contains rich noise and obstacles. For example, high-resolution earth imagery often has noise, clouds and shadows. The spectral features of these pixels cannot be used to distinguish classes. Second, class confusion exists due to heterogeneous features. For instance, pixels of tree canopies overlaying flood water have the same spectral features with those trees in dry areas, yet their classes are different. Third, implicit directed spatial dependency exists between flood class locations. Specifically, due to gravity, flood water tends to flow to nearby lower locations following topography. Such dependency is not uniform in all directions (anisotropic). Finally, the data volume is huge in high-resolution imagery (e.g., hundreds of millions of pixels in one city), requiring scalable algorithms.

To address these challenges, we propose a novel spatial classification model called geographical hidden Markov tree (HMT). It is a probablistic graphical model that generalizes the common hidden Markov model (HMM) from a one-dimensional sequence to a two dimensional geographical map. Specifically, the hidden class layer contains nodes (pixels) in a reverse tree structure to represent anisotropic spatial dependency with a partial order constraint. Each hidden class node has an associated observed feature node for the same pixel. Such a unique model structure can potentially reduce classification errors due to noise, obstacles, and heterogeneity among spectral features of individual pixels.

We further investigate computational algorithms for reverse tree construction, model parameter learning, and class inference. Specifically, reverse tree is constructed following topological orders based on elevations. In order to learn model parameters given a hidden class layer, we utilize the EM algorithm with message propagation along the reverse tree. Finally, for class inference, we design a greedy algorithm that assign class labels for tree nodes to maximize overall probability following the partial order constraint.

In summary, we make the following contributions:

  • We propose a novel geographical hidden Markov tree (HMT) model that incorporates partial order class dependency in a reverse tree structure in a hidden class layer. Unlike existing hidden Markov trees [5] which model dependency in two-dimensional time-frequency domain for signal processing, our geographical HMT captures anisotropic (directed) spatial dependency with a partial order constraint.

  • We design efficient algorithms for reverse tree construction, model parameter learning and class inference.

  • We conduct theoretical analysis on the correctness and time complexity of HMT algorithms.

  • We evaluate proposed model in both synthetic and real world datasets for flood mapping. Results show that proposed model outperforms multiple baseline methods in flood mapping, and our algorithms are scalable for large data sizes.

2 Problem Statement

2.1 Preliminaries

A spatial raster framework is a tessellation of a two dimensional plane into a regular grid of cells. Spatial neighborhood relationship exists between cells based on cell adjacency. The framework can consist of non-spatial explanatory feature layers (e.g., spectral bands in earth imagery), one spatial contextual layer (e.g., elevation), and one class layer (e.g., flood, dry).

Each cell in a raster framework is a spatial data sample, noted as , where ,

is a vector of non-spatial explanatory feature values with each element corresponding to one feature layer,

is a cell’s value in the spatial contextual layer, and is a binary class label.

A raster framework with all samples is noted as , non-spatial explanatory features of all samples are noted as , the spatial contextual layer is noted as , and the class layer is noted as .

Due to physics, spatial dependency exists between cells based on their values in the spatial contextual layer. Such dependency is often non-uniform in different directions (anisotropic). For example, due to gravity, flood water can only flow to neighboring cells with lower elevation values.

Anisotropic dependency often follows a partial order constraint. Formally, assuming the spatial contextual layer is a potential field (e.g., elevation), a partial order dependency exists if and only if there exist a sequence of neighboring (adjacent) cells such that and for any .

Figure 1(a) shows an illustrative example with eight spatially adjacent cell samples in one dimensional space. Due to gravity, if cell is flood, its nearby cells with lower elevations including should also be flood, even if their feature values indicate otherwise. Thus, we can establish partial order spatial dependency between cell locations such as , .

(a) Eight consecutive sample locations in one dimensional space
(b) Partial order constraint in a reverse tree
Figure 1: Illustration of partial order class dependency

Partial order dependency across all pairs of samples in a raster framework can be represented by a reverse tree structure, which is called spatial dependency (reverse) tree or dependency tree. We sometimes omit the word “reverse” for simplicity. The tree structure removed some redundant dependency between cell locations. Due to the reverse nature, a tree node can have at most one child , but multiple parents and multiple siblings , where represents a tree edge from a parent to a child.

Figure 1(b) shows an example of dependency tree corresponding to samples in Figure 1(a). Class dependency is redundant given dependency and . It is worth noting that we assume an arbitrary order when comparing nodes with the same elevation values. For instance, if node and node had the same elevation, the top of the tree could be either or .

2.2 Formal problem definition

We now formally define the problem.

Spatial raster framework
Explanatory features of samples
Spatial contextual layer (elevation) of samples:
Training samples
Output: A spatial classification model
Objective: minimize classification errors
Explanatory feature layers contain noise and obstacles
Partial order dependency exists between sample classes based on spatial contextual layer
Sample class is binary,

3 Proposed Approach

In this section, we start with overview of our hidden Markov tree model and its probabilistic formulation. We then introduce specific algorithms for dependency tree construction, model parameter learning and class inference.

3.1 Overview of Hidden Markov Tree

We propose a hidden Markov tree (HMT) model, which generalizes the common hidden Markov model from a total order chain structure to a partial order (reverse) tree structure. As illustrated in Figure 2, a HMT model consists of two layers: a hidden layer of sample classes (e.g., flood, dry), and an observation layer of sample feature vectors (e.g., spectral vectors). Each node corresponds to a spatial data sample (raster cell). Edge directions show probabilistic conditional dependence structure. Specifically, the model assumes that feature vectors of different samples are conditionally independent with each other given their classes, and sample classes follow a partial order dependency in a reverse tree structure.

Figure 2: Illustration of hidden Markov tree framework

Hidden Markov tree is a probabilistic graphic model. The joint distribution of all samples’ features and classes can be expressed as Equation 

1, where is the set of parent samples of the th sample in the dependency tree, and is the set of class nodes corresponding to parents of the th sample. For a leaf node , , and .


The conditional probability of sample feature vector given its class can be assumed i.i.d. Gaussian for simplicity, as shown in Equation 2, where and are the mean and covariance matrix of feature vector for class (). It is worth noting that could be more general than i.i.d. Gaussian.


Class transitional probability follows the partial order constraint. For example, due to gravity, if any parent’s class is dry, the child’s class must be dry; if all parents’ classes are flood, then the child has a high probability of being flood. Consider flood as the positive class (class value ) and dry as the negative class (class value ), the transitional probability is actually conditioned on the product of parent classes . The formula is in Equation 3, where is the probability of a child in class given all parents in class (note that we assume ). In other words, if any parent is in class (), the current node must also be in class (); if all parents are in class (), then the current node has a probability of being in class .


For a leaf node , . The transitional probability is degraded into simple class probability , where is the probability of being in class .

Though we introduce our HMT in the context of flood mapping, the model can potentially be used for a broad class of spatial classification problems in which class labels follow a partial order dependency. Examples include predicting pollutants in river stream networks and traffic congestion in road networks.

3.2 Dependency Tree Construction

Given geopotential field values (e.g., elevation) of all cells in a raster framework, the goal is to produce a partial order class dependency tree, in which each node corresponds to the class label of a cell. The process is computationally challenging due to the large number of cells (tree nodes) in real world high-resolution earth imagery (e.g., hundreds of millions of pixels).

0:    A raster framework of samples: A spatial contextual layer of samples:
0:    A spatial dependency tree
1:  Initialize all samples as unvisited
2:  Sort all samples by increasing values
3:  for each sample in an ascending order of  do
4:     Mark as visited
5:     Create a new tree node of
6:     if there exists unvisited neighbor of  then
7:        for each unvisited neighbor of  do
8:           Traverse from node to the rear of its tree branch
9:           Attach node to the rear if not have done so
10:     else
11:        Create a tree branch starting from node as a leaf
12:  return the root node of dependency tree
Algorithm 1 Spatial Dependency Tree Construction

To address the challenge, we propose an algorithm that constructs the tree by adding nodes in topological order. Details are in Algorithm 1. The algorithm starts with an empty tree and an empty set of visited cells (all cells are unvisited, step 1). It sorts all cells by their geopotential field (elevation) values (step 2). After this, unvisited cells are added into the tree (i.e., become visited) one by one following an ascending order of geopotential. Specifically, for each cell, the algorithm first marks it as visited (step 4), creates a tree node for the cell (step 5), and attach the tree node to the rear of the tree branch following every visited neighbor of the the cell (steps 6 to 9). If no neighbor of the cell is visited, the cell is a local minimum in geopotential field, and the algorithm creates a new tree branch starting from the node of the cell (steps 10 to 11).

We now use the example of Figure 1 to illustrate the algorithm execution trace. The example contains cells in one dimensional space, but generalization to the case of two dimensional space is trivial. The input contains eight cells from to . The algorithm first sorts these cells by ascending order of elevation, and gets a sequence . Then, leaf nodes are created for and , since none of their neighbors are visited by then. Next, when adding , its neighbor is visited, so the algorithm adds node to the rear of the branch following . Similarly, node and are attached to the two branches respectively. When adding the node for , both of its neighbors are visited, so is attached to the rear of both branches. After this, nodes and are added consecutively.

Time complexity analysis: Algorithm 1 involves a one-time sorting of cells, which is . Then, for each of the cells, the main operation is to attach the cell to the rear of the branches of its visited neighbors. A naive implementation will cost , making the total cost . A smarter way to do this is to maintain a rear node pointer for each branch when it is created (i.e., when a leaf node is added). Assuming that geopotential field values on neighboring cells are contiguous (this is often true since real world elevation of nearby locations do not change suddenly), finding the rear of a neighboring cell’s branch is within a constant cost, making the total time cost (cost after sorting is linear).

(a) From leaves to root
(b) From root to leaves
Figure 3: Illustration of message propagation in a HMT

3.3 Model Parameter Learning

The parameters of hidden Markov tree include the mean and covariance matrix of sample features in each class, prior probability of leaf node classes, and class transition probability for non-leaf nodes. We denote the entire set of parameters as

. Learning the set of parameters poses two major challenges: first, there exist unknown hidden class variables , which are non-i.i.d.; second, the number of samples (nodes) is huge (up to hundreds of millions of pixels).

To address these challenges, we propose to use the expectation-maximization (EM) algorithm and message (belief) propagation. Our EM-based approach has the following major steps:

  1. Initialize parameter set

  2. Compute posterior distribution of hidden classes:

  3. Compute posterior expectation of log likelihood:

  4. Update parameters:

    Return if it’s converged, otherwise goto (b)

Among the four steps above, step (b) that computes the joint posterior distribution of all sample classes is practicallly infeasible due to the large number of hidden class nodes that are non-i.i.d. Fortunately, it is not necessary to compute the entire joint posterior distribution of all sample classes . In fact, we only need the marginal posterior distribution of a node’s and its parents’ classes for non-leaf nodes, as well as the marginal posterior distribution of a node’s class for leaf nodes. The reason can be explained through the expression of the posterior expectation of log likelihood in Equation 4.


Note that for leaf node, , and the last term in the last line of above equation is degraded, .

To compute the marginal posterior distribution and , we propose to use the message propagation method based on the sum and product algorithm [12, 19].

Figure 3 illustrates the recursive message propagation process on our HMT model. Specifically, forward message propagation from leaves to root is based on Equation 5 and Equation 6, where and are the incoming message into and outgoing message from a hidden class node respectively.


Backward message propagation from root to leaves also follows a recursive process, as shown in Equation 7 and Equation 8, where and are the incoming and outgoing messages for class node respectively. The main difference from forward propagation is that when computing incoming message , we need to multiply not only outgoing message from a child node and class transitional probability, but also outgoing messages from sibling nodes in the forward propagation (also illustrated in Figure 3(b)).


After both forward and backward message propagation, we can compute marginal posterior distribution of hidden class variables based on the following theorem.

Theorem 1.

The unnormalized marginal posterior distribution of the class of a leaf node, as well as the classes of a non-leaf node with parents can be computed by (9) and (10) respectively. Their normalized marginal posterior distributions can be computed by (11) and (12) respectively.


Detailed proof is in the Appendix at the end of this paper. ∎

After computation of marginal posterior distribution, we can update model parameters by maximizing the posterior expectation of log likelihood (the maximization or M step in EM). Taking the marginal posterior distributions in (11) and (12) above as well as parameters for probabilities in (2) and (3) into the posterior expectation of log likelihood in (4), we can easily get the following parameter update formulas.


Algorithm 2

summarizes the entire parameter learning process. First, we initialize the set of parameters either with random values within reasonable range or with initial estimates based on training samples (e.g., the mean and covariance of features in each class). After parameters are initialized, the algorithm starts the iteration till parameters converge. In each iteration, it propagates messages first from leaves to root (steps 4-5) and then from root to leaves (steps 6-7). Marginal posterior distribution of node classes are then computed (steps 8-9). Based on this, the algorithm updates parameters (step 10).

0:    : cell sample feature matrix : a reverse tree for spatial dependency : parameter convergence threshold
0:    : set of model parameters
1:  Initialize ,
2:  while  do
4:     for each  from leaf to root do
5:        Compute messages by (5)-(6)
6:     for each  from root to leaf do
7:        Compute messages by (7)-(8)
8:     for each  do
9:        // Compute marginal distributions: by (9)-(12)
10:     Update based on marginal distributions: by (13)-(16)
11:  return  
Algorithm 2 EM Algorithm for Hidden Markov Tree

Time complexity: The cost of Algorithm 2 mainly comes from the iterations. In each iteration, message propagation is done through tree traversal, which costs ( is the total number of samples or tree nodes). It can also be seen easily that marginal probability computation and parameter update both have costs of . Thus, the total cost is , where is the number of iterations.

Is the model unsupervised or semi-supervised? From discussions above, it is possible to learn HMT parameters in an unsupervised manner without training class labels. However, this relies on strong assumptions on data distributions. Particularly, it requires samples in different classes to be somehow distinguishable merely based on their feature distribution (), since class transitional probability in dependency tree only enforces a partial order constraint between class nodes. This assumption can be violated in many real world applications where different classes cannot be easily distinguished via unsupervised feature clustering. In such cases, we can utilize training samples with class labels to initialize parameters of , i.e., , by maximum likelihood estimation. In this way, initialized probability can better estimate class marginal distribution before iterations. This makes the model learning semi-supervised [25].

3.4 Class Inference

After learning all model parameters, we can infer hidden class variables by maximizing the overall probability. In the traditional hidden Markov model, inference on hidden variables are done through the Viterbi algorithm [18] based on dynamic programming. However, its computational cost is still very high for a large number of nodes (e.g., up to hundreds of millions). To address this challenge, we propose a greedy algorithm that guarantees correctness based on the partial order class constraint. Taking the logarithm of joint probability in Equation 1, we have the objective function below.


The goal of class inference is to assign a class label to each tree node such that the overall sum of log probability terms is maximized. Each term in the summation can be considered as a reward. For instance, is the reward for assigning class to node (i.e., node reward), is the reward for assigning class and to node and its parents respectively (i.e., edge reward). Thus, class inference in HMT becomes a node coloring problem. Our goal is to find a node coloring to maximize the overall sum of rewards. In addition, the color must follow a partial order constraint, e.g., dry nodes cannot follow flood nodes, because otherwise, .

Figure 4: Illustration of class inference process

Therefore, we can enumerate all feasible node coloring through one bottom-up tree traversal, as described in Algorithm 3. We can initialize all node color as class (negative class, e.g., dry), and gradually changed node colors from class to class from leaves to the root. When we change the color of a node, only the reward of the node itself, as well as the rewards of edges between the nodes to its parents and child will be updated, as illustrated in Figure 4. Thus, we can easily compute the gain of rewards when updating node colors (), and maintain the current cumulative gain () as well as the maximum cumulative gain () that we’ve come across till the node so far. When we reach the root node, the maximum overall gain of rewards has been recorded. We can traverse the tree again to find its corresponding node coloring.

0:    : reverse tree for spatial dependency : set of model parameters
0:    : inferred classes for all hidden nodes
1:  Initialize for
2:  Initialize for
3:  Initialize for
4:  for each node in topological order from leaf to root do
6:     // for leaf node
9:     if  then
11:  Do breadth first tree traversal to find the frontier of maximum
12:  Set for nodes above the frontier
13:  return  , the class labels of all nodes
Algorithm 3 Class Inference for Hidden Markov Tree

Time complexity analysis: The initialization steps cost , where is the number of samples or tree nodes. Each iteration of the for loop has a constant cost, making the total cost . Similarly, the breadth first traversal and re-coloring in last step cost . Thus, the entire algorithm has a cost of .

4 Experimental Evaluation

In this section, we compared our proposed method with baseline methods on both synthetic dataset and two real world datasets in classification performance. We also evaluated the computational scalability of our method on synthetic data with different sizes. Experiments were conducted on a Dell workstation with Intel(R) Xeon(R) CPU E5-2687w v4 @ 3.00GHz, 64GB main memory. Candidate classification methods for comparison include:

  • Non-spatial classifiers with raw features

    : We tested decision tree (


    ), random forest (

    RF), maximum likelihood classifier (MLC

    ), and gradient boosted tree (

    GBM) in R packages on raw features (including red, green, blue spectral bands respectively).

  • Non-spatial classifiers with preprocessed features: We tested DT, RF and MLC with additional elevation feature (elev.) We do not include GBM due to space limit.

  • Non-spatial classifier with post-processing label propagation (LP): We also tested DT, RF and MLC on raw features but with post-processing on predicted classes via label propagation [26]. We used 4-neighborhood. We do not include GBM due to space limit.

  • Transductive SVM: Since our method utilizes features of test samples, we included Transductive SVM (SVM-Light [10]), a semi-supervised tranductive method for fair comparison.

  • Markov random field (MRF): We used open source implementation [22] based on the graph cut method [20].

  • Hidden Markov Tree (HMT): We implemented our HMT method in C++.

Unless specified otherwise, we used default parameters in open source tools for baseline methods.

4.1 Synthetic Data

We first evaluated our proposed approach on synthetic data. Specifically, we generated a regular grid with by pixels. Elevations and classes (flood, dry) of pixels are shown in Figure 5

(a-b). Feature values of pixels in two classes follow two one-dimensional Gaussian distributions with means

and standard deviations

. To reflect the spatial autocorrelation effect, we generated one common feature value for a group of contiguous pixels in a coarse resolution (), as shown in Figure 5(c). Training samples from two classes were generated based on the two Gaussian distributions of feature values.

(a) Elevation map
(b) Class map (red for flood, blue for dry)
(c) Feature map
Figure 5: Illustration of synthetic data (best viewed in color)

Computational scalability: We measured the computational time costs of different components in our HMT algorithms on varying sizes of study area (from around million pixels to around million pixels). We also fixed the number of iterations as when running algorithms on different data sizes. The time costs were measured in the average of runs. Figure 6 shows the time costs of tree construction (Algorithm 1), parameter learning (Algorithm 2), and class inference (Algorithm 3) respectively. We can see that as the number of pixels increases, time costs of all algorithms are increasing. The parameter learning part takes the vast majority of time costs. Its time costs increase linearly with data sizes, because the message propagation in each iteration is done through tree traversal operations, which has a linear time complexity. Overall, our algorithms cost less than 5 minutes on a synthetic data with 20 million samples.

Figure 6: Computational time costs of HMT on different data sizes

Classification performance:

We compared the F-score of different methods on test pixels with different parameter settings of synthetic data generation. We exclude pre-processing and post-processing methods because our synthetic data generation cannot simulate the real feature textures. In the first setting, we conducted comparison on varying numbers of training pixels from

, , to . Results in Figure 7(a) showed that the classification performance of different methods were relatively stable (easily reaching plateau) for different training set sizes. The reason was probably that one dimensional Gaussian distributions on feature values in two classes were very easy to learn. In the second setting, we fixed other parameters and varied the standard deviations of feature values in two classes. The higher the values were, the more confusion (Bayes error) there were between two classes. Results of different methods in Figure 7(b) showed that as increase, the classification performance of all methods degraded, but our HMT model persistently outperformed other baseline methods, due to incorporating anisotropic spatial dependency across locations.

Figure 7: Classification performance comparison across methods on synthetic data

4.2 Hurricane Mathew Floods 2016

Here we validated our approach in flood inundation extent mapping during Hurricane Mathew, NC, 2016. We used high-resolution aerial imagery from NOAA National Geodetic Survey [14] as explanatory features (three spectral band features including red, green, blue), and digital elevation map from the University of North Carolina Libraries [16]. All imagery data were re-sampled into a resolution of 2 meters. A test region with 1743 by 1349 pixels was used. A training set with 10000 pixels (5000 dry and 5000 flood) were manually labeled outside the test region, and 94608 test pixels (47092 dry, 47516 flood) were labeled within the test region.

Classifiers Class Precision Recall F Avg. F
DT+Raw Dry 0.62 0.84 0.71 0.65
Flood 0.76 0.48 0.59
RF+Raw Dry 0.59 0.96 0.73 0.61
Flood 0.90 0.33 0.49
GBM+Raw Dry 0.69 0.76 0.72 0.71
Flood 0.74 0.67 0.70
MLC+Raw Dry 0.64 0.93 0.76 0.69
Flood 0.88 0.48 0.62
DT+elev. Dry 0.99 0.55 0.71 0.76
Flood 0.69 0.99 0.82
RF+elev. Dry 0.99 0.66 0.79 0.82
Flood 0.74 0.99 0.85
MLC+elev. Dry 0.84 0.90 0.87 0.87
Flood 0.89 0.84 0.86
DT+LP Dry 0.61 0.92 0.74 0.65
Flood 0.85 0.43 0.57
RF+LP Dry 0.57 0.99 0.72 0.57
Flood 0.99 0.26 0.42
MLC+LP Dry 0.64 0.97 0.77 0.69
Flood 0.95 0.46 0.62
MRF Dry 0.62 0.99 0.76 0.67
Flood 0.98 0.41 0.58
TSVM Dry 0.62 0.86 0.72 0.66
Flood 0.78 0.49 0.60
HMT Dry 0.93 0.99 0.96 0.96
Flood 0.99 0.93 0.96
Table 1: Comparison on Hurricane Mathew Flood data

Classification performance comparison: We compared different methods on their precision, recall, and F-score on both the flood class and the dry class. Results were summarized in Table 1. We can see that decision tree, random forest, and maximum likelihood classifier all perform poorly on raw features, with overall F-score less than 0.7. Adding post-processing through label propagation slightly impaired performance. For example, after adding post-processing (LP) into decision tree, the recall of the dry class got better but the recall of flood class got worse, probably due to over-smoothing of correctly classified flood pixels into the dry class. Markov random field and Transductive SVM had comparable results with decision tree. In contrast, adding elevation features improved the overall classification performance dramatically for decision tree, random forest, and maximum likelihood classifier. The reason is because most flood pixels have lower elevation than dry pixels. However, the performance on dry class was still quite inferior (0.86 to 0.9 in F-score) compared with our hidden Markov tree (0.95 in F-score), probably because classification models learned based on absolute elevation values cannot perfectly apply to the test region.

Some visualization of classification results were shown in Figure 8. The spectral features and elevation values were shown in Figure 8(a-b). Results of decision tree were in Figure 8(c), which only identified flood pixels with open surface, and mistakenly classified the vast majority of flood pixels below tree canopies (the spectral features of trees indicated the dry class if not considering spatial dependency with nearby pixels). In contrast, our HMT model correctly identified most of the flood pixels, even if the flood water was below tree canopies. The reason is that our HMT incorporates the anisotropic spatial dependency across pixel locations (if a location is flood, its nearby lower locations should also be flood).

(a) High-resolution satellite imagery in North Carolina
(b) Digital elevation
(c) Decision tree result
(d) HMT result
Figure 8: Results on Mathew flood mapping (flood in brown, dry in green, best viewed in color)

Sensitivity of HMT to initial parameters: We conducted sensitivity of our HMT model to different initial parameter values on prior class probability and class transitional probability (the parameters of were initialized based on maximum likelihood estimation on the training set). First, we fixed initial and varied initial from to . Results of converged value of together with final F-score were shown in Figure 9(a-b). It can be seen that our HMT model was quite stable with different initial values. Similarly, we fixed initial , and varied initial from , , to . Results in Figure 9(c-d) showed the same trend. In practice, we can select an initial value around and a relatively high initial value such as (because flood pixels’ neighbor pixel is very likely to be flood due to spatial autocorrelation).

Figure 9: Sensitivity of HMT to different initial parameters and

Parameter iterations and convergence in HMT: Here we fixed the initial and initial , and measured the parameter iterations and convergence behavior. Our convergence threshold was set . The parameter values of and at each iteration were summarized in Figure 10 (we omitted because there were too many variables). The parameters converged after 10 iterations.

Figure 10: Parameter iterations and convergence in HMT

4.3 Hurricane Harvey Floods 2017

The high-resolution earth imagery we used were from Plant Labs. Inc. with red, green, and blue bands in 3 meter resolution, and the digital elevation data was from Texas natural resource management department. We manually collected a training set with 5000 flood samples and 5000 dry samples. We selected a test scene with 4174 rows and 4592 columns, within which we manually labeled 74305 flood samples, and 52658 dry samples as the test set.

We compared different methods on their precision, recall, and F-score on each class. Results were summarized in Table 2. We can see that decision tree, random forest, and maximum likelihood classifier all perform poorly on raw features, with overall F-score less than 0.75. Adding additional elevation feature improved the classification accuracy slightly, but the overall F-score was still below 0.8. The reason may be that the absolute elevation values from training area are not as good an indicator for two classes as in the test area. Similarly, adding label propagation in post-processing and the MRF model barely improved classification performance compared with decision tree and random forest on raw features, because the errors were mostly systematic instead of salt-and-pepper noise. TSVM performed even worse than supervised methods without unlabeled samples, which was somehow surprising. In contrast to baseline methods, our hidden Markov tree achieved superior performance with around 0.95 F-score on both classes. Visualization of some results were shown in Figure 11. We can see that our HMT model significantly outperformed decision tree on flood locations under tree canopies, similar to the results in Figure 8.

Classifiers Class Precision Recall F Avg. F
DT+Raw Dry 0.58 0.88 0.70 0.69
Flood 0.87 0.56 0.68
RF+Raw Dry 0.62 0.96 0.76 0.74
Flood 0.95 0.59 0.73
GBM+Raw Dry 0.56 0.80 0.66 0.66
Flood 0.80 0.56 0.66
MLC+Raw Dry 0.63 0.93 0.75 0.74
Flood 0.93 0.61 0.73
DT+elev. Dry 0.61 0.99 0.76 0.74
Flood 0.99 0.56 0.72
RF+elev. Dry 0.66 0.99 0.79 0.79
Flood 0.99 0.65 0.78
MLC+elev. Dry 0.65 0.97 0.78 0.77
Flood 0.97 0.62 0.76
DT+LP Dry 0.59 0.90 0.71 0.70
Flood 0.89 0.56 0.69
RF+LP Dry 0.63 0.97 0.76 0.75
Flood 0.97 0.59 0.74
MLC+LP Dry 0.63 0.94 0.76 0.75
Flood 0.93 0.61 0.74
MRF Dry 0.63 0.94 0.75 0.74
Flood 0.94 0.61 0.74
TSVM Dry 0.55 0.67 0.60 0.63
Flood 0.72 0.61 0.66
HMT Dry 0.91 0.98 0.94 0.95
Flood 0.98 0.93 0.95
Table 2: Comparison on Harvey Flood Data
(a) High-resolution satellite imagery in Houston, TX
(b) Digital elevation
(c) Decision tree result
(d) HMT result
Figure 11: Results on Harvey flood mapping (flood in brown, dry in green, best viewed in color)

5 Related Work

Over the years, various techniques have been developed to incorporate spatial properties into classification algorithms for earth imagery data. Many methods are based on preprocessing and post-processing, including neighborhood window filters [3, 6], spatial contextual variables and textures [17], spatial autocorrelation statistics [9], morphological profiling [1], spatial-spectral classifiers [21, 23] and object-based image analysis [7]. Markov random field model explicitly captures spatial dependency, but the dependency is undirected [13]. [11]

proposes a spatial classification model that captures directed spatial dependency on classes but assumes dependency to follow a total order. Deep learning methods have recently been applied to earth imagery classification 

[24] such as land cover mapping [8], target recognition and scene identification. To the best of our knowledge, none of these existing works focus on incorporating anisotropic spatial dependency in partial order constraints, which is important in hydrological applications such as flood mapping.

Hidden Markov models have been extensively studied in the signal processing literature [18]. Learning and inference of hidden Markov models are often based on EM algorithms and message propagation (sum-and-product algorithm) [12]. [19] proposes a message propagation algorithm called “upward-downward” on a dependence tree structure. [5] proposes a wavelet-domain hidden Markov tree to model dependency in the two-dimensional time-frequency plane. The model is used to characterize properties of wavelet coefficients in signal processing such as clustering, persistence, and compression, which is dramatically different from our HMT model which captures spatial dependency in the geographic space based on topography.

6 Conclusions and Future Works

In this paper, we propose hidden Markov tree (HMT), an anisotropic spatial classification model for flood mapping on earth imagery. Compared with existing methods, our HMT model explicitly captures directed spatial dependency with partial order constraint. Partial order constraint is reflected in a reverse tree structure in the hidden class layer. We also proposed algorithms for reverse tree construction, parameter learning and class inference. Evaluations on both synthetic data and real world data shows that our HMT model is scalable to large data sizes, and can utilize the partial order spatial dependency to reduce classification errors due to class confusion.

In future works, we plan to extend our HMT model for spatially non-stationary data. In this case, we need to generalize the tree structure in hidden class layer into poly-tree with each sub-tree for a spatial zone.


  • [1] J. A. Benediktsson, J. A. Palmason, and J. R. Sveinsson. Classification of hyperspectral data from urban areas based on extended morphological profiles. IEEE Transactions on Geoscience and Remote Sensing, 43(3):480–491, 2005.
  • [2] P. Brivio, R. Colombo, M. Maggi, and R. Tomasoni. Integration of remote sensing data and gis for accurate mapping of flooded areas. International Journal of Remote Sensing, 23(3):429–441, 2002.
  • [3] R. H. Chan, C.-W. Ho, and M. Nikolova. Salt-and-pepper noise removal by median-type noise detectors and detail-preserving regularization. Image Processing, IEEE Transactions on, 14(10):1479–1485, 2005.
  • [4] D. Cline. Integrated water resources science and services: an integrated and adaptive roadmap for operational implementation. Technical report, National Oceanic and Atmospheric Administration, 2009.
  • [5] M. S. Crouse, R. D. Nowak, and R. G. Baraniuk. Wavelet-based statistical signal processing using hidden markov models. IEEE Transactions on signal processing, 46(4):886–902, 1998.
  • [6] S. Esakkirajan, T. Veerakumar, A. N. Subramanyam, and C. PremChand. Removal of high density salt and pepper noise through modified decision based unsymmetric trimmed median filter. Signal Processing Letters, IEEE, 18(5):287–290, 2011.
  • [7] G. Hay and G. Castilla. Geographic object-based image analysis (geobia): A new name for a new discipline. In Object-based image analysis, pages 75–89. Springer, 2008.
  • [8] X. Jia, A. Khandelwal, G. Nayak, J. Gerber, K. Carlson, P. West, and V. Kumar. Incremental dual-memory lstm in land cover prediction. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 867–876. ACM, 2017.
  • [9] Z. Jiang, S. Shekhar, X. Zhou, J. Knight, and J. Corcoran. Focal-test-based spatial decision tree learning. IEEE Transactions on Knowledge and Data Engineering, 27(6):1547–1559, 2015.
  • [10] T. Joachims. Making large-scale SVM learning practical. In B. Schölkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, chapter 11, pages 169–184. MIT Press, Cambridge, MA, 1999.
  • [11] A. Khandelwal, V. Mithal, and V. Kumar. Post classification label refinement using implicit ordering constraint among data instances. In 2015 IEEE International Conference on Data Mining, ICDM 2015, Atlantic City, NJ, USA, November 14-17, 2015, pages 799–804, 2015.
  • [12] F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Factor graphs and the sum-product algorithm. IEEE Transactions on information theory, 47(2):498–519, 2001.
  • [13] S. Z. Li. Markov random field modeling in image analysis. Springer Science & Business Media, 2009.
  • [14] National Oceanic and Atmospheric Administration. Data and imagery from noaa’s national geodetic survey.
  • [15] National Oceanic and Atmospheric Administration. National Water Model: Improving NOAA’s Water Prediction Services., 2018.
  • [16] NCSU Libraries. LIDAR Based Elevation Data for North Carolina., 2018.
  • [17] A. Puissant, J. Hirsch, and C. Weber. The utility of texture analysis to improve per-pixel classification for high to very high spatial resolution imagery. International Journal of Remote Sensing, 26(4):733–745, 2005.
  • [18] L. R. Rabiner. A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE, 77(2):257–286, 1989.
  • [19] O. Ronen, J. Rohlicek, and M. Ostendorf. Parameter estimation of dependence tree models using the em algorithm. IEEE Signal Processing Letters, 2(8):157–159, 1995.
  • [20] R. Szeliski, R. Zabih, D. Scharstein, O. Veksler, V. Kolmogorov, A. Agarwala, M. Tappen, and C. Rother. A comparative study of energy minimization methods for markov random fields. In

    European conference on computer vision

    , pages 16–29. Springer, 2006.
  • [21] Y. Tarabalka, J. A. Benediktsson, and J. Chanussot. Spectral–spatial classification of hyperspectral imagery based on partitional clustering techniques. IEEE Transactions on Geoscience and Remote Sensing, 47(8):2973–2987, 2009.
  • [22] The Middlebury Computer Vision Pages. C++ Source Code of MRF., 2018.
  • [23] L. Wang, S. Hao, Q. Wang, and Y. Wang. Semi-supervised classification for hyperspectral imagery based on spatial-spectral label propagation. ISPRS Journal of Photogrammetry and Remote Sensing, 97:123–137, 2014.
  • [24] L. Zhang, L. Zhang, and B. Du. Deep learning for remote sensing data: A technical tutorial on the state of the art. IEEE Geoscience and Remote Sensing Magazine, 4(2):22–40, 2016.
  • [25] X. Zhu. Semi-supervised learning literature survey. 2005.
  • [26] X. Zhu and Z. Ghahramani. Learning from labeled and unlabeled data with label propagation. 2002.

Appendix A Proof of Theorems

Theorem 1.

The unnormalized marginal posterior distribution of the class of a leaf node, as well as the classes of a non-leaf node with parents can be computed by (9) and (10) respectively. Their normalized margin posterior distributions can be computed by (11) and (12) respectively.


Detailed proof can be found below. Some symbols are defined in Table 3.

Symbol Description
All nodes in the subtree rooted at node except for the root node
All nodes in the subtree rooted at node
All nodes in the reverse tree excluding
All nodes in the reverse tree excluding
Table 3: List of symbols in the proof

Forward message propagation, the statistical meanings of the message are as below


Here, means all nodes in the subtree rooted at node except for the root node (i.e., all nodes visited before in its subtrees during bottom-up propagation), means all nodes in the subtree rooted at node (i.e., ).

Base case: If is leaf node,


Thus, the statement Equation 22 and 23 are correct.

Induction step: Let be an internal node and and suppose Equation 22 and 23 is true for . Then, based on topological order, let be the child of node .


Thus, Equation 22 and 23 hold for internal nodes. Hence Equation 22 and 23 are correct for all nodes.

Backward message propagation, the statistical meanings of the message are as below


Here, means all nodes visited before during top-down propagation in the reverse tree (i.e., is all nodes in the reverse tree excluding ), means all nodes visited including during top-down propagation in the reverse tree (i.e., includes all nodes in the reverse tree excluding ).

Base case: If is root node,


Thus, the statement Equation 28 and 29 are correct.

Induction step: Let be an internal node and and suppose Equation 28 and 29 is true for . Then, based on topological order, let be one of the parents of node .


Thus, Equation 28 and 29 hold for internal nodes. Hence Equation 28 and 29 are correct for all nodes.

For unnormalized marginal posterior distribution, Equation 18 and 21 are proved as below. For simplicity, we omit on the right side of the following equations.