Biomass is an important plant characteristic that helps with crop monitoring, yield estimation, and indicating plant growing conditions, and is quantified based on the above-ground weight of a plant before dehydration. In the case of sorghum (the main plant in our study), wet biomass determines the amount of ethanol product. To identify superior plant varieties for breeding, biomass can be manually measured at the end of the growing season; however, this traditional method is time consuming, expensive, and retrospective. Instead, hyperspectral images collected by Unmanned Aerial Vehicles (UAVs) throughout the season can be used to predict the final biomass. Remote sensing experts in our team collected high resolution hyperspectral images acquired multiple times (from June to Sept.) over 14 acres of experimental sorghum fields with 830 varieties in the 2017 growing season. The ground truth wet biomass was measured once at the end of the growing season (Oct. ).
A hyperspectral image captures a spectrum that covers wavelengths ranging from 400 nm to 1000 nm in 2.2 nm increments for each pixel (272 bands). The original collected 272 bands are continuous narrow bands, which are highly correlated with neighboring ones. To reduce the dependency among these bands, we adopted hyperspectral indices based on domain practice. Specifically, we utilize the 36 hyperspectral vegetation indices listed in [liang2015estimation]. Each index is derived from several bands values and based on a unique plant biophysical meaning. However, some indices have closely-related calculation formulas.
More information about the sensors, data pre-processing, and feature extraction is available in[masjedi2018sorghum, elbahnasawy2018multi, zhang2017prediction].
2 Related Work
Feature selection methods can be generally divided into four categories: filter methods, wrapper methods, embedded methods, and hybrid methods [Chandrashekar2014]. The filter and wrapper categories are relevant to our work; therefore, we will focus on them here. Pearson’s correlation coefficient is a popular filtering method for narrowing down features to the ones with high (linear) correlation with the dependent variable. However, correlated but redundant features may be selected, and the coefficient is unable to characterize nonlinear relationships. Wrapper methods use regression or classification models to find an optimal feature subset by iteratively adding or removing features. The combination of learning models (e.g. SVR) and wrapper methods (e.g. RFE) has traditionally been used for automatic feature selection [duan2005multiple, Ding:2011:SBF:1943363.1943368].
Several visualizations have been proposed for feature selection, including correlation matrices [Friendly2002], feature clustering [Yang2003], feature ranking [JinwookSeo2004, Piringer2008, Johansson2009], scatterplot matrices [Elmqvist2008], and dimensionality reduction [Brushing11]. A few visual analytics systems have leveraged a combination of automatic and visual feature selection techniques. RegressionExplorer [dingen2019regressionexplorer]
is one such system for inspecting logistic regression models. Other systems have been proposed to support exploring linear relationships among features[Guo2009, Piringer2010, Barlowe2008]. BEAMES [dasbeames] is another multi-model system that enables users to interactively compare different types of models with various hyper-parameters (e.g., logistic regression vs. Bayesian regression models), while allowing users to interactively weigh data instances and features. INFUSE [Krause2014] enables the ensemble of multiple feature selection methods by visualizing features importance as determined by various feature selection methods in a radial glyph. Our focus, however, is to support domain experts in efficiently reducing a high-dimensional feature space into key feature subsets, and tracing back the features to the underlying wavelengths for incorporating domain knowledge.
Partition-based visual analytics systems [May2011, Muhlbacher2013] primarily focus on the interactive exploration of local structures and relationships of independent and target variables, appropriate for lower feature space dimensions. They are aimed at closer inspection of limited numbers of selected features for optimal distribution partitioning and model building.
However, our focus is on high dimensions (of both data instances and feature space). Our system’s integrated hierarchical clustering and matrix visualizations facilitate the quick identification of (a) influential feature subsets (either already selected or missing) for model building, (b) the interchangeable features within those subsets, and (c) detailed feature distribution and importance.
Our system’s integrated hierarchical clustering and matrix visualizations facilitate the quick identification of (a) influential feature subsets (either already selected or missing) for model building, (b) the interchangeable features within those subsets, and (c) detailed feature distribution and importance.
3 Design Goals
We collaborated with three remote sensing experts: two Ph.D. students and a senior faculty member with expertise in hyperspectral image analysis for agronomy. Traditionally, they predict biomass using automated feature selection algorithms and regression models. Oftentimes, optimally tuning these algorithms requires large numbers of data samples, which are expensive to collect. It is challenging to build a model that performs well for all kinds of hybrid varieties, plants in different locations, or at different growing stages/conditions with limited samples. Therefore, the domain experts needed to identify the key hyperspectral features to achieve stable, credible, and accurate prediction results, using both automated methods and their domain knowledge to inspect the relationship among features, the importance of features, and trace the hyperspectral indices back to the biophysical space. Hyperspectral indices indicate meaningful chemical concentrations in plants, which can be applied to differentiate plant varieties. The domain experts also expressed the need for clustering features, dynamic feature selection, and model performance comparisons with and without feature selection. We derived the following design goals to fulfill these requirements:
Interactive exploration of features, including feature density distributions and relationships among multivariate features.
Identification of important features such as influential hyperspectral indices and the underlying wavelengths that contribute to the prediction of wet biomass.
Direct manipulation and refinement on subset of features through interactively adding and removing specific features.
Evaluation of regression results with ground truth for subset of selected features versus full set of feature.
These requirements were formalized into design mock-ups using visualizations already familiar to domain users based on their request. We then implemented the design, and made minor modifications according to feedback from domain experts, as described below.
In this section, we first explain how our system addresses the design goals, and then elaborate on the frontend user interface and backend analytics components of FeatureExplorer.
Figure 1 presents the system components in FeatureExplorer, and our process. As shown in Figure 1, FeatureExplorer supports the analysis of both linear and non-linear relationships (DG1, DG2). To visualize feature relationships, a correlation matrix serves as an overview to render the Pearson’s correlation coefficient for all pairs of features. Users can click on any cell for a detailed inspection of any particular pair of features. For non-linear relationship analysis, Support Vector Regression and Recursive Feature Elimination (SVR + RFE) provide feature importance ranking. Users can compare and analyze the ranking results and use the synthesized information to add or remove features (DG3). and Root Mean Square Error (RMSE) are calculated to show the regression models’ performance with the selected subset of features (DG4). After initial implementation, users requested the capability to adjust the number of folds in cross validation, to compare the performance of regression models with a selected subset of features versus with all features, and to map hyperspectral indices to original wavelengths. This way, users can utilize the gained insights from the interactive exploration process to identify the underlying pertinent wavelengths, and strategize ways to collect only pertinent data in the future to save cost and time.
4.2 User Interface
Figure 2 illustrates the user interface that contains three panels: (A) a control panel, (B) a correlation panel, and (C) an evaluation panel. As we described in the previous section, the two latter panels are separated based on the linearity of the relationship between input features and predicted variables. In this section, we describe the views individually, and will showcase the integrated use of these views in a use case in Section 5.
In the correlation panel, a correlation matrix shows the Pearson’s correlation coefficient between any pair of features. The coefficient value is double-encoded using two visual channels (color and radius) for better usability. Hierarchical clustering groups the features based on the similarity of correlations to other features. This helps users identify representative pairs from each cluster while minimizing the chances of including other similarly correlated pairs. While providing a good overview, a single correlation value does not provide sufficient information for interpreting the relationship between two features. To address this, users can click on any cell to see the scatterplot of the selected two features. The system uses both histograms and KDE to illustrate the marginal distribution of univariate features at the edge of the histogram. We also overlaid a 2D KDE on the scatterplot to better visualize the distribution of two features. The marginal distributions and KDE contours are beneficial in understanding general data patterns. The domain users pointed out that exploring the hyperspectral index vs. wet biomass scatterplot could help them investigate whether the index captures the variation across high and low biomass values.
At the top part of the evaluation panel, a scatterplot shows ground truth values against predicted results along with and RMSE values. With this graph, domain users identified that the regression model does not perform well on extremely high or low biomass values. The horizontal bar graphs show the feature importance score for each input feature (using SVR + RFE), and the light blue rectangles indicate selected features. The histogram beside the bar graphs shows the frequency of using pertinent reflectance (raw data) to derive the indices in the subset of selected features over the wavelength range of 400 nm to 900 nm. This enables domain experts to trace back the selected features to the wavelengths that are utilized to derive the indices. Moreover, a table shows performance comparison for a subset of selected features versus all features based on the same data partition (training vs. testing) and regression model.
As we mentioned before, the correlation matrix and the SVR + RFE bar graphs provide different rankings, the former for linear relationships and the latter for non-linear models. Users can refer to both to adjust the subset of selected features. In the control panel, the leftmost list shows unused features, and the list in the middle shows the selected ones. Users can drag and drop features between these two lists and evaluate the results on the fly. To avoid exhaustive feature searching at the beginning by the users, the system enables an initial automatic feature selection method based on SVR + RFE.
4.3 Regression Models
After testing several regression models including Ridge, Elastic Net, Partial Least Squares, SVR, Random Forest, and AdaBoost, we found that SVR[scholkopf2002learning] outperforms other models for predicting biomass from hyperspectral indices for most dates. The results of for these regression models are listed in Table 1. Since and RMSE are highly correlated (higher means lower RMSE), we only report the . Based on the results, we decided to integrate SVR + RFE (for automatic feature selection) into the system.
The system runs k-fold cross validation for model evaluation. For each training of the SVR model, the system first runs a grid search with a Radial Basis Function (RBF)[alpaydin2009introduction]
kernel to select the best model hyperparameters that maximize, and then performs initial feature selection on that model [liu2011feature]. The RFE ranks the features based on their contributions in the regression model, and the system transforms these ranks to scores in the range of [0, 1], 0 meaning no contribution and 1 meaning the most important feature in the model.
We use Equation 1 to compute the ranking score of a feature, where k is the number of folds, d is the number of dimensions in the feature space, and r denotes the ranking determined by RFE. The RFE method outputs the ranking of features in a sequential order from the most important to least; the most important feature has a ranking of 1 and the least important feature has a ranking of d. The numerator of Equation 1 sums the normalized ranking (mapping values in [1, d] to [0, 1]), which is then divided by k to calculate the average of these scores for multiple runs (in cross-fold validation). We use this RankingScore in feature importance visualization (the horizontal bar graphs).
5 Case Study
A remote sensing expert in our team used FeatureExplorer to investigate hyperspectral indices for biomass prediction. He aimed to determine which indices were the most predictive ones, and if he could reduce a combination of 36 features down to 10 key features while understanding their biophysical meanings in collaboration with a plant scientist. He used 10 hyperspectral images collected from June 21st to Sept. 24th in 2017 to investigate whether the important subset of hyperspectral indices changed in each image set. First, he started with one dataset (July 18th) and applied automatic feature selection for 20 features (out of 36 total), and found that performance using 20 features was slightly better than when using all 36 features. Then, he applied automatic feature selection, limiting to 3 features. The regression performance () dropped significantly (higher RMSE). Based on ranked feature sets and the correlation matrix, he added 4 features that had high importance scores and low correlation among them. These 4 features were selected from different clusters in the correlation matrix, since he wanted the regression model to learn useful information from diverse features. The performance of the model improved. After adding up to 10 features, the performance of the regression model was almost equivalent to its performance when using 20 features (Figure 3(1)). He then tested whether applying automatic selection limited to 10 features would lead to similar results; it turned out that the manually selected features outperformed the automatic selection (Figure 3(2)).
Next, he applied the same subset of features on another hyperspectral image (July 30th) that was captured 12 days after the first one. He found that wet biomass had stronger correlations with most hyperspectral indices (the correlation matrix shown in Figure 3(5)) compared with the first dataset (the correlation matrix shown in Figure 2(B)). The regression model performed better on the second dataset than the first one because the plants were at a different growing stage [gerik2003sorghum] and their reflectance had changed [brandao2015spectral]. Tuning the regression model on the second dataset with the 10 features selected during analyzing the first dataset did not improve the prediction results; however, the performance of the regression model did not drop dramatically (Figure 3(3)). By carefully examining the correlation matrix for the second dataset, he found 3 features that did not have high correlations with biomass. After removing these 3 features and adding another feature which had a high importance score and high correlation with biomass, the model’s performance improved significantly (Figure 3(4)). This indicates the human-in-the-loop can improve the predictive performance of the regression model.
6 Conclusion and Future Work
We presented a visual analytics system for the exploration, ranking, and selection of features in integrated regression models supporting analysis on linear and non-linear relationships. The system provides initial automated feature selection, and enables users to dynamically change, compare and evaluate models’ performance based on user-specified subsets of features. We demonstrated the successful use of the system by remote sensing experts to identify important hyperspectral indices at various plant growth stages for predicting the biomass at the end of the growing season, as well as tracing these indices back to the underlying wavelengths for each growing stage. This enables more targeted data collection and analysis in the future. FeatureExplorer can also be applied to other sensor data (e.g., multispectral, LiDAR) that possess similar properties to hyperspectral indices (e.g. high dimensions, derived correlated features), to predict variables other than biomass. Our system also can be adjusted to include different regression models since the underlying model will not intrinsically impact the feature exploration workflow.
Future visual analytics research should investigate the dynamic generation of features based on raw input data, e.g. customized features based on different formulations of hyperspectral indices. Also, one can improve the feature selection workflow by visually highlighting potential features in clusters that are ranked high importance (or low), for faster subgroup inclusion/exclusion. Feature selection in regression models for spatially and temporally heterogeneous data is also an open area for research. Specifically, the geovisualization of feature importance for spatial regression methods has not been adequately addressed. Finally, time series analysis can be incorporated to model temporally variable feature contributions, e.g. in a sequence of hyperspectral images with temporally variable wavelength reflectances at different plant growing stages.