Mild traumatic brain injury (mTBI) is a significant public health issue with millions of civilian, military, and sport-related injuries occurring every year . Moreover, of patients with mTBI develop persistent symptoms months to years after initial injury . Cognitive complaints are important due to their significant impact on the quality of life. In this study, we examine the specific cognitive subdomain of working memory in relation to the underlying tissue microstructure by accessing diffusion MRI and predict performance on working memory. Defining specific imaging biomarkers related to cognitive dysfunction after mTBI would not only shed light on the underlying pathophysiology of injury leading to cognitive impairments, but also help to triage patients and offer a quantitative means to track recovery in the cognitive domain as well as track efficacy of targeted cognitive therapeutic strategies .
A few previous works apply feature analysis to identify injury in MTBI and to predict clinical status of mTBI patients. Lui et al. used Minimum Redundancy and Maximum Relevance (mRMR) approach to identify the most relevant features for classifying patients between MTBI versus control
. Minaee et al. proposed a combination of linear regression and exhaustive MRI feature selection to predict NP (neuropsychological) test scores. Though they reported achieving reasonable accuracy, these methods were developed using very small datasets (
subjects) and explored only a small set of handcrafted imaging features (10-15 features). Due to limited datasets, it is not feasible to apply deep learning to entire brain volumes obtained with multiple diffusion metrics for either task (MTBI classification or prediction of NP scores). To overcome this challenge, Minaee et al. applied Adversarial Auto-encoder  to extract latent features that could then be used to reconstruct image patches, and adopted a bag of visual words (BoW) representation to describe each metric in each brain region, where the visual words were obtained by clustering the latent features. Despite a high classification accuracy , feature selection after BoW was accomplished by greedy forward search, which may produce suboptimal feature subset that is quite far from the optimal one.
There are several other works that use imaging features to study mTBI, such as dictionary learning  for dimensionality reduction and network-based statistics analysis . However, since feature selection over a large feature space is prohibitively expensive, these works either 1) are limited in the number of initial features considered, which relies on prior knowledge to handcraft features and may potentially miss the most relevant ones or 2) project an originally large feature dimension to a low dimension space; a downside of such approaches is that the transformed features are often hard to interpret.
To overcome these limitations, we leverage a powerful feature selection method known as greedy best first search (Greedy BFS) 
, which has been shown to be more effective than the more typically adopted greedy forward or backward search method or the genetic algorithm. We further propose a novel improvement over the Greedy BFS method. First, sufficiently large (280) number of statistic features are extracted from 7 anatomic white matter brain regions and 8 diffusion MRI metrics. Gradient Boosting Tree (GBT) is selected for accurate regression and repeated stratified cross-validation is used to avoid over-fitting. During the search, each feature subset is evaluated by the cross validationscore by the GBT model. The proposed improvement to the Greedy BFS method, known as BFS with crossover, uses crossovers to jump over the feature subset graph so that a broader feature subset can be visited to produce a more accurate result in the most efficient time frame.
Compared to using greedy forward or backward search or genetic search, Greedy BFS method yielded greater accuracy in the prediction of working memory subtests’scores from difussion MRI features. The BFS with crossover further improved the accuracy over greedy BFS consistently. Interestingly, the features that were chosen frequently by the BFS with crossover method are those diffusion MRI metrics that represent the underlying tissue microstructure.
Ii-a Dataset and Feature Extraction
The dataset contains 154 subjects, among which 70 are normal controls (NC) and 84 are mTBI. Age-appropriate WAIS-IV subtests  were performed to assess performance of working memory, including Digit Span Forward (DSF), Digit Span Backward (DSB), and Letter-Number Sequencing (LNS). For each subset, separate models are developed for the control and MTBI populations, respectively, in order to discover normal and pathologic microstructure features that inform on the working memory. In addition, a combined model is also developed.
Based on previous diffusion studies in mTBI patients, 8 metrics from DTI, DKI and compartment specific white matter modeling  were chosen, summarized in Table I. For compartment specific metrics, voxels with FA were excluded as recommended to interrogate single-fiber orientations  . Instead of considering the entire brain volume, we compute several statistics of each metric over 7 major white matter brain regions: left rostral (LR), right rostral (RR), left middle (LM), right middle (RM), left caudal (LC), right caudal (RC), and corpus collasum (CC) (See Fig. 1
|Diffusion Imaging Metrics||Description|
|Compartment Specific||AWF||Axonal Water Fraction|
|De-par||Extra-axonal axial diffusivity|
|De-perp||Extra-axonal radial diffusivity|
Ii-B Wrapper Feature Selection as Graph Search Problem
There are three main categories of feature selection methods: filter, wrapper and embedded . Filter based feature selection ranks the feature subsets based on some criteria such as the correlation between the individual features and the target outcome and the correlations among the features, independent of the prediction/ classification method. Wrapper based approach would train multiple prediction/ classification models using different feature subsets and use validation scores to select the best feature subset. Embedded approach constrains the model parameters related to the input features to be sparse, and conducts feature selection during model construction. In general, filter approach is computationally fastest but often yields sub-optimal feature subsets; whereas the wrapper method is the most accurate but is computationally costly. In this work we follow the wrapper approach.
The wrapper based feature subset selection can be generalized as a graph search problem . Consider a dataset with N samples , where represents the features for the sample, and the ground truth outcome. If each data sample has features, the number of total possible feature subsets is .
Any path connecting vertex and another vertex has length equals to the sum of the edge weights along this path:
From the definition of Eq. (1), it is easy to show that:
Therefore, the feature selection problem is to find the longest path from vertex to any possible vertex in graph, which is equivalent to searching the vertex with the best score:
Then consider a directed weighted graph
. Each vertex is represented by a binary vector indimensions, , indicating whether each feature is selected. Two vertices and are connected if there is only one bit difference, which means only the state of one feature is different (See Fig. 2 ). contains vertices, with in-degree and out-degree of each vertex both equals to . The weight of an edge is assigned to be the difference between the performance scores of connected vertices, which is usually calculated through cross-validation ,
Ii-C Greedy Best First Search Algorithm
Since the graph has nodes, an exhaustive traverse would be prohibitive if is large. Thus, a heuristic is usually used to avoid exhaustive search without losing accuracy significantly. Some classical heuristic approaches are: sequential feature selection (SFS), Hill Climbing. Meta-heuristic is another family of algorithms that simulates natural phenomena, including simulated annealing (SA), swarm algorithm such as whale optimization (WO) and genetic algorithm (GA).
In this paper, we revisit and improve a heuristic approach: greedy best first search (Greedy BFS). Greedy BFS is initially proposed for robots’ path finding and later applied to feature selection  . However, this method did not get much attraction due to the limited feature subset size and computation power at that time. Recently researchers start to rediscover it and its variations for problems such as sparse representation .
As shown in Algorithm 1, greedy BFS algorithm starts at one node and iteratively selects next node maximizing . Each time the node with the best score “current” node in Algorithm 1) in the priority queue (“open” queue in Algorithm 1) is popped out, its undiscovered children are evaluated and pushed into the priority queue. This process is repeated until the queue is empty or the best accuracy has not been updated for a certain number of iteration (patience).
Greedy BFS is a superset of sequential floating feature selection (SFFS) , which is in turn a superset of sequential feature selection (SFS). SFS includes greedy forward and backward. When the patience is set to infinity it is equivalent to exhaustive search.
Ii-D Best First Search with Crossover Operator
Despite its potential, Greedy BFS is not widely applied to feature selection because it is computationally costly. In each step, it evaluates all children of the current vertex, the number of which equals to , the out-degree of vertex. To solve this problem, we propose a novel algorithm combining Greedy BFS search and crossover operator.
The idea of crossover operator comes from the genetic algorithm , a classical meta-heuristic optimization approach that simulates natural selection process. The core of the genetic algorithm is mutation and crossover operator. Mutation randomly changes one or several bits of population. Crossover takes the two best vertices that share the same parent and generates a new child from 3 possible operations as illustrated in Fig. 3.
After a node with the best score in the current queue is popped out, the BFS with crossover adds all its children to the priority queue. While this step is the same as Greedy BFS, a crossover operation is conducted between the best two children (“first” and “second” in Algorithm 2) of the node to identify a crossover node. This node is also added to the queue. There are three possible conditions depending on the relation of the two children with their parent (See Fig. 3).
|Meta-heuristic||Greedy||Best First Search||BFS with Crossover Results|
|Test/Cohort||Genetic Algorithm||Forward||Backward||Greedy BFS||BFS with Crossover||Pearson Coefficient||p-value|
|Test/Cohort||selected features||selected features + age||selected features + gender||selected features + age + gender|
Compared to Greedy BFS, the crossover node, which is likely a good node with high score, will be evaluated with the same priority as all the children of the current node. With Greedy BFS, the crossover node will have to be evaluated along with all the children of the “first” node. With crossover, if the “cross” node is actually better than “first”, the evaluations of other children of “first” will be skipped. However, there is no guarantee that the “cross” node is better than the children node of “first”, so BFS with crossover may not always yield better results than Greedy BFS.
Ii-E Gradient Boosting Tree
Gradient boosting tree (GBT) is chosen as an estimator for its simplicity and robustness. The idea of boosting is to combine the output of many weak models to produce a powerful ensemble. Gradient boosting adds the idea of steepest descent on top of boosting 
. It iteratively adds new weak model to correct the largest previous error. In addition, decision tree is often chosen as a weak estimator. GBT has strong generalization ability and robustness to errors
. In our preliminary work, we have compared GBT with other regression methods including Support Vector Machine and Neural Network. GBT usually leads to the best performance with limited hyperparameter tuning. Here, we choose to present only the performance of GBT under different feature selection methods.
Ii-F Repeated Stratified K-fold Cross Validation
K-fold Cross-validation (CV) is widely applied method to estimate model performance . Here, we use 5-fold cross validation with stratified CV split, which splits the entire dataset into 5 folds with the same distribution of labels . In our case the labels are continuously distributed in . We quantize this range to 5 bins and each fold would have the same percentage of the samples in each bin, as the whole dataset.
Iii Results and Discussion
Iii-a Prediction results between BFS with Crossover and other heuristic algorithms
For each NP test, we develop three GBT models using control subjects, mTBI subjects, and all subjects, respectively. For each model, we perform feature selection using the proposed BFS with Crossover method as well as several other methods including greedy forward, greedy backward and genetic algorithm.
The average score among all validation samples is chosen to assess of model performance.
is defined as the portion of variance explained:
Table II summarizes the performance of different feature selection methods. Greedy backward search yields poor performance, suggesting that this is not a viable method when the feature space is very large. The greedy forward and genetic algorithm are substantially better than greedy backward, but the score is still mostly below 0.5. The Greedy BFS provides substantial improvement over these two methods in all the models. Finally, BFS with crossover achieves further improvement over greedy BFS in all cases. The relatively high scores and the scatter plot in Fig. 4 indicate a reasonably good fit.
Last two columns of Table II
present the Pearson correlation between the ground truth and the predicted values by BFS with crossover and the corresponding p-value, which indicates the probability that an uncorrelated system produces such computed Pearson correlation. It could be observed that for most of the tests the Pearson correlation is larger than 0.7 with p value less than 0.05, which for biological systems indicates a strong and reliable relationship.
We have also suspected the relevance between np test results and age and gender. Based on the selected features, age or/and gender is added to the feature set. GBT is run again to test the accuracy. Excepet for tests DSB mTBI and DSB combine, features without age and gender generate better accuracy (See Table III). And DSB mTBI and DSB combine only have minor updates in terms of accuracy.
Iii-B Selected Features
|DSF NC||std-LC-AK, skew-LC-AK, std-LM-AWF, std-LR-De-par, mean-RR-De-par, mean-LM-De-par, skew-RM-De-par, mean-LM-De-perp, skew-LM-De-perp, std-RM-De-perp, kurt-RM-De-perp, skew-LR-FA, kurt-LR-FA, mean-LC-FA, kurt-RC-FA, std-CC-FA, etrp-RM-DA, skew-RR-MK, skew-RM-MK|
|DSB NC||mean-CC-AK, std-CC-AK, skew-LC-AWF, kurt-RR-Depar, kurt-CC-Depar, etrp-CC-Depar, etrp-LC-Deperp, skew-CC-Deperp, skew-LM-FA, skew-RM-FA, std-RR-MD, mean-RR-MK, kurt-LM-MK|
|LNS NC||mean-RM-AK, skew-RR-AWF, kurt-RR-AWF, mean-CC-AWF, std-LM-De-par, skew-LR-De-perp, std-LM-De-perp, mean-RR-DA, skew-RM-DA, skew-RC-DA, mean-CC-DA|
|DSF mTBI||skew-RM-AK, std-RR-AWF, etrp-RM-AWF, kurt-CC-AWF, std-LM-De-par, etrp-RM-De-par, std-LC-De-par, kurt-RC-De-par, std-RR-De-perp, mean-LM-De-perp, skew-LM-De-perp, kurt-LM-De-perp, kurt-RM-De-perp, kurt-LM-DA, std-RM-DA, kurt-LR-MK, std-RC-MK|
|DSB mTBI||std-LR-AK, mean-RM-AK, kurt-RM-AK, mean-LC-AK, kurt-LC-AK, mean-LR-AWF, skew-LR-AWF, etrp-LR-AWF, mean-RR-AWF, std-LM-AWF, skew-RM-AWF, mean-LC-AWF, std-LC-AWF, std-RC-AWF, skew-RC-AWF, etrp-LM-De-perp, etrp-LM-FA, etrp-RC-FA, std-CC-FA, std-RM-DA, mean-LC-DA, skew-LM-MD, std-LR-MK, std-RM-MK, mean-LC-MK|
|LNS mTBI||mean-LC-AK, kurt-LC-AK, skew-RC-AK, std-LR-AWF, skew-LM-AWF, std-LC-AWF, mean-RR-De-par, mean-RM-De-par, etrp-LC-De-par, kurt-RR-De-perp, mean-LM-FA, skew-LR-DA, mean-RR-DA, kurt-RR-DA, etrp-LM-DA, skew-LC-DA, mean-CC-DA, mean-LM-MD, mean-LM-MK, kurt-LM-MK|
|DSF combine||etrp-LC-AK, etrp-CC-AK, etrp-RR-AWF, skew-LC-AWF, std-RR-De-par, etrp-RR-De-par, skew-RM-De-par, std-CC-De-par, etrp-RR-FA, kurt-LR-DA, skew-LM-DA, kurt-RC-DA, skew-CC-DA, kurt-LR-MD, std-LM-MD, etrp-LC-MD, std-RC-MD, mean-CC-MD|
|DSB combine||skew-LM-AK, kurt-LC-AK, skew-LR-AWF, kurt-RR-AWF, mean-LC-AWF, etrp-CC-AWF, skew-LC-De-par, kurt-LC-De-par, skew-LM-De-perp, skew-RM-De-perp, etrp-RC-De-perp, mean-LR-FA, kurt-RR-DA, std-RM-DA, std-CC-DA, kurt-LC-MD, std-LM-MK, std-RC-MK|
|LNS combine||std-RR-AK, skew-RM-AK, skew-LC-AK, kurt-LC-AK, kurt-RC-AK, std-LM-AWF, kurt-LM-AWF, mean-LC-AWF, mean-CC-AWF, kurt-RM-De-par, etrp-RC-De-par, mean-CC-De-par, kurt-CC-De-par, skew-LC-De-perp, etrp-RC-De-perp, kurt-RR-FA, std-LM-FA, kurt-RM-FA, skew-RC-FA, etrp-RC-FA, kurt-RR-DA, std-RM-DA, kurt-LC-MD, std-CC-MK|
The features chosen by BFS with crossover are analyzed since they produce best accuracy. The number of times a diffusion MRI metric is chosen is accumulated and summarized in Figure 5.
For predicting the LNS test performance, it is interesting to observe that DA (Intra-axonal diffusivity, See Table I) metric is selected most often, for the modeled developed for the NC and MTBI cohorts, respectively. LNS is the most complex working memory task among these three tests and may have greater dependency on specific microstructural integrity more so than easy tasks. DA reflects axon injury or integrity and has been previously implicated in mTBI .
Additionally, it is noted that when counting the total number of times a metric is chosen over all three working memory tests, we see that for the models developed for the mTBI and control populations, respectively, the most frequently chosen features include De-par, De-perp, AWF, and DA. These compartment-specific metrics have been shown to be more sensitive to the underlying microstructure than others, such as DTI and DKI, which are known to be non-specific and empiric (See Table I).
Comparing the performances of separate models for the different cohorts (See Table II), we see that we are able to predict well with the models for the mTBI and NC cohorts, respectively. Furthermore, we see that the combined models (mTBI and NC together) are not as good with a weaker correlation coefficient for all three prediction tasks. The features chosen among these three populations for predicting the same NP score also differ (Figure 5). This suggests that mTBI and NP are two distinct populations in terms of white matter microstructure, in keeping with what we know about mTBI and white matter injury.
In this work, a new feature selection algorithm for predicting performance on working memory using diffusion MRI features is proposed. The algorithm is able to search over a large feature space effectively and achieved consistently better performance than other popular feature selection methods. This novel feature selection method is applicable to other classification and regression problems with large feature space and limited training data.
The prediction models using the selected features achieved quite high Pearson Correlation ( in all cases) with very low p-value (), demonstrating statistically significant agreement between the predicted scores and the measured working memory test scores. These results suggest that optimizing feature selection for predicting NP test performance has a great potential to reveal the most important imaging features that would be related to cognitive functions or cognitive impairments in mTBI patients.
Research reported in this paper is supported in part by grant funding from the National Institute for Neurological Disorders and Stroke (NINDS), National Institutes of Health (NIH): R21 NS090349, R01 NS039135-11, R01 NS088040 and NIBIB Biomedical Technology Resource Center Grant NIH P41 EB01718. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
-  M. Faul, M. M. Wald, L. Xu, and V. G. Coronado, “Traumatic brain injury in the united states; emergency department visits, hospitalizations, and deaths, 2002-2006,” 2010.
-  D. C. Voormolen, M. C. Cnossen, S. Polinder, N. Von Steinbuechel, P. E. Vos, and J. A. Haagsma, “Divergent classification methods of post-concussion syndrome after mild traumatic brain injury: prevalence rates, risk factors, and functional outcome,” Journal of neurotrauma, vol. 35, no. 11, pp. 1233–1241, 2018.
-  E. J. Grossman, M. Inglese, and R. Bammer, “Mild traumatic brain injury: is diffusion imaging ready for primetime in forensic medicine?” Topics in magnetic resonance imaging: TMRI, vol. 21, no. 6, p. 379, 2010.
-  M. E. Shenton, H. Hamoda, J. Schneiderman, S. Bouix, O. Pasternak, Y. Rathi, M.-A. Vu, M. P. Purohit, K. Helmer, I. Koerte et al., “A review of magnetic resonance imaging and diffusion tensor imaging findings in mild traumatic brain injury,” Brain imaging and behavior, vol. 6, no. 2, pp. 137–192, 2012.
-  S. Chung, E. Fieremans, X. Wang, N. E. Kucukboyaci, C. J. Morton, J. Babb, P. Amorapanth, F.-Y. A. Foo, D. S. Novikov, S. R. Flanagan et al., “White matter tract integrity: an indicator of axonal pathology after mild traumatic brain injury,” Journal of neurotrauma, vol. 35, no. 8, pp. 1015–1020, 2018.
-  S. Chung, X. Wang, E. Fieremans, R. Joseph, A. Prin, F. Farng-Yang A, C. Morton, N. Dmitry, F. Steven R, and Y. W. Lui, “Altered relationship between working memory and brain microstructure after mild traumatic brain injury,” American Journal of Neuroradiology, in press.
-  L. Miles, R. I. Grossman, G. Johnson, J. S. Babb, L. Diller, and M. Inglese, “Short-term dti predictors of cognitive dysfunction in mild traumatic brain injury,” Brain injury, vol. 22, no. 2, pp. 115–122, 2008.
-  Y. W. Lui, Y. Xue, D. Kenul, Y. Ge, R. I. Grossman, and Y. Wang, “Classification algorithms using multiple mri features in mild traumatic brain injury,” Neurology, vol. 83, no. 14, pp. 1235–1240, 2014.
-  S. Minaee, Y. Wang, and Y. W. Lui, “Prediction of longterm outcome of neuropsychological tests of mtbi patients using imaging features,” in 2013 IEEE Signal Processing in Medicine and Biology Symposium (SPMB). IEEE, Conference Proceedings, pp. 1–6.
-  S. Minaee, Y. Wang, A. Aygar, S. Chung, X. Wang, Y. W. Lui, E. Fieremans, S. Flanagan, and J. Rath, “Mtbi identification from diffusion mr images using bag of adversarial visual features,” IEEE transactions on medical imaging, 2019.
-  A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, “Adversarial autoencoders,” arXiv preprint arXiv:1511.05644, 2015.
-  P.-Y. Kao, E. Rojas, J. W. Chen, A. Zhang, and B. Manjunath, “Unsupervised 3-d feature learning for mild traumatic brain injury,” in International Workshop on Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. Springer, 2016, pp. 282–290.
J. Mitra, K.-k. Shen, S. Ghose, P. Bourgeat, J. Fripp, O. Salvado, K. Pannek, D. J. Taylor, J. L. Mathias, and S. Rose, “Statistical machine learning to identify traumatic brain injury (tbi) from structural disconnections of white matter networks,”NeuroImage, vol. 129, pp. 247–259, 2016.
-  R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial intelligence, vol. 97, no. 1-2, pp. 273–324, 1997.
-  J. M. Sattler and J. J. Ryan, Assessment with the WAIS-IV. Jerome M Sattler Publisher, 2009.
-  E. Fieremans, J. H. Jensen, and J. A. Helpern, “White matter characterization with diffusional kurtosis imaging,” Neuroimage, vol. 58, no. 1, pp. 177–188, 2011.
-  J. H. Jensen, E. T. McKinnon, G. R. Glenn, and J. A. Helpern, “Evaluating kurtosis-based diffusion mri tissue models for white matter with fiber ball imaging,” NMR in Biomedicine, vol. 30, no. 5, p. e3689, 2017.
-  G. Chandrashekar and F. Sahin, “A survey on feature selection methods,” Computers & Electrical Engineering, vol. 40, no. 1, pp. 16–28, 2014.
-  D. Rodrigues, L. A. Pereira, R. Y. Nakamura, K. A. Costa, X.-S. Yang, A. N. Souza, and J. P. Papa, “A wrapper approach for feature selection based on bat algorithm and optimum-path forest,” Expert Systems with Applications, vol. 41, no. 5, pp. 2250–2258, 2014.
-  J. E. Doran and D. Michie, “Experiments with the graph traverser program,” Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences, vol. 294, no. 1437, pp. 235–259, 1966.
-  H. Arai, C. Maung, and H. Schweitzer, “Optimal column subset selection by a-star search,” in Twenty-ninth AAAI conference on artificial intelligence, 2015.
-  P. Pudil, J. Novovičová, and J. Kittler, “Floating search methods in feature selection,” Pattern recognition letters, vol. 15, no. 11, pp. 1119–1125, 1994.
-  J. Yang and V. Honavar, “Feature subset selection using a genetic algorithm,” in Feature extraction, construction and selection. Springer, 1998, pp. 117–136.
-  T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, ser. Springer Series in Statistics. Springer New York, 2013. [Online]. Available: https://books.google.com/books?id=yPfZBwAAQBAJ
-  J. H. Friedman, “Greedy function approximation: a gradient boosting machine,” Annals of statistics, pp. 1189–1232, 2001.
-  Y.-D. Zhang, Z.-J. Yang, H.-M. Lu, X.-X. Zhou, P. Phillips, Q.-M. Liu, and S.-H. Wang, “Facial emotion recognition based on biorthogonal wavelet entropy, fuzzy support vector machine, and stratified cross validation,” IEEE Access, vol. 4, pp. 8375–8385, 2016.