I Introduction
Optimizing the performance of systems of parts is a central task during an engineering design process. For example, in automotive or aerospace engineering, the shape of individual parts is commonly optimized to improve aerodynamic performance using computer aided design (CAD) methods. Typically, engineers wish to understand which changes in a shape, carried out during the optimization, lead to the improved behavior. Hereby, it is often of interest to account for interactions between parameters such as to identify which parameters influence a shape’s fitness only when considered jointly [1, 2]. We therefore present a novel, informationtheoretic approach for the identification of optimization parameters most relevant to changes in a shape’s fitness, which accounts for interactions between parameters with respect to the fitness, such as to identify parameters that interact jointly with the target. We further utilize recently introduced informationtheoretic measures to quantify interactions between features. We demonstrate the applicability of our approach on a set of realistic turbofan rotor blade optimization runs [3], but strongly believe that it is of interest for a wide range of engineering design optimization scenarios.
Information theory [4] is a powerful tool for the analysis of dependencies between variables. Informationtheoretic methods, such as the mutual information (MI), are modelfree and are able to capture dependencies of arbitrary order, while requiring only minimal assumptions about the data for their estimation when using stateoftheart estimators [5]. These properties make informationtheoretic measures particularly promising tools for the analysis of data in the engineering domain [6], for example, results from optimization runs [7]. Here, the relationship between parameters and the optimization objective is expected to be highly nonlinear and the number of data samples is typically rather limited because the evaluation of fitness functions is costly. Furthermore, data distributions are typically not known and are expected to be highly biased due to the fact that data are generated by an optimization algorithm. As a result, highquality global surrogate models that cover substantial parts of the search domain are most likely not available to understand optimization runs [3]. Thus, there is a need for methods that allow for a posthoc analysis of optimization parameters and their influence on the optimization outcome.
We use a recently introduced algorithm for inferring relationships between variables that uses a conditional mutual information criterion (CMI) as a selection criterion [8, 9]. Using the CMI for selecting variables allows to account for interactions between variables such as redundancies, but also synergistic contributions [10]. Furthermore, we use the recently introduced partial information decomposition (PID) framework to investigate selected variables for interactions with respect to the target variable. We apply our approach to data from realistic turbofan blade aerodynamic optimization runs that use computational fluid dynamics (CFD) to evaluate a shape’s fitness [3]. We propose a parametrization of the turbofan blade geometry that allows for application of the proposed algorithm and compare our algorithm’s performance to related informationtheoretic feature selection criteria. To our knowledge, this work is the first using PID for sensitivity analysis in aerodynamic optimization data.
Ii Methods
Iia Optimization and Simulation Setup
We use data from realistic optimization runs on turbofan rotor blade geometries that was previously described and published in [3]. For details on the data generation process refer to the original publication. Fig. 1A shows a schematic of a turbojet engine and the corresponding turbofan rotor blade geometry.
A rotor blade is optimized by starting with a baselineshape that is modified under the objective of minimizing a target function. The shape is modified by deforming three crosssections of the shape, where each section is a cylindrical cut of the geometry. We consider one section at the hub, one at midspan height, and one at the shroud of the blade as indicated by the red lines in the inset of Fig. 1A.
Each section is deformed independently by the following manipulations: (i) rotation of the section around the leading edge (LE) point, (ii) movement of the section in the axialmeridional plane, and (iii) deformation of the section profile by adding HicksHenne shape functions [11], which is a common approach in 2D airfoil design and is illustrated in Fig. 1C. The HicksHenne function is defined as
(1) 
where parametrizes the chord length of each section and is the location of the maximum of each shape function. We placed the maxima of shape functions per section at equally spaced locations along the cord length, where . Considering all three possible manipulations, section rotation, movement, and deformation with HicksHenne functions, the total number of free shape parameters is .
For the optimization of shape parameters, we used a covariance matrix adaptation evolutionary strategy (CMAES) [12] with a population size of and parents which we ran for 161 generations, which amounts to 1932 evaluations, i.e., data samples, per run. We used an initial step size of in relative units of the maximal allowed variation (i.e., a initial variation). We performed four optimization runs, where two runs were performed with and two runs with , which lead to and free parameters to be determined by the optimization, respectively. Each run was initialized using a different random seed. These parameter settings are derived from best practices which try to balance the exploration and exploitation capabilities of each optimization run, to arrive at manageable optimization runtimes (each CFD simulation of a blade takes about hours on cores), and utilize the HPC infrastructure most efficiently.
The optimization target was to maximize the aerodynamic efficiency of the rotor blade at cruising conditions, which is estimated by calculating the isentropic efficiency of the blade,
(2) 
where and are the massflow averaged total pressure and total temperature at the specified location and is the heat capacity ratio (see, for example, [13]).
The boundary conditions of the CFD simulation mimic the behavior of a jet engine under cruising conditions.
Each blade was evaluated with a CFD simulation which employed the compressible flow solver steadyCompressibleMRFFoam
from the OpenFOAM CFD
suite (version foamextend3.2
), adapted to be more robust for transsonic simulations
[14]. The fitness of a blade was calculated as
(3) 
where denotes the isentropic efficiency of the blade of Eq. (2), averaged over the last 100 iterations of the solver. represents a penalty term that increases and thus worsens the fitness if the CFD simulation does not show good convergence or if the generated blade geometry is not feasible. See the original publication [3] for more details on the simulation setup, the optimization and the data generation. The fitness values during the optimization runs as function of the generations is shown in Fig. 2.
IiB Feature Extraction of Turbofan Blade Geometries for Sensitivity Analysis
We ran four optimizations with varying numbers of parameters for shape modification. In a next step, we wished to identify the locations at which modifications were most relevant to a blade’s fitness. To apply the proposed informationtheoretic approach, we first had to find suitable features that represented the blade geometry’s surface and could be used as input features for the algorithm (e.g., [15]). To this end, we considered multiple sectional cuts through the blade geometry. At each sectional cut , we placed points equally spaced along the chord line and recorded the absolute distance from the actual blade surface to the chord line at these locations (Fig. 1B). Furthermore, we considered the  and coordinates of the leading edge (LE), and , as well as the  and coordinates of the trailing edge (TE), and . We varied the number of points and sections used for the features to investigate the stability of results over various representations of the geometry. We used 3 sectional cuts with 2 points, resulting in 18 features, 3 cuts with 3 points, resulting in 21 features, 5 cuts and 5 points resulting in 45 features, and 10 cuts and 8 points, resulting in 120 features (Fig. 1D).
IiC InformationTheoretic Preliminaries
Before introducing the algorithm used to identify the most relevant locations of modification, we introduce the necessary informationtheoretic preliminaries (for a more detailed introduction see [16]).
The algorithm uses a conditional mutual information (CMI) to quantify the influence a single feature has on the fitness, in the context of further features. The CMI is defined as
(4) 
where , ,
are random variables with realizations
, , , andis a shorthand for the probability distribution
. The CMI quantifies the average information that has about , given the outcome of is known. The CMI is symmetric in and , and . Further, each random variable may also be replaced by a set of variables, e.g., , and thus quantifying the information a set of variables provides about a second variable, or set of variables, .Note that conditioning the information provides about on a third variable, , may have two effects: first, information that is redundantly present in both and about is removed from the information alone provides about (as measured by the unconditioned MI, ). Second, information that is provided synergistically by and together about is added to the information alone is providing about [17]. Hence, the CMI quantifies the information provides uniquely about and the information and provide jointly about in a synergistic fashion; at the same time, redundant contributions in and about are excluded. See also [10] for a discussion of the use of the CMI for feature selection.
As an example of synergistic information contribution, consider a binary xorgate with iid. inputs, and , and output . Inputs and alone, each provide no information about the output , such that . Only by conditioning on the second input, the information the first input provides is “decoded” and . Here, the two inputs provide information about the output in an exclusively synergistic fashion.
The framework to decompose the information two variables contribute about a third into unique, redundant, and synergistic contributions has only recently been introduced and is termed Partial Information Decomposition (PID) [17] (Fig. 3A, see also [18, 19]). PID extends classical information theory by providing axioms that allow to decompose the joint information two input variables and provide about a target variable , , into the information provided uniquely by each and , information provided redundantly by and , and information provided synergistically when considering and jointly. Note that such a detailed decomposition of the information contributed by two variables about a third was not possible using existing informationtheoretic concepts, e.g., the (C)MI or Shannon entropy, as shown by Williams and Beer [17] and illustrated in Fig. 3B.
In the present work, we use the PID framework to identify interactions between features with respect to the blade’s fitness. In particular, we estimate the synergistic information contribution of features and sets of features to identify those feature combinations that provide information about the fitness primarily when considered jointly.
IiD Identification of Most Relevant Features using InformationTheoretic Feature Selection Algorithm
We used a recently introduced forwardselection algorithm for feature selection [8, 9, 10] to identify the most relevant blade features with respect to the optimization outcome. The algorithm uses a CMI criterion for iterative feature selection, which measures the MI between a feature to be selected and the fitness, conditional on all already selected features. Thus, the CMI criterion, includes features not only based on their individual (unique) information contribution to the fitness, but also accounts for synergistic effects between the currently considered feature and the already selected feature set. Lastly, the inclusion criterion ensures that redundancies between features are avoided. For a detailed discussion of the algorithm and the CMI as a featureselection criterion see [10]. See algorithm 1.
The algorithm starts with an empty feature set , the set of all input variables, , and the target variable . Features are selected iteratively, where in each iteration, , the algorithm selects the feature that maximizes the criterion,
(5) 
where denotes the remaining input variables in iteration , and the set of already selected features. The identified maximum contribution is tested for statistical significance using nonparametric permutation testing and a testing scheme that controls the familywise error rate (see [9] for a detailed description of the test). If the information contributed by as measured by the CMI is statistically significant, is included in the set of selected features, and removed from the set of remaining variables, ,
(6)  
Note that statistical testing of the CMI estimate is necessary because while in theory the CMI is zero for (conditionally) independent variables, this may not be the case when estimating the CMI from finite data, due to the known bias of informationtheoretic estimators (e.g., [20]
). Instead, the test evaluates whether the estimate significantly differs from the distribution of estimates from permuted data and thus tests the Nullhypothesis of no dependence between the feature and the target in the context of the already selected feature set. The statistical test not only handles the estimation bias, but also provides an automatic stopping criterion for feature selection, because the algorithms stops if no remaining variable provides significant information about the target, given the already selected feature set. The number of features included in the selected feature set can indirectly be influenced by changing the critical alphalevel,
, of the statistical test, i.e., the threshold an individual test in iteration has to pass to allow for inclusion of candidate feature . We here used , where lowering leads to a more strict criterion and thus to the selection of fewer features in general, and vice versa.For practical estimation, we use an implementation of the algorithm as part of the IDTxl python toolbox [8, 9, 10], which uses a knearestneighborbased estimator for MI and CMI estimation from continuous data [5], which—while not being biasfree—has shown to provide the most favorable bias properties compared to other approaches [5, 21, 22].
IiE Posthoc Analysis of Feature Interactions by Estimating Synergistic Information Contribution
After selecting the most relevant geometric features for each optimization run using the presented forwardselection algorithm, we identify interactions between features with respect to the fitness by estimating the synergy between all pairs of selected features and the fitness. We use a PID estimator introduced in [23], also implemented in the IDTxl toolbox [8].
Iii Results
Iiia Identified Features and Interactions Between Features
The locations of features for the four optimization runs and the four extracted feature sets of the blade surface are shown in Fig. 4. Here, the first two markers in each row indicate the  and coordinates of the leading edge, , , while the last two markers indicate the coordinates of the trailing edge, , (both are in blue). The bottom row indicates the section closest to the hub, while the top row indicates the section closest to the shroud. Markers between the first and last two markers in each row indicate geometric features from left to right, , where indicates the section number from hub to shroud and indicates the feature index. Hence, the total number of input variables per feature set was . Panels A and B, and panels C and D each show optimization runs with identical setup but different random initialization for (A and B) and for (C and D).
Colored markers indicate relevant features identified by the algorithm. Dashed lines indicate the three pairs of features with highest synergy over all feature pairs.
We first note that the selected features are not completely consistent between runs which is expected. The data for each case was generated by an optimization run which is a highly structured process, and therefore, the feature space is sampled very inhomogeneously. Additionally, the blade regions with the largest deformations differ between runs [3], leading to variations in the extracted features. However, there are regions which are identified to be important in all runs, for example, at around chord length from the LE in the region from midspan to blade tip (i.e. the upper forward region). This region is expected to have high influence due to the shocksystem builtup [24]. Similarly, the region near the TE, and in particular close to the tip, is directly influencing the exitflow angle and thus affecting the efficiency strongly. The location at the hub is also consistently identified as important, but the exact location along the chord line varies between the runs.
Comparing the selected features of each run between the different feature sets provides a consistent picture for the smaller feature sets , , and
. The apparent differences can be understood by considering the peculiarities of the data and the PIDbased selection method. First of all, it is expected that for each feature set strong correlations and redundancy are present in the features, due to the deformation method used to generate the blades. Only three sections (at hub, midspan and shroud) were allowed to change independently and the changes were linearly interpolated inbetween, leading to many features being linear combinations of others. In addition, the HicksHennebased deformations of each section also induces smooth changes with possibly highly correlated and thus potentially redundant neighboring features. Also, the optimization algorithm induces correlated changes of parameters, i.e., blade regions, once it starts to converge to some (local) optimum. Therefore, the features from the feature set with
selected on the midspan section are replaced by (Fig. 4A and B) or augmented with (C and D) more informative features on the second section from the tip. For , high values of the redundancy are observed between the selected features and the not selected features which are close to the locations of the selected features form the smaller sets (not shown).For the largest feature set with the selected features are consistent with the smaller feature sets in the above described manner for the case D, but are only partially consistent or even seem inconsistent for the other cases A, B and C. This is understandable from the insights described above. Extracting 120 features from designs which are created with only 18 (A and B) or 30 (C and D) independent parameters constitutes a vast overparametrization of the independent influence factors, and results in huge redundancy in the feature set. In that case, the selected features are strongly influenced by the statistical variations of the rather few and highly structured 1932 data samples. Multiple sets of features could be selected which would be almost equally informative regarding the fitness, but which have different distribution of selected features over the blade region. Which set will be finally selected is strongly influenced by its ability to describe the statistical fluctuations of the data set. From the theoretical perspective this is correct, as the selected features represent the most informative features with respect to the fitness values for the given data set. However, the value to the engineer might be limited, as the most informative set does not necessarily represent the most important engineering design changes.
IiiB Prediction of Optimization Results
To validate the identified set of relevant parameters for each combination of number of features and optimization run, we used the selected features to predict the fitness values of each blade across the optimization run. We compared the features selected by our algorithm to features selected by the FEAST toolbox [25]
and features selected by standard machine learning approaches (linear Pearson correlation, MI, decision trees, extra trees, random forest, LARS).
The FEAST toolbox implements a variety of informationtheoretic feature selection criteria based on the MI and applies them to rank features. These criteria do not consider interactions between features, i.e., features are evaluated solely based on their individual contribution to the target. Hence, synergistic effects as well as redundancies are not accounted for (see also [10] for a comparison of the selection criteria to the regular CMI). Also, the toolbox does neither provide means to handle estimator bias nor an automatic stopping criterion for feature inclusion. As the toolbox only handles discrete variables, we binned the data prior to feature selection.
We used the following selection criteria implemented in FEAST: Joint MI (JMI) [26], MI Maximization (MIM) [27], MaxRelevance MinRedundancy (MRMR) [28], Conditional MI Maximization (CMIM) [29], Double Input Symmetrical Relevance (DISR) [30]
, Conditional Infomax Feature Extraction (CIFE)
[31], Interaction Capping (ICAP) [32], Conditional Redundancy [25], Relief [33], and the CMI estimated from binned data. We predicted the fitness from the different selected feature sets using nearestneighbor regression with number of neighbors, . Since the FEAST toolbox does not provide a stopping criterion, but just ranks the features by importance, we performed predictions from feature sets up to a size of 10 features, which was the maximum feature set size identified by our algorithm through statistical testing.Prediction results for various identified feature sets using the FEAST toolbox, a selection of standard featureselection methods from machine learning, and our proposed algorithm are shown in Fig. 5. The algorithms we compared our approach against, do not provide a stopping criterion, but only rank features by their importance. Hence, for each algorithm we predicted the fitness using various sets of the highestranked variables to allow for comparison to our solution. The plots show the prediction error from various sets of sizes up to 10, i.e., . In many cases, the feature sets from standard machine learning approaches did not provide accurate predictions. Only for run C and D with , run A with and for runs B and C with the largest feature set, the predictions of one standard method allowed for rather accurate predictions compared to selected feature sets of the same size. The features sets selected with the MRMR method, as one of the best performing methods from the FEAST toolbox, performed quite well, but only when a small number of features was selected and the relative performance dropped for larger sets of selected features. The proposed method based on CMI feature selection performed well for all studied situations as it consistently gave a good tradeoff solution with respect to feature set size and prediction accuracy. In 14 out of the 16 considered runs and number of features, our algorithms selected the best feature set among all feature sets of the same or smaller size. In 6 of these cases, the selected feature set performed best across all feature sets of any size. The other methods did not provide feature sets with such consistently good prediction performance, as can be seen, for example, for the LARS (blue crosses) and MRMR (red crosses) method in Fig. 5, which performed well for some configurations, but did not return good results consistently.
Generally, we observed that many different feature sets led to similar prediction performance, especially for the largest feature set with , which supports our previous analysis that this parametrization lead to highly redundant and correlated features. Nevertheless, the proposed CMIbased feature selection algorithm still managed to identify meaningful feature sets which were not too large and which allowed for good prediction performance.
Iv Discussion
We applied a recently introduced informationtheoretic approach to feature selection [10] in sensitivity analysis for optimization data. A strong conceptual and practical advantage of the proposed feature selection approach is its ability to account for interactions between variables when selecting features, such that the selection of redundant features is avoided while features that contribute information in a synergistic fashion together with other features are included. A further significant advantage of the used algorithm for the present application is the ability to automatically determine the number of relevant features by means of statistical testing, whereas for most established methods the number of features has to be fixed in advance. Furthermore, we used the recently introduced partial information decomposition (PID) framework [17] to identify feature interactions.
We successfully applied the approach to four realistic aerodynamic optimization runs, where we showed that the feature sets identified by the proposed algorithm always provided a good tradeoff solution with respect to feature set size and prediction performance. We showed that in most of the cases (14 out of 16) the selected feature set could be used to predict the optimization’s objective function with smaller error than using feature sets of the same size or smaller identified through existing approaches.
Central to the proposed approach is its ability to identify feature sets while accounting for interactions between features and to identify synergistic interactions. This property is especially desirable in application domains where optimization parameters are expected to show interactive effects on the target function. Such an analysis was previously not possible using the MI or its extensions, for example, the interaction information [34], which was proposed for the analysis of interactions in design data in earlier studies (e.g., [1, 2]). However, it was shown that these measures are not able to disentangle redundant and synergistic contributions and that such a contribution required the axiomatic extension of classical information theory as was done in the PID framework [17] (see also [18]). Accordingly, the development of informationtheoretic filtering methods accounting for interactions has not advanced in recent years such that the methods employed here, which often assume variable independence, are still a common approach (e.g., MRMR [35, 36]). We believe that this stagnation is partially due to the inherent lack in classical information theory to describe multivariate information contributions that has only become available with the introduction of PID [17, 10]. Hence, PID enables the informationtheoretic quantification of interactions in design applications as defined in [6]: “a design interaction is defined as a unique dependency between design and objective parameters from which all dependencies of lower ordinality are removed”.
The algorithm used for feature selection employs statistical testing to handle the bias in informationtheoretic estimates. Statistical testing furthermore provides an automatic stopping criterion as it can reveal that an estimate is not significantly different from an estimate from data with no relationship. Using statistical testing in feature selection has been proposed, for example, by [37]. However, the approach used here is the first to rigorously control the familywise error rate when testing repeatedly during iterative feature selection [9].
The used algorithm accounts for redundant and synergistic contributions during the identification of relevant features by conditioning on the set of all already selected features. A limitation is here that due to the iterative inclusion, variables that provide purely synergistic information can not be detected. To handle this latter scenario, one may start feature selection with a nonempty set, e.g., some random subset or a subset informed by prior information. Alternatively, one may include variable tuples instead of individual variables [38].
A further limiting factor is the number of features that the algorithm is able to select given a certain amount of data. If the selected feature set becomes too large, CMIestimation suffers from the curse of dimensionality such that the CMI can no longer be estimated reliably from the available. As a result, the estimate fails to reach statistical significance and the algorithm terminates. However, in sensitivity analysis it is typically the goal to identify the set of
most relevant features that can still be meaningfully interpreted by a human. As shown here, the algorithm was able to identify up to 10 informative variables from less than 2000 highly biased samples.Regarding the engineering task of identifying the most influential regions of the shape design the proposed approach gave satisfactory results, as features located at known highly influential region were successfully identified. Also, the high degree of redundancy and correlations in the features sets, which is a natural consequence of the smoothness of the shape deformations, is handled well by the approach.
Future work may focus on a visualization and interpretation of the results to provide a more intuitive picture to the engineer who is potentially not vellversed in information theory.
We conclude that the proposed algorithm [8, 9, 10], together with the recently introduced PID framework [17, 18, 19] and suitable estimators [5, 23], provides a valuable tool for the assessment of optimization outcomes in practical applications. In particular, the interactionaware feature selection together with the estimation of synergistic effects allows to identify interactions between optimization parameters that was previously not possible using informationtheoretic methods. Thus, the novel extension to informationtheoretic analysis presented here provides powerful tools for quantifying relationships in a wide area of application domains that are concerned with the analysis of data from nonlinear systems.
References
 [1] L. Graening, M. Olhofer, and B. Sendhoff, “Interaction detection in aerodynamic design data,” in International Conference on Intelligent Data Engineering and Automated Learning (IDEAL 2009). Lecture Notes in Computer Science. Springer, 2009, vol. 5788, pp. 160–167.

[2]
M. Rath and L. Graening, “Modeling design and flow feature interactions for
automotive synthesis,”
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
, vol. 6936 LNCS, pp. 262–270, 2011.  [3] J. Kmec and S. Schmitt, “Exploring the fitness landscape of a realistic turbofan rotor blade optimization,” in 6th International Conference on Engineering Optimisation (EngOpt 2018), 2018.
 [4] C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, pp. 379–423, 1948.
 [5] A. Kraskov, H. Stögbauer, and P. Grassberger, “Estimating mutual information,” Physical Review E, vol. 69, p. 16, 2004.
 [6] L. Graening and B. Sendhoff, “Shape mining: A holistic data mining approach for engineering design,” Advanced Engineering Informatics, vol. 28, pp. 166–185, 2014.

[7]
L. Graening, S. Menzel, T. Ramsay, and B. Sendhoff, “Application of
sensitivity analysis for an improved representation in evolutionary design
optimization,”
IEEE International Conference on Genetic and Evolutionary Computing
, pp. 1–4, 2012. 
[8]
P. Wollstadt, J. T. Lizier, R. Vicente, C. Finn, M. MartínezZarzuela,
P. A. M. Mediano, L. Novelli, and M. Wibral, “IDTxl: The Information
Dynamics Toolkit xl: a Python package for the efficient analysis of
multivariate information dynamics in networks,”
Journal of Open Source Software
, vol. 4, p. 1081, 2019. [Online]. Available: https://github.com/pwollstadt/IDTxl  [9] L. Novelli, P. Wollstadt, P. A. M. Mediano, M. Wibral, and J. T. Lizier, “Largescale directed network inference with multivariate transfer entropy and hierarchical statistical testing,” Network Neuroscience, vol. 3, pp. 827–847, 2019.
 [10] P. Wollstadt, S. Schmitt, and M. Wibral, “A rigorous informationtheoretic definition of redundancy and relevancy in feature selection based on (partial) information decomposition,” arXiv preprint, arXiv:2105.04187 [cs.IT], 2021.
 [11] R. M. Hicks and P. A. Henne, “Wing design by numerical optimization,” Journal of Aircraft, vol. 15, pp. 407–412, 1978.
 [12] N. Hansen, “The CMA evolution strategy: a comparing review,” in Towards a new evolutionary computation. Studies in Fuzziness and Soft Computing, vol 192. Berlin: Springer, 2006, pp. 75–102.
 [13] E. A. Baskharone, Principles of Turbomachinery in AirBreathing Engines. Cambridge, UK: Cambridge University Press, 2014.
 [14] H. Rusche and S. Schmitt, “Stability improvements of pressurebased compressible solver and validation for industrial turbo machinery applications,” in 4th Annual OpenFOAM User Conference, 2016.
 [15] L. Graening, S. Menzel, M. Hasenjäger, T. Bihrer, M. Olhofer, and B. Sendhoff, “Knowledge extraction from aerodynamic design data and its application to 3D turbine blade geometries,” Journal of Mathematical Modelling and Algorithms, vol. 7, pp. 329–350, 2008.
 [16] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. Hoboken, New Jersey: John Wiley & Sons, Inc., 2006.
 [17] P. L. Williams and R. D. Beer, “Nonnegative decomposition of multivariate information,” arXiv Preprint arXiv:1004.2515 [cs.IT], 2010.
 [18] A. J. Gutknecht, M. Wibral, and A. Makkeh, “Bits and pieces: Understanding information decomposition from partwhole relationships and formal logic,” Proceedings of the Royal Society A, vol. 477, p. 20210110, 2021.
 [19] A. Makkeh, A. J. Gutknecht, and M. Wibral, “Introducing a differentiable measure of pointwise shared information,” Physical Review E  Statistical, Nonlinear, and Soft Matter Physics, vol. 103, p. 032149, 2021.
 [20] L. Paninski, “Estimation of entropy and mutual information,” Neural Computation, vol. 15, pp. 1191–1253, 2003.
 [21] S. Khan, S. Bandyopadhyay, A. R. Ganguly, S. Saigal, D. J. Erickson III, V. Protopopescu, and G. Ostrouchov, “Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data,” Physical Review E, vol. 76, p. 026209, 2007.

[22]
G. Doquire and M. Verleysen, “A comparison of multivariate mutual information
estimators for feature selection,” in
Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods
, 2012, pp. 176–185.  [23] A. Makkeh, D. O. Theis, and R. Vicente, “BROJA2PID: A robust estimator for bivariate partial information decomposition,” Entropy, vol. 20, p. 271, 2018.
 [24] T. Sonoda, R. Schnell, T. Arima, G. Endicott, and E. Nicke, “A study of a modern transonic fan rotor in a low reynolds number regime for a small turbofan engine,” in Turbo Expo: Power for Land, Sea, and Air. Volume 6A: Turbomachinery, vol. 55225, 2013, p. V06AT35A032.
 [25] G. Brown, A. Pocock, M.J. Zhao, and M. Lujan, “Conditional likelihood maximisation: A unifying framework for mutual information feature selection,” Journal of Machine Learning Research, vol. 13, pp. 27–66, 2012.

[26]
H. H. Yang and J. Moody, “Data visualization and feature selection: New algorithms for nongaussian data,” in
Advances in Neural Information Processing Systems (NIPS ’99), vol. 12, 1999, pp. 687–693.  [27] D. D. Lewis, “Feature selection and feature extraction for text categorization,” in Proceedings of the Workshop on Speech and Natural Language. Association for Computational Linguistics, 1992, pp. 212–217.
 [28] H. Peng, F. Long, and C. Ding, “Feature selection based on mutual information: criteria of maxdependency, maxrelevance, and minredundancy,” IEEE Transactions on Pattern Analysis & Machine Intelligence, pp. 1226–1238, 2005.
 [29] F. Fleuret, “Fast binary feature selection with conditional mutual information,” Journal of Machine Learning Research, vol. 5, pp. 1531–1555, 2004.
 [30] P. E. Meyer and G. Bontempi, “On the use of variable complementarity for feature selection in cancer classification,” in Workshops on Applications of Evolutionary Computation. Springer, 2006, pp. 91–102.

[31]
D. Lin and X. Tang, “Conditional infomax learning: an integrated framework for
feature extraction and fusion,” in
European Conference on Computer Vision (ECCV 2006)
. Springer, 2006, pp. 68–82.  [32] A. Jakulin, “Machine learning based on attribute interactions,” Ph.D. dissertation, Univerza v Ljubljani, 2005.
 [33] K. Kira and L. A. Rendell, “A practical approach to feature selection,” in Machine learning proceedings 1992. Morgan Kaufmann, 1992, pp. 249–256.
 [34] W. J. McGill, “Multivariate information transmission,” Psychometrika, vol. 19, pp. 97–116, 1954.
 [35] N. AlNuaimi, M. M. Masud, M. A. Serhani, and N. Zaki, “Streaming feature selection algorithms for big data: A survey,” Applied Computing and Informatics, 2020.
 [36] J. Cai, J. Luo, S. Wang, and S. Yang, “Feature selection in machine learning: A new perspective,” Neurocomputing, vol. 300, pp. 70–79, 2018.
 [37] A. Tsimpiris, I. Vlachos, and D. Kugiumtzis, “Nearest neighbor estimate of conditional mutual information in feature selection,” Expert Systems with Applications, vol. 39, pp. 12 697–12 708, 2012.
 [38] J. T. Lizier and M. Rubinov, “Multivariate construction of effective computational networks from observational data,” Preprint no.: 25/2012, Max Planck Institute for Mathematics in the Sciences, 2012, available from: https://www.mis.mpg.de/publications/preprints/2012/prepr201225.html (accessed: 20201215).