Mars Express (MEX), a spacecraft operated by the European Space Agency (ESA), is Europe’s first spacecraft that orbits Mars. During its science operations, since the beginning of 2004, it has provided evidence of the presence of water above and below the surface of the planet , an ample amount of three-dimensional renders of the surface as well as the most complete map of the chemical composition of Mars’s atmosphere .
MEX is powered by electricity generated by its solar arrays and stored in batteries to be used during the eclipse periods. The scientific payload of the MEX consists of seven instruments that provide global coverage of the planet’s surface, subsurface and atmosphere. The instruments and on-board equipment have to be kept within their operating temperature ranges, spanning from room temperature for some instruments, to temperatures as low as –180°C for others. In order to maintain these predefined operating temperatures, the spacecraft is equipped with an autonomous thermal system composed of heater lines as well as coolers. The thermal system, together with the platform units, consumes a significant amount of the total generated electric power, leaving a fraction to be used for science operations.
Predicting the power consumption of the thermal system is a non-trivial task. However, due to the aging of the spacecraft and the decaying capacity of its batteries, it is a very crucial one for optimal planning and execution of science operations on MEX. The power consumption is a dynamic process that changes through time, depending on various external and internal factors, such as long-term exposure of the spacecraft to the Sun or heat generated by the on-board instruments. For instance, Figure 1
shows the effect of the radio transmitter during a communication pass, with interpolation between different temperature sensors of theface of the spacecraft. Temperatures fluctuate by up to 28°C due to these two different on/off conditions. Current attempts at modeling and predicting the power consumption, involve manually constructed models that are based on simplified first-principle models, expert knowledge and experience. Given MEX’s current condition, this prompts for a more accurate predictive model of the thermal power consumption, which would yield prolonged operating life.
This motivated the organization of the first ESA’s data mining competition – the Mars Express Power Challenge . The focus of the challenge was the development of specialized approaches for constructing models that are able to accurately estimate and predict the MEX’s thermal power consumption (TPC) given only measured telemetry data. For this task of predictive modeling, machine learning approaches offer a different, yet more accurate solution to modeling the complex relationship between the telemetry and the power consumption than a human expert.
Machine learning is an area in the realm of artificial intelligence, which studies algorithms with the ability to learn, i.e., algorithms that improve their performance through knowledge gathered from experience (data). Their ability to capture and describe patterns in complex data makes them a valuable asset for studying a variety of phenomena in different domains from life sciences, earth sciences, social and behavioral sciences. In the context of the MEX challenge, machine learning algorithms for predictive modeling aim at constructing a model that can capture complex relationships in the data. In turn, such models accurately estimate future values of power consumption for each of theheater lines and coolers (target variables/features) given measured telemetry data (descriptive variables/features).
In our previous work , we presented the machine learning pipeline solution that won the Mars Express Power Challenge. The winning solution first transforms the raw telemetry data into carefully constructed features with minute time resolution between values, rendering a massive data set. Next, it uses the method of Random Forest of Predictive Clustering Trees (RF-PCTs)  to construct predictive models for each of the target variables. Finally, it outputs a predicted value of each target variable for every hour of one Martian year in the future ( Earth years). The proposed solution performed better than the 40 other competing solutions while being more accurate than the models currently in use at ESA by an order of magnitude.
However, the premium predictive accuracy of the winning solution came at a cost of substantial computational overhead. In this paper, we extend the work presented in  to address this issue. In particular, we propose an update to the winning solution which aims at efficiently constructing predictive models of MEX’s TPC, while still being able to maintain good predictive performance. More specifically, we consider updates of the pipeline along two dimensions: (1) constructing data with different data granularity in the learning process and (2) using different machine learning methods which can efficiently learn accurate predictive models. The former considers engineering features from the raw telemetry data at different time resolutions coarser than minute thus reducing the size of the data set used in the learning phase. The latter considers both local and global methods for multi-target regression. Local methods construct a model for each target variable separately. Here, besides the winning method RFs of PCTs 
, we also consider XGBoost, a recent efficient implementation of Stochastic Gradient Boosting . In contrast, global methods produce a predictive model able to predict several target variables simultaneously . To this end, we consider global RFs of PCTs for multi-target regression, an extension of the local version, which can construct single model for all target variables, therefore substantially reducing the computational time needed for obtaining a solution [5, 9].
In sum, the main task that we address is: Given three Martian years of telemetry data (August 22, 2008 to April 14, 2014), use machine learning to efficiently construct predictive model that accurately predicts the values of the electric current through the thermal power consumers for the subsequent Martian year (April 14, 2014 to March 1, 2016).
The remainder of the paper is organized as follows. In Section 2, we provide an overview of the work related to machine learning applications for space-exploration research. Section 3 presents and discusses the tasks of data preparation and pre-processing. Section 4 presents the machine learning methods used in this study. In Section 5, we present the experimental setup for evaluating the proposed extensions of the machine learning pipeline. Section 6 presents and discusses the results of the empirical evaluation. Finally, Section 7 concludes the paper and suggests directions for further work.
Ii Related Work
Machine learning offers an ample amount of methods that tackle predictive tasks in real life domains [10, 11]. These methods have been applied for predicting discrete output values (classification), continuous output values (regression), even structured outputs as in gene networks, image classification, text categorization etc .
The challenges typically addressed in space-exploration research are associated with high-cost of failure 
. For instance, the remote spacecraft are typically equipped with processors and memory lagging decades behind the state of the art. Next, the development and launch of a space mission is expensive and there is little or no opportunity for repair. In this context, the utility of machine learning approaches has been proven to be valuable asset. In particular, many applications of machine learning address the task of anomaly detection in spacecraft, where a typical task considers monitoring the status of the on-board equipment. Such analyses of telemetry data are performed using neural networks
, relevance vector machine
or by applying seasonal decomposition methods (linear regression together with the nearest neighbours).
Machine learning can be used not only for estimating the current state of a spacecraft, but also predicting its future ones and therefore allowing for autonomous decisions. In our previous work , we propose a solution for predicting spacecraft’s power consumption, which can be used for optimizing its operation. This works closely relates to the one of , where the authors propose Random Forests for predicting temperature of the instruments to optimize battery usage during eclipses.
Another important challenge relates to safe ground movement of autonomous (space) rovers or robots. The authors of 
address this issue by using support vector machines to recognize and avoid dangerous objects. In similar context, addresses the task of image analysis for automated navigation systems. Here, with deep neural Networks as the underlying machine learning method, autonomous drones are utilized for tracking forests paths.
Finally, machine learning can be also utilized to learn or simplify a physical model of a spacecraft or a model of its environment. In , the authors employ Random Forests to simplify the exact physical model for complex and dynamic radiation environment in the Van Allen belts. In a similar context, 
outlines several studies which address challenges of spacecraft operating in high-radiation environments (beyond Earth’s magnetosphere and ionosphere) and the reliability of machine learning algorithms applied in these scenarios. In particular, these studies propose variants of traditional ML approaches (k-means and SVMs) robust to potential data corruption on the disks due to the various levels of radiation.
Iii The Data
In the typical machine learning setting, the input in a learning algorithm is (training) data which embodies the experience. The data consists of training examples (also referred to as instances or measurements) and their properties (also referred to as features or attributes). The features, numerical (i.e., continuous) or nominal (i.e., discrete), can either describe the data or specify the desired output of the algorithm. In the context of predicting MEX’s TPC, an example is a time period, while features are derived from context and observations data.
In this paper, we use data provided by ESA , that consists of raw telemetry data (context data) and measurements of the electric current of thermal power lines (observation data), for three Martian years of MEX operations. We refer to these data as the training data . For the fourth Martian year of the operation, the context part of the data was used for generating the predicted values, that in turn where evaluated using the real measured observation data. We refer to these data as the test data .
The observation data consists of the electric current measurements of the 33 power consumers, recorded once or twice per minute. The context data consists of five components:
SAA (Solar Aspect Angles) data contain the angles between the Sun–MEX line and the axes of the MEX’s coordinate system.
DMOP (Detailed Mission Operations Plans) data contain the information about the execution of different subsystems’ commands at a specific time.
FTL (Flight dynamics TimeLine events) data contain the pointing and action commands that impact the position of MEX, such as pointing the spacecraft towards Earth or Mars.
EVTF (Miscellaneous Events) data contain time intervals during which MEX was in Mars’s shadow or records of the time points when the MEX is in apsis of its elliptical orbit.
LTDATA (Long Term Data) contain the Sun–Mars distances and the solar constant.
All raw data entries are time-stamped (expressed in milliseconds) indicating when the entry was logged. The time span between the two consecutive entries varies from less than a minute (SAA) to several hours (LTDATA). For a detailed description of the task and the data, we refer the reader to [3, 21].
The raw data is not directly applicable to a machine learning algorithm due to two main reasons: (i) incompatible time resolutions of the different components of the raw data, and (ii) unstructured format of some of the entries, such as text, that are not readily usable for machine learning algorithms. Therefore, to construct an appropriate data set for a learning algorithm, we pre-process the raw data in two phases: (i) conveying data time resolution (time interval between two consecutive examples) and (ii) engineering new (more informative) features from different parts of the context data that may yield to better predictive performance.
The first phase relates to choosing an appropriate time resolution of the data set, and divide the time span into subintervals , where . Here, is the first time-stamp in the , and is the last time-stamp in the .
The second phase considers constructing more informative features. The value of a given feature for a particular time interval is obtained by aggregating measurements from the time interval at hand that correspond to one or more components from the raw telemetry data.
Due to issues with the spacecraft communication in some periods, some measurements are missing from the both the context and observations data. In principle, the machine learning methods employed in this study can handle data with missing values. However, longer periods with contiguous missing values can substantially damage the accuracy of the learned predictive models as well as add an additional computational overhead. For this reason, we remove examples with missing observation data for time periods longer than minutes. On the other hand, in the context data, we interpolate the examples with missing values for time periods shorter than minutes, or leave them intact otherwise.
In the following subsections, we describe the groups of features constructed in the pre-processing step of the pipeline.
Iii-a Energy influx features
There are seven features in this group: one for solar panels and one for each of the six sides of the cuboid of MEX. The features describe the amount of solar energy that is collected through a given surface in a given time interval . The solar energy collected by a side of cuboid directly influence the amount of the energy used by the thermal lines that maintain the temperature in that part of the spacecraft. The solar energy collected by the solar panels influence the amount of available energy that can be stored in spacecraft’s batteries.
The amount of energy collected by a given surface is proportional to the product of the effective area of the surface and the solar constant . If the area is given, we compute as , where is the angle between the Sun–MEX line and the outer normal to the surface (see Figure 2). Without any loss of generality, we assume for all surfaces, as the machine learning methods that we use, are invariant to monotonic transformations of features. The values of for each of the seven surfaces were computed directly from the SAA data, while was given in LTDATA. In addition to the the effective area and the solar constant, (pen)umbras have a considerable impact on the energy influx. We define the amount of the energy that pass through the surface at time interval as
where is the umbra coefficient, an approximation of the proportion of Sun visible from the spacecraft. takes the value if the spacecraft is in an umbra, if the spacecraft is in a penumbra, and otherwise. Instead of calculating exact integrals for , we approximate the values using the trapezoid-rule.
Iii-B Historical energy influx features
The thermal state of the spacecraft depends not only on the current energy influx, but also on the energy influx in the past. To capture this, we construct historical energy influx features for each of the seven surfaces. A given historical feature for surface at time is computed as a sum of energy influx during given historical time-frame:
where is the number of time-intervals included in the historical feature. To account for different impacts of the historical energy influx we construct historical features with different time-frames, for different values of , given in Sec. V-A.
Iii-C DMOP features
The DMOP data contain log of commands issued to different MEX sub-systems. The names of commands have been obfuscated, however the available documentation reveals two variants of events: (1) events that contain information about the subsystem and command that has been executed (e.g., ASXX383C) and (2) events that represent flight dynamics events (e.g., MAPO.000005).
The first four characters of the first variant represent the subsystem while the rest represent the command and its parameters. In the second variant the first four characters represent the name of the event, followed by a number that indicates the number of occurrences. Given that these events have different impact on the temperatures of various subsystems of the spacecraft, it is safe to assume that they impact the thermal subsystems differently.
More specifically, we assume that there is a significant delay between triggering of a sub-systems’ command and its actual effect on the thermal state of the spacecraft. Therefore, from the raw DMOP data, we construct features that encode this information of delayed effect in terms of ”time since last activation” of a specific sub-system command.
The values of the DMOP features are calculated as follows:
where denotes the value of feature corresponding to event at time . Note that, here we also assume that all of the subsystems were deactivated at the first time point (i.e., ). The regulates the effect of a given event diminishing with time, rendering its influence unimportant at some point. We selected this threshold to be 1 day (, the number of minutes in a day). Table I presents these calculations of features.
|Raw data||DMOP features|
We construct such features for each flying dynamic event ( features), each subsystem-command pair ( features) and each sub-system in case different commands are issued to it ( features). We also construct binary indicators for each subsystem and flying dynamic event ( features in total), where a feature has value of 1 if the subsystem was triggered within the time-step , and 0 otherwise.
Iii-D FTL features
The FTL data contain logs of pointing events and their time ranges, where simultaneous events are also possible. For each pointing event in time interval , a feature has value that equals the proportion of the time in , during which the event is in progress. Since the duration of events is typically longer than , the values of the features are mostly 1 (event is in progress), or 0 (event not in progress). This approach renders 23 FTL features in total.
Iii-E The final data sets
Table II presents the important details regarding the final constructed data sets used further in the experiments†††The data sets will be publicly available after the completion of the review process..
|[min]||number of examples||number of features||size [MB]|
Iv Machine learning methods
Considering predictive modeling, many machine learning methods struggle with the problem of overfitting. Overfitting occurs when a method learns a model with a very good performance on the provided training data, but has limited generalization power and performs poorly on data unseen during learning. In machine learning, there is a long tradition of developing methods that address this problem of, by learning multiple (diverse) models and combining their outputs instead of just learning a single model. These methods are referred to as ensemble methods or ensembles. An ensemble is a set of (base) predictive models constructed with a given algorithm, that is expected to lead to predictive performance gain over an individual model by combining the predictions of its constituents. In this study, we employ two types of ensembles: (i) Random Forest of Predictive Clustering Trees (both local and global version of the algorithm for multi-target regression) [5, 22] and (ii) Stochastic Gradient Boosted Trees (XGBoost) [6, 7].
Iv-a Random Forest of Predictive Clustering Trees (PCTs)
Random Forest (RF)  is an ensemble method that learns a set of tree-based predictive models and combines their prediction. The base models are learned from random bootstrap samples of the training set, where for each tree at each tree node a random feature subset (with a user defined size) is considered for selecting the best split. Such approach allows for constructing a set of diverse predictive models that can differ both in size and performance.
In this study, the Random Forest ensembles consist of Predictive Clustering Trees (PCTs) 
. PCTs have a tree structure that includes internal nodes and leaves. The internal nodes contain tests on the descriptive variables (i.e., the different features extracted with pre-processing), while leaves give predictions for the target variable (i.e., power consumption of a thermal line). PCT refers to a hierarchy of clusters with each node corresponding to a cluster. In particular, the top-node of a PCT corresponds to one cluster (group) containing all data points. This cluster is then recursively partitioned into smaller clusters while moving down the tree. The leaves represent the clusters at the lowest level of the hierarchy and each leaf is labeled with its cluster’s centroid/prototype (the average of the target variable is the prediction made by the leaf).
Random Forest of PCTs is a generalization of the traditional Random Forest ensemble of regression trees , in terms of addressing structured output prediction tasks [5, 25]. While the traditional RF is able to predict values of a single numeric target variable at a time (i.e., is a local method for Multi-Target Regression), the RF of PCTs ensemble allows also for predicting several target variables simultaneously (i.e., is a global method for Multi-Target Regression). The algorithm for learning a RF of PCTs is presented in Alg. 1. Namely, it takes three inputs: (1) training data , as well as two hyper-paramaters denoting (2) the number of trees in the ensemble and (3) the size of the feature subset considered at each node split . Each PCT in the ensemble is learned with greedy recursive top-down induction algorithm on a random bootstrap sample of the input data set (line 3 of the Alg. 1)‡‡‡The RF of PCT framework is implemented in the CLUS system available at http://source.ijs.si/ktclus/clus-public.
. The best test is greedily chosen by a heuristic function that typically measures the impurity of the target values of the examples in the subsetsof the set
that this test results in. The goal of the heuristic function is to guide the algorithm towards small trees with good predictive performance. In the global MTR setting, this heuristic function is the average variance of the targets. In our case, with numeric features, every test is of form, for some threshold , and partitions the set into two subsets. These are the set of test-positive instances for which and the set of test-negative instances for which , i.e., (line 9).
The procedure is recursively repeated on the subsets to construct the sub-trees (line 12) until a stopping criterion is satisfied (e.g., the minimal number of examples in a leaf is reached or the heuristic score no longer changes, etc.). In turn, a leaf node is created and its prototype is computed (line 15). In the global MTR setting, the prototype is a vector of average target values of the examples in the leaf. The prototypes are used for prediction.
Finally, the RF of PCTs algorithm outputs a set of PCTs, which predictions are combined (averaged per value) to obtain the final ensemble prediction. The reasons for using RF of PCTs are twofold: (i) its state-of the-art predictive performance , and (ii) ability to calculate feature importance scores, i.e., ranking of the features w.r.t. their importance for the target variables.
Namely, random forests can measure how much each feature contributed to the quality of the predictive model. For this purpose, we used the Genie3 algorithm , for which the motivation is the following. If a relevant feature is part of a test , then the heuristic score (reduction of the variance) of this split is high. Additionally, the features that appear in the tests of nodes at lower depths, e.g., in the root, influence more examples compared to the ones appearing deeper in the tree, so the former are intuitively more important. Therefore, the Genie3 importance score is defined as
where is the set of nodes in the tree in which is part of the test, is the value of the variance reduction function in the node and is the number of examples that come to the node .
Further in the paper, we denote the local and the global version of Random Forest PCTs for multi-target regression with L-RF and G-RF, respectively.
Iv-B Gradient Boosted Trees
Gradient boosting  refers to a class of boosting ensemble methods  which aims at learning a set of predictive models focusing on difficult observations in the data. In contrast to Random Forest ensembles, that first learn the ensemble constituents from random parts of the data set and in turn combine their outputs, boosting ensembles are constructed iteratively. At each boosting iteration a new weak base model that corrects the error made by the ensemble thus far is learned and added to the set creating a stronger model at each step. Typically, such weak models have simple structure and thus perform slightly better than random models.
In general, boosting methods differ in the type of base models they employ and how the learning is performed. The former is task related, where typical base models considered include: logistic regressors (for classification tasks), linear regressors (for regression tasks) or decision trees (for both classification and regression tasks). Regarding how the learning is performed, boosting base models are either constructed to focus on hard examples identified in previous iterations or by minimizing the empirical risk via steepest gradient descent . In this paper we focus on the latter category, referred to as Gradient Boosting.
An outline of the gradient boosting algorithm for constructing ensembles with decision trees is presented in Alg. 2. The algorithm takes three inputs: (1) a training data set , (2) number of boosting iterations
, (3) a loss functionand (4) a learning rate . First, an initial default (weak) model is learned on the whole data set that minimizes the loss function . Given that the gradient boosted method aims at optimizing the loss function, it is important that should be convex and differentiable. A typical choice of is square-loss function . In turn, gradient boosting at each step learns a new model on the pseudo-residuals, i.e. the discrepancy value between the true and the predicted value of the ensemble in the previous iterations (line 4).
However, such straightforward approach can very easily overfit to the training data. In order to address the problem of overfitting, more sophisticated gradient boosting methods implement two different mechanisms: a learning rate and random data sampling procedure. The former regulates the influence of the prediction of each subsequent model added in the ensemble set (line 6). The latter, referred to as Stochastic Gradient Boosting , employs additional random data sampling procedure: Each model is learned and evaluated on different random subsamples of the training data (line 3).
V Experimental Setup
V-a Parameter Instantiation
Granularity. The data granularity is defined by the length of the time interval that corresponds to one example in the data set. We construct the predictive models using data sets with (measured in minutes) where corresponds to the data set used in .
Historical features. For the data set with finest granularity , we consider the following numbers of historical intervals from (1): . Consequently, the historical time span ranges from minutes to minutes. The choices of the historical intervals in the data sets with courser granularity are presented in Table III. In total, these values yield features in the data set with and features data set with .
|values of||time spans|
Random Forest parameters. To constrain the size of the trees in the Random Forests, we specify a minimal number of examples in the leaves for each tree. Since the number of instances in the data sets is inversely proportional to , we set the minimal number of instances for the experiments to , while for the others we set them to . Additionally, we set the total number of trees in the random forests to , where one quarter of the features is considered at every split when growing the trees, i.e., in Alg. 1.
XGBoost parameters. Analogously, to constrain the size of the trees we set maximal depth of each tree in the ensemble to . The learning rate parameter is set to . Additionally, to address potential over-fitting issues, for every boosting iteration of the features and of the examples are randomly chosen for training. The maximum number of boosting iteration (ensemble constituents) is set to with an early-stop option, i.e., if the newly added trees in the ensemble do not improve the performance of the ensemble over five consecutive boosting iterations the algorithm stops.
V-B Evaluation procedure
The data set consists of examples where is a vector of feature values (features are described in Sec. III), and is a vector of target values, i.e., the electrical currents trough the heaters and coolers.
The data set is divided into two parts: that describes the state of the spacecraft throughout the first three Martian years, and that describes the state of the spacecraft throughout the fourth Martian year. All predictive models, i.e., the approximations of the true mappings , were learned on . The -th component of vectors and , i.e., the predicted and true value for -th target, are denoted by and .
The quality of the predictions is evaluated on a separate test set , not used for learning the models. It is measured in terms of the average root mean squared error , defined via the root mean squared errors of each target variable, as follows:
where denotes the size of and is the number of target variables.
We also compare the machine learning methods in terms of their time efficiency (for constructing a predictive model). By doing so, we estimate the trade-off between the predictive performance of the models and time needed for constructing them, in turn determining the optimal combination of time resolution in the data and machine learning method. The time efficiency refers to single-threaded runs of the algorithms, computed as follows. First, only an portion of the ensemble is constructed where we measure the learning time . Subsequently, the total learning time is estimated as . Such estimations were necessary, since some methods do not allow for single threaded runs (as reported in the next section).
The goal of the experiments is to determine whether we can improve the efficiency of our approach for predicting thermal power consumption, while retaining good predictive performance. In particular, we evaluate tree different algorithms in terms of time efficiency (for learning a model) and predictive performance.
First, we report on the running times of the three algorithms given training data with different granularity as well as the impact on their predictive performance. Next, we discuss different alternatives for improving both the efficiency and the predictive performance of the algorithms. Finally, we discuss the quality of the learned predictive models using feature importance diagrams.
We first focus on the learning times for different algorithms given different data granularity . Figure 3 presents the time needed for each algorithm to construct a model for each target as well as the total time to complete the task.
As expected, in general, for all approaches the learning times decrease with the granularity increasing. Nevertheless, the L-RF approach, that constructs an ensemble model for each of the target variables, has the longest learning time of approximately hours ( years). In particular, learning an L-RF ensemble from the data set takes approximately -times longer than constructing an XGB ensemble which takes hours.
While the constituents of an XGB ensemble are considerably smaller than the ones of an , constructing a XGB ensemble still takes approximately twice as much time as constructing a G-RF ensemble ( hours). Note that, the constituents of both types of ensembles can be constructed in parallel, thus additionally reducing the learning time for L-RF and G-RF by a factor of . In contrast, the most resource friendly is the data set, where the estimated learning times of L-RF, XGB and G-RF are , and hours respectively.
Next, Figure 4 presents the impact of the data granularity on the predictive performance of the models constructed with the three different methods. Note that there is a trade-off between learning-time efficiency and predictive performance. In particular, the ability of G-RF to construct predictive models in the shortest amount of time comes at a cost of decreased predictive performance compared to the other two methods. In this context, XGB performs best in two cases () while being substantially more efficient than L-RF which performs best in the remaining four cases ().
The predictive error of both methods, increases substantially with . In the case of XGB, however, the best predictive performance is obtained with the medium grained data sets . Note that, the L-RF trained on data set (as used in ) still has some competitive advantage in terms of predictive performance. However, similar (or slightly better) performance can be obtained either by learning from the which is considerably more efficient, or by constructing XGB ensemble using the data set.
Vi-B Further optimization
So far, we reported on ensembles consisting of constituents. In general, the size of the ensemble has different effects on the quality of the predictions in the case of Random Forest ensembles and Boosting ensembles. In the former case, a general rule-of-thumb is that the predictive performance increases with the ensemble size, until it (effectively) saturates at some point. The reason for this is that every tree in RF is grown independently of the others. Thus, a RF ensemble may only be unnecessarily large.
In contrast, the boosting ensemble creation is different. Here, every additional tree focuses on minimizing the errors of the previously grown trees, therefore such ensembles have a tendency to overfit to the training data. As a consequence, the ensemble size can have substantial effect on the predictive performance.
We conjecture that such artefacts are also present in the models evaluated thus far, and therefore we aim to further optimize the learning methods. To estimate the optimal number of trees in the ensembles, we take the data set and perform 5-fold cross validation on . More specifically, we randomly divide into five equally sized parts , . We train ensembles with different sizes (, …, ) on every group of four parts, and estimate their performance on the remaining part not used for training. The average error of the five attempts (so called cross-validated error) is the final score of a given ensemble. We perform the same procedures for the RF (G-RF) and XGB ensembles.
The results in Figure 5 show that both types of methods can achieve good performance with considerably smaller ensembles. In particular, the RF method is able to achieve good performance very early (which does not drastically improve over time) due to accurate and deeper trees. On the other hand, as expected, XGB starts with very poor performance which considerably improves after cca. iterations.
Given these evidence, we once again evaluate the predictive performance of the three methods on the test data, however instead of constructing ensembles with trees we construct them with only .
Table IV confirms our conjecture: The ensemble size in the case of RF methods (L-RF and G-RF) has in general insignificant effect on the predictive performance. Note that, the Random Forest method is comprised of random trees, hence in some cases, like the G-RF constructed on the , small ensemble sizes hurt the predictive performance. Further analysis shows that in this case the performance stabilizes with at least constituents to . In in the case of XGB, the effect of the ensemble size is more prominent and constant. More specifically, the predictive performance of an XGB ensemble learned on a data set with granularity yields the best overall performance we obtained so far. Note that obtaining such performance is also considerably faster than the one reported in our previous study, i.e., L-RF learned on .
Nevertheless, in terms of time efficiency, all ensemble methods benefit from the reduction of the ensemble size. In particular, on average the learning time is reduced by factor of in the case of RF ensembles and by a factor of almost in the case of boosting. Figure 6 presents the results of the overall learning time and the predictive performance of all three methods with reduced ensemble size.
More specifically, here we aim at identifying the optimal choice of algorithm given both criteria of predictive performance and time efficiency. Therefore, the optimal solution should be as close as possible to the lower left corner of the graph. A point in the graph is dominated by another point in the graph if is better than in both criteria. For example, the performance of L-RF on is dominated by the performance of XGB with both . The non-dominated points form a so called Pareto front. In our case, the Pareto front consists of the two XGB points () and four G-RF points (). All these points are considered optimal, unless we further specify our preferences over the criteria: If one aims at obtaining the fastest (but less accurate) solution one should consider G-RF (). On the other hand, XGB () yields the most accurate (but less efficient) solution.
Vi-C Ensemble of Ensembles
The performance of an ensembles is a consequence of the performance and diversity of its constituents. Moreover, ensembles usually perform better when compared to each individual constituent . Given that in this study we consider different types of ensembles (RFs and XGB) with good predictive performance, to further improve the overall performance we can also combine their outputs in an Ensemble of Ensembles (EoE). As a proof-of-principle, Figure 7 presents the predictions of XGB and G-RF during sample time-period of 2 days for the 12th power line. We can see that, while the XGB overestimates and G-RF underestimates the measured data (red line), their combination performs better.
Given the results from Figure 6 we construct two EoEs. The member ensembles are trained on data set, since all results so far point to a good trade-off between efficiency and performance in this case. Moreover, both EoEs have one XGB member combined either with G-RF (both methods are efficient) or L-RF (both methods are accurate).
The results are summed up in Table V. While the XGB&G-RF ensemble is very efficient to construct, its performance is only better than one of the members, G-RF. On the other hand, the XGB&L-RF achieves the best performance overall of . However, in practice, obtaining such an ensemble takes 10 times more than learning only a XGB ensemble that has practically similar performance ().
Vi-D Feature Importance
In our last set of experiments, we assess the importance of the features obtained from the three methods. Typically, different features have different influence on the target variables, which in turn affects the predictive performance of the constructed models. Figure 11, illustrates how different groups of features (energy influx, DMOP and FTL) influence the power consumers. The feature importance diagrams were calculated using the data set. More specifically, in the case of L-RF and XGB (Figs. (b)b and (c)c), the proportions in the diagrams for each target were computed according to Eq. (2). On the other hand, in the case of G-RF (Figure (a)a) where the model predicts all targets simultaneously, Eq. 2 leads only to the global feature importance (all diagram). The per-target importance were computed from the local models, considering one target when computing the heuristic function. Note that, for some targets, the feature importance diagrams are blank since all ensemble constitutes in these cases are constant models (trees without internal nodes).
In the case of G-RF, the most important feature group overall is DMOP. In particular, in the majority of the power lines (27) their influence is at least . The Energy influx features have a major influence on five power lines, while the FTL features have a considerable impact only on the 13th power line. Similarly, L-RF finds the DMOP features as most important overall. However, as opposed to G-RF, here we can see find more power lines which are almost exclusively influenced by energy influx features. These differences in the computed feature importance between G-RF and L-RF is a consequence of how the ensembles are constructed: While the former is able to capture the global phenomena across all power lines, the latter captures more detailed behaviour that relates to each individual power line.
On the other hand, XGB mostly relates to the DMOP features. Compared to the other two methods, in this case their influence is considerably greater. In particular, the Energy Influx features have some significant importance () on only 4 power lines, while the FTL features do not contribute greatly. Additionally, XGB was not able to produce feature importance diagrams for 5 of the target variables.
In this paper, we propose a machine learning pipeline for predicting the power of the thermal subsystem on board the Mars Express spacecraft. More specifically, we propose novel solutions in the machine learning pipeline that focus on efficiently constructing predictive models of MEX’s TPC, while still being able to maintain high predictive performance. More specifically, we employ state-of-the-art feature engineering approaches for transforming raw telemetry data, which in turn is used for constructing accurate predictive models with different machine learning methods. These solutions are the main contribution of our paper, since they considerably improve our competition-winning solution  in two directions: efficiency and accuracy.
The proposed improvements in the pipeline consider (1) preparing training data with different time granularity, as well as (2) employing different machine learning methods for constructing accurate predictive models. Regarding the former, we carefully transformed the raw telemetry data at different time resolutions ( min) which resulted in significant reductions of the size of the data sets used in the learning phase. Regarding the latter, we considered different state-of-the-art local and global machine learning ensemble methods for multi-target regression. These methods include: Local and Global Random Forests of Predictive Clustering Trees (L-RF and G-RF) as well as Stochastic Gradient Boosted Trees (XGB). We evaluated our proposed solutions on the task of predicting hourly values of the electric current through the thermal power consumers on board MEX for one Martian year, given raw telemetry data of three preceding Martian years.
In terms of time efficiency, our empirical study shows that the time resolution of the data has a significant impact on both the construction time of the predictive models as well as on their accuracy. The former is an expected result, given that coarser granularity yields a reduced data set and therefore shorter learning time. However, learning methods using coarser data usually yield less accurate models. The latter result though, provides a significant insight into this problem: Given a data set with moderate granularity (, all three methods are able to obtain models with comparable (or better) predictive performance, in substantially shorter time as compared to models learned on data set with finer granularity.
In terms of predictive performance, the local ensemble methods perform better than the global method. More specifically, in most cases, L-RF and XGB have comparable performance, with XGB being slightly better. While both methods perform better than G-RF, the difference in performance is neither substantial nor significant. Note that learning a global model also takes considerably less time than learning a local model. In the same context, our results show that, while the size of the ensembles has a significant effect on the learning time, it can also improve the predictive performance. In particular, we showed that, with all methods we can obtain similar or better (XGB) predictive performance with smaller ensembles. Moreover, we also demonstrated that by further combining the predictions of the different ensembles into an ensemble-of-ensembles, we can additionally improve the predictive performance and obtain premium accuracy ().
Finally, our feature importance analysis indicates that, for this particular problem of predicting thermal power consumption, the Detailed Mission Operations Plans (DMOP) have a significant role in the quality of the predictive models. Their importance is more prominent in the XGB ensembles models than in the RF ensembles, which also rely on the Energy Influx features when constructing a model.
There are several directions to extend the work presented in this paper. Considering the data, note that DMOP information is available only after a certain command is executed on the spacecraft and its effect measured subsequently. This means that using these data for predicting longer time horizons is not possible. Moreover, given the findings of our paper, omitting them from the learning process might have a severe consequence on the performance of the predictive models. Therefore, an immediate continuation of the work presented here is to further optimize the constructed features as well as investigate different approaches for engineering new (informative) features. Finally, while the proposed methodology focuses on the thermal subsystem of the MEX spacecraft, it can also be readily applied to the other subsystems. Moreover, it can also be extended to other spacecraft such as the XMM Newton , Integral  and ExoMars as well as rovers (such as Curiosity and ExoMars) exploring Mars.
We thank the Slovenian grid initiative & the Academic and Research Network of Slovenia for support with the computer infrastructure. We acknowledge the financial support of the Slovenian Research Agency (via the grants P2-0103, J4-7362, L2-7509 and the young researcher grants to MP and MB), as well as the European Commission via the grants HBP (The Human Brain Project) SGA1 and SGA2.
-  R. Orosei, S. E. Lauro, E. Pettinelli, A. Cicchetti, M. Coradini, B. Cosciotti, F. Di Paolo, E. Flamini, E. Mattei, M. Pajola, F. Soldovieri, M. Cartacci, F. Cassenti, A. Frigeri, S. Giuppi, R. Martufi, A. Masdea, G. Mitri, C. Nenna, R. Noschese, M. Restano, and R. Seu, “Radar evidence of subglacial liquid water on mars,” Science, 2018.
-  A. Chicarro, P. Martin, and R. Trautner, “The Mars Express mission: An overview,” in Mars Express: The Scientific Payload, vol. 1240, 2004, pp. 3–13.
-  L. Lucas and R. Boumghar, “Machine learning for spacecraft operations support - The Mars Express Power Challenge,” in Sixth International Conference on Space Mission Challenges for Information Technology, SMC-IT 2017, 2017.
-  M. Breskvar, D. Kocev, J. Levatić, A. Osojnik, M. Petković, N. Simidjievski, Ženko, R. Bernard Boumghar, and L. Lucas, “Predicting thermal power consumption of the mars express satellite with machine learning,” in 2017 6th International Conference on Space Mission Challenges for Information Technology (SMC-IT), 2017, pp. 88–93.
-  D. Kocev, C. Vens, J. Struyf, and S. Džeroski, “Tree ensembles for predicting structured outputs,” Pattern Recognition, vol. 46, no. 3, pp. 817–833, 2013.
-  T. Chen and C. Guestrin, “Xgboost: A scalable tree boosting system,” in Proceedings of the 22Nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’16, 2016, pp. 785–794.
-  J. H. Friedman, “Stochastic gradient boosting,” Computational Statistics & Data Analysis, vol. 38, no. 4, pp. 367 – 378, 2002, nonlinear Methods and Data Mining.
-  H. Borchani, G. Varando, C. Bielza, and P. Larrañaga, “A survey on multi-output regression,” Data Mining and Knowledge Discovery, vol. 5, no. 5, pp. 216–233, 2015.
-  E. Spyromitros-Xioufis, G. Tsoumakas, W. Groves, and I. Vlahavas, “Multi-target regression via input space expansion: treating targets as inputs,” Machine Learning, vol. 104, no. 1, pp. 55–98, 2016.
-  T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, ser. Springer Series in Statistics. Springer New York, 2013.
-  I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, 2005.
-  G. H. Bakır, T. Hofmann, B. Schölkopf, A. J. Smola, B. Taskar, and S. V. N. Vishwanathan, Predicting structured data, ser. Neural Information Processing. The MIT Press, 2007.
-  A. McGovern and K. L. Wagstaff, “Machine learning in space: extending our reach,” Machine Learning, vol. 84, no. 3, pp. 335–340, 2011.
-  Z. Li, “Machine learning in spacecraft ground systems,” in 2017 6th International Conference on Space Mission Challenges for Information Technology (SMC-IT), 2017, pp. 76–81.
-  T. Yairi, Y. Kawahara, R. Fujimaki, Y. Sato, and K. Machida, “Telemetry-mining: A machine learning approach to anomaly detection and fault diagnosis for space systems,” pp. 476–483, 07 2006.
-  M. Muñoz, Y. Yue, and R. Weber, “Telemetry anomaly detection system using machine learning to streamline mission operations,” in 2017 6th International Conference on Space Mission Challenges for Information Technology (SMC-IT), 2017, pp. 70–75.
-  G. De Canio, T. Godard, R. Boumghar, and U. Weissmann, “Optimization of the battery usage during eclipses using a machine learning approach,” in 15th International Conference on Space Operations, Marseille, France.
-  A. C. Hernández, C. Gómez, J. Crespo, and R. Barber, “Adding uncertainty to an object detection system for mobile robots,” in 2017 6th International Conference on Space Mission Challenges for Information Technology (SMC-IT), 2017, pp. 7–12.
-  A. Giusti, J. Guzzi, D. C. Cireşan, F.-L. He, J. P. Rodríguez, F. Fontana, M. Faessler, C. Forster, J. Schmidhuber, D. Di Caro, Gianni Scaramuzza, and L. M. Gambardella, “A machine learning approach to visual perception of forest trails for mobile robots,” IEEE Robotics and Automation Letters, vol. 1, no. 2, pp. 661–667, 2016.
-  T. J. Finn, R. Boumghar, J. Martinez, and A. Georgiadou, “Machine learning modeling methods for radiation belts profile predictions,” in 15th International Conference on Space Operations, Marseille, France.
-  R. Boumghar, L. Lucas, and A. Donati, “Machine learning in operations for the mars express orbiter,” in 15th International Conference on Space Operations, Marseille, France.
-  L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, 2001.
-  H. Blockeel, L. De Raedt, and J. Ramon, “Top-down induction of clustering trees,” in The 15th International Conference on Machine learning. Morgan Kaufmann, San Francisco, CA, 1998, pp. 55–63.
-  L. Breiman, J. Friedman, R. Olshen, and C. J. Stone, Classification and Regression Trees. Chapman & Hall/CRC, 1984.
-  H. Blockeel, “Top-down induction of first order logical decision trees,” Ph.D. dissertation, Katholieke Universiteit Leuven, Leuven, Belgium, 1998.
-  V. A. Huynh-Thu, A. Irrthum, L. Wehenkel, and P. Geurts, “Inferring regulatory networks from expression data using tree-based methods,” PLOS ONE, vol. 5, no. 9, pp. 1–10, 09 2010.
-  J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” The Annals of Statistics, vol. 29, no. 5, pp. 1189–1232, 2001.
-  Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting,” Journal of Computer and System Sciences, vol. 55, no. 1, pp. 119 – 139, 1997.
-  L. K. Hansen and P. Salamon, “Neural network ensembles,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 12, pp. 993–1001, 1990.