Regularized Operating Envelope with Interpretability and Implementability Constraints

12/21/2019
by   Qiyao Wang, et al.
hitachi
1

Operating envelope is an important concept in industrial operations. Accurate identification for operating envelope can be extremely beneficial to stakeholders as it provides a set of operational parameters that optimizes some key performance indicators (KPI) such as product quality, operational safety, equipment efficiency, environmental impact, etc. Given the importance, data-driven approaches for computing the operating envelope are gaining popularity. These approaches typically use classifiers such as support vector machines, to set the operating envelope by learning the boundary in the operational parameter spaces between the manually assigned `large KPI' and `small KPI' groups. One challenge to these approaches is that the assignment to these groups is often ad-hoc and hence arbitrary. However, a bigger challenge with these approaches is that they don't take into account two key features that are needed to operationalize operating envelopes: (i) interpretability of the envelope by the operator and (ii) implementability of the envelope from a practical standpoint. In this work, we propose a new definition for operating envelope which directly targets the expected magnitude of KPI (i.e., no need to arbitrarily bin the data instances into groups) and accounts for the interpretability and the implementability. We then propose a regularized `GA + penalty' algorithm that outputs an envelope where the user can tradeoff between bias and variance. The validity of our proposed algorithm is demonstrated by two sets of simulation studies and an application to a real-world challenge in the mining processes of a flotation plant.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 9

06/04/2010

On the definition of a theoretical concept of an operating system

We dwell on how a definition of a theoretical concept of an operating sy...
05/26/2021

Automatic design of quantum feature maps

We propose a new technique for the automatic generation of optimal ad-ho...
12/17/2019

Balancing the Tradeoff Between Clustering Value and Interpretability

Graph clustering groups entities – the vertices of a graph – based on th...
12/10/2020

Data-driven Method for Estimating Aircraft Mass from Quick Access Recorder using Aircraft Dynamics and Multilayer Perceptron Neural Network

Accurate aircraft-mass estimation is critical to airlines from the safet...
01/22/2019

On the validation of complex systems operating in open contexts

In the recent years, there has been a rush towards highly autonomous sys...
06/24/2013

Modeling The Stable Operating Envelope For Partially Stable Combustion Engines Using Class Imbalance Learning

Advanced combustion technologies such as homogeneous charge compression ...
07/28/2021

Q-Learning for Conflict Resolution in B5G Network Automation

Network automation is gaining significant attention in the development o...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

In industrial operations, an important concept is that of the operating envelope. Conceptually, the operating envelope is a set of operational parameters, such that some KPI is optimized. In the industrial context, typical KPIs include product quality, operational safety, equipment efficiency, environmental impact, etc [1, 2, 3, 4]. The operating envelope has wide application since it directly targets the business outcome and yields actionable recommendations in the operations space. It has been widely considered across numerous industrial domains. For instance, an operating envelope for an oil well was identified by [5] to ensure asset integrity and production performance in the oil and gas field. Similarly, an operating range in terms of airflow, fuel cell, and turbine load was determined to ensure the steadiness of a direct fired fuel cell gas turbine [6]. Other published applications of the operating envelope include generating a safer industrial system design (e.g., sub-sea oil production facilities) [7], and identifying more stable operations for partially stable combustion engines [8].

Despite its popularity and significance in the industrial domain, there is no universally agreed upon definition for an operating envelope. Very often, different concepts are ambiguously referred to as an operating envelope by different groups of researchers and engineers. The first contribution of our paper is that we comprehensively identify the required components to define the operating envelope and systematically distinguish among different types of operating envelopes.

Analogous to traditional machine learning, the target value or the KPI of interest is called the

response variable

. Without loss of generality, we assume that there is only one response variable of interest and higher response values indicate better performances. The factors that affect the response variable can be categorized into either

control variables or state variables. The control variables correspond to operations that are fully controllable by operators. From modeling perspective, these variables can be treated as deterministic. For instance, when driving a car, the pressure from feet to the pedal is a control variable. The state variables

quantify factors which affect the response variable but cannot be directly controlled. State variables are considered as random variables in modeling. In the car driving example, the actual pressure transmitted to the accelerator is a state variable which affects the response variable speed, but cannot be directly controlled.

Depending on the availability of data sources and the problem of interest, three types of operating envelopes and the corresponding use cases can be defined, i.e., 1) operating envelopes with respect to the state variables, 2) operating envelopes with respect to control variables, 3) operating envelope with respect to both state and control variables. In this paper, we mainly focus on the first type of operating envelope which corresponds to scenarios where the response variable and state variables are given, while the control variables are not recorded or not easily quantifiable. The impact of control variables on the response variable is through the state variables, and the underlying mechanics are typically not completely understood or too difficult to model. The operating envelope problem is then to identify good sets of values in the state variable space such that the corresponding response variable is higher or highest when compared to the other regions. Random state variables are the more general cases and more challenging to be dealt with than the deterministic control variables. The proposed definition and the corresponding solutions discussed later can easily be extended to handle the other two types of scenarios with appropriate modifications.

In the literature, for the considered type of operating envelope problem, there are two existing types of methods. The first type of methods is based on domain knowledge. Specifically, domain experts build physical models to determine or simulate the specific optimal operating envelope. Previous work falling into this category include [6, 9]. This type of approach is well supported by the domain expert’s opinions. However, this manual operating envelope identification process is time-consuming and even infeasible when the industrial process or equipment is so complicated that there are too many correlated state variables that need to be considered simultaneously. Another drawback of domain knowledge based approaches is that they cannot be easily scaled across different industries or different types of equipment.

The second type of approaches uses data-driven methods. For these, data records are first grouped into a ‘high response’ and a ‘low response’ groups based on the response variable. Then classification algorithms are utilized to learn the boundary between these two groups. The learned boundary is used to indicate the operating envelope, i.e., regions in the state variable space with a higher probability of belonging to the ‘high response’ group are considered acceptable. For example, in

[10] authors used the support vector machine to determine the gasoline direct injected engine admissible operating envelope. Classification based approaches have several drawbacks: (i) When the response variable is a numerical variable, classification based algorithms discard information about the response variable by converting the numerical data to categorical data. The lack of differentiation on the KPI values within the identified region may result in sub-optimal performance. (ii) A bigger challenge with these approaches is that with the goal of maximizing classification accuracy, these approaches may result in very complex or narrow regions, which greatly reduces the interpretability (meaning, whether the outcome is comprehensible by the operator) and implementability (meaning that the operating envelope can be achieved in practice) of the operating envelope. For example, the classification algorithms usually do not put continuity constraints on the targeting operating envelope. As an illustration, suppose we have three state variables denoted by , , , and we apply classification algorithms to identify the operating envelope in the space spanned by these three state variables to achieve ‘high’ performance. The resulting operating envelope with high classification accuracy may indicate that states and are good operations and are inside the envelope, but the another state in-between the line connecting the above two states, , is a bad operation and is outside of the envelope. This type of non-intuitive operating envelope is not comprehensible to operators and is an obstacle in operationalizing it in practice. That is, the operating envelopes of complex shapes that are non-compact are not easily interpretable. Another observation is that tight and narrow operating envelopes are not easily implementable.

To overcome these limitations, we propose a new definition for the operating envelope problem. Aiming at solving the first limitation listed above, we propose to directly set the expected value of the response variable as the optimization target. As for the second set of limitation, to ensure interpretability, we only consider operating envelopes that are in the form of a single parameterizable continuous compact set or union of such disjoint compact sets in the state variable space, and to ensure the implementability of the resulted operating envelope, we put an additional constraint on the probabilistic coverage of our recommendation. Specifically, we require that the probability of state variables falling into the operating envelope is greater than some pre-specified threshold. We call the identification of the region with the largest mean response value subjecting to these interpretability and implementability constraints as the operating envelope with interpretability and implementability constraints problem. (We give precise definitions in the next section). This is a novel way of formulating the operating envelope problem, which is motivated by the fact that real-world practitioners usually not only care about the accuracy of our recommendation but also the interpretability and the implementability of data-driven solutions.

Interpretable machine learning is a rapidly rising field [11, 12]

with the growing implications of deep learning models in real world practices. The classical interpretable models focus on interpreting the impact of certain feature values on the model predictions. We extend this idea further: the interpretability in our work aims to ensure that the recommendation on the identified variables can be easily interpreted by operators. For example, the value of a certain variable should be in a range. Like the traditional understanding of interpretability, we aim at increasing the trust between human and models, and hence their acceptance.

Our proposed operating envelope problem is essentially a region searching task. The identifications of interesting regions has been previously studied by, such as the detection of outlying regions [13]

, the detection of salient regions in computer vision

[14], the detection of clusters, etc. However, none of these region identification problems is formulated to maximize the expected region-wise response variable value. The proposed operating envelope with interpretability and implementability constraints has never been considered before and no existing algorithms can be directly applied to solve it. In this paper, we propose an effective solution which is summarized as follows.

We first apply the penalty method [15] to turn the probabilistic coverage constrained optimization problem into an unconstrained problem. As shown in the next section, no closed form exists for the objective function in general, which eliminates the applicability of methods that require gradient or Hessian information (e.g., Newton’s method). One may then adapt one-dimensional directed search methods such as Golden Section Search to solve this multi-dimensional optimization problem with black-box function via coordinate cycling search. However, the directed search based methods process a single point of the search space in the searching iterations and are not population based, which poses a high risk of being trapped in a local optimum. Hence, we propose to use the genetic algorithm (GA) [16] to search over all possible regions characterized by a finite set of parameters. GAs are more robust than the directed search algorithms and the effectiveness depends less on the initial solution values. Another important property of the GAs is that they maintain a population of potential solutions, which reduces the risk of being trapped in a local minimum. To boost the efficiency of GA, we adopt parallelization for GA [17]

. To enhance the generalization of the calculated operating envelope, we propose to add a regularization term using the sample standard deviation of the estimated expected response value within any region during the learning phase.

The contributions of the paper are summarized as follows:

  • We systematically identify the components in the conventional operating envelope problem and map the space of existing solutions.

  • We propose the interpretability and the implementability constraints, and define a new operating envelope identification problem with these two constraints.

  • For the new operating envelope definition, we design an algorithm by combining the penalty method with the genetic algorithm.

  • We also propose a regularization strategy to improve the generalization of the achieved operating envelope.

The rest of the paper is organized as follows. Section II presents the proposed definition for the operating envelope with interpretability and implementability constraints problem. Section III discusses in detail our proposed solutions, including the estimation module, the search module, and the generalization consideration. Section IV presents numerical experiments including two simulation studies and one real-world data analysis task. Section V concludes the paper.

Ii Definition of the Proposed Operating Envelope with Interpretability and Implementability Constraints Problem

Ii-a Mathematical notations

Let denote the response variable, which represents the KPI of interest in our operating envelope problem. The state variables that affect the value of , i.e., the input in machine learning language, are denoted by a dimensional vector . Denote the underlying density distribution of as , which is unknown. To simplify our notations, we mainly focus on scenarios where the state variables are numerical variables in this paper. The proposed problem formulation and the corresponding solutions can be extended to handle categorical state variables with mild modifications. The space spanned by is denoted as . Suppose that there is an underlying mapping from to the response variable , i.e.,

(1)

where represents the combination of any other factors that is not in and the random disturbance. We assume that the mapping has been normalized such that the expectation of is 0 and the variance is , and is independent of . Neither the structure of nor is known beforehand. Instead, we have access to independent pairs of samples, , with for .

Ii-B Problem formulation

The ultimate goal of our operating envelope problem is to identify an optimal region in the state variable space such that the response variable is maximized, and the detected region is interpretable and implementable by making use of the observed data points , . Before presenting the proposed formulation for our new operating envelope problem, we first describe the concepts of interpretability and implementability of any region in the state variable space.

To enhance the interpretability of the achieved envelope, we propose to define eligible candidates as regions of certain simple geometrical shapes. Specifically, we assume that the qualified regions are -sided hyper-rectangles with sides being parallel to the state variable axes or unions of () disjoint such hyper-rectangles. The mathematical representations of this type of hyper-rectangles is provided in Definition 1. Hyper-rectangles that are parallel to the orthogonal state variable axes are essentially the Cartesian product of intervals on each axis. The identified optimal hyper-rectangle explicitly indicates within which interval each state variable should be, increasing the explainability of data-driven results to operators.

For any region defined above, we propose to use the probability that state variable falling into it to quantify its implementability, i.e. . A larger value of indicates that the corresponding region has a better implementability. To ensure the implementability in the output region, we introduce a lower bound and define implementable regions as all regions that satisfy the coverage constraint

(2)

In Eq. (2), higher values of place higher requirements on the implementability. If any region size is implementable, can be set as 0.

is a hyperparameter that needs to be specified in advance according to the practical needs.

Based on the described interpretability and implementability concepts, the proposed operating envelope problem is formally defined in Definition 1 below.

Definition 1.

For a response variable and state variables , suppose that any targeting operating envelope is an union of () disjoint hyper-rectangles in the state variable space , which is denoted by , where , with and , for and .

The operating envelope is then the solution of:

(3)

subject to the probabilistic coverage constraint

(4)

as well as the mutually disjoint region constraint, i.e., and , there exist such that

(5)

In Definition 1, Eq. (3), the main objective function, indicates that the goal of our proposed operating envelope is to maximizing the region-wise response variable over region-characterizing factors, i.e, , for . To the best of our knowledge, this is the first formal definition for the operating envelope identification problem that directly considers the expected value of response variable within regions as the objective function. In previous literature on data-driven operating envelope problems, machine learning classifiers treat the accuracy measures for predicting the labels ‘ cutoff’ and ‘ cutoff’[18, 19] as the objective function. The proposed definition also originally incorporates customizable interpretability and implementability constraints by Eq. (4) and Eq. (5) to account for important practical requirements. The desired operating envelope that is a union of disjoint rectangles within a space spanned by covariates is given in Figure 1.

Fig. 1: Visualization of the desired operating envelope with and in Definition 1.

Note that our proposed definition can be generalized to handle candidate regions with other shapes that can be characterized by a finite set of parameters, such as hyperspheres. In Definition 1, is a hyperparameter that indicates the number of desired disjoint hyper-rectangles in the output. Specifying multiple disjoint sub-regions in the output is sometimes important. For instance, the operations may be conducted at multiple different operating modes. For each mode, we need to specify an optimal operating region accordingly. This hyperparameter is optional for our constrained operating envelope problem. When it is not specified, we can output the operating envelope and the corresponding optimal mean response variable under different , and the user can choose among the results based on their own needs.

In the next section, we describe step by step how we use the sample data , , to obtain the optimal envelope defined in Definition 1.

Iii A Data-Driven Solution Approach for the New Operating Envelope Problem

To solve the problem defined in the previous section, two essential modules are required. First, an estimation module is needed to estimate the objective function and the probabilistic coverage for any given region . Proposed estimation methods are described in Section III-A. Second, a search module should be constructed to globally search over the constrained state variable space. Relevant proposals are provided in Section III-B and Section III-C. An operating envelope specific generalization problem and the proposed regularization technique are presented in Section III-D.

Iii-a Estimate objective function and probabilistic coverage

A natural way of evaluating the objective function as well as the probabilistic coverage for any region could be that we first estimate by , for example, through regression models, and estimate

, for example, via the kernel density estimation

. Then we estimate the objective function and the probabilistic coverage by plugging in and to the objective function and the probabilistic coverage, respectively. Specifically, the estimates are

(6)

and

(7)

where represents the numerical integration using techniques reviewed by [20]

. However, this function approximation first and then numerical integration method is computationally heavy, especially in the high dimensional data cases.

To overcome the calculation challenge, we proposed an alternative model-free approach. Based on the observed samples , , for any given region , we propose to use the sample average approximation (SAA) to estimate . Let be an indicator function of event , i.e.,

(8)

then the proposed estimator for is

(9)

For any given region , we propose to estimate the probabilistic coverage by the sample proportion estimator. Mathematically,

(10)

The model-free sample average estimator and the sample proportion estimator in Eq. (9) and Eq. (10) are efficient to calculate. Our simulation studies also indicate that they yield accurate estimates for and , respectively, as long as the sample data is representative over the entire space .

When the raw data , is not representative over the entire space. We propose to improve data’s representability by first generating more data from the learned models and before using Eq. (9) and Eq. (10) to estimate the required components. We focus on the model-free estimation approach given in Eq. (9) and Eq. (10) in this paper.

Iii-B Penalty method for the constraints

Based on the estimation module in the previous section, empirically, for any given and , we need to calculate the optima for the following constrained optimization problem,

(11)

subject to

(12)

where , with , and . and , there exist such that

(13)

Note that and are given by Eq. (9), (10).

Note that the above constrained optimization problem involves decision variables (i.e., and ) and the disjoint constraint Eq. (13) actually involves individual conditions. To further boost the computational efficiency, we define a new objective function with the disjoint constraint Eq. (13) considered as . Let the set consist of all regions that satisfy the requirements regarding region in Eq. (13) be . The new objective function is defined as

(14)

with being a numerical number that is smaller than the minimum value of , . The problem in Eq. (11), Eq. (12) and Eq. (13) is simplified to

(15)

subject to

(16)

Lagrangian relaxation is a popular approach to solve the constrained optimization problem like our problem expressed in Eq. (15) and Eq. (16). However, the Lagrangian relaxation approach is known to suffer the problem of the duality gap under the general setting. It is not guaranteed that we always obtain the global or even local optima. In our numerical studies, we experimented the Lagrangian relaxation approach but failed to find the optimal solutions in some situations. In this paper, we propose to use an augmented Lagrangian method, called the penalty method. To be more specific, the original constrained optimization problem in Eq. (15) and Eq. (16) is equivalently transformed into

(17)

where , is the minimal point of function , , defined as

(18)

Iii-C Genetic algorithm for optimization

The next step is to build a module to solve the single objective optimization problem in Eq. (18) for any given . Due to the inaccessibility to the closed form of the objective function and the high dimensional nature (total number of parameters for any given is

), we propose to use the genetic algorithm (GA), one of the evolutionary algorithms, to conduct the optimization task

[21]. GA does not require information or approximation on derivatives, which is not available in our problem setting. Moreover, the GA is more robust than coordinate cycling directed search methods as it maintains a population of potential solutions in each iteration of searching and bears less risk of being trapped in a local minimum.

GA tries to mimic the natural process and consists of three components: selection, based on the fitness score, determining which individual candidates are chosen for the mating process and how many offspring each candidate produces; crossover, an exchange is performed between some variables of the parents, producing two new individuals; and mutation, for a few selected offspring, one variable is altered by a small perturbation[21]. When using GA to optimize over continuous variables, the real coding is used and the fitness of individual candidates is evaluated by the magnitude of the objective function[16].

To solve the optimization problem in Eq. (18) using GA, for any given value of , we proceed as follows:

  • Step 1: Initiate a candidate population of size : for , denote the -th initialization as a -dimensional vector , for . The region spanned by these values is denoted as .

  • Step 2: Evaluate the fitness (i.e., how good it is in terms of maximizing the objective function in Eq. (18) ) of each initialization by evaluating the objective function .

  • Step 3: Check whether certain stopping criteria is met. For instance, the criteria can be the increment in the fitness, compared to the previous population, is smaller than some threshold. If yes, output the best individual as the identified region . Otherwise, move to the next step to produce a new population.

  • Step 4: There are several steps in the new population producing process:

    • Select which individuals among the current population will produce offspring based on their fitness, i.e., the fitted values of the objective function .Individuals with higher fitness values have higher chances to be selected.

    • Some of the variables are exchanged between the selected parents to produce offspring.

    • The produced offspring are mutated by certain perturbation with a certain probability.

    • The produced new population is fed into step 2.

In this paper, we used the R package ‘GA’ [22] to numerically implement the above procedures. The built-in parallel running option is adopted to increase the efficiency.

Iii-D Generalization

In supervised machine learning problems, generalization refers to the ability of an algorithm being effective for unseen data [23]. The generalization of machine learning algorithms is important as it ensures that a similar accuracy level can be achieved when the model is deployed to make predictions for new data sets.

Analogously, in our operating envelope problem, it is also important to ensure that the achieved mean value of the response variable (i.e., the KPI) when applying the identified region to unseen data does not deviate much from the mean value of the response variable achieved in the training phase due to sampling error. To ensure the generalization of the identified operating envelope, we propose the following regularized objective function

(19)

where is the sample standard deviation of the achieved estimate ,

represents the size of regularization on the sample standard deviation part. The intuition is that if the sample standard error (i.e., variation across different data sets) of the achieved optimal average KPI within the detected hyper-rectangle is small, we have better chance to maintain the same optimal KPI level for any new data set.

There is usually no closed form for in Eq. (19). We propose to use the bootstrapping re-sampling technique to non-parametrically estimate this quantity [24]. The detailed estimation procedure is summarized as follows. Let the total number of bootstrapping be ,

  • Step 1: For iteration to :

    • Sample with replacement from the raw index {1,…,n}, and get another index set of length , .

    • The new data set of size is for .

    • Compute

      (20)
  • Step 2: Compute and output

    (21)

Let the estimated sample standard deviation be . The empirical regularized objective function for any given region is then

(22)

The counterparts of the bias and variance concepts in supervised machine learning problems exist in the proposed operating envelope problem in Definition 1. Specifically, we would like to achieve as high KPI value as possible in our setting. Hence, we define the bias as the gap between the expectation of the achieved KPI values for an unseen data set and the true optimal KPI value. The true optimal KPI value refers to the optimal objective value in Definition 1 assuming that we could have had access to the whole data population. In informal notation, . Since we would not have the true optimal KPI value, we propose to use another non-increasing function of the expected KPI value of unseen data to represent bias. In particular,

(23)

The variance concept here is similar to its counterpart in machine learning. In particular, we define variance as the variability of achieved KPI values of an unseen data set when applying the identified regions by solving the problem in Definition 1 using different training data sets. It is formally defined as follows:

(24)

For any given regularization parameter , the bias and variance defined above can be estimated by the cross-validation technique [25]. Similar to supervised machine learning, there exists a trade-off between the bias and variance, which requires judgments of the real needs to decide how to balance between bias and variance in the operating envelope problem. This fact is demonstrated by the simulation studies and the real-world data analysis in Section IV.

All the components discussed in Section III-A to III-D works together and forms the proposed regularized ‘GA + penalty’ algorithm summarized in Algorithm 1.

  1. Achieve the desired regularization size :

    1. Split data into training and testing sets.

    2. Specify a set of candidates for the regularization parameter , denoted as .

    3. For any given ,

      1. Specify a set of candidates for the penalty parameter in Eq. (18), denoted as .

      2. For any given ,

        • Use training data to calculate .

        • Use training data to bootstrap to achieve .

        • Use the genetic algorithm to solve

          (25)
      3. The desired penalty parameter for is set as

        (26)

        The corresponding optimal region is saved.

      4. Use to calculate the bias and variance term corresponding to with testing data.

    4. Based on the bias and variance over different to determine the desired .

  2. Output the region detected corresponding to stored in step c) as the optimal operating envelope.

Algorithm 1 Regularized ‘GA + penalty’ algorithm

Iv Numerical Experiments

Iv-a Simulation study I

In this subsection, we consider a set of simulations with state variable to demonstrate the effectiveness of our proposed regularized ‘GA + penalty’ algorithm in Section III for the operating envelope with interpretability and implementability constraints problem defined in Section II. The first two experiments focus on investigating the impacts of different components in Definition 1 on the region detection results. In these two experiments, the variability is simulated to be the same across different sub-regions of and the ‘GA + penalty’ algorithm without regularization is utilized. In the latter two experiments, we investigate the benefit of adding regularization. For all the simulation studies, we assume that data , for are generated from model:

(27)

where and .

Iv-A1 Simulation I-(a)

In this simulation, we demonstrate the influence of on the operating envelope output. The mapping function is (see Fig. 4(a)). The number of samples is . The standard deviation of errors is

. The probability density function

is a mixture of three Gaussian distributions, i.e., it with probability

of being , for . The mean values are , , and , with parameter quantifying the sparseness of data around the first and the third peaks of . The Gaussian distributions and the corresponding generated data for and are provided in Fig. 2. As shown by Fig. 2, with the increase of , more data are centered closed to the middle peak .

The probabilistic coverage threshold is specified as . The number of disjoint regions is . Three different values for are used, i.e., . For a given , we use the proposed ‘GA + penalty’ solution in Section III to calculate the operating envelope (i.e., interval) that satisfies the probabilistic coverage constraint. The calculation is repeated for Monte Carlo simulations. The frequencies that the calculated interval falling into each of the three sub-regions , , and are visualized in the bar charts in Fig. 3. Our proposed algorithm produces reasonable results. With the increase of , i.e., more data are centered around the middle region, the proportion of the identified operating envelopes falling into the middle sub-region increases.

(a)
(b)
Fig. 2: Components of the mixture Gaussian density and the generated data with in Simulation I-(a).
(a)
(b)
(c)
Fig. 3: Simulation I-(a). Bar charts for the location of the detected region under different probability density functions, i.e. . The three bars represent the number of times that the detected region being within , , and respectively over 100 Monte Carlo repetitions.

Iv-A2 Simulation I-(b)

In this experiment, we investigate the impact of the probabilistic coverage threshold . The state variable is distributed the same as the one in the previous experiment with . is

(28)

The standard deviation of random errors is .

The calculated operating envelope for probabilistic coverage threshold are given in Fig. 4. The output from our proposed algorithm is well aligned with our intuition that the calculated operating envelope is wider when the required probabilistic coverage is larger.

(a)
(b)
(c)
Fig. 4: Simulation I-(b). Calculated operating envelope indicated by the two vertical lines for probabilistic coverage threshold .

Iv-A3 Simulation I-(c)

In this study, we demonstrate the benefit of adding regularization. The state variable is the same as the Simulation I-(b). The relation between and is quantified by , which is the same as the first simulation study (Fig. 4(a)). The standard deviation of random errors takes three values, , for the three sub-regions, , , and respectively (Fig. 4(c)). The generated data are plotted in Fig. 4(d). Based on the simulation setting, it can be seen that sub-region is a better region to construct an operating envelope. This is because the magnitude of are the same among all three sub-regions and the sampling standard deviation of is smaller when , which means that the evaluated value of for any new data set has a larger chance to be close to the current value. The probabilistic coverage threshold considered is .

The three sub-regions have the same and . Without taking the variability of relevant statistics into account, the detected operating envelope has the same chance to be within any of these three sub-regions. This phenomenon is justified by Fig. 5(a). Now we apply the regularized ‘GA + penalty’ technique described in Section III-D. For any given region, the sample standard deviation of is done by the bootstrapping procedure with . For any candidate among , the 4-fold cross-validation is used to estimate the bias and variance in Eq. (23) and Eq. (24). According to Fig. 6(a), with the increase of , the black bias curve (i.e., the evaluated mean response within the detected region) stays relatively stable, while the blue variance curve (.e., the variability of achieved optimal mean KPI) declines. We set , the smallest that makes variance reach the minimum. The corresponding optimal region identified by the proposed algorithm is visualized in Fig. 5(b). Our proposed regularization term effectively accounts for the variability factor.

(a) Assumed .
(b) Components of .
(c) Standard deviation of error at different .
(d) Generated data samples, .
Fig. 5: Settings and the generated data for Simulation I-(c).
(a) Without regularization
(b) With regularization
Fig. 6: Simulation I-(c). Left: Bar chart for the location of the detected region without using regularization. Right: The achieved envelope with regularization ().
(a) Simulation I-(c)
(b) Simulation I-(d)
Fig. 7: Bias (black line) and variance (blue line) as a function of the regularization parameter in Simulation I-(c),(d).

Iv-A4 Simulation I-(d)

In this simulation, we discuss the trade-off between the bias and the variance discussed in Section III-D. The mapping is

(29)

All the other settings are the same as Simulation I-(c). Function and the generated data are provided in Fig. 8.

For any candidate among , the 4-fold cross-validation technique is used to estimate the bias as well as the variance term defined in Section III-D. As shown by Fig. 6(b). With the increase of , the bias term (i.e., the evaluated mean response within the detected region) stays relatively stable at the beginning when the regularization is not large enough to push the region to the latter two sub-regions with lower peaks. When the regularization size reaches a certain level ( in this case), the bias increases to its maximal and then stays at that level, as the detected envelope stays within the third sub-regions with smaller mean and less variability. The variance curve displays an inverse trend with the increase of the regularization size.

(a) Assumed .
(b) Generated data samples.
Fig. 8: Settings and the generated data for Simulation I-(d).

Iv-B Simulation study II

In this section, we consider a simulation study with state variables. We illustrate that it is infeasible to conduct a multidimensional operating envelope identification by separately learning the lower and upper limits on each state variable dimension when the state variables are correlated. We also show the results for different numbers of the disjoint regions.

The data are simulated as follows. The joint probability density function of the state variable vector is , where , the means are , , and . The mapping from to is specified as , with being when and when . The standard deviations of random errors are respectively 0.11, 0.15, 0.05, 0.5 in the four sub-regions (, , , ) in Fig. 8(a). The contour plot of the generated data can be found in Fig. 9.

Under this simulation setting, it can be seen that we can achieve a better expected region-wise response variable by bounding both state variables. Let’s first consider to calculate a single rectangle shaped operating envelope (i.e., ) under the probabilistic coverage constraint . To calculate the 2-dimensional operating envelope, one natural thinking is to bound each state variable separately and then construct a rectangle shaped envelope from the Cartesian product of the detected intervals. However, it is infeasible to appropriately specify the required probabilistic coverage on each state variable dimension, due to the fact that relation in Eq. (30) no longer holds when the state variables are not independent.

(30)

In this study, we use simulated data to justify this argument. Specifically, suppose that the required coverage for the targeting rectangle shaped operating envelope is . We calculate the interval on and separately with probabilistic coverage . The identified intervals are shown in Fig. 8(b) and Fig. 8(c). However, the rectangle determined by the Cartesian product of these two intervals only covers of data, which is smaller than the required coverage threshold . A more reasonable approach is to utilize our proposed ‘GA + penalty’ algorithm in Section III to jointly search over the two-dimensional state variable space. The detected region with a valid empirical coverage () is visualized in Fig. 8(d).

Next, we evaluate the performance of the proposed regularized ‘GA + penalty’ algorithm when target results consist of disjoint regions. According to our simulation setting, sub-region has the largest mean and the smallest variance (best sub-region). Sub-regions and have the second largest mean values, and the variability within sub-region is smaller than that within sub-region . These facts indicate that the detected disjoint envelopes should be within and , such that the achieved high mean response variable is more probably to be reached in unseen data. We first set the required probabilistic coverage as , the bias and variance terms as a function of the regularization parameter are given in Fig. 9(a). Based on these plots, we choose and the resulted envelope is shown at the bottom of Fig. 9(a). The bias and variance curves, and the achieved operating envelope for are given in Fig. 9(b). The regularization term introduced in Section III-D helps to push the second envelope into the sub-region with less variability, instead of , even though the mean response is the same within these two sub-regions. All the simulation results in Section IV-A and Section IV-B demonstrate the effectiveness of our proposed regularized operating envelope identification algorithm.

(a) Scatter plot of and with distribution
(b) Output using with coverage threshold .
(c) Output using with coverage threshold .
(d) Output using ,
Fig. 9: Relevant results in Simulation II when the number of disjoint regions is .
(a)
(b)
Fig. 10: Relevant results in Simulation II when the number of disjoint regions is . In the bias and variance trade-off plots, bias is the black line and variance is the blue line.

Iv-C Operating envelope for mining processes

In this section, we apply the proposed regularized ‘GA + penalty’ operating envelope identification algorithm to real industrial data on Kaggle.com. The data set is uploaded to Kaggle by Eduardo Magalhães Oliveira and can be downloaded through https://www.kaggle.com/edumagalhaes/quality-prediction-in-a-mining-process. The data set includes the quality of produced ore in a flotation plant, variables indicating the quality of the raw ore, as well as state variables that quantify the operations along the mining processes. Among all the state variables, the data disposer mentioned that five of them, including starch flow, amina flow, ore pulp flow, ore pulp pH, and ore pulp density, are highly related to the quality of the produced ore. Our objective is to jointly identify an implementable set of upper and lower limits for the five important state variable such that the resulted ore quality is higher or highest given the quality of raw ore fed into the mining processes. Totally, we used 36872 records, and of them are used for training and the remaining data are used for testing.

First, to move the effect of raw ore’s quality on the quality of the produced ore, we propose to normalize the quality of the produced ore by subtracting its fitted value from the random forest regression given the raw ore’s quality. The random forest regression is learned from the training data. Next, we apply the proposed ‘GA + penalty’ approach both with and without regularization to identify the region spanned by the top five important state variables such that the corresponding mean normalized quality score of the produced ore is high. The results for coverage threshold

, the number of disjoint regions , and the regularization parameter are summarized in Table I. As shown by Table I, the results match with our intuition. First, when learning without regularization, we achieve a higher optimal mean response when compared to . This is because we search over a larger set of candidate solutions when is higher. Second, given the number of disjoint regions , the bias increases with the increase of regularization parameter, while the variance is smaller for larger regularization parameter values. This phenomenon justifies the bias and variance trade-off discussed in Section III-D. In practice, users can choose the combinations , , that best meets their needs.

We notice that the computational cost of our GA based algorithm increases with the increment in dimensions (i.e., the number of state variables considered), due to the nature of GA. For the real data analysis, it took us about 5 hours to get the results. However, as discussed in the previous sections, GA is the most reasonable solution for the considered operating envelope with interpretability and implementability constraints problem. From the application perspective, the operating envelope is not needed to be learned in a streaming mode, hence the computational cost is not a big issue for the proposed ‘GA + penalty’ approach.

L Method Envelope Train Test Bias Var
Without
5.65 0.026
With
()
With
()
Without
5.41 0.022
TABLE I: Identified operating envelope results on mining processes when . The baseline, i.e., mean quality score over all data points, is .

V Conclusion and Discussion

In this paper, we formally categorize the operating envelope problem into three types. For identifying the operating envelope with respect to uncontrollable state variables, we propose a new mathematical definition that not only directly targets the region-wise mean response (the ultimate objective of learning operating envelopes) but also effectively accounts for the interpretability and implementability of the solution. To solve for the new operating envelope with constraints, we propose a regularized ‘GA + penalty’ approach which is capable of calculating the desired interpretable and implementable envelope with the bias and variance terms being well balanced. Our numerical experiments including two sets of simulation studies and the application to a real-world data challenge demonstrate the validity of our proposals.

References

  • [1]

    Q. Wang, S. Zheng, A. Farahat, S. Serita, T. Saeki, and C. Gupta, “Multilayer perceptron for sparse functional data,” in

    2019 International Joint Conference on Neural Networks (IJCNN)

    .   IEEE, 2019, pp. 1–10.
  • [2]

    C. Zhang, C. Gupta, A. Farahat, K. Ristovski, and D. Ghosh, “Equipment health indicator learning using deep reinforcement learning,” in

    Joint European Conference on Machine Learning and Knowledge Discovery in Databases.   Springer, 2018, pp. 488–504.
  • [3] Q. Wang, A. Farahat, K. Ristovski, H.-K. Tang, S. Serita, and C. Gupta, “What maintenance is worth the money? a data-driven answer,” in 2018 IEEE 16th International Conference on Industrial Informatics (INDIN).   IEEE, 2018, pp. 284–291.
  • [4] Q. Wang, S. Zheng, A. Farahat, S. Serita, and C. Gupta, “Remaining useful life estimation using functional data analysis,” arXiv preprint arXiv:1904.06442, 2019.
  • [5] H. Kassim, A. Sonde, O. Ayeni, J. Njoku, A. A. Ajayi, A. Akande et al., “A holistic approach to defining well operating envelopes,” in SPE Nigeria Annual International Conference and Exhibition.   Society of Petroleum Engineers, 2013.
  • [6] D. Tucker, E. Liese, and R. Gemmen, “Determination of the operating envelope for a direct fired fuel cell turbine hybrid using hardware based simulation,” National Energy Technology Lab.(NETL), Pittsburgh, PA, and Morgantown, WV …, Tech. Rep., 2009.
  • [7] M. D. Stuber, “Evaluation of process systems operating envelopes,” Ph.D. dissertation, Massachusetts Institute of Technology, 2013.
  • [8] V. M. Janakiraman, X. Nguyen, J. Sterniak, and D. Assanis, “Modeling the stable operating envelope for partially stable combustion engines using class imbalance learning,” arXiv preprint arXiv:1306.5702, 2013.
  • [9] R. A. Rallo, D. A. Dress, and H. J. Siegle, “Operating envelope charts for the langley 0.3-meter transonic cryogenic wind tunnel,” 1986.
  • [10] I. Kolmanovsky and E. Gilbert, “Support vector machine-based determination of gasoline direct injected engine admissible operating envelope,” SAE Transactions, pp. 2172–2178, 2002.
  • [11] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems, 2017, pp. 4765–4774.
  • [12] L. H. Gilpin, D. Bau, B. Z. Yuan, A. Bajwa, M. Specter, and L. Kagal, “Explaining explanations: An overview of interpretability of machine learning,” in

    2018 IEEE 5th International Conference on data science and advanced analytics (DSAA)

    .   IEEE, 2018, pp. 80–89.
  • [13] F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” in 2008 Eighth IEEE International Conference on Data Mining.   IEEE, 2008, pp. 413–422.
  • [14] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. Torr, and S.-M. Hu, “Global contrast based salient region detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 3, pp. 569–582, 2014.
  • [15] D. P. Bertsekas, Constrained optimization and Lagrange multiplier methods.   Academic press, 2014.
  • [16] R. Chelouah and P. Siarry, “A continuous genetic algorithm designed for the global optimization of multimodal functions,”

    Journal of Heuristics

    , vol. 6, no. 2, pp. 191–213, 2000.
  • [17] H. Mühlenbein, M. Schomisch, and J. Born, “The parallel genetic algorithm as function optimizer,” Parallel computing, vol. 17, no. 6-7, pp. 619–632, 1991.
  • [18] A. Shrikumar, P. Greenside, and A. Kundaje, “Learning important features through propagating activation differences,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70.   JMLR. org, 2017, pp. 3145–3153.
  • [19]

    D. Gunning, “Explainable artificial intelligence (xai),”

    Defense Advanced Research Projects Agency (DARPA), nd Web, 2017.
  • [20] P. J. Davis and P. Rabinowitz, Methods of numerical integration.   Courier Corporation, 2007.
  • [21] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: Nsga-ii,”

    IEEE transactions on evolutionary computation

    , vol. 6, no. 2, pp. 182–197, 2002.
  • [22] L. Scrucca et al., “Ga: a package for genetic algorithms in r,” Journal of Statistical Software, vol. 53, no. 4, pp. 1–37, 2013.
  • [23] D. Michie, D. J. Spiegelhalter, C. Taylor et al., “Machine learning,”

    Neural and Statistical Classification

    , vol. 13, 1994.
  • [24] C. Z. Mooney, R. D. Duval, and R. Duvall, Bootstrapping: A nonparametric approach to statistical inference.   Sage, 1993, no. 95.
  • [25] R. Kohavi et al., “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in Ijcai, vol. 14, no. 2.   Montreal, Canada, 1995, pp. 1137–1145.