CDF-Intervals: A Reliable Framework to Reason about Data with Uncertainty

05/15/2014 ∙ by Aya Saad, et al. ∙ The American University in Cairo 0

This research introduces a new constraint domain for reasoning about data with uncertainty. It extends convex modeling with the notion of p-box to gain additional quantifiable information on the data whereabouts. Unlike existing approaches, the p-box envelops an unknown probability instead of approximating its representation. The p-box bounds are uniform cumulative distribution functions (cdf) in order to employ linear computations in the probabilistic domain. The reasoning by means of p-box cdf-intervals is an interval computation which is exerted on the real domain then it is projected onto the cdf domain. This operation conveys additional knowledge represented by the obtained probabilistic bounds. Empirical evaluation shows that, with minimal overhead, the output solution set realizes a full enclosure of the data along with tighter bounds on its probabilistic distributions.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This research proposes a novel constraint domain for reasoning about data with uncertainty. The work was driven by the practical usage of reliable approaches in Constraint Programming (CP). These approaches tackle large scale constraint optimization (LSCO) problems associated with data uncertainty in an intuitive and tractable manner. Yet they have a lack of knowledge when the data whereabouts are to be considered. These whereabouts often indicate the data likelihood or chance of occurence, which in turn, can be ill-defined or have a fluctuating nature. It is important to know the source and type of the data whereabouts in order to reason about them. The purpose of this novel framework is to intuitively describe data coupled with uncertainty or following unknown distributions without losing any knowledge given in the problem definition. The p-box cdf-intervals extend the cdf-intervals approach, [Saad et al. (2010)]

, with a p-box structure to obtain a safe enclosure. This enclosure envelops the data along with its whereabouts with two distinct quantile values, each is issuing a

cdf

-uniform distribution,

[Saad et al. (2012)]. This research is concerned with the following contributions: (1) a new uncertain data representation specified by p-box cdf-intervals, (2) a constraint reasoning framework that is used to prune variable domains in a p-box cdf-interval constraint relation to ensure their local consistency, (3) an experimental evaluation, using the inventory management problem, to compare the novel framework with existing approaches in terms of expressiveness and tractability. The expressiveness, in this comparison, measures the ability to model the uncertainty provided in the original problem, and the impact of this representation on the solution set realized. On the other hand, the tractability measures the system time performance and scalability. The experimental work shows how this novel domain representation yields more informed results, while remaining computationally effective and competitive with previous work.

2 Preliminaries

Models tackling uncertainty are classified under the set of plausibility measures

[Halpern (2003)]. They are categorized as: possibilistic and probabilistic. Convex models, found in the world of fuzzy and interval/robust programming, are favored when ignorance takes place. They are adopted in the CP paradigm in fuzzy Constraint Satisfaction Problems (CSPs) [Dubois et al. (1996)] and numerical CSPs [Benhamou and de Vinci (1994)]. Probabilistic models are best adopted when the data has a fluctuating nature. They are the heart of the probabilistic CP modeling found in valued CSP [Schiex et al. (1995)], semirings [Bistarelli et al. (1999)], stochastic CSPs [Walsh (2002)], scenario-based CSPs [Tarim et al. (2006)] and mixed CSPs [Fargier et al. (1996)]. Techniques adopting convex modeling are characterized to be more conservative. They can often consider many unnecessary outcomes along with important ones. This conservative property supplements convex modeling with a high tractible and scalable behavior since operations, in these models, are exerted on the convex bounds only. On the other hand, probabilistic approaches add a quantitative information that expresses the likelihood, yet these approaches impose assumptions on the distribution shape in order to conceptually deal with it in a mathematical manner. Moreover, probabilistic mathematical computations are very expensive because they often depend on the non-linear probability shape. The research objective is to introduce a novel framework: the p-box cdf-intervals. It is based on a probability box (p-box) structure [Ferson et al. (2003)] that envelops a set of cumulative distribution functions (cdf). The p-box concept is adopted in the literature, specifically when the environment is uncertain, to represent an unknown distribution with a safe enclosure rather than depending on statistical approximation. A cdf is a monotone (non-decreasing) function that indicates for a given uncertain value the probability that the actual data lies before. It defines the aggregated probability of a value to occur. The p-box bounding cdf distributions in the proposed framework are uniform, each is represented by a line equation in order to maintain an inexpensive computational complexity. The key idea behind the construction of the p-box cdf-intervals is to combine techniques from the convex models, to take advantage of their tractability, with approaches revealing quantifiable information from the probabilistic and stochastic world, to take advantage of their expressiveness.

The framework is based on CP concepts because they proved to have a considerable flexibility in formulating real-world combinatorial problems. The CP paradigm aims at building descriptive algebraic structures which are easily embedded into declarative programming languages. These structures are heavily used in problem solving environment by specifying conditions that need to be satisfied and allow the solver to search for feasible solutions. The following section demonstrates how to intuively represent the uncertainty, already given in the problem definition, in order to reason about it by means of the p-box cdf-intervals. A comparison between the novel representation of the data uncertainty with existing possibilistic and probabilistic approaches is also taking place in order to demonstrate the model expressiveness. This representation is input to the solver with a new domain specification. Consequently the reasoning about this new specification is defined. It proves how the reasoning by means of p-box cdf-intervals is tractable. Accordingly, combining reasoning techniques from convex models with quantifiable information from probabilistic models yields a novel model that is together tractable and expressive.

3 Input Data Representation

Quantifiable information is often available during the data collection process, but lost during the reasoning because it is not accounted for in the representation of the uncertain data. This information however is crucial to the reasoning process, and the lack of its interpretation yields erroneous reasoning because of its absence in the produced solution set. It is always necessary to quantify uncertainty that is naturally given in the problem definition in order to obtain robust and reliable solutions.

Example 3.1.

Consider, as a running example, the varying cost observations of a steel stud manufacturing item. Fig. 1(a) illustrates the cost variations along with their corresponding frequencies of occurrence. For instance, the point is the amount of the cost/item, equal to , and observed times during the past years (corresponding to a population ). is the number of distinct measured quantiles. The minimum and the maximum observed values, in this example, are and respectively.

To compute the probabilistic/ possibilistic representations, the average and standard deviation of the observed population are derived. In this example, they are equal to

and

respectively. The nearest Normal probability distribution and the

fuzzy membership function are illustrated in Fig. 1 (b) and (c).

Figure 1:

Varying cost of the steel stud item and its probability histogram: (a) genuine observations (b) Normal distribution (c)

fuzzy distribution

To compare the data representations adopted in various approaches, the observed data is projected onto the cdf-domain. By definition, the cdf is a monotonic distribution that keeps the probabilistic information in an aggregated manner. Information obtained from the measurement process is often discrete and incomplete, hence, its cdf-domain projection forms a staircase shape [Smith and La Poutre (1992)]. The cdf distribution of the genuine observed data whereabouts is depicted in the running example by the dotted staircase shape in Figure 2. Normal and fuzzy cdf distributions are shown by the continuous red curves in Fig. 2 (b) and (c). Each is based on an approximation that lacks precise point fitting of the original data whereabouts. Similarly, the cdf-interval, in Fig. 2 (d), approximates the data whereabouts by means of a line connecting the two bounding data values. The convex model representation however shapes a rectangle, illustrated in Fig. 2 (e). This rectangle includes all values in the cdf range . The convex representation treats data values lying within the interval bounds equally, i.e. it lacks the probabilistic information. The p-box cdf-interval, depicted in Fig. 2 (f) enforces tighter bounds on the probabilities when compared to convex models illustrated in Fig. 2 (e). This envelopment guarantees a safe enclosure on the unknown distribution while preserving tractability due to the fact that its bounds are represented each by a line equation.

Figure 2: Varying cost of the steel stud item, its derived probabilities, and its cdf distributions

Interpretation of the p-box cdf confidence interval .

For a given interval of points specified by , and are the extreme points which bound the p-box cdf-interval. One can see that this interval approach does not aim at approximating the curve but rather enclosing it in a reliable manner. The complete envelopment is exerted by means of the uniform cdf-bounds, which are depicted by the red curves in Fig. 2 (f). It is impossible to find a point that exists outside the formed interval bounds. The cdf bounds are chosen to have a uniform distribution. Each is represented by a line with a slope issued from one of the extreme quantiles. Storing the full information of each bound is sufficient to restore the designated interval assignment. Bounds are denoted by triplet points, in the D space, to guarantee the full information on: the extreme quantile value observed; the cdf-line issued from this observed value; and the degree of steepness formed by this line. The slope of the uniform cdf-distribution indicates how the probabilistic values accumulate for successive quantiles on the line. Accordingly, the p-box cdf-interval point representation: and .

Definition 3.1.

is the slope of a given cdf-distribution; it signifies the average step probabilistic value. For a given uniform cdf-distribution

(1)

Plotting a point within the p-box cdf-interval deduces bounds on its possible chances of occurrence.

Definition 3.2.

is the interval of values obtained when is projected onto the p-box cdf bounds. For a point denoted as

(2)

and are the possible maximum and minimum cdf values can take; both are computed by projecting the point onto the cdf distributions passing through real points and respectively. They are derived using the following linear projections, computed in complexity:

The equation above guarantees the probabilistic feature of the cdf-function by restricting its aggregated value from exceeding the value and having negative values below .

Example 3.2.

is the p-box cdf-interval of the cost/item in Example 3.1. Suppose that , its cdf-bound values . This means that the possible chance of the value to be at most is between and , with an average step probabilistic value between and . Note that this interval is opposed to only one approximated value in the cdf-intervals representation proposed in [Saad et al. (2010)], the fuzzy cdf value and its Normal cdf value is . Note that convex models do not enforce any probabilistic bounds, accordingly, has a cdf .

4 Constraint reasoning

In the CP paradigm, relations between variables are specified as constraints. A set of rules and algebraic semantics, defined over the list of constraints, formalize the reasoning about the problem. As a fundamental language component in the Constraint Logic Programming (CLP), these set of rules, with a syntax of definite clauses, form the language scheme

[Jaffar and Lassez (1987)]. The constraint solving scheme is intuitively and efficiently utilized in the reasoning over the computation domain. The scheme formally attempts at assigning to variables a suitable domain of discourse equipped with an equality theory together with a least and a greatest model of fix-point semantics. Starting from an initial state the reasoning scheme follows a local consistency technique which attempts at constraining each variable over the p-box cdf-interval domain while excluding values which do not belong to the feasible solution. An implementation of the constraint system was established as a separate module in the ECLPS constraint programming environment [ECRC (1994)]. ECLPS provides two major components to build the solver: an attributed variable data structure and a suspension handling mechanism. Fundamentally, attributed variables are specific data structures which attach more than one data type. Together they permit for a new definition of unification which extends the well-known Prolog unification [Le Huitouze (1990), Holzbaur (1992)]. A p-box cdf-interval point is implemented in an attributed variable data structure with three main components: quantile, cdf value and slope. Whilst constraints suspension handling is a highly flexible mechanism that aims at controlling user defined atomic goals. This is achieved by waiting for user-defined conditions to trigger specific goals.

Implemented rules in our solver infer the local consistency in the p-box cdf-interval domains of the binary equality and ordering constraints , and that of the ternary arithmetic constraints . Operations, in the solver, are exerted first as real interval computations, and then they are projected onto the cdf domain using a linear computation, as shown in Definition 3.2. This section demonstrates how the ordering and the ternary addition constraints infer the local consistency over the variable domains of , , and assuming that their initial bindings are , and respectively. The ternary multiplication, subtraction and division constraints are implemented in the same way.

Ordering constraint

. To infer the local consistency of the binary ordering constraint, the lower cdf-bound of is extended and the upper cdf-bound of is contracted.

Example 4.1.

Let and be two p-box cdf-interval domains. and . The effect of applying the set of constraints and , prunes the domain of . As a result, the variable is bounded by the lower bound of and by the upper bound of : as shown in Fig. 3 (a). Clearly the obtained domain of , in this example, preserves the convex property of the p-box cdf-intervals. Let be subject to the domain pruning using the set of constraints: and . As a result, should be bounded by the lower bound of and the upper bound of . However, in this case, at lower quantiles , the upper bound distribution of preceeds the lower bound of . The fact that conflicts the stochastic dominance property of a p-box cdf-interval domain. In order to resolve this conflict, the real bounds of are further pruned to the point of the probability intersection .

Figure 3: Ordering constraint execution

Ternary addition constraints

. The addition operation is implemented by summing up pair of points, defined in the D space and located within the p-box cdf-interval bounds which enclose the domain ranges of and . This addition operation is linear. It is convex and can be computed from the end points of the domains involved in the addition. The p-box cdf-domain of is updated to envelop all points defined in that range.

Figure 4: Ternary addition inference rule execution: initial bindings are , and ; final bindings are , and .
Example 4.2.

Fig. 4 depicts the execution steps of the p-box cdf ternary addition inference rule, exerted on the variable domains involved in the relation . Observe that domain pruning is performed in a dimensional manner: quantile and cdf. The addition of the two variables and is performed on the bounds of their predefined domains then it is projected onto the initial bindings. The first row in Fig. 4 shows output domains from the addition , and . Domain operations are exerted on the extreme points. The second row illustrates the intersection of the output domains with the initial bindings, assigned to , and . Obtained domains from the ternary addition operation are , and . Clearly, in this example, pruning real quantile bounds is identical to that of real domains and since output domains preserve the stochastic dominance property no further pruning takes place.

The ternary addition constraint exerted on p-box cdf-interval domains is a simple addition computation since it adopts the real-interval arithmetics which are then projected, linearly, onto the cdf domain. This operation is opposed to the fuzzy extended addition operation adopted in the constraint reasoning utilized in the possibilistic domain [Dutta et al. (2005), Petrović et al. (1996)], and to the Normal probabilistic addition which has a high computation complexity that is due to the Normal distribution shape [Glen et al. (2004)].

5 Empirical evaluation

The inventory management problem model proposed by [Tarim and Kingsman (2004)] is employed, as a case study, to evaluate the proposed framework. The key idea is to schedule ahead replenishement periods and find the optimal order sizes which achieve a minimum total manufacturing cost. A reorder point with order size should meet customer demands up to the next point of replenishment.

Definition 5.1.

An inventory management model defined over a time horizon of cycles is

(3)

The constituents of the total cost in the model are: the setup cost, holding cost and purchase cost. The setup cost is defined by the ordering cost multiplied by the number of times a replenishment takes place. The holding cost depends on the depreciation cost and the level of the inventory observed in a given cycle. The purchase cost is the reorder quantity multiplied by the varying cost/item. From this model, one can observe that all cost components are typically fluctuating and unpredictable especially in the real-life version of the problem. This is due to the unpredictability of customer demands and the variability of the cost/item. Accordingly, this model perfectly fits the purpose of the evaluation: comparing the behavior of the models when the environment is uncertain.

Information realized in the solution set.

The model is tested for a randomly distributed monthly demands over a time horizon cycles. The p-box cdf-interval representation is constructed for each demand observation per cycle and for each observed varying cost component (ordering cost , holding cost per item and varying cost per item ) to guarantee a safe enclosure on the data whereabouts. This is opposed to the fuzzy and probabilistic modeling which is based on the average demand values given in the set . The two later models set assumptions on the shape of the probability distribution adopted, as pointed out in Section 3. The solver executes the set of addition and equality constraints in the p-box cdf-interval domain. Constraints are triggered until stabilized and consistency is reached by means of the inference rules defined in Section 4. The solver suggests to replenishment periods, with a total holding cost and a total manufacturing cost . This output is opposed to replenishment periods realized by the fuzzy and the probabilistic models with a total holding cost and and a total manufacturing cost and respectively.

Figure 5: Output solutions for holding cost

Fig. 5 illustrates a comparison between the output holding cost obtained from the models under consideration. The p-box cdf-interval graphical representation of the cost is depicted by the shaded region and their bounds in the convex models are illustrated by the dotted rectangles. Clearly, the solution set obtained from the p-box cdf-intervals model, when compared with the outcome of the convex model, realized an additional knowledge (i.e. tighter bounds in the cdf domain). This solution set is opposed to a one value proposed as by the fuzzy and as by the probabilistic models. Output solution point suggested by the latter models can, sometime, mislead or deviate the decision making. This is because their distributions are built, from the begining, on approximating the actual observed distribution.

Model tractability.

We generate random distributions for monthly demands scaling up the problem time horizon for cycles. The first three rows in Table 1 show the real time taken by each model in seconds to generate the output solution of the total cost. Two other measurements, the shared heap used and the control stack used, are taken into consideration in order to study the memory consumption of each model. The shared heap used is the memory allocated to store compiled Prolog code and its related variables and necessary buffers. The control stack used is utilized to hold backtracking information. Table 1 demonstrates that stochastic model memory consumption grows exponentially when scaling-up the problem, it reaches of the memory usage for a time horizon . The p-box cdf-intervals behavior is similar to convex models. Probabilistic and fuzzy models have the best shared heap utilization. Clearly the percentage of the control stack employed in the stochastic model is the highest. This is due to the behavior of the stochastic techniques which exhaustively build the solution scenarios in order to reach a solution. It is worth noting that convex models and p-box cdf-intervals do not need to build this tree since output solution set is provided within an interval range that is encapsulating all possible output scenarios.

time horizon stochastic probabilistic fuzzy cdf p-box convex real time (sec) shared heap used control stack used
Table 1: Real-time taken and measurement of memory consumption

Evidently, convex models outperfrom the rest of the models in terms of speed; p-box cdf-intervals have a closer speed, followed by the fuzzy models, then by the probabilistic models. In summary, the p-box cdf-intervals performance is closer to that of the convex models. This means that, the new framework, with minimal overhead, adds up a quantifiable information by imposing tighter bounds on the probability distribution, in a safe and in a tractable manner. Applied computations are tractable because they are exerted on the interval bounds, using interval computations, then results are further projected, linearly, onto the cdf domain. Empirical evaluations proved that p-box cdf-intervals have a scalability measure that is close to that of convex models.

6 Conclusion and future research direction

This research proposes a novel constraint domain to reason about data with uncertainty. The key idea is to extend convex models with the notion of p-boxes in order to realize aditional quantifiable information on the data whereabouts. P-Boxes have never been implemented in the CP paradigm, yet they are very good candidates to deal with and reason about uncertainty in the probabilistic paradigm, especially when data is shaping an unknown distribution. The case study of the inventory management problem demonstrates that p-box cdf-intervals can be practically adopted to intuitively envelop the uncertain data found in different modeling aspects with minimum overhead. Evaluation results show that stochastic CPs and probabilistic models have the slowest performance. Fuzzy models proved to have a better performance and their output solutions are characterized to be reliable, i.e. they seek the satisfaction of all possible realizations. Convex models and the p-box cdf-intervals encapsulate all possible distributions of the solution set in a convex representation. The p-box cdf-intervals framework provides a range of quantiles along with bounds on their data whereabouts.

The introduction of a novel framework to reason about data coupled with uncertainty due to ignorance or based on variability, paves the way to many fruitful research directions. We can list many in: studying models having variables following dependent probability distributions, exploring different search techniques, revisiting the framework within a dynamically changing environment, generalizing the framework to deal with all types of uncertainty by considering together vagueness and dynamicity, and last but not least applying the model to a variety of large scale optimization problems which target real-life engineering and management applications.

References

  • Benhamou and de Vinci (1994) Benhamou, F. and de Vinci, R. 1994. Interval constraint logic programming. Constraint programming: basics and trends: Châtillon Spring School, France.
  • Bistarelli et al. (1999) Bistarelli, S., Montanari, U., Rossi, F., Schiex, T., Verfaillie, G., and Fargier, H. 1999. Semiring-based CSPs and valued CSPs: Frameworks, properties, and comparison. Constraints 4, 3, 199–240.
  • Dubois et al. (1996) Dubois, D., Fargier, H., and Prade, H. 1996. Possibility theory in constraint satisfaction problems: Handling priority, preference and uncertainty. Applied Intelligence 6, 4, 287–309.
  • Dutta et al. (2005) Dutta, P., Chakraborty, D., and Roy, A. 2005.

    A single-period inventory model with fuzzy random variable demand.

    Mathematical and Computer Modelling 41, 8, 915–922.
  • ECRC (1994) ECRC. 1994. Eclipse (a) user manual, (b) extensions of the user manual. Tech. rep., ECRC.
  • Fargier et al. (1996) Fargier, H., Lang, J., and Schiex, T. 1996. Mixed constraint satisfaction: A framework for decision problems under incomplete knowledge. In

    Proceedings of the National Conference on Artificial Intelligence

    . Citeseer, 175–180.
  • Ferson et al. (2003) Ferson, S., Kreinovich, V., Ginzburg, L., Myers, D., and Sentz, K. 2003. Constructing Probability Boxes and Dempster-Shafer structures, Sandia National Laboratories. Tech. rep., SANDD2002-4015.
  • Glen et al. (2004) Glen, A., Leemis, L., and Drew, J. 2004.

    Computing the distribution of the product of two continuous random variables.

    Computational statistics & data analysis 44, 3, 451–464.
  • Halpern (2003) Halpern, J. Y. 2003. Reasoning about uncertainty.
  • Holzbaur (1992) Holzbaur, C. 1992. Metastructures vs. attributed variables in the context of extensible unification - applied for the implementation of clp languages. In In 1992 International Symposium on Programming Language Implementation and Logic Programming. Springer Verlag, 260–268.
  • Jaffar and Lassez (1987) Jaffar, J. and Lassez, J.-L. 1987. Constraint logic programming. In Proceedings of the 14th ACM SIGACT-SIGPLAN symposium on Principles of programming languages. ACM, 111–119.
  • Le Huitouze (1990) Le Huitouze, S. 1990. A new data structure for implementing extensions to Prolog. In Programming Language Implementation and Logic Programming. Springer, 136–150.
  • Petrović et al. (1996) Petrović, D., Petrović, R., and Vujošević, M. 1996. Fuzzy models for the newsboy problem. International Journal of Production Economics 45, 1, 435–441.
  • Saad et al. (2010) Saad, A., Gervet, C., and Abdennadher, S. 2010. Constraint Reasoning with Uncertain Data Using CDF-Intervals.

    Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems

    , 292–306.
  • Saad et al. (2012) Saad, A., Gervet, C., and Fruehwirth, T. 2012. CDF-Intervals Revisited. The Eleventh International Workshop on Constraint Modelling and Reformulation - ModRef2012.
  • Schiex et al. (1995) Schiex, T., Fargier, H., and Verfaillie, G. 1995. Valued constraint satisfaction problems: Hard and easy problems. In International Joint Conference on Artificial Intelligence. Vol. 14. Citeseer, 631–639.
  • Smith and La Poutre (1992) Smith, W. and La Poutre, H. 1992. Approximation of staircases by staircases. Tech. rep., Citeseer.
  • Tarim and Kingsman (2004) Tarim, S. and Kingsman, B. 2004. The stochastic dynamic production/inventory lot-sizing problem with service-level constraints. International Journal of Production Economics 88, 105–119.
  • Tarim et al. (2006) Tarim, S., Manandhar, S., and Walsh, T. 2006. Stochastic constraint programming: A scenario-based approach.
  • Walsh (2002) Walsh, T. 2002. Stochastic constraint programming. Proceedings of the 15th Eureopean Conference on Artificial Intelligence, 111–115.