1 Introduction
This research proposes a novel constraint domain for reasoning about data with uncertainty. The work was driven by the practical usage of reliable approaches in Constraint Programming (CP). These approaches tackle large scale constraint optimization (LSCO) problems associated with data uncertainty in an intuitive and tractable manner. Yet they have a lack of knowledge when the data whereabouts are to be considered. These whereabouts often indicate the data likelihood or chance of occurence, which in turn, can be illdefined or have a fluctuating nature. It is important to know the source and type of the data whereabouts in order to reason about them. The purpose of this novel framework is to intuitively describe data coupled with uncertainty or following unknown distributions without losing any knowledge given in the problem definition. The pbox cdfintervals extend the cdfintervals approach, [Saad et al. (2010)]
, with a pbox structure to obtain a safe enclosure. This enclosure envelops the data along with its whereabouts with two distinct quantile values, each is issuing a
cdf[Saad et al. (2012)]. This research is concerned with the following contributions: (1) a new uncertain data representation specified by pbox cdfintervals, (2) a constraint reasoning framework that is used to prune variable domains in a pbox cdfinterval constraint relation to ensure their local consistency, (3) an experimental evaluation, using the inventory management problem, to compare the novel framework with existing approaches in terms of expressiveness and tractability. The expressiveness, in this comparison, measures the ability to model the uncertainty provided in the original problem, and the impact of this representation on the solution set realized. On the other hand, the tractability measures the system time performance and scalability. The experimental work shows how this novel domain representation yields more informed results, while remaining computationally effective and competitive with previous work.2 Preliminaries
Models tackling uncertainty are classified under the set of plausibility measures
[Halpern (2003)]. They are categorized as: possibilistic and probabilistic. Convex models, found in the world of fuzzy and interval/robust programming, are favored when ignorance takes place. They are adopted in the CP paradigm in fuzzy Constraint Satisfaction Problems (CSPs) [Dubois et al. (1996)] and numerical CSPs [Benhamou and de Vinci (1994)]. Probabilistic models are best adopted when the data has a fluctuating nature. They are the heart of the probabilistic CP modeling found in valued CSP [Schiex et al. (1995)], semirings [Bistarelli et al. (1999)], stochastic CSPs [Walsh (2002)], scenariobased CSPs [Tarim et al. (2006)] and mixed CSPs [Fargier et al. (1996)]. Techniques adopting convex modeling are characterized to be more conservative. They can often consider many unnecessary outcomes along with important ones. This conservative property supplements convex modeling with a high tractible and scalable behavior since operations, in these models, are exerted on the convex bounds only. On the other hand, probabilistic approaches add a quantitative information that expresses the likelihood, yet these approaches impose assumptions on the distribution shape in order to conceptually deal with it in a mathematical manner. Moreover, probabilistic mathematical computations are very expensive because they often depend on the nonlinear probability shape. The research objective is to introduce a novel framework: the pbox cdfintervals. It is based on a probability box (pbox) structure [Ferson et al. (2003)] that envelops a set of cumulative distribution functions (cdf). The pbox concept is adopted in the literature, specifically when the environment is uncertain, to represent an unknown distribution with a safe enclosure rather than depending on statistical approximation. A cdf is a monotone (nondecreasing) function that indicates for a given uncertain value the probability that the actual data lies before. It defines the aggregated probability of a value to occur. The pbox bounding cdf distributions in the proposed framework are uniform, each is represented by a line equation in order to maintain an inexpensive computational complexity. The key idea behind the construction of the pbox cdfintervals is to combine techniques from the convex models, to take advantage of their tractability, with approaches revealing quantifiable information from the probabilistic and stochastic world, to take advantage of their expressiveness.The framework is based on CP concepts because they proved to have a considerable flexibility in formulating realworld combinatorial problems. The CP paradigm aims at building descriptive algebraic structures which are easily embedded into declarative programming languages. These structures are heavily used in problem solving environment by specifying conditions that need to be satisfied and allow the solver to search for feasible solutions. The following section demonstrates how to intuively represent the uncertainty, already given in the problem definition, in order to reason about it by means of the pbox cdfintervals. A comparison between the novel representation of the data uncertainty with existing possibilistic and probabilistic approaches is also taking place in order to demonstrate the model expressiveness. This representation is input to the solver with a new domain specification. Consequently the reasoning about this new specification is defined. It proves how the reasoning by means of pbox cdfintervals is tractable. Accordingly, combining reasoning techniques from convex models with quantifiable information from probabilistic models yields a novel model that is together tractable and expressive.
3 Input Data Representation
Quantifiable information is often available during the data collection process, but lost during the reasoning because it is not accounted for in the representation of the uncertain data. This information however is crucial to the reasoning process, and the lack of its interpretation yields erroneous reasoning because of its absence in the produced solution set. It is always necessary to quantify uncertainty that is naturally given in the problem definition in order to obtain robust and reliable solutions.
Example 3.1.
Consider, as a running example, the varying cost observations of a steel stud manufacturing item. Fig. 1(a) illustrates the cost variations along with their corresponding frequencies of occurrence. For instance, the point is the amount of the cost/item, equal to , and observed times during the past years (corresponding to a population ). is the number of distinct measured quantiles. The minimum and the maximum observed values, in this example, are and respectively.
To compute the probabilistic/ possibilistic representations, the average and standard deviation of the observed population are derived. In this example, they are equal to
andrespectively. The nearest Normal probability distribution and the
fuzzy membership function are illustrated in Fig. 1 (b) and (c).Varying cost of the steel stud item and its probability histogram: (a) genuine observations (b) Normal distribution (c)
fuzzy distributionTo compare the data representations adopted in various approaches, the observed data is projected onto the cdfdomain. By definition, the cdf is a monotonic distribution that keeps the probabilistic information in an aggregated manner. Information obtained from the measurement process is often discrete and incomplete, hence, its cdfdomain projection forms a staircase shape [Smith and La Poutre (1992)]. The cdf distribution of the genuine observed data whereabouts is depicted in the running example by the dotted staircase shape in Figure 2. Normal and fuzzy cdf distributions are shown by the continuous red curves in Fig. 2 (b) and (c). Each is based on an approximation that lacks precise point fitting of the original data whereabouts. Similarly, the cdfinterval, in Fig. 2 (d), approximates the data whereabouts by means of a line connecting the two bounding data values. The convex model representation however shapes a rectangle, illustrated in Fig. 2 (e). This rectangle includes all values in the cdf range . The convex representation treats data values lying within the interval bounds equally, i.e. it lacks the probabilistic information. The pbox cdfinterval, depicted in Fig. 2 (f) enforces tighter bounds on the probabilities when compared to convex models illustrated in Fig. 2 (e). This envelopment guarantees a safe enclosure on the unknown distribution while preserving tractability due to the fact that its bounds are represented each by a line equation.
Interpretation of the pbox cdf confidence interval .
For a given interval of points specified by , and are the extreme points which bound the pbox cdfinterval. One can see that this interval approach does not aim at approximating the curve but rather enclosing it in a reliable manner. The complete envelopment is exerted by means of the uniform cdfbounds, which are depicted by the red curves in Fig. 2 (f). It is impossible to find a point that exists outside the formed interval bounds. The cdf bounds are chosen to have a uniform distribution. Each is represented by a line with a slope issued from one of the extreme quantiles. Storing the full information of each bound is sufficient to restore the designated interval assignment. Bounds are denoted by triplet points, in the D space, to guarantee the full information on: the extreme quantile value observed; the cdfline issued from this observed value; and the degree of steepness formed by this line. The slope of the uniform cdfdistribution indicates how the probabilistic values accumulate for successive quantiles on the line. Accordingly, the pbox cdfinterval point representation: and .
Definition 3.1.
is the slope of a given cdfdistribution; it signifies the average step probabilistic value. For a given uniform cdfdistribution
(1) 
Plotting a point within the pbox cdfinterval deduces bounds on its possible chances of occurrence.
Definition 3.2.
is the interval of values obtained when is projected onto the pbox cdf bounds. For a point denoted as
(2) 
and are the possible maximum and minimum cdf values can take; both are computed by projecting the point onto the cdf distributions passing through real points and respectively. They are derived using the following linear projections, computed in complexity:
The equation above guarantees the probabilistic feature of the cdffunction by restricting its aggregated value from exceeding the value and having negative values below .
Example 3.2.
is the pbox cdfinterval of the cost/item in Example 3.1. Suppose that , its cdfbound values . This means that the possible chance of the value to be at most is between and , with an average step probabilistic value between and . Note that this interval is opposed to only one approximated value in the cdfintervals representation proposed in [Saad et al. (2010)], the fuzzy cdf value and its Normal cdf value is . Note that convex models do not enforce any probabilistic bounds, accordingly, has a cdf .
4 Constraint reasoning
In the CP paradigm, relations between variables are specified as constraints. A set of rules and algebraic semantics, defined over the list of constraints, formalize the reasoning about the problem. As a fundamental language component in the Constraint Logic Programming (CLP), these set of rules, with a syntax of definite clauses, form the language scheme
[Jaffar and Lassez (1987)]. The constraint solving scheme is intuitively and efficiently utilized in the reasoning over the computation domain. The scheme formally attempts at assigning to variables a suitable domain of discourse equipped with an equality theory together with a least and a greatest model of fixpoint semantics. Starting from an initial state the reasoning scheme follows a local consistency technique which attempts at constraining each variable over the pbox cdfinterval domain while excluding values which do not belong to the feasible solution. An implementation of the constraint system was established as a separate module in the ECLPS constraint programming environment [ECRC (1994)]. ECLPS provides two major components to build the solver: an attributed variable data structure and a suspension handling mechanism. Fundamentally, attributed variables are specific data structures which attach more than one data type. Together they permit for a new definition of unification which extends the wellknown Prolog unification [Le Huitouze (1990), Holzbaur (1992)]. A pbox cdfinterval point is implemented in an attributed variable data structure with three main components: quantile, cdf value and slope. Whilst constraints suspension handling is a highly flexible mechanism that aims at controlling user defined atomic goals. This is achieved by waiting for userdefined conditions to trigger specific goals.Implemented rules in our solver infer the local consistency in the pbox cdfinterval domains of the binary equality and ordering constraints , and that of the ternary arithmetic constraints . Operations, in the solver, are exerted first as real interval computations, and then they are projected onto the cdf domain using a linear computation, as shown in Definition 3.2. This section demonstrates how the ordering and the ternary addition constraints infer the local consistency over the variable domains of , , and assuming that their initial bindings are , and respectively. The ternary multiplication, subtraction and division constraints are implemented in the same way.
Ordering constraint
. To infer the local consistency of the binary ordering constraint, the lower cdfbound of is extended and the upper cdfbound of is contracted.
Example 4.1.
Let and be two pbox cdfinterval domains. and . The effect of applying the set of constraints and , prunes the domain of . As a result, the variable is bounded by the lower bound of and by the upper bound of : as shown in Fig. 3 (a). Clearly the obtained domain of , in this example, preserves the convex property of the pbox cdfintervals. Let be subject to the domain pruning using the set of constraints: and . As a result, should be bounded by the lower bound of and the upper bound of . However, in this case, at lower quantiles , the upper bound distribution of preceeds the lower bound of . The fact that conflicts the stochastic dominance property of a pbox cdfinterval domain. In order to resolve this conflict, the real bounds of are further pruned to the point of the probability intersection .
Ternary addition constraints
. The addition operation is implemented by summing up pair of points, defined in the D space and located within the pbox cdfinterval bounds which enclose the domain ranges of and . This addition operation is linear. It is convex and can be computed from the end points of the domains involved in the addition. The pbox cdfdomain of is updated to envelop all points defined in that range.
Example 4.2.
Fig. 4 depicts the execution steps of the pbox cdf ternary addition inference rule, exerted on the variable domains involved in the relation . Observe that domain pruning is performed in a dimensional manner: quantile and cdf. The addition of the two variables and is performed on the bounds of their predefined domains then it is projected onto the initial bindings. The first row in Fig. 4 shows output domains from the addition , and . Domain operations are exerted on the extreme points. The second row illustrates the intersection of the output domains with the initial bindings, assigned to , and . Obtained domains from the ternary addition operation are , and . Clearly, in this example, pruning real quantile bounds is identical to that of real domains and since output domains preserve the stochastic dominance property no further pruning takes place.
The ternary addition constraint exerted on pbox cdfinterval domains is a simple addition computation since it adopts the realinterval arithmetics which are then projected, linearly, onto the cdf domain. This operation is opposed to the fuzzy extended addition operation adopted in the constraint reasoning utilized in the possibilistic domain [Dutta et al. (2005), Petrović et al. (1996)], and to the Normal probabilistic addition which has a high computation complexity that is due to the Normal distribution shape [Glen et al. (2004)].
5 Empirical evaluation
The inventory management problem model proposed by [Tarim and Kingsman (2004)] is employed, as a case study, to evaluate the proposed framework. The key idea is to schedule ahead replenishement periods and find the optimal order sizes which achieve a minimum total manufacturing cost. A reorder point with order size should meet customer demands up to the next point of replenishment.
Definition 5.1.
An inventory management model defined over a time horizon of cycles is
(3) 
The constituents of the total cost in the model are: the setup cost, holding cost and purchase cost. The setup cost is defined by the ordering cost multiplied by the number of times a replenishment takes place. The holding cost depends on the depreciation cost and the level of the inventory observed in a given cycle. The purchase cost is the reorder quantity multiplied by the varying cost/item. From this model, one can observe that all cost components are typically fluctuating and unpredictable especially in the reallife version of the problem. This is due to the unpredictability of customer demands and the variability of the cost/item. Accordingly, this model perfectly fits the purpose of the evaluation: comparing the behavior of the models when the environment is uncertain.
Information realized in the solution set.
The model is tested for a randomly distributed monthly demands over a time horizon cycles. The pbox cdfinterval representation is constructed for each demand observation per cycle and for each observed varying cost component (ordering cost , holding cost per item and varying cost per item ) to guarantee a safe enclosure on the data whereabouts. This is opposed to the fuzzy and probabilistic modeling which is based on the average demand values given in the set . The two later models set assumptions on the shape of the probability distribution adopted, as pointed out in Section 3. The solver executes the set of addition and equality constraints in the pbox cdfinterval domain. Constraints are triggered until stabilized and consistency is reached by means of the inference rules defined in Section 4. The solver suggests to replenishment periods, with a total holding cost and a total manufacturing cost . This output is opposed to replenishment periods realized by the fuzzy and the probabilistic models with a total holding cost and and a total manufacturing cost and respectively.
Fig. 5 illustrates a comparison between the output holding cost obtained from the models under consideration. The pbox cdfinterval graphical representation of the cost is depicted by the shaded region and their bounds in the convex models are illustrated by the dotted rectangles. Clearly, the solution set obtained from the pbox cdfintervals model, when compared with the outcome of the convex model, realized an additional knowledge (i.e. tighter bounds in the cdf domain). This solution set is opposed to a one value proposed as by the fuzzy and as by the probabilistic models. Output solution point suggested by the latter models can, sometime, mislead or deviate the decision making. This is because their distributions are built, from the begining, on approximating the actual observed distribution.
Model tractability.
We generate random distributions for monthly demands scaling up the problem time horizon for cycles. The first three rows in Table 1 show the real time taken by each model in seconds to generate the output solution of the total cost. Two other measurements, the shared heap used and the control stack used, are taken into consideration in order to study the memory consumption of each model. The shared heap used is the memory allocated to store compiled Prolog code and its related variables and necessary buffers. The control stack used is utilized to hold backtracking information. Table 1 demonstrates that stochastic model memory consumption grows exponentially when scalingup the problem, it reaches of the memory usage for a time horizon . The pbox cdfintervals behavior is similar to convex models. Probabilistic and fuzzy models have the best shared heap utilization. Clearly the percentage of the control stack employed in the stochastic model is the highest. This is due to the behavior of the stochastic techniques which exhaustively build the solution scenarios in order to reach a solution. It is worth noting that convex models and pbox cdfintervals do not need to build this tree since output solution set is provided within an interval range that is encapsulating all possible output scenarios.
Evidently, convex models outperfrom the rest of the models in terms of speed; pbox cdfintervals have a closer speed, followed by the fuzzy models, then by the probabilistic models. In summary, the pbox cdfintervals performance is closer to that of the convex models. This means that, the new framework, with minimal overhead, adds up a quantifiable information by imposing tighter bounds on the probability distribution, in a safe and in a tractable manner. Applied computations are tractable because they are exerted on the interval bounds, using interval computations, then results are further projected, linearly, onto the cdf domain. Empirical evaluations proved that pbox cdfintervals have a scalability measure that is close to that of convex models.
6 Conclusion and future research direction
This research proposes a novel constraint domain to reason about data with uncertainty. The key idea is to extend convex models with the notion of pboxes in order to realize aditional quantifiable information on the data whereabouts. PBoxes have never been implemented in the CP paradigm, yet they are very good candidates to deal with and reason about uncertainty in the probabilistic paradigm, especially when data is shaping an unknown distribution. The case study of the inventory management problem demonstrates that pbox cdfintervals can be practically adopted to intuitively envelop the uncertain data found in different modeling aspects with minimum overhead. Evaluation results show that stochastic CPs and probabilistic models have the slowest performance. Fuzzy models proved to have a better performance and their output solutions are characterized to be reliable, i.e. they seek the satisfaction of all possible realizations. Convex models and the pbox cdfintervals encapsulate all possible distributions of the solution set in a convex representation. The pbox cdfintervals framework provides a range of quantiles along with bounds on their data whereabouts.
The introduction of a novel framework to reason about data coupled with uncertainty due to ignorance or based on variability, paves the way to many fruitful research directions. We can list many in: studying models having variables following dependent probability distributions, exploring different search techniques, revisiting the framework within a dynamically changing environment, generalizing the framework to deal with all types of uncertainty by considering together vagueness and dynamicity, and last but not least applying the model to a variety of large scale optimization problems which target reallife engineering and management applications.
References
 Benhamou and de Vinci (1994) Benhamou, F. and de Vinci, R. 1994. Interval constraint logic programming. Constraint programming: basics and trends: Châtillon Spring School, France.
 Bistarelli et al. (1999) Bistarelli, S., Montanari, U., Rossi, F., Schiex, T., Verfaillie, G., and Fargier, H. 1999. Semiringbased CSPs and valued CSPs: Frameworks, properties, and comparison. Constraints 4, 3, 199–240.
 Dubois et al. (1996) Dubois, D., Fargier, H., and Prade, H. 1996. Possibility theory in constraint satisfaction problems: Handling priority, preference and uncertainty. Applied Intelligence 6, 4, 287–309.

Dutta
et al. (2005)
Dutta, P., Chakraborty, D., and Roy, A. 2005.
A singleperiod inventory model with fuzzy random variable demand.
Mathematical and Computer Modelling 41, 8, 915–922.  ECRC (1994) ECRC. 1994. Eclipse (a) user manual, (b) extensions of the user manual. Tech. rep., ECRC.

Fargier
et al. (1996)
Fargier, H., Lang, J., and Schiex, T. 1996.
Mixed constraint satisfaction: A framework for decision problems
under incomplete knowledge.
In
Proceedings of the National Conference on Artificial Intelligence
. Citeseer, 175–180.  Ferson et al. (2003) Ferson, S., Kreinovich, V., Ginzburg, L., Myers, D., and Sentz, K. 2003. Constructing Probability Boxes and DempsterShafer structures, Sandia National Laboratories. Tech. rep., SANDD20024015.

Glen
et al. (2004)
Glen, A., Leemis, L., and Drew, J. 2004.
Computing the distribution of the product of two continuous random variables.
Computational statistics & data analysis 44, 3, 451–464.  Halpern (2003) Halpern, J. Y. 2003. Reasoning about uncertainty.
 Holzbaur (1992) Holzbaur, C. 1992. Metastructures vs. attributed variables in the context of extensible unification  applied for the implementation of clp languages. In In 1992 International Symposium on Programming Language Implementation and Logic Programming. Springer Verlag, 260–268.
 Jaffar and Lassez (1987) Jaffar, J. and Lassez, J.L. 1987. Constraint logic programming. In Proceedings of the 14th ACM SIGACTSIGPLAN symposium on Principles of programming languages. ACM, 111–119.
 Le Huitouze (1990) Le Huitouze, S. 1990. A new data structure for implementing extensions to Prolog. In Programming Language Implementation and Logic Programming. Springer, 136–150.
 Petrović et al. (1996) Petrović, D., Petrović, R., and Vujošević, M. 1996. Fuzzy models for the newsboy problem. International Journal of Production Economics 45, 1, 435–441.

Saad
et al. (2010)
Saad, A., Gervet, C., and Abdennadher, S. 2010.
Constraint Reasoning with Uncertain Data Using CDFIntervals.
Integration of AI and OR Techniques in Constraint Programming for Combinatorial Optimization Problems
, 292–306.  Saad et al. (2012) Saad, A., Gervet, C., and Fruehwirth, T. 2012. CDFIntervals Revisited. The Eleventh International Workshop on Constraint Modelling and Reformulation  ModRef2012.
 Schiex et al. (1995) Schiex, T., Fargier, H., and Verfaillie, G. 1995. Valued constraint satisfaction problems: Hard and easy problems. In International Joint Conference on Artificial Intelligence. Vol. 14. Citeseer, 631–639.
 Smith and La Poutre (1992) Smith, W. and La Poutre, H. 1992. Approximation of staircases by staircases. Tech. rep., Citeseer.
 Tarim and Kingsman (2004) Tarim, S. and Kingsman, B. 2004. The stochastic dynamic production/inventory lotsizing problem with servicelevel constraints. International Journal of Production Economics 88, 105–119.
 Tarim et al. (2006) Tarim, S., Manandhar, S., and Walsh, T. 2006. Stochastic constraint programming: A scenariobased approach.
 Walsh (2002) Walsh, T. 2002. Stochastic constraint programming. Proceedings of the 15th Eureopean Conference on Artificial Intelligence, 111–115.
Comments
There are no comments yet.