 # Stochastic Constraint Programming as Reinforcement Learning

Stochastic Constraint Programming (SCP) is an extension of Constraint Programming (CP) used for modelling and solving problems involving constraints and uncertainty. SCP inherits excellent modelling abilities and filtering algorithms from CP, but so far it has not been applied to large problems. Reinforcement Learning (RL) extends Dynamic Programming to large stochastic problems, but is problem-specific and has no generic solvers. We propose a hybrid combining the scalability of RL with the modelling and constraint filtering methods of CP. We implement a prototype in a CP system and demonstrate its usefulness on SCP problems.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Stochastic Constraint Programming (SCP) is an extension of Constraint Programming (CP) designed to model and solve complex problems involving uncertainty and probability, a direction of research first proposed in

[2, 22]. SCP problems are in a higher complexity class than CP problems and can be much harder to solve, but many real-world problems contain elements of uncertainty so this is an important class of problems. They are traditionally tackled by Stochastic Programming , but a motivation for SCP is that it should be able to exploit the richer choice of variables and constraints used in CP, leading to more compact models and the use of powerful filtering algorithms.

However, so far SCP has not been applied to very large problems. If a problem has many decision variables we can apply metaheuristics [11, 12, 14, 22] but we must still check all scenarios to obtain an exact solution, though in special cases a subset is sufficient . If we are content with an approximate solution we can apply scenario reduction by sampling  or approximation 

, but scenario reduction methods can be nontrivial to analyse and apply. Confidence intervals can be applied to control approximations

 but this does not address the issue of scaling up to a huge number of scenarios. In summary, to solve large real-world problems via SCP one must think carefully about scenario reduction, and to do so can require significant mathematical expertise. Moreover, the number of scenarios required might turn out to be unmanageable.

In contrast, many large stochastic and adversarial problems have been successfully solved by methods from Reinforcement Learning (RL) , which is related to Neuro-Dynamic Programming  and Approximate Dynamic Programming . RL algorithms are designed for problems in which rewards may be delayed, so that the consequences of making a decision are not known until a later time. RL algorithms such as SARSA and Q-Learning can be used to find high-quality solutions to large-scale problems. In RL researchers are less concerned with sample sizes, confidence intervals or other statistical issues. Typically they model their problem, choose an RL algorithm and tune it to their application. These methods have been successfully applied to problems in robotics, control, game playing, trading and human-computer interfaces, for example. Perhaps most famously, RL was used to learn how to play the game of Backgammon  by trial-and-error self-play and without human intervention, leading to a world-class player. Related methods developed in Operations Research to handle exponentially-many actions are able to handle far larger problems, for example the scheduling of tens of thousands of trucks . Such applications show that the solutions found by RL can be good enough for practical purposes.

Such applications are far beyond the scope of current SCP techniques. Our aim is to boost the scalability of SCP so that it can tackle similar problems to RL, while retaining its modelling power and constraint filtering techniques. From the RL point of view, this is of interest because it provides a generic RL solver for a significant class of problems, which uses constraint filtering to reduce the size of the state space. Section 2 provides background on SCP, Section 3 describes our method, Section 4 presents experimental results using an implementation in a CP system, and Section 5 draws conclusions and discusses future work.

## 2 Stochastic Constraint Programming

An -stage SCSP is defined as a tuple where is a set of decision variables, a set of stochastic variables, a function mapping each element of to a domain of values, a function mapping each variable in

a set of constraints on , a function mapping each constraint in to a threshold value , and a list of decision stages such that the partition and the partition . Each constraint must contain at least one variable, a constraint with threshold is a hard constraint, and one with is a chance constraint. To solve an -stage SCSP an assignment to the variables in must be found such that, given random values for , assignments can be found for such that, given random values for assignments can be found for such that, given random values for , the hard constraints are each satisfied and the chance constraints (containing both decision and stochastic variables) are satisfied in the specified fraction of all possible scenarios (set of values for the stochastic variables).

An SCSP solution is a policy tree of decisions, in which each node represents a value chosen for a decision variable, and each arc from a node represents the value assigned to a stochastic variable. Each path in the tree represents a different possible scenario and the values assigned to decision variables in that scenario. A satisfying policy tree is a policy tree in which each chance constraint is satisfied with respect to the tree. A chance constraint is satisfied with respect to a policy tree if it is satisfied under some fraction of all possible paths in the tree.

## 3 SCP as RL

In this section we describe our hybrid approach to solving SCP problems.

### 3.1 Reinforcement Learning

RL is an area of machine learning with roots in dynamic programming, Monte Carlo methods, optimal control and behavioural psychology. It is one of the three main classes of machine learning, the other two being supervised and unsupervised learning. RL involves the interaction between a decision-making

agent and its environment. The agent seeks to optimise an expected total reward under uncertainty about its environment. The agent can take actions which may affect the future state of the environment, which in turn may affect the agent’s later options.

Rewards might be random, which is why the agent maximises their expectation. Rewards may also be delayed in time, so that choosing actions involves taking into account their later consequences. For example when playing a game the only reward might occur at the end of the game: 1 for a win and 0 for a loss. Thus the agent must learn how to react to any possible game state in order to maximise its probability of a win.

Any state might have an associated reward. The agent must learn a policy

(a function from states to actions) that maximises the total expected reward, under the assumption that it follows an optimal path. To do this it estimates the total expected reward starting from each state, typically storing the estimates in a table of

state values (or in some algorithms state-action values). The agent learns these estimates by performing a large number of Monte Carlo-style simulations called episodes, and updating the values at each state encountered.

### 3.2 Modelling

We model an SCP problem as an RL problem as shown in Table 1. In this approach we can benefit from constraint filtering methods: stronger filtering restricts our choice of actions (domain values), so we avoid visiting more states, which may enable a simpler state aggregation method to emulate a more complex one. But it is possible to reach a dead-end state in which no actions remain, because of SCP domain wipe-out

. We need RL to learn to take decisions that will avoid dead-ends, so we reward each (decision or random) variable assignment with a constant

, which will typically be greater than the greatest possible objective value. Instead we could relax sufficient constraints to prevent dead-ends then penalise any violations, but this loses the advantage of constraint filtering.

From the RL point of view, the CP solver is now part of the policy. For example, suppose we have an SCP problem containing an alldifferent global constraint such as that in , and we solve it, obtaining a policy. If we then try to use that policy to choose a sequence of actions, but with a different CP solver that implements alldifferent as a set of pairwise disequality constraints, the different level of filtering leads to a different state space, we will follow a different policy, and the results will be unpredictable. We must therefore use exactly the same CP solver when finding a policy and using it.

This framework can handle a single chance constraint, plus any number of hard constraints, by attempting to maximise the probability that the constraint is satisfied: if this is greater than the threshold then we have a solution. However, it does not handle multiple chance constraints. It might be possible to extend it to chance constraints but we do not see this as vital. Most SP problems do not use chance constraints, instead penalising constraint violations and minimising the total penalty as part of the objective. For a recent discussion on penalty functions versus chance constraints see .

A potential problem with this scheme is that domain values for a random variable might be filtered because of earlier decision variable assignments, or assigning a value to a random variable might fail because of domain wipe-out. This would artificially rule out some scenarios and make the solver incorrect, but it can be avoided by a cheap runtime check: on encountering a random variable, check that it still has its original domain; and after selecting a domain value, check that the assignment is successful. If the check fails, the probe halts at the random variable.

### 3.3 Solving

The RL algorithm we shall use is a form of tabular TD(0)  with a reward computed at the end of each episode. However, many problems have far too many possible states to use RL in tabular form. To extend RL to cope with such problems researchers have applied function approximation, also referred to as state generalisation or state aggregation. This is key to the success of RL on real-world problems and we shall use it below. To apply our algorithm a user must provide an SCP model, including a real-valued function on total assignments defining a reward.

## 4 Experiments

We now perform experiments to evaluate our approach, which we refer to as TDCP, on stochastic problems. It is implemented in the Eclipse constraint logic programming system

 and all experiments are performed on a 2.8 GHz Pentium 4 with 512 MB RAM.

### 4.1 An artificial single-stage problem

As a first experiment we design an artificial single-stage problem with known optimum solution. The problem has decision variables and random variables all with domain . We post an alldifferent constraint on the decision variables: there are variables with values, so the solution must be a permutation of . All random variable domains have the same uniform probability distribution: each value has probability . The objective is to maximise the sum of the probabilities that each decision variable is no greater than each random variable : see Figure 1, where is 1 if condition is true and 0 if it is false. The sum of the reified terms (without expectation) is the TDL reward. The optimal solution is known to be with objective value . There are scenarios so this problem cannot be solved by SCP methods without some form of scenario reduction.

To handle the exponentially large number of states we use a form of state aggregation based on Zobrist hashing  with hash table entries for some integer , which works as follows. To each (decision or random) variable-value pair we assign a random integer which remains fixed. At any point during an episode we have some set of assignments , and we take the exclusive-or of the values associated with these assignments:

 XS=⨁⟨v,i⟩∈Srvi

Finally, we use as an index to an array with entries. So at any point during an episode we are in an RL state with variables assigned, and we use array element as the state value estimate. If is sufficiently large then hash collisions are unlikely, and we will have a unique array element for each state encountered. In practice some hash collisions will occur, leading to multiple states sharing value estimates, and less exact results. Nevertheless, we shall show empirically that good results can be obtained. Our hash-based state aggregation can also be applied to other single-stage SCP problems, or multi-stage problems in which recourse actions are computed by an algorithm other than RL (as in the problem of Section 4.2). However, we do not expect it to be successful on all multi-stage problems.

The scatter plot in Figure 2 shows the results for using . For several numbers of episodes (all far less than the full ten billion) we run TD(0) ten times with different random seeds. The graph shows that as more episodes are used for learning, the estimated objective function value converges to the known optimum value.

### 4.2 Pre-disaster planning

In this section we tackle a pre-disaster planning problem introduced by Peeta et al.  who solved it approximately using a Monte Carlo method with function approximation. The six problem instances were later solved exactly in . A detailed description of the problem can be found in those papers, and here we state only its main features.

This is a two-stage problem in which the recourse action is determined by solving a shortest path problem. The first stage has 30 binary decision variables representing investment in links of a transportation network, and 30 binary random variables representing the survival or failure of those links in a hypothetical earthquake, according to given survival probabilities. The probabilities are assumed to be independent of each other, but they depend on the investment decisions: decision-dependent probabilities are a non-standard feature of SP and SCP called endogenous uncertainty. (This can make problems harder to solve by some methods, but not for a simulation-based approach such as ours.) We can choose to invest in any subset of the links subject to a budget constraint, and three alternative budget levels are chosen. The objective is to minimise the expected total path length between five pairs of nodes in the network, where a penalty is imposed when no path exists between a pair of nodes. Two alternative penalty schemes are used, which we shall refer to as low-M and high-M, giving a total of six problem instances.

An SCP model is shown in Figure 3. For each link (where is the set of links in the network) we define a binary decision variable which is 1 if we invest in that link and 0 otherwise. We define a binary stochastic variable which is 1 if link survives and 0 if it fails. We define a single second-stage decision variable to be computed by a shortest-path algorithm. Following Peeta et al. we denote the survival (non-failure) probability of link by without investment and with, the investment required for link by , the length of link by , the budget by , and the penalty for no path from source to sink by . is a global constraint that constructs a representation of the graph from the values, uses Dijkstra’s algorithm to find a shortest path between source and sink, and computes its length ; if source and sink are unconnected then . We implemented this constraint via an Eclipse suspended goal whose execution is delayed until the second stage. To model failure probabilities we define real auxiliary decision variables . The are constrained to be if link is invested in () and otherwise. Because they are auxiliary variables and functionally dependent on the we do not include them in the stage structure.

The problem is hard to solve exactly by standard SP and SCP methods, partly because of its endogenous uncertainty, but mainly because it has approximately a billion () scenarios. Peeta et al. therefore used function approximation and Monte Carlo simulation to find good solutions. However, a symmetry-based technique called scenario bundling was later applied to find exact solutions .

This is an ideal test problem for our approach: it is an interesting stochastic optimisation problem based on real-world data; it is a large, hard problem (unless we use scenario bundling); unusually for such a problem we know the exact answer (via scenario bundling); again unusually we can exactly evaluate new solutions (via scenario bundling); and we can compare our approximate results with those found by another approximate approach (that of Peeta et al.). We again use our Zobrist hashing technique from Section 4.1. Though this problem is two-stage because it has recourse actions, in a sense it is only a one-stage problem because the recourse actions are computed by a shortest path algorithm: RL need not learn how to react to different scenarios. This makes the problem appropriate for our hashing technique. In a true multi-stage problem TDCP will never learn how to react to any given scenario because it is unlikely ever to encounter the same scenario twice.

In experiments we found quite different solution quality in different runs, so for each problem instance we performed ten runs of TDCP and report the best results in Table 2. We show the optimum objective values from , the exact evaluation of the plans found by Peeta et al., the TDCP plans (a list of the links invested in), their TDCP-estimated objective values and their exact values. Each run of Peeta et al. took approximately 380 seconds on a PC with GHz Xeon processor and 5 GB RAM implemented in Matlab 7.0, while ours took approximately 1000 seconds each on a roughly comparable machine: we are unable to compare execution times directly but ours seem reasonably efficient.

The TDCP objective estimates turn out to be quite accurate, with at most 0.5% deviance from the actual objective value. Our approach required multiple runs of episodes instead of one run, so it appears to be less efficient than that of Peeta et al., but the results are competitive and in two cases were closer to optimal. It is perhaps surprising that TDCP, a general-purpose SCP algorithm with random state aggregation, gives comparable results to the more sophisticated and problem-specific approximation of Peeta et al.

As an illustration of an advantage of using a generic CP-based solver, we experimented further with the model. In principle we can apply many standard CP techniques to improve the SCP model: add implied constraints, change the filtering algorithm for a constraint (for example by using a global constraint), break symmetries, exploit dominances, experiment with different variable orderings, and so on. For this problem we found improved results by making two changes. Firstly, we added constraints to limit the search to maximal solutions:

 Bye+z+ce>B(∀e∈E)

These constraints exclude non-maximal investment plans in which we do not invest in a link despite there being enough unspent money to do so. Secondly we randomly permute the before starting the search: we did not find a good deterministic variable ordering, but by randomising we hope to find a better ordering (if one exists) over multiple runs. We obtained some improved results: for =2 =low we found the plan (1 4 10 15 17 20 21 22 25 23) with estimated objective value 67.271 and actual value 67.334, and for =1 =high plan (10 17 21 22 23 25) with estimated value 211.492 and actual value 212.413 (this is the optimal plan from ), both better than the plans of Peeta et al.

However, the use of a random variable permutation caused a greater variability in plan quality. Clearly the variable ordering has a strong effect on the search, and more research might find a good heuristic. But the main point of this experiment was to show that it is very easy to experiment with alternative SCP models and heuristics to obtain new RL algorithms for SCP.

## 5 Conclusion

We implemented a simple RL algorithm in a CP solver, and obtained a novel algorithm for solving SCP problems. We showed that this RL/CP hybrid can find high-quality solutions to hard problems. We believe that exploiting Machine Learning methods is a good direction for SCP research, to make it a practical tool for real-world problems. In future work we shall show that our approach extends to multistage SCP problems using different state aggregation techniques (we have preliminary results on an inventory control problem).

This work should also be of interest from an RL perspective. Firstly, implementing RL algorithms in a CP solver enables the user to perform rapid prototyping of RL methods for new problems. For example, simply by specifying a different filtering algorithm for a global constraint we obtain a new RL solver. Secondly, we now have an RL solver for an interesting class of problem (SCP problems). There are no general-purpose RL solvers available because, like Dynamic Programming, RL is a problem-specific approach. Thirdly, allowing the use of constraint filtering methods in RL potentially boosts its ability to solve tightly-constrained problems.

#### Acknowledgments

This publication has emanated from research supported in part by a research grant from Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289.

## References

•  K. R. Apt, M. Wallace. Constraint Logic Programming Using Eclipse. Cambridge University Press, 2007.
•  T. Benoist, E. Bourreau, Y. Caseau, B. Rottembourg. Towards Stochastic Constraint Programming: A Study of On-Line Multi-Choice Knapsack with Deadlines. 7th International Conference on Principles and Practice of Constraint Programming, Lecture Notes in Computer Science vol. 2239, Springer, 2001, pp. 61–76.
•  D. P. Bertsekas, J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
• 

L. Bianchi, M. Dorigo, L. M. Gambardella, W. J. Gutjahr. A Survey on Metaheuristics for Stochastic Combinatorial Optimization.

Natural Computing 8(2):239–287, 2009.
•  J. Birge, F. Louveaux. Introduction to Stochastic Programming. Springer Series in Operations Research, 1997.
•  L. Bordeaux, H. Samulowitz. On the Stochastic Constraint Satisfaction Framework. ACM Symposium on Applied Computing, 2007, pp. 316–320.
•  M. Branda. On Relations Between Chance Constrained and Penalty Function Problems Under Discrete Distributions. Mathematical Methods of Operations Research 77:265–277, 2013.
•  J. Dupǎcová, N. Gröwe-Kuska, W. Römisch. Scenario Reduction in Stochastic Programming: an Approach Using Probability Metrics. Mathematical Programming Series A 95:493–511, 2003.
•  S. Peeta, F. S. Salman, D. Gunnec, K. Viswanath. Pre-Disaster Investment Decisions for Strengthening a Highway Network. Computers & Operations Research 37:1708–1719, 2010.
•  W. B. Powell. Approximate Dynamic Programming. John Wiley & Sons, 2007.
•  S. D. Prestwich, S. A. Tarim, R. Rossi, B. Hnich. Evolving Parameterised Policies for Stochastic Constraint Programming. 15th International Conference on Principles and Practice of Constraint Programming, Lecture Notes in Computer Science vol. 5732, 2009, pp. 684–691.
•  S. D. Prestwich, S. A. Tarim, R. Rossi, B. Hnich. Stochastic Constraint Programming by Neuroevolution With Filtering.

7th International Conference on Integration of Artificial Intelligence and Operations Research Techniques in Constraint Programming, Lecture Notes in Computer Science

vol. 6140, Springer, 2010, pp. 282–286.
•  S. D. Prestwich, M. Laumanns, B. Kawas. Value Interchangeability in Scenario Generation. 19th International Conference on Principles and Practice of Constraint Programming, Lecture Notes in Computer Science Vol. 8124, Springer, 2013, pp. 587–595.
•  S. D. Prestwich, S. A. Tarim, R. Rossi, B. Hnich. Hybrid Metaheuristics for Stochastic Constraint Programming. Constraints, 2014 (to appear).
•  J.-C. Régin. Generalized Arc Consistency for Global Cardinality Constraint. 14th National Conference on Artificial Intelligence pp. 209–215, 1996.
•  R. Rossi, B. Hnich, S. A. Tarim, S. Prestwich. Confidence-Based Reasoning in Stochastic Constraint Programming. Artificial Intelligence, 228(1):129–152, Elsevier, 2015.
•  G. A. Rummery, M. Niranjan, M. On-Line Q-Learning Using Connectionist Systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department, 1994.
•  R. S. Sutton, A. G. Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
•  S. A. Tarim, I. Miguel. A Hybrid Bender’s Decomposition Method for Solving Stochastic Constraint Programs with Linear Recourse. Lecture Notes in Computer Science vol. 3978, Springer, 2006, pp. 133–148.
•  G. Tesauro. Practical Issues in Temporal Difference Learning. Machine Learning 8:257–277, 1992.
•  J. N. Tsitsiklis. Commentary — Perspectives on Stochastic Optimization Over Time. INFORMS Journal on Computing 22(1):18–19, 2010.
•  T. Walsh. Stochastic Constraint Programming. 15th European Conference on Artificial Intelligence, 2002, pp. 111–115.
•  A. L. Zobrist. A New Hashing Method with Application for Game Playing. Technical Report 88, Computer Sciences Department, University of Wisconsin, Madison, Wisconsin, 1969. Also: International Computer Chess Association Journal 13(2):69–73, 1990.