Log In Sign Up

A Machine Learning Approach to Solving Large Bilevel and Stochastic Programs: Application to Cycling Network Design

by   Timothy C. Y. Chan, et al.

We present a novel machine learning-based approach to solving bilevel programs that involve a large number of independent followers, which as a special case include two-stage stochastic programming. We propose an optimization model that explicitly considers a sampled subset of followers and exploits a machine learning model to estimate the objective values of unsampled followers. Unlike existing approaches, we embed machine learning model training into the optimization problem, which allows us to employ general follower features that can not be represented using leader decisions. We prove bounds on the optimality gap of the generated leader decision as measured by the original objective function that considers the full follower set. We then develop follower sampling algorithms to tighten the bounds and a representation learning approach to learn follower features, which can be used as inputs to the embedded machine learning model. Using synthetic instances of a cycling network design problem, we compare the computational performance of our approach versus baseline methods. Our approach provides more accurate predictions for follower objective values, and more importantly, generates leader decisions of higher quality. Finally, we perform a real-world case study on cycling infrastructure planning, where we apply our approach to solve a network design problem with over one million followers. Our approach presents favorable performance compared to the current cycling network expansion practices.


page 27

page 29


Generalization of Machine Learning for Problem Reduction: A Case Study on Travelling Salesman Problems

Combinatorial optimization plays an important role in real-world problem...

Solving Bilevel Knapsack Problem using Graph Neural Networks

The Bilevel Optimization Problem is a hierarchical optimization problem ...

The stochastic bilevel continuous knapsack problem with uncertain follower's objective

We consider a bilevel continuous knapsack problem where the leader contr...

A Pessimistic Bilevel Stochastic Problem for Elastic Shape Optimization

We consider pessimistic bilevel stochastic programs in which the followe...

Stochastic Learning Approach to Binary Optimization for Optimal Design of Experiments

We present a novel stochastic approach to binary optimization for optima...

Learning Parameters for Balanced Index Influence Maximization

Influence maximization is the task of finding the smallest set of nodes ...

Private Machine Learning via Randomised Response

We introduce a general learning framework for private machine learning b...

1 Introduction

This paper is concerned with solving bilevel optimization problems with a large number of followers and where the feasible region of the leader is independent of the followers. A wide range of decision problems can be modeled this way, including active transportation network design (Liu et al. 2019, 2021b, Lim et al. 2021), network pricing (Van Hoesel 2008, Alizadeh et al. 2013), energy pricing (Fampa et al. 2008, Zugno et al. 2013), and portfolio optimization (Carrión et al. 2009, Leal et al. 2020). This model also generalizes two-stage stochastic programming. Indeed, if the objectives of the leader and followers are identical, then this model becomes a two-stage stochastic program where the set of followers represent scenarios. Thus, in this paper, the reader should think of “leader” in a bilevel program and “first-stage decision maker” in a stochastic program as synonymous, and similarly for “follower” and “second-stage decision maker”. As we discuss the bilevel or stochastic programming literature below, we use the corresponding terminology.

The main challenge of solving a bilevel problem with a large set of followers stems from having to solve a large number of follower problems to evaluate the quality of the leader’s decision. For the bilevel problem we consider, given its relationship to stochastic programming, we can draw on approaches to deal with large from both communities. Two predominant strategies are: i) solving the problem with a small sample of , and ii) approximating the followers’ cost without explicitly solving the followers’ problems.

Sampling a smaller follower set can be done via random sampling (Liu et al. 2021b, Lim et al. 2021) or clustering (Dupačová et al. 2003, Hewitt et al. 2021, Bertsimas and Mundru 2022). Given a sample , we can obtain a feasible leader solution by solving the reduced problem, which improves computational tractability and solution interpretability due to the reduced problem size. Furthermore, under suitable regularity assumptions, an optimal solution to the reduced problem provides a bound on the optimal value and optimal leader solution to the original problem (Römisch and Schultz 1991, Römisch and Wets 2007). However, there is no theoretical guarantee on the performance of the optimal solution to the reduced problem as measured by the original problem’s objective.

Regarding the second strategy, many different algorithms have been developed to approximate the followers’ cost. Such approximations can be obtained by relaxing the constraint that the followers’ decisions are optimal for their own objectives and progressively refining the relaxed problem until the generated follower solutions are indeed optimal. Algorithms along this direction include L-shaped methods (Birge and Louveaux 2011), vertex enumeration (Candler and Townsley 1982), the Kuhn-Tucker approach (Bard 2013), and penalty function methods (White and Anandalingam 1993). These algorithms are generally not able to deal with huge because the relaxed problem size still increases drastically as increases. To overcome this issue, machine learning (ML) methods have recently achieved encouraging performance on approximating the second-stage cost as a function of first-stage decisions (Liu et al. 2021a, Dumouchelle et al. 2022) and learning second-stage decision rules (Chen et al. 2008) in two-stage stochastic programming. However, the former requires identifying features that can be compactly represented using the leader’s decision, which is practically challenging, while the latter may produce infeasible decisions when nontrivial second-stage constraints are present. Moreover, neither method has optimality guarantees.

In this paper, we build on the ideas from both strategies. In particular, we consider sampling a subset . However, we also augment the overall objective function with an estimate of the objective value of the unsampled followers, from , using an ML model. Unlike existing methods that use offline ML models to map leader decisions to objective values, our ML model intakes follower features that are independent of leader decisions. We embed the ML model training into the bilevel problem to link the ML model with leader decisions. When optimizing leader decisions, the ML model is trained on a dataset of the sampled followers on-the-fly. Simultaneous optimization and ML model training enables derivation of new theoretical guarantees for the generated leader decisions. Finally, to demonstrate our methodology, we apply it to a cycling infrastructure design problem, completed in collaboration with the City of Toronto using real data. The methods we develop in this paper were driven by the need to solve this large, real-world application.

1.1 Cycling Infrastructure Planning

Cycling has become an increasingly popular transportation mode due to its positive impact on urban mobility, public health, and the environment (Mueller et al. 2018, Kou et al. 2020) In fact, during the COVID-19 pandemic, cycling popularity increased significantly since it represented a low-cost and safe alternative to driving and public transit, facilitated outdoor activities, and improved access to essential services (Kraus and Koch 2021, Buehler and Pucher 2021). However, cycling safety and comfort concerns have been repeatedly identified as major factors that inhibit cycling uptake and overall mode choice (Dill and McNeil 2016, Li et al. 2017). Building high-quality cycling infrastructure is among the most effective ways to alleviate cycling stress and enhance cycling adoption (Buehler and Dill 2016). In this paper, we develop an optimization model to aid cycling infrastructure planning by maximizing “low-stress cycling accessibility” subject to a fixed budget. Low-stress cycling accessibility, defined as the total amount of “opportunities” (i.e., people, jobs, retail stores) reachable by individuals using streets that are perceived to be safe for cycling, has been widely adopted to assess the service provided by transportation infrastructure and the impact of cycling infrastructure (Sisson et al. 2006, Lowry et al. 2012, Furth et al. 2018, Kent and Karner 2019, Imani et al. 2019, Gehrke et al. 2020, Lin et al. 2021). In our problem, a transportation planner designs a cycling network subject to a given budget, considering that cyclists will use the low-stress network to travel to opportunities via shortest paths. The planner’s objective is to maximize the total number of opportunities accessible via low-stress routes. The resulting formulation for the City of Toronto includes over one million origin-destination pairs (i.e., followers) between small geographic units known as census dissemination areas, motivating the development of our methodology.

1.2 Contributions

  1. We develop a novel ML-augmented optimization model for solving bilevel optimization problems with a large number of followers. Our objective function has an exact component for a sampled subset of followers and an approximate component derived from an ML model. Sampling improves computational tractability, while the ML model ensures that the objective function better captures the impact of the leader’s decision on the unsampled followers. The training of the ML model is embedded into the optimization model to enable the usage of predictive features that cannot be compactly represented using leader decisions. Indeed, we develop a theoretical bound on the quality of the generated solution as measured on the original objective function with the full set of followers. Given that our model generalizes two-stage stochastic programming, this ML-augmented approach also represents a new way for solving stochastic programming problems.

  2. Informed by our theoretical insights, we develop practical strategies to enhance the performance of the ML-augmented model, including i) follower sampling algorithms to tighten the theoretical bounds, ii) strategies for hyperparameter tuning, and iii) a novel representation learning framework that automatically learns follower features that are predictive of follower objective values. To the best of our knowledge, this is the first application of representation learning to bilevel optimization.

  3. We demonstrate the effectiveness of our approach via computational studies on a synthetic cycling network design problem. We show that i) our learned features are more predictive of follower objective values compared to baseline features from the literature; ii) our follower sampling algorithms further improve the ML models’ out-of-sample prediction accuracy by a large margin compared to baseline sampling methods; iii) our strong predictive performance translates into high-quality and stable leader decisions from the ML-augmented model. The performance gap between our approach and sampling-based models without the ML component is particularly large when the follower sample is small.

  4. In collaboration with the City of Toronto, we perform a real-world case study on cycling infrastructure planning in Toronto, Canada. We solve a large-scale cycling network design problem, demonstrating that our approach can increase accessibility over a greedy method by 19.2%, averaging over different budgets. If we consider 100 km of cycling infrastructure to be designed using a greedy method, our method can achieve a similar level of accessibility using only 70 km, equivalent to $18M in potential cost savings, or a 11.2% increase in accessibility if all 100 km was designed using our approach.

Proofs are in the Electronic Companion.

2 Literature review

2.1 Integration of Machine Learning and Optimization

“Predict, then optimize” is a common modeling paradigm that uses machine learning models to estimate parameters in an optimization problem, which is then solved with those estimates to obtain decisions (Ferreira et al. 2016, Elmachtoub and Grigas 2022). Recent progress has been made in using ML models to prescribe decisions based on contextual features (Ban and Rudin 2019, Notz and Pibernik 2022, Babier et al. 2022, Bertsimas and Kallus 2020), to improve optimization algorithms (Khalil et al. 2016, Tang et al. 2020, Jia and Shen 2021, Morabit et al. 2021), and to build end-to-end optimization solvers (Vinyals et al. 2015, Bello et al. 2017, Khalil et al. 2017, Nazari et al. 2018, Kool et al. 2019).

Our idea of using an ML model to predict followers’ costs is similar to a stream of literature that integrates offline ML models into the solution of optimization problems to map decision variables to uncertain objective values (Biggs et al. 2017, Mišić 2020, Liu et al. 2021a, Bergman et al. 2022, Dumouchelle et al. 2022). In contrast, we integrate the ML model training into the optimization problem and use predictive features that cannot be compactly represented using our decision variables.

2.2 Scenario Reduction in Stochastic Programming

Scenario reduction has been extensively studied in the stochastic programming literature. One stream of literature quantifies the similarity between individual scenarios and then applies clustering methods to select a subset. Common measures include the cost difference between single-scenario optimization problems (Keutchayan et al. 2021), the opportunity cost of switching between scenarios (Hewitt et al. 2021)

, and the distance between scenario feature vectors

(Crainic et al. 2014, Wu et al. 2022). Another stream of literature selects a scenario subset from a given set to minimize the discrepancy between the distributions described by the two sets. Probabilistic metrics, including Wasserstein distance (Römisch and Schultz 1991, Bertsimas and Mundru 2022) and Fortet-Mourier metrics (Dupačová et al. 2003)

, derived from the stability theory of stochastic programming and heuristic loss functions

(Prochazka and Wallace 2020) have been proposed to measure such discrepancies.

Our proposal of learning follower representations is similar to the work of Wu et al. (2022). However, our approach does not rely on the graph structure of the optimization problem and considers the impact of the leader’s solution on the followers’ costs. Our theoretical results deviate from the second stream of literature in that instead of focusing on the stability of the optimal value and optimal leader’s solution, we provide solution quality guarantees for the leader’s solution obtained from solving our ML-augmented problem.

2.3 Active Learning

Active learning aims to reduce the cost of data labeling by selecting the most “informative” subset of data for machine learning model training. Such a subset is typically constructed through an iterative process where an ML model is trained on the labeled data at each iteration and then queries one or more unlabeled data points based on informativeness measures such as prediction uncertainty (Lewis and Gale 1994), and expected model change (Settles et al. 2007). See Settles (2009) for a comprehensive review of the available measures. Instead of iteratively selecting unlabeled data, a subset of data may be selected once to improve data labeling efficiency (Sener and Savarese 2018). Our approach is most similar to Sener and Savarese (2018) because including the exact objective for a sampled set of followers in the reduced problem is analogous to costly labeling of a small set of data points once. We extend the core-set method developed by Sener and Savarese (2018) to derive a theoretical bound for the quality of the leader decisions from our ML-augmented model.

2.4 Strategic Cycling Infrastructure Planning

Previous studies on strategic cycling infrastructure planning have considered a variety of approaches. Many papers greedily choose road segments to install new cycling infrastructure using expert-defined metrics (Lowry et al. 2012, Kent and Karner 2019, Olmos et al. 2020) or visualization techniques (Larsen et al. 2013, Putta and Furth 2019). Optimization-based approaches typically consider minimizing the travel cost (Sohn 2011, Mesbah et al. 2012, Duthie and Unnikrishnan 2014, Bagloee et al. 2016, Mauttone et al. 2017), maximizing total utility (Bao et al. 2017, Liu et al. 2019, 2021b, Lim et al. 2021, Ospina et al. 2022), or maximizing total ridership (Liu et al. 2022) given a large number of origin-destination (OD) pairs. Due to the large problem size, such models are usually solved with heuristics. To the best of our knowledge, only Liu et al. (2021b), Lim et al. (2021), and Liu et al. (2022) solve the problems to optimality at a city scale by randomly sampling OD pairs or restricting the routes that can be used by each OD pair. Our work adds to the literature by developing a computationally tractable solution method that can solve larger problems and does not require restrictions such as limited routes for each OD pair.

3 Model Preliminaries

In this section, we present the general bilevel problem of interest, a reduced version based on sampling, and our proposed model. We will use the following notational conventions. Vectors and matrices are denoted in bold, and sets are denoted in calligraphic font. We let denote the indicator function, and .

3.1 The Bilevel Model

The following is the bilevel optimization problem of interest:

subject to (1b)

Let denote the leader’s decision with feasible set and cost function . Let be the set of followers, be the decision of follower , measure the cost of a follower’s decision, and be a nonnegative weight. To determine the optimal decision of follower , we assume the follower is optimizing an objective function subject to a non-empty feasible set that depends on the leader’s decision. Note that when and are identical for all , this problem is equivalent to a two-stage stochastic program

subject to

where the decisions of the leader and followers correspond to the first-stage and second-stage decisions, respectively, is the set of second-stage scenarios, and

(suitably normalized) is the probability of realizing scenario


To simplify the notation, we write the bilevel problem as






Let be an optimal solution to (2).

3.2 Reduced Model

Given a sampled follower set , we consider the reduced problem




is as defined in (4), and the weight assigned to scenario , , may be different from , due to re-weighting. Let be an optimal solution to (5).

For the special case of a two-stage stochastic program, stability results have been established for problem (5). Under certain regularity conditions, it is possible to bound (Römisch and Schultz 1991, Römisch and Wets 2007, Bertsimas and Mundru 2022) and (Römisch and Schultz 1991). However, there is no bound in the literature on , which is something we develop using the model from the next section.

3.3 ML-Augmented Model

Given a sampled follower set , we propose the following model

subject to (7b)

This formulation augments the reduced problem by integrating an ML model that predicts the costs of the unsampled followers in . The ML model is specified by , which predicts the cost of follower based on a feature vector . We use to denote the function class of ML models, to be a user-defined upper bound on the training loss, and to be a weight assigned to follower for calculating the training loss of . The training of on dataset is embedded into the problem via the training loss constraint (7b). We choose the L1 loss because it can be easily linearized.

Let denote the feasible region of problem (7). We can write problem (7) as




Problem (8) provides a general structure for our modeling approach. Its effectiveness on a given problem depends on multiple factors: i) function class , ii) weighting scheme and upper bound , iii) sample , and iv) availability of predictive follower features . We address the first three items in Section 4 and the fourth in Section 5.

4 Integrating a Prediction Model

In Section 4.1, we introduce two classes of prediction models – one non-parametric and one parametric – that are compatible with our ML-augmented model. We provide theoretical bounds on performance in Section 4.2. Finally, we present algorithms and discuss practical implementation, based on insights from examining the bounds, in Section 4.3.

4.1 Function Classes

4.1.1 -nearest neighbor regression.

For any fixed , let

where denotes the size of neighborhood considered and represents the -nearest neighbors in the sampled set of follower . Since NN regression is a non-parametric method that does not require training, we do not require constraint (7b). Equivalently, we can simply set or for all . Then, problem (8) becomes


Note that our ML-augmented model is also compatible with other non-parametric prediction models, such as locally weighted regression (Cleveland and Devlin 1988) and kernel regression (Parzen 1962), which assign non-uniform weights to nearest neighbors. However, for those models, our theoretical bound (Section 4.2.3

) becomes a complicated function of the distances from each follower to its nearest neighbors, making it difficult to formulate practical follower selection algorithms to tighten the bound. We thus leave the integration of more sophisticated non-parametric models for future work.

4.1.2 General parametric regression.

Consider a general parametric regression model written as , where represents the parameters of model and is the feasible set of model parameters. Then, problem (8) becomes


where is the number of followers in whose nearest neighbor in is .

For model (11) to be effective, one should choose a function class that can be compactly represented with and

. For example, a linear regression model

can be incorporated using only additional continuous decision variables . An additional set of continuous decision variables and linear constraints are required to linearize the L1 training loss. Such a representation is tractable when and are small. Though our theoretical results are applicable to general regression models, we focus on linear models in this paper for computational efficiency. This is not restrictive since nonlinearity can be incorporated through feature engineering.

4.2 Theoretical Properties

4.2.1 Prediction model setup.

We start by formally defining the prediction problem that is embedded into our ML-augmented model. For any fixed leader decision , we are interested in a regression problem defined in a feature space and a target space . We denote by

the probability density function of the target variable given a feature vector

. We regard

as a random variable because the true mapping from features to this target may not be deterministic. For example, consider a network design problem where the follower’s cost is the length of the shortest path from an origin to a destination using the network designed by the leader. If we use a one-dimensional binary feature that is 1 if both the origin and destination are in downtown and 0 otherwise, then all downtown OD pairs will share the same feature value but with shortest path lengths that could be drastically different.

4.2.2 Assumptions.

Next, we introduce several assumptions that enable the derivation of our theoretical results in Sections 4.2.3 and 4.2.4. [Independence] For any followers and leader decisions , the target (random) variables with distributions and are independent.

[Predictivity] There exists a such that, for any fixed , , and drawn from , there exists an interval such that and almost surely.

Assumption 4.2.2 holds for a wide range of applications where the original follower set is independently sampled. For example, in transportation network design, OD pairs (followers) are usually independently sampled from survey data (Lim et al. 2021) or ridership data (Liu et al. 2021b, 2022). In two-stage stochastic programming, second-stage scenarios are usually from historical observations that can be regarded as independent samples (Shapiro et al. 2009, Birge and Louveaux 2011). Assumption 4.2.2 states that the chosen features should be predictive of the follower’s cost. Given a fixed feature vector , if the associated follower cost could vary wildly, it would be difficult for any prediction model to achieve good performance. In contrast, if Assumption 4.2.2 holds and is small, then achieving a small prediction error is theoretically possible if the ML model is properly chosen and trained. Note that for Theorems 4.2.3 and 4.2.4 to hold, Assumption 4.2.2 can be relaxed if the cost function is bounded. However, using predictive features helps to narrow the interval, thus tightening the bounds. Assumptions 4.2.2 and 4.2.2 enable the application of concentration inequalities to bound the deviation of the total cost of out-of-sample followers from its expected value.

[Continuity of Follower Cost] For any fixed , there exists a such that is -Lipschitz continuous with respect to .

[Continuity of the Prediction Model] For any fixed , there exists a such that is -Lipschitz continuous with respect to .

Assumption 4.2.2 limits the change in the expected follower cost as a function of the change in feature space. Similar assumptions are commonly made to derive stability results for two-stage stochastic programming (Römisch and Schultz 1991, Römisch and Wets 2007, Bertsimas and Mundru 2022). Assumption 4.2.2 limits the expressive power of . Given that is trained on a small dataset associated with , limiting its expressive power is critical to avoid overfitting. Such a condition can be enforced by adding regularization constraints to . For example, for linear regression, we can set . This assumption is needed only for parametric regression models.

Next, we present theoretical bounds for the quality of solutions from the NN-augmented (Section 4.2.3) and parametric regression-augmented models (Section 4.2.4), and then follower selection and model tuning methods that tighten the bounds (Section 4.3).

4.2.3 Bound on NN-augmented model solution.

Given a follower sample and a neighborhood size , let denote an optimal solution to problem (10). If Assumptions hold, with probability at least ,

where is a distance metric in , , and .

Theorem 4.2.3 bounds the optimality gap of the solution generated by the

NN-augmented model on the original problem. The first term on the right-hand side corresponds to the prediction bias and the second term corresponds to the variance and Bayes error, both controlled by

and . The bias is proportional to the sum of the average distance from each to its -nearest neighbors and it increases as increases. When the sample size is fixed, the second term is controlled by . Note that , so the second term is minimized when are identical for all , which follows from the Cauchy-Schwarz inequality. The intuition is that if the followers from are “evenly” assigned to sample followers in , then the overall prediction performance on is less affected by the random deviation of the individual cost of follower , , from its expected value. In the worst case, the second term is bounded by , which decreases as increases, reflecting the bias-variance trade-off. Motivated by Theorem 4.2.3, we introduce how to select and to tighten the bound in Section 4.3.1.

4.2.4 Bound on parametric regression-augmented model solution.

Given a follower sample , let denote the optimal solution to problem (11) and denote the nearest neighbor of in , if Assumptions hold, with probability at least ,

Theorem 4.2.4 bounds the optimality gap of the leader’s decision generated by the parametric regression-augmented model on the original problem. The first term on the right-hand side is controlled by the training loss , whereas the second and third terms are controlled by . To reduce the second and third terms, should be chosen such that followers are not too far away from its nearest neighbor in (second term) and the assignment of followers in to followers in should be “even” (third term). Informed by Theorem 4.2.4, we discuss how to select and in Section 4.3.2.

4.3 Practical Implementation

4.3.1 NN-augmented model.

To tighten the bound in Theorem 4.2.3, we need to determine i) the follower sample , and ii) the value of . Jointly optimizing and is challenging due to the complicated function form. We propose a practical solution method that enumerates different values of and samples one follower set for each by solving


where is an upper bound on the sample size imposed by the available computational resources. We choose to focus on bias minimization because i) the bias-minimization problem corresponds to a vector-assignment -median problem (Weaver and Church 1985), which is well-studied; ii) the variance term is obtained by applying the Hoeffding inequality (Hoeffding 1994)

without considering the moments of

, meaning that the bound might be loose; iii) variance-minimization can be implicitly considered by enforcing capacity constraints or penalizing “uneven” assignment in the objective function in problem (12).

In principle, one should enumerate all and obtain a follower sample and a leader’s decision by solving problems (12) and (10), respectively, for each . The best leader’s solution can then be selected based on the objective function of problem (2). However, since both problems (10) and (12) are challenging to solve, full enumeration of introduces considerable computational burden. We thus narrow the search window for based on the predictive performance of NN regression on the follower costs under some sampled leader decisions. Specifically, we first randomly generate a set of feasible leader solutions and build a dataset of followers’ costs for each one. We then select the best neighborhood size based on the average out-of-sample predictive performance on these datasets. We then vary in a small interval around , and obtain the corresponding follower samples and leader solutions by solving Problems (12) and (10), respectively. The complete solution approach is presented as Algorithm 1.

Input: Width of the search window ; Number of leader decisions to sample ; Follower sample size ; Follower features .
Output: A leader solution .

1:Randomly sample feasible leader solutions .
2:for  to  do
3:     Generate a dataset .
4:     Perform a random train-test split to obtain and such that .
5:     for  to  do
6:         Build NN model using .
7:         Calculate out-of-sample loss .      
8:Select the best .
9:for  do
10:     Obtain leader’s solution by solving problems (12) and (10) with .
11:Select the best solution .
Algorithm 1 A solution method using the NN-augmented model

4.3.2 General parametric regression.

To tighten the bound in Theorem 4.2.4, we need to determine i) the follower sample and ii) the value of . To select , we focus on minimizing the second term of the bound for similar reasons as discussed in the previous section. In this case, we solve the classical -median problem


Regarding , choosing a small value will tighten the bound, but could lead to overfitting or even worse, render problem (11) infeasible. We propose a practical approach to iteratively search for an appropriate . The search starts from a given , which is estimated using data associated with randomly generated leader decisions, and then gradually increases this value until the generated leader decision stops improving. The complete solution approach is presented as Algorithm 2.

Input: Step size ; Number of leader decisions to sample ; Follower sample size ; Follower features .
Output: A leader solution .

1:Randomly sample feasible leader solutions .
2:for  to  do
3:     Generate dataset .
4:     Randomly select a training set such that .
5:     Train a prediction model on .
6:     Calculate the training loss .
7:Select a starting point .
8:Obtain by solving Problem (13).
9:Obtain an initial solution by solving Problem (11) with and .
10:Initialize step counter
12:     Update .
13:     Obtain by solving Problem (11) with and .
14:until .
15:Select the best solution .
Algorithm 2 A solution method using the parametric regression-augmented model

5 Learning Follower Representations

In this section, we present a representation learning framework that maps a follower set to a -dimensional feature space. The learned follower features are used as inputs to the ML model embeded in problems (10) and (11), and to quantify similarity between followers to aid follower sampling in problems (12) and (13). This framework, as presented in Algorithm 3, consists of two steps. In the first step, we construct a relationship graph , where each node represents one follower and each edge is assigned a weight reflecting the similarity between the followers. In the second step, we adopt a graph embedding algorithm to map each node in to a feature space . We emphasize that this framework is compatible with any relationship graph and graph embedding algorithms, providing flexibility to tailor the framework to deal with different applications.

Input: Number of leader decisions ; Embedding size ; Walk per follower ; Walk length ; Window size .
Output: Follower features .

1:Generate leader decisions . Relationship graph construction
2:for  to  do
3:     Calculate for all .
4:Construct a graph , where .
5:Calculate weight for all .
6:Initialize walk container . Graph embedding
7:for  to  do
8:     for  do
9:         Initialize the current node and the walk .
10:         for  to  do
11:              Sample the next node .
12:              Update and .          
13:         Update .      
14:Learn follower features
Algorithm 3 An Algorithm for Follower Embedding

5.1 Relationship Graph Construction

The core of the relationship graph construction is defining a weight for each edge that measures the similarity between followers. Many metrics have been proposed in the stochastic programming literature, with most focusing on the “opportunity cost” of switching between scenarios (i.e., followers). For example, Keutchayan et al. (2021) and Hewitt et al. (2021) define the opportunity cost of applying the first-stage decision that is optimal to scenario in scenario as where and denote the optimal first-stage solutions obtained by solving the single-scenario version of problem (2) with scenarios and , respectively. Building on a similar idea, Bertsimas and Mundru (2022) define a symmetric divergence measure for scenarios and as . While these opportunity cost-based measures have been shown to be effective and are compatible with the proposed framework, they require solving single-scenario versions of problem (2) in advance, which is computationally expensive when is huge.

Motivated by the fact that evaluating followers’ costs given a leader’s solution is computationally cheaper, we propose a data-driven approach that quantifies follower similarity based on their costs under some sampled leader solutions. Specifically, we first randomly sample leader decisions . For each sampled , we calculate the costs for all followers by solving the associated follower problems. We then define the weight of the edge between followers based on the distance between and . Among the distance metrics we experimented with, the following definition of edge weight achieves the best performance in predicting followers’ costs. Specifically, we define the weight of edge in the relationship graph as

where is a user-selected function whose target space is restricted to to avoid being dominated by a single sampled follower’s solution. For instance, letting be a small positive constant, we use in our cycling network design problem where the followers’ costs are non-negative. For applications where the followers’ costs can be negative, functions such as might be considered. The value of can be interpreted as the average cost similarity between followers and under the sampled leader’s decisions. Intuitively, considering more leader decisions provides more accurate estimates of such a relationship.

5.2 Follower Embedding

Once is constructed, we adapt the DeepWalk algorithm proposed by Perozzi et al. (2014) to map nodes in the relationship graph to a -dimensional feature space. The idea is to first generate a set of random walks in the relationship graph, and then apply the Word2Vec algorithm (Mikolov et al. 2013) to learn node features treating each node and each random walk as a word and a sentence, respectively. Unlike Perozzi et al. (2014), who generate random walks by uniformly sampling nodes connected to the current node, we generate random walks according to the weights assigned to edges incident to the current node. So, followers that yield similar results under the sampled leader decisions are likely to appear in a same walk, and thus will be positioned close to each other in the learned feature space.

6 Computational Study: Algorithm Performance on Synthetic Cycling Network Design Problem

In this section, we validate the effectiveness of our ML-augmented model with our representation learning framework for a cycling network design problem. We introduce the problem and its formulation in Section 6.1. We present two experiments to validate the predictive power of the learned follower features and the value of integrating an ML model in the optimization problem in Sections 6.2 and 6.3, respectively. We use a moderate-sized synthetic network (described in 12.1) that can be solved to optimality, which enables us to accurately measure the performance of our approach.

6.1 Maximum Accessibility Network Design Problem

The goal of the maximum accessibility network design problem (MaxANDP) is to design a cycling network subject to a fixed budget such that the total accessibility of a given set of OD pairs, denoted by , is maximized. Such a set may be defined based on geographical units (Imani et al. 2019, Lim et al. 2021) or ridership data (Liu et al. 2021b). Existing studies have proposed various metrics to measure accessibility, mostly focusing on first finding one or more routes between each OD pair using the designed network and then calculating the accessibility based on the selected routes. Such measures have been shown to be correlated with travel behavior data (Saghapour et al. 2017, Imani et al. 2019) and have been widely adopted to assess the performance of cycling networks (Vale et al. 2016).

Let be a directed graph where is the set of edges and is the set of nodes, corresponding to road segments and intersections, respectively. Each edge is assigned a travel time . We denote by and the sets of incoming and outgoing edges of node , respectively. Edges and nodes are partitioned into high-stress and low-stress sets according to a cycling stress assessment based on road geometry, existing infrastructure, and vehicle traffic conditions (Landis et al. 1997, Harkey et al. 1998, Furth et al. 2016). We assume that cyclists prefer cycling on low-stress roads over high-stress roads. Sets with subscripts and indicate the high-stress and low-stress subsets of the original set, respectively. High-stress edges and nodes are assigned costs and , respectively, corresponding to the costs of turning them into low-stress through building new infrastructure such as cycle tracks or traffic lights.

Let and , respectively, denote the edge selection and node selection variables (referred to as network design decisions), whose components are 1 if that edge or node is chosen for the installation of infrastructure that makes it low stress. Edge and node selections are subject to budgets and , respectively. Let denote the routing decision associated with OD pair . The routing problem on a network specified by and is characterized by an objective function and a feasible set . A function is used to calculate the accessibility of each OD pair based on the selected route(s). Each OD pair is weighted by a constant (e.g., population). The MaxANDP is formulated as

subject to (14b)

where and indicate cost vectors for high-stress edges and nodes, respectively. The objective function (14a) maximizes total cycling accessibility. Constraints (14b) ensure that the selected routes are optimal for the OD pairs’ objective functions. Constraints (14c) and (14d) enforce budgets on the network design decisions. Constraints (14e) describe the domain of the decision variables.

To apply problem (14), the accessibility measure (specified by ) and the routing problem (specified by and ) should be carefully chosen based on recent travel behavior data in the studied area (Geurs and Van Wee 2004). To illustrate our methodology, we consider two problems: i) one that uses location-based accessibility measures and shortest-path routing problems, and ii) one proposed by Liu et al. (2021b) that employs a utility-based accessibility measure and discrete route choice models. We refer readers to Liu et al. (2021b) for the latter problem. We briefly describe the former problem next.

Location-based accessibility measures use a decreasing function of the travel time from origin to destination, namely an impedance function, to model the dampening effect of separation (Iacono et al. 2010). We consider a piecewise linear impedance function


where indicates a vector of edge travel times, are breakpoints, and are penalty factors for and , respectively. This function can be used to approximate commonly used impedance functions, including negative exponential, rectangular, and linear functions (visualized in 12.2). While we consider two breakpoints for simplicity, the formulation can be easily generalized to account for more.

We use the level of traffic stress (LTS) metric (Furth et al. 2016) to formulate the routing problems. Let be the node-edge matrix describing the flow-balance constraints on , and be a vector whose and entries are 1 and , respectively, with all other entries 0. Given network design , the routing problem for is formulated as

subject to (16b)

Objective function (16a) minimizes the travel time. Constraints (16b) direct one unit of flow from to . Constraints (16c) ensure that a currently high-stress edge can be used only if it is selected. Constraints (16d) guarantee that a currently high-stress node can be crossed only if either the node is selected or all high-stress edges that are connected to this node are selected. Constraints (16d) are an exact representation of the intersection LTS calculation scheme that assigns the low-stress label to a node if traffic signals are installed or all incident roads are low-stress (Imani et al. 2019). To ensure problem (16) is feasible, we add a dummy low-stress link from to and set its travel time to . In doing so, the travel time is set to when the destination is unreachable using the low-stress network, corresponding to an accessibility of zero, as defined in equation (15). The full formulation is given in 10. We adapt the Benders approach from Magnanti et al. (1986) to solve synthetic instances to optimality. The algorithm and its acceleration strategies is in 11.

6.2 Experiment 1: Predicting OD-Pair Accessibility Using ML Models

In this experiment, we randomly generate network designs and calculate the accessibility for each OD pair under each design. The accessibility associated with each network design constitutes a dataset on which we perform train-test data splits, train ML models to predict OD-pair accessibility, and evaluate their out-of-sample prediction performance. Our goal is to i) compare the predictive power of our learned features and baseline features, ii) validate the effectiveness of our follower sampling method in improving the prediction accuracy, and iii) compare the prediction performance of online and offline ML methods.

Data generation. We consider three edge design budgets of 100, 300, and 500 distance units, corresponding to three levels of ambition, and generate 1,000 network designs for each. We set the node design budget to zero for computational tractability. We consider three location-based accessibility measures using piecewise linear impedance functions that mimic negative exponential (EXP), linear (LIN), and rectangular (REC) functions. We also consider the utility-based accessibility measure (UT) proposed by Liu et al. (2021b). The accessibility calculations are detailed in 12.2. In total, we generate 12,000 datasets (1,000 for each pair of design budget and accessibility measure).

Predictive power of the learned features. Since accessibility is a function of the travel time from origin to destination, which can be regarded as the optimal objective value of a two-stop traveling salesman problem (TSP), we employ the features proposed by Liu et al. (2021a) for predicting TSP objective value as our baseline features. For each dataset, we perform a random train-test split and then train two ML models using the same training sample, but with different OD-pair features. One uses the TSP features, while the other uses the features derived from our representation learning framework (REP). We consider ML models that are compatible with our ML-augmented model, including

NN, linear regression, lasso, and ridge regression. We vary the training sample size between 1%–5% of all OD pairs because practical implementation of our approach would be based on sampling a small number of OD pairs only. For each pair of sample size and ML model, we report the median test MAE over ten random train-test splits. The details on the baseline features and ML hyper-parameters are given in

12.3 and 12.7, respectively.

Figure 1: Mean out-of-sample MAE over 3,000 datasets of each accessibility measure when using different ML models and follower features. ML models are coded by colors and features are coded by line types.

We observe that ML models generally perform better with the REP features than with the TSP features. As presented in Figure 1, when using NN, the REP features outperform the TSP features by a large margin (e.g. over 22.8%, 26.1%, 22.5%, and 25.7% for EXP, LIN, REC, and UT, respectively). However, linear regression performs better with the TSP features. One possible explanation is that linear regression models based on the REP features (16-dimensional) are more complicated than those based on the TSP features (9-dimensional), and thus are more likely to overfit the training data. When using lasso and ridge, the REP features are better than or competitive with the TSP features. Lasso and ridge also achieve lower prediction errors than linear regression, highlighting the importance of enforcing regularization constraints in the ML-augmented model (recall discussion following Assumption 4.2.2).

Effectiveness of the follower selection algorithm. Next, we use the REP features while applying different follower sampling methods, and compare the resulting test MAE achieved by the ML models. We consider i) -median sampling as introduced in Section 4.3 (MED), 2) uniform sampling (UNI), which is commonly used in the transportation literature, and 3) -center sampling (CEN) from Sener and Savarese (2018). Since the -center and -median problems are -hard, we adapt heuristics from Boutilier and Chan (2020) and Gonzalez (1985) to solve them (detailed in 12.4). As a result, all methods involve some randomness. To account for the randomness, we apply each sampling method 10 times with different random seeds and report the median test MAE on each dataset.

From Figure 2, we observe that MED typically achieves the lowest test MAE, regardless of the accessibility measures, ML models, and sample sizes. Especially when the sample size is extremely small, the gap between MED and UNI can be up to 27% of the MAE achieved by UNI. MED outperforms CEN by a large margin when using location-based accessibility measures and is competitive with CEN when using the utility-based measure.

Figure 2: Mean test MAE over 3000 datasets associated with each accessibility measure when using different ML models and follow sampling methods. ML models are coded by colors and sampling methods are coded by markers. Our sampling method (MED) is further highlighted using solid lines, while baseline sampling methods are represented by dashed lines

Online ML model versus offline ML model. We also compare the approximations for the overall leader’s objective achieved by the best online ML approach (train a

NN model using REP features and MED samples for each leader decision) versus an offline ReLU neural network (ReLU)

(Dumouchelle et al. 2022) that predicts the total accessibility of all OD pairs directly based on leader decisions (detailed in 12.6). As presented in Figure 3, our online ML model provides better approximations using only 1–2% of all OD pairs as training data. These results provide an empirical justification for integrating ML model training into the optimization problem.

Figure 3: Mean MAE for predicting the leader’s overall objective over 3000 datasets associated with each accessibility measure. “Leader-ReLU” indicates the performance of an offline ML model, “MED-REP-NN” indicates the performance of the best online ML model.

Summary. i) our learned features are predictive of accessibility and are generally more predictive than baseline features. ii) By applying follower sampling methods derived from solving a -median problem, we can substantially improve the out-of-sample prediction accuracy of the ML models with our learned features. iii) the best online ML model yields a better approximation for the original objective than an offline ML model.

6.3 Experiment 2: Generating Leader Decisions using ML-augmented Models

Next, we investigate the extent to which our learned features and our follower samples can assist the ML-augmented model in generating high-quality leader (i.e., transportation planner) decisions. We consider the reduced model, NN-augmented model, and linear regression-augmented model using MED and UNI samples, totaling six methods for generating leader decisions. Results for CEN samples are in 12.9, as they are similar to UNI samples. We create 12 problem instances on the synthetic network (one for each pair of design budget and accessibility measure). We vary the sample size from 1% to 5%. We apply each model 10 times using 10 samples generated with different random seeds and report the average optimality gap of the leader decisions on the original problem.

Figure 4: Optimality gap of leader decisions from the three sampling-based models on the 12 problem instances. Problem instances follow the naming convention of “accessibility measure”-“budget” and the solution methods follow the naming convention of “model”-“sampling method”.

From Figure 4, our first observation is that using MED samples enhances the performance of both the ML-augmented models and the reduced model. Significant performance gaps are observed for the two ML-augmented models in all problem instances considered. Using the MED samples on average reduces the optimality gap by 5.7% and 5.2% for the NN-augmented and linear regression-augmented models, respectively. Even for the reduced model, our sampling strategy is competitive with or better than uniform sampling, with a difference in optimality gap of up to over 20% (e.g., EXP-100, 3% training sample). These results highlight the importance of sample selection for both models.

Our second observation is that when using the MED samples, the ML-augmented models generally achieve better and more stable performance compared to the reduced model. As shown in Figure 4, the ML-augmented model (NN-MED or REG-MED) generally outperforms Reduced-MED by a large margin, especially when using location-based accessibility measures and when the sample size is extremely small (1%). Figure 5 compares the stability of Reduced-MED and NN-MED.

NN-MED generally achieves a lower standard deviation of optimality gaps as Reduced-MED, even though they use exactly the same samples. The integrated ML component helps to better capture the impact of leader decisions on unsampled followers, leading to solutions of higher quality and stability.

Finally, we observe no dominance of a single ML-augmented model. REG-MED and NN-MED are comparable on mid- and large-budget instances (300 and 500). NN-MED performs better when using location-based accessibility measures and a small budget (100). One possible reason is that when the budget is limited, few OD pairs have positive accessibility while all others have zero accessibility. The feature-accessibility relationship is highly non-linear, making NN a better fit for the accessibility prediction than linear regressions.

Figure 5: The standard deviation of the optimality gaps achieved by Reduced-MED versus NN-MED. Results from instances with different design budgets are coded by colors and markers. The black diagonal lines indicate equal standard deviations.

7 Case Study: Cycling Infrastructure Planning in the City of Toronto

In this section, we present a case study applying our methodology to the City of Toronto, Canada. Toronto has built over 65 km of new cycling infrastructure from 2019–2021, partially in response to the increased cycling demand amid the COVID-19 pandemic. It plans to expand the network by 100 km from 2022–2024. We started a collaboration with the City’s Transportation Services Team in September 2020, focusing on developing quantitative tools to support cycling infrastructure planning in Toronto. As an evaluation metric, low-stress cycling accessibility has been used by the City of Toronto to support project prioritization

(City of Toronto 2021a, b). We introduce Toronto’s cycling network in Section 7.1 and use our methodology to examine actual and future potential decisions regarding network expansion in Section 7.2.

Figure 6: Level of traffic stress of Toronto’s road network (July 2021).

7.1 Cycling Network in Toronto

We construct Toronto’s cycling network based on the centerline network retrieved from the Toronto Open Data Portal (City of Toronto 2020). We pre-process the network by removing roads where cycling is legally prohibited, deleting redundant nodes and edges, and grouping arterial roads into candidate cycling infrastructure projects (detailed in 13.1). The final cycling network has 10,448 nodes, 35,686 edges, and 1,296 candidate projects totaling 1,913 km. We use the methods and data sources summarized in Lin et al. (2021)

to calculate the LTS of each link in the cycling network. LTS1 and LTS2 links are classified as low-stress, while LTS3 and LTS4 links are high-stress since LTS2 corresponds to the cycling stress tolerance for the majority of the adult population

(Furth et al. 2016). Although most local roads are low-stress, high-stress arterials create many disconnected low-stress “islands”, limiting low-stress cycling accessibility in many parts of Toronto (Figure 6).

The city is divided into 3,702 geographical units called dissemination areas (DAs). We define each DA centroid as an origin with every other DA centroid that is reachable within 30 minutes on the overall network being a potential destination, totalling 1,154,663 OD pairs. The low-stress cycling accessibility of a given DA is defined as the sum of the destination opportunities that are reachable within 30 minutes using the low-stress network. We use DA job counts retrieved from the 2016 Canadian census (Statistics Canada 2016) to represent opportunities. The resulting accessibility measure has been shown to be highly correlated with cycling mode choice in Toronto (Imani et al. 2019). We assume a constant cycling speed of 15 km/h for travel time calculation.

7.2 Expanding Toronto’s Cycling Network

As a part of our collaboration, in January 2021 we were asked to evaluate the accessibility impact of three project alternatives for building bike lanes (see Figure 7) to meet the direction of Toronto’s City Council within the City’s adopted Cycling Network Plan, intended to provide a cycling connection between midtown and the downtown core (City of Toronto 2021b). These projects were proposed in 2019 but their evaluation and implementation were accelerated because of increased cycling demand during COVID. We determined that alternative 2 had the largest accessibility impact. It was ultimately implemented due to its accessibility impact and other performance indicators (City of Toronto 2021b).

This decision-making process exemplifies the current practice of cycling infrastructure planning in Toronto: i) manually compile a list of candidate projects, ii) rank the candidate projects based on certain metrics, and iii) design project delivery plans (City of Toronto 2021c). From a computational perspective, steps i) and ii) serve as a heuristic for solving MaxANDP. This heuristic approach was necessary for several reasons, including political buy-in for the three alternatives, and the computational intractability of solving MaxANDP at the city scale prior to developing the methodology in this paper. Now, we can use our ML-augmented model to search for project combinations without pre-specifying candidates.

Figure 7: Project alternatives and the existing cycling infrastructure in the City of Toronto (January 2021).

To this end, we first apply the ML-augmented model with a budget of 6 km (similar to alternative 2). The optimal projects (see Figure 7) improve Toronto’s total low-stress cycling accessibility by approximately 9.5% over alternative 2. Instead of constructing only one corridor as in alternative 2, the ML-augmented model selects six disconnected road segments. Some of them serve as connections between existing cycling infrastructure, others bridge currently disconnected low-stress sub-networks consisting of low-stress local roads.

Next, we increase the road design budget from 10 to 100 km in increments of 10 km. The 100 km budget aligns with Toronto’s cycling network expansion plan for 2022–2024 (City of Toronto 2021a). We compare our approach to a greedy heuristic that iteratively selects the candidate project that leads to the maximum increase in total accessibility until the budget is depleted. The greedy heuristic took over 3 days to expand the network by 100 km as each iteration involves solving millions of shortest path problems. Our approach took around 4 hours to find a leader decision using a sample of 2,000 OD pairs (1.7% of all OD pairs). Given this speedup, we can solve our model multiple times with different samples and report the best solution as measured by the total accessibility of all OD pairs. The computational setups of the greedy heuristic and our approach are detailed in 13.3.

As shown in Figure 8, when holding both methods to the same computational time (meaning that we solve our ML-augmented model with 21 different sets of OD pair samples and taking the best solution), our approach increases accessibility by 19.2% on average across different budgets. For example, with a budget of 70 km, we can improve the total accessibility by a similar margin as achieved by the greedy heuristic using a 100 km budget, corresponding to a savings of 18 million Canadian dollars estimated based on the City’s proposed budget (City of Toronto 2021a). If instead we used the full 100 km budget, we would achieve 11.3% greater accessibility. The improvements mainly come from identifying projects that have little accessibility impact when constructed alone, yet significantly improve the accessibility of their surrounding DAs when combined (visualized in 13.4). These projects are typically not directly connected to existing cycling infrastructure, and thus are difficult to identify through manual analysis. Finally, we note that solution quality was similar between 14 and 21 samples, meaning that with we can achieve the above gains while simultaneously reducing solution time by approximately 33%.

Figure 8: The performance profiles of the greedy and optimal expansions. Note that using 21 different sets of OD pairs samples results in the same solution time as the greedy expansion. Hence, 7 and 14 samples correspond to and of the solution time of greedy.

In summary, our approach is a valuable tool for transportation planners to search for optimal project combinations that maximize the low-stress cycling accessibility. Although this is not the only goal of cycling network design, we believe it can be useful in at least three contexts: i) in the long term, our model can be used to generate a base plan that can later be tuned by transportation planners; ii) in the near term, our approach can efficiently search for project combinations from a large pool that would be very difficult to analyze manually; iii) Given a fixed budget, our model provides a strong benchmark against which to validate the goodness of human-proposed solutions.

8 Conclusion

In this paper, we present a novel ML-based approach to solving bilevel (stochastic) programs with a large number of independent followers (scenarios). We build on two existing strategies—sampling and approximation—to tackle the computational challenges imposed by a large follower set. The model considers a sampled subset of followers while integrating an ML model to estimate the impact of leader decisions on unsampled followers. Unlike existing approaches for integrating optimization and ML models, we embed the ML model training into the optimization model, which allows us to employ general follower features that may not be compactly represented by leader decisions. Under certain assumptions, the generated leader decisions enjoy solution quality guarantees as measured by the original objective function considering the full follower set. We also introduce practical strategies, including follower sampling algorithms and a representation learning framework, to enhance the model performance. Using both synthetic and real-world instances of a cycling network design problem, we demonstrate the strong computational performance of our approach in generating high-quality leader decisions. The performance gap between our approach and baseline approaches are particularly large when the sample size is small.

The authors are grateful to Sheng Liu, Merve Bodur, Elias Khalil, and Rafid Mahmood for helpful comments and discussions. This research is supported by funding from the City of Toronto and NSERC Alliance Grant 561212-20. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.


  • Alizadeh et al. (2013) Alizadeh S, Marcotte P, Savard G (2013) Two-stage stochastic bilevel programming over a transportation network. Transportation Research Part B: Methodological 58:92–105.
  • Babier et al. (2022) Babier A, Chan TCY, Diamant A, Mahmood R (2022) Learning to optimize contextually constrained problems for real-time decision-generation. arXiv preprint arXiv:1805.09293 .
  • Bagloee et al. (2016) Bagloee SA, Sarvi M, Wallace M (2016) Bicycle lane priority: Promoting bicycle as a green mode even in congested urban area. Transportation Research Part A: Policy and Practice 87:102–121.
  • Ban and Rudin (2019) Ban GY, Rudin C (2019) The big data newsvendor: Practical insights from machine learning. Operations Research 67(1):90–108.
  • Bao et al. (2017) Bao J, He T, Ruan S, Li Y, Zheng Y (2017) Planning bike lanes based on sharing-bikes’ trajectories. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1377–1386.
  • Bard (2013) Bard JF (2013) Practical bilevel optimization: algorithms and applications, volume 30 (Springer Science & Business Media).
  • Bello et al. (2017)

    Bello I, Pham H, Le QV, Norouzi M, Bengio S (2017) Neural combinatorial optimization with reinforcement learning.

    International Conference on Learning Representations Workshop.
  • Bergman et al. (2022) Bergman D, Huang T, Brooks P, Lodi A, Raghunathan AU (2022) Janos: an integrated predictive and prescriptive modeling framework. INFORMS Journal on Computing 34(2):807–816.
  • Bertsimas and Kallus (2020) Bertsimas D, Kallus N (2020) From predictive to prescriptive analytics. Management Science 66(3):1025–1044.
  • Bertsimas and Mundru (2022) Bertsimas D, Mundru N (2022) Optimization-based scenario reduction for data-driven two-stage stochastic optimization. Operations Research 0(0).
  • Biggs et al. (2017)

    Biggs M, Hariss R, Perakis G (2017) Optimizing objective functions determined from random forests, available at SSRN:
  • Birge and Louveaux (2011) Birge JR, Louveaux F (2011) Introduction to stochastic programming (Springer Science & Business Media).
  • Bodur and Luedtke (2017) Bodur M, Luedtke JR (2017) Mixed-integer rounding enhanced benders decomposition for multiclass service-system staffing and scheduling with arrival rate uncertainty. Management Science 63(7):2073–2091.
  • Boutilier and Chan (2020) Boutilier JJ, Chan TCY (2020) Ambulance emergency response optimization in developing countries. Operations Research 68(5):1315–1334.
  • Buehler and Dill (2016) Buehler R, Dill J (2016) Bikeway networks: A review of effects on cycling. Transport Reviews 36(1):9–27.
  • Buehler and Pucher (2021) Buehler R, Pucher J (2021) COVID-19 impacts on cycling, 2019–2020. Transport Reviews 41(4):393–400.
  • Candler and Townsley (1982) Candler W, Townsley R (1982) A linear two-level programming problem. Computers & Operations Research 9(1):59–76.
  • Carrión et al. (2009) Carrión M, Arroyo JM, Conejo AJ (2009) A bilevel stochastic programming approach for retailer futures market trading. IEEE Transactions on Power Systems 24(3):1446–1456.
  • Chen et al. (2008) Chen X, Sim M, Sun P, Zhang J (2008) A linear decision-based approximation approach to stochastic programming. Operations Research 56(2):344–357.
  • City of Toronto (2020) City of Toronto (2020) City of Toronto open data., accessed: 2020-09-15.
  • City of Toronto (2021a) City of Toronto (2021a) 2021 cycling network plan update. Accessed via on July 8, 2022.
  • City of Toronto (2021b) City of Toronto (2021b) ActiveTO: Lessons learned from 2020 and next steps for 2021. Accessed via on July 21, 2022.
  • City of Toronto (2021c) City of Toronto (2021c) Cycling network plan update — external stakeholders briefing summary. Accessed via on July 21, 2022.
  • Cleveland and Devlin (1988)

    Cleveland WS, Devlin SJ (1988) Locally weighted regression: an approach to regression analysis by local fitting.

    Journal of the American Statistical Association 83(403):596–610.
  • Crainic et al. (2014) Crainic TG, Hewitt M, Rei W (2014) Scenario grouping in a progressive hedging-based meta-heuristic for stochastic network design. Computers & Operations Research 43:90–99.
  • Dill and McNeil (2016) Dill J, McNeil N (2016) Revisiting the four types of cyclists: Findings from a national survey. Transportation Research Record 2587(1):90–99.
  • Dumouchelle et al. (2022) Dumouchelle J, Patel R, Khalil E, Bodur M (2022) Neur2sp: Neural two-stage stochastic programming. arXiv preprint arXiv.2205.12006 .
  • Dupačová et al. (2003) Dupačová J, Gröwe-Kuska N, Römisch W (2003) Scenario reduction in stochastic programming. Mathematical Programming 95(3):493–511.
  • Duthie and Unnikrishnan (2014) Duthie J, Unnikrishnan A (2014) Optimization framework for bicycle network design. Journal of Transportation Engineering 140(7):04014028.
  • Elmachtoub and Grigas (2022) Elmachtoub AN, Grigas P (2022) Smart “predict, then optimize”. Management Science 68(1):9–26.
  • Fampa et al. (2008) Fampa M, Barroso L, Candal D, Simonetti L (2008) Bilevel optimization applied to strategic pricing in competitive electricity markets. Computational Optimization and Applications 39(2):121–142.
  • Ferreira et al. (2016) Ferreira KJ, Lee BHA, Simchi-Levi D (2016) Analytics for an online retailer: Demand forecasting and price optimization. Manufacturing & Service Operations Management 18(1):69–88.
  • Fischetti et al. (2017) Fischetti M, Ljubić I, Sinnl M (2017) Redesigning benders decomposition for large-scale facility location. Management Science 63(7):2146–2162.
  • Furth et al. (2016) Furth PG, Mekuria MC, Nixon H (2016) Network connectivity for low-stress bicycling. Transportation Research Record 2587(1):41–49.
  • Furth et al. (2018) Furth PG, Putta TV, Moser P (2018) Measuring low-stress connectivity in terms of bike-accessible jobs and potential bike-to-work trips. Journal of Transport and Land Use 11(1):815–831.
  • Gehrke et al. (2020) Gehrke SR, Akhavan A, Furth PG, Wang Q, Reardon TG (2020) A cycling-focused accessibility tool to support regional bike network connectivity. Transportation Research Part D: Transport and Environment 85:102388.
  • Geurs and Van Wee (2004) Geurs KT, Van Wee B (2004) Accessibility evaluation of land-use and transport strategies: review and research directions. Journal of Transport Geography 12(2):127–140.
  • Gonzalez (1985) Gonzalez TF (1985) Clustering to minimize the maximum intercluster distance. Theoretical Computer Science 38:293–306.
  • Gurobi Optimization, LLC (2022) Gurobi Optimization, LLC (2022) Gurobi Optimizer Reference Manual. URL
  • Harkey et al. (1998) Harkey DL, Reinfurt DW, Knuiman M (1998) Development of the bicycle compatibility index. Transportation Research Record 1636(1):13–20.
  • Hewitt et al. (2021) Hewitt M, Ortmann J, Rei W (2021) Decision-based scenario clustering for decision-making under uncertainty. Annals of Operations Research 1–25.
  • Hochbaum and Shmoys (1985) Hochbaum DS, Shmoys DB (1985) A best possible heuristic for the k-center problem. Mathematics of Operations Research 10(2):180–184.
  • Hoeffding (1994) Hoeffding W (1994) Probability inequalities for sums of bounded random variables. The Collected Works of Wassily Hoeffding, 409–426 (Springer).
  • Iacono et al. (2010) Iacono M, Krizek KJ, El-Geneidy A (2010) Measuring non-motorized accessibility: issues, alternatives, and execution. Journal of Transport Geography 18(1):133–140.
  • Imani et al. (2019) Imani AF, Miller EJ, Saxe S (2019) Cycle accessibility and level of traffic stress: A case study of Toronto. Journal of Transport Geography 80:102496.
  • Jia and Shen (2021)

    Jia H, Shen S (2021) Benders cut classification via support vector machines for solving two-stage stochastic programs.

    INFORMS Journal on Optimization 3(3):278–297.
  • Kent and Karner (2019) Kent M, Karner A (2019) Prioritizing low-stress and equitable bicycle networks using neighborhood-based accessibility measures. International Journal of Sustainable Transportation 13(2):100–110.
  • Keutchayan et al. (2021) Keutchayan J, Ortmann J, Rei W (2021) Problem-driven scenario clustering in stochastic optimization. arXiv preprint arXiv:2106.11717 .
  • Khalil et al. (2017) Khalil E, Dai H, Zhang Y, Dilkina B, Song L (2017) Learning combinatorial optimization algorithms over graphs. Advances in Neural Information Processing Systems, 6348–6358.
  • Khalil et al. (2016) Khalil E, Le Bodic P, Song L, Nemhauser G, Dilkina B (2016) Learning to branch in mixed integer programming.

    Proceedings of the AAAI Conference on Artificial Intelligence

    , volume 30.
  • Kool et al. (2019) Kool W, van Hoof H, Welling M (2019) Attention, learn to solve routing problems! 7th International Conference on Learning Representations, ICLR, URL
  • Kou et al. (2020) Kou Z, Wang X, Chiu SFA, Cai H (2020) Quantifying greenhouse gas emissions reduction from bike share systems: a model considering real-world trips and transportation mode choice patterns. Resources, Conservation and Recycling 153:104534.
  • Kraus and Koch (2021) Kraus S, Koch N (2021) Provisional COVID-19 infrastructure induces large, rapid increases in cycling. Proceedings of the National Academy of Sciences 118(15).
  • Landis et al. (1997) Landis BW, Vattikuti VR, Brannick MT (1997) Real-time human perceptions: toward a bicycle level of service. Transportation Research Record 1578(1):119–126.
  • Larsen et al. (2013) Larsen J, Patterson Z, El-Geneidy A (2013) Build it. but where? the use of geographic information systems in identifying locations for new cycling infrastructure. International Journal of Sustainable Transportation 7(4):299–317.
  • Leal et al. (2020) Leal M, Ponce D, Puerto J (2020) Portfolio problems with two levels decision-makers: Optimal portfolio selection with pricing decisions on transaction costs. European Journal of Operational Research 284(2):712–727.
  • Lewis and Gale (1994) Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. SIGIR’94, 3–12.
  • Li et al. (2017) Li S, Muresan M, Fu L (2017) Cycling in Toronto, Ontario, Canada: Route choice behavior and implications for infrastructure planning. Transportation Research Record 2662(1):41–49.
  • Lim et al. (2021) Lim J, Dalmeijer K, Guhathakurta S, Van Hentenryck P (2021) The bicycle network improvement problem: Optimization algorithms and a case study in Atlanta. arXiv preprint arXiv:2107.04451 .
  • Lin et al. (2021) Lin B, Chan TCY, Saxe S (2021) The impact of COVID-19 cycling infrastructure on low-stress cycling accessibility: A case study in the City of Toronto. Findings 19069.
  • Liu et al. (2019)

    Liu H, Szeto W, Long J (2019) Bike network design problem with a path-size logit-based equilibrium constraint: Formulation, global optimization, and matheuristic.

    Transportation Research Part E: Logistics and Transportation Review 127:284–307.
  • Liu et al. (2021a) Liu S, He L, Shen ZJM (2021a) On-time last-mile delivery: Order assignment with travel-time predictors. Management Science 67(7):4095–4119.
  • Liu et al. (2021b) Liu S, Shen ZJM, Ji X (2021b) Urban bike lane planning with bike trajectories: Models, algorithms, and a real-world case study. Manufacturing & Service Operations Management 0(0).
  • Liu et al. (2022) Liu S, Siddiq A, Zhang J (2022) Planning bike lanes with data: Ridership, congestion, and path selection, available at SSRN:
  • Lowry et al. (2012) Lowry MB, Callister D, Gresham M, Moore B (2012) Assessment of communitywide bikeability with bicycle level of service. Transportation Research Record 2314(1):41–48.
  • Magnanti et al. (1986) Magnanti TL, Mireault P, Wong RT (1986) Tailoring benders decomposition for uncapacitated network design. Netflow at Pisa, 112–154 (Springer).
  • Mauttone et al. (2017) Mauttone A, Mercadante G, Rabaza M, Toledo F (2017) Bicycle network design: model and solution algorithm. Transportation Research Procedia 27:969–976.
  • Mesbah et al. (2012) Mesbah M, Thompson R, Moridpour S (2012) Bilevel optimization approach to design of network of bike lanes. Transportation Research Record 2284(1):21–28.
  • Mikolov et al. (2013)

    Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality.

    Advances in Neural Information Processing Systems, volume 26.
  • Mišić (2020) Mišić VV (2020) Optimization of tree ensembles. Operations Research 68(5):1605–1624.
  • Morabit et al. (2021) Morabit M, Desaulniers G, Lodi A (2021) Machine-learning–based column selection for column generation. Transportation Science 55(4):815–831.
  • Mueller et al. (2018) Mueller N, Rojas-Rueda D, Salmon M, Martinez D, Ambros A, Brand C, De Nazelle A, Dons E, Gaupp-Berghausen M, Gerike R, et al. (2018) Health impact assessment of cycling network expansions in european cities. Preventive Medicine 109:62–70.
  • Naoum-Sawaya and Elhedhli (2013) Naoum-Sawaya J, Elhedhli S (2013) An interior-point benders based branch-and-cut algorithm for mixed integer programs. Annals of Operations Research 210(1):33–55.
  • Nazari et al. (2018) Nazari M, Oroojlooy A, Snyder L, Takác M (2018) Reinforcement learning for solving the vehicle routing problem. Advances in Neural Information Processing Systems, 9839–9849.
  • Notz and Pibernik (2022) Notz PM, Pibernik R (2022) Prescriptive analytics for flexible capacity management. Management Science 68(3):1756–1775.
  • Olmos et al. (2020)

    Olmos LE, Tadeo MS, Vlachogiannis D, Alhasoun F, Alegre XE, Ochoa C, Targa F, González MC (2020) A data science framework for planning the growth of bicycle infrastructures.

    Transportation Research Part C: Emerging Technologies 115:102640.
  • Ospina et al. (2022) Ospina JP, Duque JC, Botero-Fernández V, Montoya A (2022) The maximal covering bicycle network design problem. Transportation Research part A: Policy and Practice 159:222–236.
  • Papadakos (2008) Papadakos N (2008) Practical enhancements to the Magnanti-Wong method. Operations Research Letters 36(4):444–449.
  • Parzen (1962) Parzen E (1962) On estimation of a probability density function and mode. The Annals of Mathematical Statistics 33(3):1065–1076.
  • Pedregosa et al. (2011) Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830.
  • Perozzi et al. (2014) Perozzi B, Al-Rfou R, Skiena S (2014) Deepwalk: Online learning of social representations. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 701–710.
  • Prochazka and Wallace (2020) Prochazka V, Wallace SW (2020) Scenario tree construction driven by heuristic solutions of the optimization problem. Computational Management Science 17(2):277–307.
  • Putta and Furth (2019) Putta T, Furth PG (2019) Method to identify and visualize barriers in a low-stress bike network. Transportation Research Record 2673(9):452–460.
  • Řehůřek and Sojka (2010) Řehůřek R, Sojka P (2010) Software Framework for Topic Modelling with Large Corpora. Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50.
  • Römisch and Schultz (1991) Römisch W, Schultz R (1991) Stability analysis for stochastic programs. Annals of Operations Research 30(1):241–266.
  • Römisch and Wets (2007) Römisch W, Wets RB (2007) Stability of -approximate solutions to convex stochastic programs. SIAM Journal on Optimization 18(3):961–979.
  • Saghapour et al. (2017) Saghapour T, Moridpour S, Thompson RG (2017) Measuring cycling accessibility in metropolitan areas. International Journal of Sustainable Transportation 11(5):381–394.
  • Sener and Savarese (2018)

    Sener O, Savarese S (2018) Active learning for convolutional neural networks: A core-set approach.

    International Conference on Learning Representations.
  • Settles (2009) Settles B (2009) Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison.
  • Settles et al. (2007) Settles B, Craven M, Ray S (2007) Multiple-instance active learning. Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, 1289–1296.
  • Shapiro et al. (2009) Shapiro A, Dentcheva D, Ruszczynski A (2009) Lectures on stochastic programming: modeling and theory (SIAM).
  • Sisson et al. (2006) Sisson SB, Lee SM, Burns EK, Tudor-Locke C (2006) Suitability of commuting by bicycle to Arizona elementary schools. American Journal of Health Promotion 20(3):210–213.
  • Sohn (2011) Sohn K (2011) Multi-objective optimization of a road diet network design. Transportation Research Part A: Policy and Practice 45(6):499–511.
  • Statistics Canada (2016) Statistics Canada (2016) Population and dwelling count, 2016 census., accessed: 2020-11-15.
  • Tang et al. (2020) Tang Y, Agrawal S, Faenza Y (2020) Reinforcement learning for integer programming: Learning to cut. International Conference on Machine Learning, 9367–9376 (PMLR).
  • Vale et al. (2016) Vale DS, Saraiva M, Pereira M (2016) Active accessibility: A review of operational measures of walking and cycling accessibility. Journal of Transport and Land Use 9(1):209–235.
  • Van Hoesel (2008) Van Hoesel S (2008) An overview of stackelberg pricing in networks. European Journal of Operational Research 189(3):1393–1402.
  • Vinyals et al. (2015) Vinyals O, Fortunato M, Jaitly N (2015) Pointer networks. Advances in Neural Information Processing Systems, 2692–2700.
  • Weaver and Church (1985) Weaver JR, Church RL (1985) A median location model with nonclosest facility service. Transportation Science 19(1):58–74.
  • White and Anandalingam (1993)

    White DJ, Anandalingam G (1993) A penalty function approach for solving bi-level linear programs.

    Journal of Global Optimization 3(4):397–419.
  • Wu et al. (2022) Wu Y, Song W, Cao Z, Zhang J (2022) Learning scenario representation for solving two-stage stochastic integer programs. International Conference on Learning Representations.
  • Zetina et al. (2019) Zetina CA, Contreras I, Cordeau JF (2019) Exact algorithms based on benders decomposition for multicommodity uncapacitated fixed-charge network design. Computers & Operations Research 111:311–324.
  • Zugno et al. (2013) Zugno M, Morales JM, Pinson P, Madsen H (2013) A bilevel model for electricity retailers’ participation in a demand response market environment. Energy Economics 36:182–197.

9 Omitted Proof of Statements


Proof of Theorem 4.2.3 We start by decomposing the optimality gap as follows.