1 Introduction
This paper is concerned with solving bilevel optimization problems with a large number of followers and where the feasible region of the leader is independent of the followers. A wide range of decision problems can be modeled this way, including active transportation network design (Liu et al. 2019, 2021b, Lim et al. 2021), network pricing (Van Hoesel 2008, Alizadeh et al. 2013), energy pricing (Fampa et al. 2008, Zugno et al. 2013), and portfolio optimization (Carrión et al. 2009, Leal et al. 2020). This model also generalizes twostage stochastic programming. Indeed, if the objectives of the leader and followers are identical, then this model becomes a twostage stochastic program where the set of followers represent scenarios. Thus, in this paper, the reader should think of “leader” in a bilevel program and “firststage decision maker” in a stochastic program as synonymous, and similarly for “follower” and “secondstage decision maker”. As we discuss the bilevel or stochastic programming literature below, we use the corresponding terminology.
The main challenge of solving a bilevel problem with a large set of followers stems from having to solve a large number of follower problems to evaluate the quality of the leader’s decision. For the bilevel problem we consider, given its relationship to stochastic programming, we can draw on approaches to deal with large from both communities. Two predominant strategies are: i) solving the problem with a small sample of , and ii) approximating the followers’ cost without explicitly solving the followers’ problems.
Sampling a smaller follower set can be done via random sampling (Liu et al. 2021b, Lim et al. 2021) or clustering (Dupačová et al. 2003, Hewitt et al. 2021, Bertsimas and Mundru 2022). Given a sample , we can obtain a feasible leader solution by solving the reduced problem, which improves computational tractability and solution interpretability due to the reduced problem size. Furthermore, under suitable regularity assumptions, an optimal solution to the reduced problem provides a bound on the optimal value and optimal leader solution to the original problem (Römisch and Schultz 1991, Römisch and Wets 2007). However, there is no theoretical guarantee on the performance of the optimal solution to the reduced problem as measured by the original problem’s objective.
Regarding the second strategy, many different algorithms have been developed to approximate the followers’ cost. Such approximations can be obtained by relaxing the constraint that the followers’ decisions are optimal for their own objectives and progressively refining the relaxed problem until the generated follower solutions are indeed optimal. Algorithms along this direction include Lshaped methods (Birge and Louveaux 2011), vertex enumeration (Candler and Townsley 1982), the KuhnTucker approach (Bard 2013), and penalty function methods (White and Anandalingam 1993). These algorithms are generally not able to deal with huge because the relaxed problem size still increases drastically as increases. To overcome this issue, machine learning (ML) methods have recently achieved encouraging performance on approximating the secondstage cost as a function of firststage decisions (Liu et al. 2021a, Dumouchelle et al. 2022) and learning secondstage decision rules (Chen et al. 2008) in twostage stochastic programming. However, the former requires identifying features that can be compactly represented using the leader’s decision, which is practically challenging, while the latter may produce infeasible decisions when nontrivial secondstage constraints are present. Moreover, neither method has optimality guarantees.
In this paper, we build on the ideas from both strategies. In particular, we consider sampling a subset . However, we also augment the overall objective function with an estimate of the objective value of the unsampled followers, from , using an ML model. Unlike existing methods that use offline ML models to map leader decisions to objective values, our ML model intakes follower features that are independent of leader decisions. We embed the ML model training into the bilevel problem to link the ML model with leader decisions. When optimizing leader decisions, the ML model is trained on a dataset of the sampled followers onthefly. Simultaneous optimization and ML model training enables derivation of new theoretical guarantees for the generated leader decisions. Finally, to demonstrate our methodology, we apply it to a cycling infrastructure design problem, completed in collaboration with the City of Toronto using real data. The methods we develop in this paper were driven by the need to solve this large, realworld application.
1.1 Cycling Infrastructure Planning
Cycling has become an increasingly popular transportation mode due to its positive impact on urban mobility, public health, and the environment (Mueller et al. 2018, Kou et al. 2020) In fact, during the COVID19 pandemic, cycling popularity increased significantly since it represented a lowcost and safe alternative to driving and public transit, facilitated outdoor activities, and improved access to essential services (Kraus and Koch 2021, Buehler and Pucher 2021). However, cycling safety and comfort concerns have been repeatedly identified as major factors that inhibit cycling uptake and overall mode choice (Dill and McNeil 2016, Li et al. 2017). Building highquality cycling infrastructure is among the most effective ways to alleviate cycling stress and enhance cycling adoption (Buehler and Dill 2016). In this paper, we develop an optimization model to aid cycling infrastructure planning by maximizing “lowstress cycling accessibility” subject to a fixed budget. Lowstress cycling accessibility, defined as the total amount of “opportunities” (i.e., people, jobs, retail stores) reachable by individuals using streets that are perceived to be safe for cycling, has been widely adopted to assess the service provided by transportation infrastructure and the impact of cycling infrastructure (Sisson et al. 2006, Lowry et al. 2012, Furth et al. 2018, Kent and Karner 2019, Imani et al. 2019, Gehrke et al. 2020, Lin et al. 2021). In our problem, a transportation planner designs a cycling network subject to a given budget, considering that cyclists will use the lowstress network to travel to opportunities via shortest paths. The planner’s objective is to maximize the total number of opportunities accessible via lowstress routes. The resulting formulation for the City of Toronto includes over one million origindestination pairs (i.e., followers) between small geographic units known as census dissemination areas, motivating the development of our methodology.
1.2 Contributions

We develop a novel MLaugmented optimization model for solving bilevel optimization problems with a large number of followers. Our objective function has an exact component for a sampled subset of followers and an approximate component derived from an ML model. Sampling improves computational tractability, while the ML model ensures that the objective function better captures the impact of the leader’s decision on the unsampled followers. The training of the ML model is embedded into the optimization model to enable the usage of predictive features that cannot be compactly represented using leader decisions. Indeed, we develop a theoretical bound on the quality of the generated solution as measured on the original objective function with the full set of followers. Given that our model generalizes twostage stochastic programming, this MLaugmented approach also represents a new way for solving stochastic programming problems.

Informed by our theoretical insights, we develop practical strategies to enhance the performance of the MLaugmented model, including i) follower sampling algorithms to tighten the theoretical bounds, ii) strategies for hyperparameter tuning, and iii) a novel representation learning framework that automatically learns follower features that are predictive of follower objective values. To the best of our knowledge, this is the first application of representation learning to bilevel optimization.

We demonstrate the effectiveness of our approach via computational studies on a synthetic cycling network design problem. We show that i) our learned features are more predictive of follower objective values compared to baseline features from the literature; ii) our follower sampling algorithms further improve the ML models’ outofsample prediction accuracy by a large margin compared to baseline sampling methods; iii) our strong predictive performance translates into highquality and stable leader decisions from the MLaugmented model. The performance gap between our approach and samplingbased models without the ML component is particularly large when the follower sample is small.

In collaboration with the City of Toronto, we perform a realworld case study on cycling infrastructure planning in Toronto, Canada. We solve a largescale cycling network design problem, demonstrating that our approach can increase accessibility over a greedy method by 19.2%, averaging over different budgets. If we consider 100 km of cycling infrastructure to be designed using a greedy method, our method can achieve a similar level of accessibility using only 70 km, equivalent to $18M in potential cost savings, or a 11.2% increase in accessibility if all 100 km was designed using our approach.
Proofs are in the Electronic Companion.
2 Literature review
2.1 Integration of Machine Learning and Optimization
“Predict, then optimize” is a common modeling paradigm that uses machine learning models to estimate parameters in an optimization problem, which is then solved with those estimates to obtain decisions (Ferreira et al. 2016, Elmachtoub and Grigas 2022). Recent progress has been made in using ML models to prescribe decisions based on contextual features (Ban and Rudin 2019, Notz and Pibernik 2022, Babier et al. 2022, Bertsimas and Kallus 2020), to improve optimization algorithms (Khalil et al. 2016, Tang et al. 2020, Jia and Shen 2021, Morabit et al. 2021), and to build endtoend optimization solvers (Vinyals et al. 2015, Bello et al. 2017, Khalil et al. 2017, Nazari et al. 2018, Kool et al. 2019).
Our idea of using an ML model to predict followers’ costs is similar to a stream of literature that integrates offline ML models into the solution of optimization problems to map decision variables to uncertain objective values (Biggs et al. 2017, Mišić 2020, Liu et al. 2021a, Bergman et al. 2022, Dumouchelle et al. 2022). In contrast, we integrate the ML model training into the optimization problem and use predictive features that cannot be compactly represented using our decision variables.
2.2 Scenario Reduction in Stochastic Programming
Scenario reduction has been extensively studied in the stochastic programming literature. One stream of literature quantifies the similarity between individual scenarios and then applies clustering methods to select a subset. Common measures include the cost difference between singlescenario optimization problems (Keutchayan et al. 2021), the opportunity cost of switching between scenarios (Hewitt et al. 2021)
, and the distance between scenario feature vectors
(Crainic et al. 2014, Wu et al. 2022). Another stream of literature selects a scenario subset from a given set to minimize the discrepancy between the distributions described by the two sets. Probabilistic metrics, including Wasserstein distance (Römisch and Schultz 1991, Bertsimas and Mundru 2022) and FortetMourier metrics (Dupačová et al. 2003), derived from the stability theory of stochastic programming and heuristic loss functions
(Prochazka and Wallace 2020) have been proposed to measure such discrepancies.Our proposal of learning follower representations is similar to the work of Wu et al. (2022). However, our approach does not rely on the graph structure of the optimization problem and considers the impact of the leader’s solution on the followers’ costs. Our theoretical results deviate from the second stream of literature in that instead of focusing on the stability of the optimal value and optimal leader’s solution, we provide solution quality guarantees for the leader’s solution obtained from solving our MLaugmented problem.
2.3 Active Learning
Active learning aims to reduce the cost of data labeling by selecting the most “informative” subset of data for machine learning model training. Such a subset is typically constructed through an iterative process where an ML model is trained on the labeled data at each iteration and then queries one or more unlabeled data points based on informativeness measures such as prediction uncertainty (Lewis and Gale 1994), and expected model change (Settles et al. 2007). See Settles (2009) for a comprehensive review of the available measures. Instead of iteratively selecting unlabeled data, a subset of data may be selected once to improve data labeling efficiency (Sener and Savarese 2018). Our approach is most similar to Sener and Savarese (2018) because including the exact objective for a sampled set of followers in the reduced problem is analogous to costly labeling of a small set of data points once. We extend the coreset method developed by Sener and Savarese (2018) to derive a theoretical bound for the quality of the leader decisions from our MLaugmented model.
2.4 Strategic Cycling Infrastructure Planning
Previous studies on strategic cycling infrastructure planning have considered a variety of approaches. Many papers greedily choose road segments to install new cycling infrastructure using expertdefined metrics (Lowry et al. 2012, Kent and Karner 2019, Olmos et al. 2020) or visualization techniques (Larsen et al. 2013, Putta and Furth 2019). Optimizationbased approaches typically consider minimizing the travel cost (Sohn 2011, Mesbah et al. 2012, Duthie and Unnikrishnan 2014, Bagloee et al. 2016, Mauttone et al. 2017), maximizing total utility (Bao et al. 2017, Liu et al. 2019, 2021b, Lim et al. 2021, Ospina et al. 2022), or maximizing total ridership (Liu et al. 2022) given a large number of origindestination (OD) pairs. Due to the large problem size, such models are usually solved with heuristics. To the best of our knowledge, only Liu et al. (2021b), Lim et al. (2021), and Liu et al. (2022) solve the problems to optimality at a city scale by randomly sampling OD pairs or restricting the routes that can be used by each OD pair. Our work adds to the literature by developing a computationally tractable solution method that can solve larger problems and does not require restrictions such as limited routes for each OD pair.
3 Model Preliminaries
In this section, we present the general bilevel problem of interest, a reduced version based on sampling, and our proposed model. We will use the following notational conventions. Vectors and matrices are denoted in bold, and sets are denoted in calligraphic font. We let denote the indicator function, and .
3.1 The Bilevel Model
The following is the bilevel optimization problem of interest:
(1a)  
subject to  (1b)  
(1c) 
Let denote the leader’s decision with feasible set and cost function . Let be the set of followers, be the decision of follower , measure the cost of a follower’s decision, and be a nonnegative weight. To determine the optimal decision of follower , we assume the follower is optimizing an objective function subject to a nonempty feasible set that depends on the leader’s decision. Note that when and are identical for all , this problem is equivalent to a twostage stochastic program
subject to  
where the decisions of the leader and followers correspond to the firststage and secondstage decisions, respectively, is the set of secondstage scenarios, and
(suitably normalized) is the probability of realizing scenario
.To simplify the notation, we write the bilevel problem as
(2) 
where
(3) 
and
(4) 
Let be an optimal solution to (2).
3.2 Reduced Model
Given a sampled follower set , we consider the reduced problem
(5) 
where
(6) 
is as defined in (4), and the weight assigned to scenario , , may be different from , due to reweighting. Let be an optimal solution to (5).
For the special case of a twostage stochastic program, stability results have been established for problem (5). Under certain regularity conditions, it is possible to bound (Römisch and Schultz 1991, Römisch and Wets 2007, Bertsimas and Mundru 2022) and (Römisch and Schultz 1991). However, there is no bound in the literature on , which is something we develop using the model from the next section.
3.3 MLAugmented Model
Given a sampled follower set , we propose the following model
(7a)  
subject to  (7b) 
This formulation augments the reduced problem by integrating an ML model that predicts the costs of the unsampled followers in . The ML model is specified by , which predicts the cost of follower based on a feature vector . We use to denote the function class of ML models, to be a userdefined upper bound on the training loss, and to be a weight assigned to follower for calculating the training loss of . The training of on dataset is embedded into the problem via the training loss constraint (7b). We choose the L1 loss because it can be easily linearized.
Problem (8) provides a general structure for our modeling approach. Its effectiveness on a given problem depends on multiple factors: i) function class , ii) weighting scheme and upper bound , iii) sample , and iv) availability of predictive follower features . We address the first three items in Section 4 and the fourth in Section 5.
4 Integrating a Prediction Model
In Section 4.1, we introduce two classes of prediction models – one nonparametric and one parametric – that are compatible with our MLaugmented model. We provide theoretical bounds on performance in Section 4.2. Finally, we present algorithms and discuss practical implementation, based on insights from examining the bounds, in Section 4.3.
4.1 Function Classes
4.1.1 nearest neighbor regression.
For any fixed , let
where denotes the size of neighborhood considered and represents the nearest neighbors in the sampled set of follower . Since NN regression is a nonparametric method that does not require training, we do not require constraint (7b). Equivalently, we can simply set or for all . Then, problem (8) becomes
(10) 
Note that our MLaugmented model is also compatible with other nonparametric prediction models, such as locally weighted regression (Cleveland and Devlin 1988) and kernel regression (Parzen 1962), which assign nonuniform weights to nearest neighbors. However, for those models, our theoretical bound (Section 4.2.3
) becomes a complicated function of the distances from each follower to its nearest neighbors, making it difficult to formulate practical follower selection algorithms to tighten the bound. We thus leave the integration of more sophisticated nonparametric models for future work.
4.1.2 General parametric regression.
Consider a general parametric regression model written as , where represents the parameters of model and is the feasible set of model parameters. Then, problem (8) becomes
(11) 
where is the number of followers in whose nearest neighbor in is .
For model (11) to be effective, one should choose a function class that can be compactly represented with and
. For example, a linear regression model
can be incorporated using only additional continuous decision variables . An additional set of continuous decision variables and linear constraints are required to linearize the L1 training loss. Such a representation is tractable when and are small. Though our theoretical results are applicable to general regression models, we focus on linear models in this paper for computational efficiency. This is not restrictive since nonlinearity can be incorporated through feature engineering.4.2 Theoretical Properties
4.2.1 Prediction model setup.
We start by formally defining the prediction problem that is embedded into our MLaugmented model. For any fixed leader decision , we are interested in a regression problem defined in a feature space and a target space . We denote by
the probability density function of the target variable given a feature vector
. We regardas a random variable because the true mapping from features to this target may not be deterministic. For example, consider a network design problem where the follower’s cost is the length of the shortest path from an origin to a destination using the network designed by the leader. If we use a onedimensional binary feature that is 1 if both the origin and destination are in downtown and 0 otherwise, then all downtown OD pairs will share the same feature value but with shortest path lengths that could be drastically different.
4.2.2 Assumptions.
Next, we introduce several assumptions that enable the derivation of our theoretical results in Sections 4.2.3 and 4.2.4. [Independence] For any followers and leader decisions , the target (random) variables with distributions and are independent.
[Predictivity] There exists a such that, for any fixed , , and drawn from , there exists an interval such that and almost surely.
Assumption 4.2.2 holds for a wide range of applications where the original follower set is independently sampled. For example, in transportation network design, OD pairs (followers) are usually independently sampled from survey data (Lim et al. 2021) or ridership data (Liu et al. 2021b, 2022). In twostage stochastic programming, secondstage scenarios are usually from historical observations that can be regarded as independent samples (Shapiro et al. 2009, Birge and Louveaux 2011). Assumption 4.2.2 states that the chosen features should be predictive of the follower’s cost. Given a fixed feature vector , if the associated follower cost could vary wildly, it would be difficult for any prediction model to achieve good performance. In contrast, if Assumption 4.2.2 holds and is small, then achieving a small prediction error is theoretically possible if the ML model is properly chosen and trained. Note that for Theorems 4.2.3 and 4.2.4 to hold, Assumption 4.2.2 can be relaxed if the cost function is bounded. However, using predictive features helps to narrow the interval, thus tightening the bounds. Assumptions 4.2.2 and 4.2.2 enable the application of concentration inequalities to bound the deviation of the total cost of outofsample followers from its expected value.
[Continuity of Follower Cost] For any fixed , there exists a such that is Lipschitz continuous with respect to .
[Continuity of the Prediction Model] For any fixed , there exists a such that is Lipschitz continuous with respect to .
Assumption 4.2.2 limits the change in the expected follower cost as a function of the change in feature space. Similar assumptions are commonly made to derive stability results for twostage stochastic programming (Römisch and Schultz 1991, Römisch and Wets 2007, Bertsimas and Mundru 2022). Assumption 4.2.2 limits the expressive power of . Given that is trained on a small dataset associated with , limiting its expressive power is critical to avoid overfitting. Such a condition can be enforced by adding regularization constraints to . For example, for linear regression, we can set . This assumption is needed only for parametric regression models.
4.2.3 Bound on NNaugmented model solution.
Given a follower sample and a neighborhood size , let denote an optimal solution to problem (10). If Assumptions 4.2.2–4.2.2 hold, with probability at least ,
where is a distance metric in , , and .
Theorem 4.2.3 bounds the optimality gap of the solution generated by the
NNaugmented model on the original problem. The first term on the righthand side corresponds to the prediction bias and the second term corresponds to the variance and Bayes error, both controlled by
and . The bias is proportional to the sum of the average distance from each to its nearest neighbors and it increases as increases. When the sample size is fixed, the second term is controlled by . Note that , so the second term is minimized when are identical for all , which follows from the CauchySchwarz inequality. The intuition is that if the followers from are “evenly” assigned to sample followers in , then the overall prediction performance on is less affected by the random deviation of the individual cost of follower , , from its expected value. In the worst case, the second term is bounded by , which decreases as increases, reflecting the biasvariance tradeoff. Motivated by Theorem 4.2.3, we introduce how to select and to tighten the bound in Section 4.3.1.4.2.4 Bound on parametric regressionaugmented model solution.
Given a follower sample , let denote the optimal solution to problem (11) and denote the nearest neighbor of in , if Assumptions 4.2.2–4.2.2 hold, with probability at least ,
Theorem 4.2.4 bounds the optimality gap of the leader’s decision generated by the parametric regressionaugmented model on the original problem. The first term on the righthand side is controlled by the training loss , whereas the second and third terms are controlled by . To reduce the second and third terms, should be chosen such that followers are not too far away from its nearest neighbor in (second term) and the assignment of followers in to followers in should be “even” (third term). Informed by Theorem 4.2.4, we discuss how to select and in Section 4.3.2.
4.3 Practical Implementation
4.3.1 NNaugmented model.
To tighten the bound in Theorem 4.2.3, we need to determine i) the follower sample , and ii) the value of . Jointly optimizing and is challenging due to the complicated function form. We propose a practical solution method that enumerates different values of and samples one follower set for each by solving
(12) 
where is an upper bound on the sample size imposed by the available computational resources. We choose to focus on bias minimization because i) the biasminimization problem corresponds to a vectorassignment median problem (Weaver and Church 1985), which is wellstudied; ii) the variance term is obtained by applying the Hoeffding inequality (Hoeffding 1994)
without considering the moments of
, meaning that the bound might be loose; iii) varianceminimization can be implicitly considered by enforcing capacity constraints or penalizing “uneven” assignment in the objective function in problem (12).In principle, one should enumerate all and obtain a follower sample and a leader’s decision by solving problems (12) and (10), respectively, for each . The best leader’s solution can then be selected based on the objective function of problem (2). However, since both problems (10) and (12) are challenging to solve, full enumeration of introduces considerable computational burden. We thus narrow the search window for based on the predictive performance of NN regression on the follower costs under some sampled leader decisions. Specifically, we first randomly generate a set of feasible leader solutions and build a dataset of followers’ costs for each one. We then select the best neighborhood size based on the average outofsample predictive performance on these datasets. We then vary in a small interval around , and obtain the corresponding follower samples and leader solutions by solving Problems (12) and (10), respectively. The complete solution approach is presented as Algorithm 1.
4.3.2 General parametric regression.
To tighten the bound in Theorem 4.2.4, we need to determine i) the follower sample and ii) the value of . To select , we focus on minimizing the second term of the bound for similar reasons as discussed in the previous section. In this case, we solve the classical median problem
(13) 
Regarding , choosing a small value will tighten the bound, but could lead to overfitting or even worse, render problem (11) infeasible. We propose a practical approach to iteratively search for an appropriate . The search starts from a given , which is estimated using data associated with randomly generated leader decisions, and then gradually increases this value until the generated leader decision stops improving. The complete solution approach is presented as Algorithm 2.
5 Learning Follower Representations
In this section, we present a representation learning framework that maps a follower set to a dimensional feature space. The learned follower features are used as inputs to the ML model embeded in problems (10) and (11), and to quantify similarity between followers to aid follower sampling in problems (12) and (13). This framework, as presented in Algorithm 3, consists of two steps. In the first step, we construct a relationship graph , where each node represents one follower and each edge is assigned a weight reflecting the similarity between the followers. In the second step, we adopt a graph embedding algorithm to map each node in to a feature space . We emphasize that this framework is compatible with any relationship graph and graph embedding algorithms, providing flexibility to tailor the framework to deal with different applications.
5.1 Relationship Graph Construction
The core of the relationship graph construction is defining a weight for each edge that measures the similarity between followers. Many metrics have been proposed in the stochastic programming literature, with most focusing on the “opportunity cost” of switching between scenarios (i.e., followers). For example, Keutchayan et al. (2021) and Hewitt et al. (2021) define the opportunity cost of applying the firststage decision that is optimal to scenario in scenario as where and denote the optimal firststage solutions obtained by solving the singlescenario version of problem (2) with scenarios and , respectively. Building on a similar idea, Bertsimas and Mundru (2022) define a symmetric divergence measure for scenarios and as . While these opportunity costbased measures have been shown to be effective and are compatible with the proposed framework, they require solving singlescenario versions of problem (2) in advance, which is computationally expensive when is huge.
Motivated by the fact that evaluating followers’ costs given a leader’s solution is computationally cheaper, we propose a datadriven approach that quantifies follower similarity based on their costs under some sampled leader solutions. Specifically, we first randomly sample leader decisions . For each sampled , we calculate the costs for all followers by solving the associated follower problems. We then define the weight of the edge between followers based on the distance between and . Among the distance metrics we experimented with, the following definition of edge weight achieves the best performance in predicting followers’ costs. Specifically, we define the weight of edge in the relationship graph as
where is a userselected function whose target space is restricted to to avoid being dominated by a single sampled follower’s solution. For instance, letting be a small positive constant, we use in our cycling network design problem where the followers’ costs are nonnegative. For applications where the followers’ costs can be negative, functions such as might be considered. The value of can be interpreted as the average cost similarity between followers and under the sampled leader’s decisions. Intuitively, considering more leader decisions provides more accurate estimates of such a relationship.
5.2 Follower Embedding
Once is constructed, we adapt the DeepWalk algorithm proposed by Perozzi et al. (2014) to map nodes in the relationship graph to a dimensional feature space. The idea is to first generate a set of random walks in the relationship graph, and then apply the Word2Vec algorithm (Mikolov et al. 2013) to learn node features treating each node and each random walk as a word and a sentence, respectively. Unlike Perozzi et al. (2014), who generate random walks by uniformly sampling nodes connected to the current node, we generate random walks according to the weights assigned to edges incident to the current node. So, followers that yield similar results under the sampled leader decisions are likely to appear in a same walk, and thus will be positioned close to each other in the learned feature space.
6 Computational Study: Algorithm Performance on Synthetic Cycling Network Design Problem
In this section, we validate the effectiveness of our MLaugmented model with our representation learning framework for a cycling network design problem. We introduce the problem and its formulation in Section 6.1. We present two experiments to validate the predictive power of the learned follower features and the value of integrating an ML model in the optimization problem in Sections 6.2 and 6.3, respectively. We use a moderatesized synthetic network (described in 12.1) that can be solved to optimality, which enables us to accurately measure the performance of our approach.
6.1 Maximum Accessibility Network Design Problem
The goal of the maximum accessibility network design problem (MaxANDP) is to design a cycling network subject to a fixed budget such that the total accessibility of a given set of OD pairs, denoted by , is maximized. Such a set may be defined based on geographical units (Imani et al. 2019, Lim et al. 2021) or ridership data (Liu et al. 2021b). Existing studies have proposed various metrics to measure accessibility, mostly focusing on first finding one or more routes between each OD pair using the designed network and then calculating the accessibility based on the selected routes. Such measures have been shown to be correlated with travel behavior data (Saghapour et al. 2017, Imani et al. 2019) and have been widely adopted to assess the performance of cycling networks (Vale et al. 2016).
Let be a directed graph where is the set of edges and is the set of nodes, corresponding to road segments and intersections, respectively. Each edge is assigned a travel time . We denote by and the sets of incoming and outgoing edges of node , respectively. Edges and nodes are partitioned into highstress and lowstress sets according to a cycling stress assessment based on road geometry, existing infrastructure, and vehicle traffic conditions (Landis et al. 1997, Harkey et al. 1998, Furth et al. 2016). We assume that cyclists prefer cycling on lowstress roads over highstress roads. Sets with subscripts and indicate the highstress and lowstress subsets of the original set, respectively. Highstress edges and nodes are assigned costs and , respectively, corresponding to the costs of turning them into lowstress through building new infrastructure such as cycle tracks or traffic lights.
Let and , respectively, denote the edge selection and node selection variables (referred to as network design decisions), whose components are 1 if that edge or node is chosen for the installation of infrastructure that makes it low stress. Edge and node selections are subject to budgets and , respectively. Let denote the routing decision associated with OD pair . The routing problem on a network specified by and is characterized by an objective function and a feasible set . A function is used to calculate the accessibility of each OD pair based on the selected route(s). Each OD pair is weighted by a constant (e.g., population). The MaxANDP is formulated as
(14a)  
subject to  (14b)  
(14c)  
(14d)  
(14e) 
where and indicate cost vectors for highstress edges and nodes, respectively. The objective function (14a) maximizes total cycling accessibility. Constraints (14b) ensure that the selected routes are optimal for the OD pairs’ objective functions. Constraints (14c) and (14d) enforce budgets on the network design decisions. Constraints (14e) describe the domain of the decision variables.
To apply problem (14), the accessibility measure (specified by ) and the routing problem (specified by and ) should be carefully chosen based on recent travel behavior data in the studied area (Geurs and Van Wee 2004). To illustrate our methodology, we consider two problems: i) one that uses locationbased accessibility measures and shortestpath routing problems, and ii) one proposed by Liu et al. (2021b) that employs a utilitybased accessibility measure and discrete route choice models. We refer readers to Liu et al. (2021b) for the latter problem. We briefly describe the former problem next.
Locationbased accessibility measures use a decreasing function of the travel time from origin to destination, namely an impedance function, to model the dampening effect of separation (Iacono et al. 2010). We consider a piecewise linear impedance function
(15) 
where indicates a vector of edge travel times, are breakpoints, and are penalty factors for and , respectively. This function can be used to approximate commonly used impedance functions, including negative exponential, rectangular, and linear functions (visualized in 12.2). While we consider two breakpoints for simplicity, the formulation can be easily generalized to account for more.
We use the level of traffic stress (LTS) metric (Furth et al. 2016) to formulate the routing problems. Let be the nodeedge matrix describing the flowbalance constraints on , and be a vector whose and entries are 1 and , respectively, with all other entries 0. Given network design , the routing problem for is formulated as
(16a)  
subject to  (16b)  
(16c)  
(16d) 
Objective function (16a) minimizes the travel time. Constraints (16b) direct one unit of flow from to . Constraints (16c) ensure that a currently highstress edge can be used only if it is selected. Constraints (16d) guarantee that a currently highstress node can be crossed only if either the node is selected or all highstress edges that are connected to this node are selected. Constraints (16d) are an exact representation of the intersection LTS calculation scheme that assigns the lowstress label to a node if traffic signals are installed or all incident roads are lowstress (Imani et al. 2019). To ensure problem (16) is feasible, we add a dummy lowstress link from to and set its travel time to . In doing so, the travel time is set to when the destination is unreachable using the lowstress network, corresponding to an accessibility of zero, as defined in equation (15). The full formulation is given in 10. We adapt the Benders approach from Magnanti et al. (1986) to solve synthetic instances to optimality. The algorithm and its acceleration strategies is in 11.
6.2 Experiment 1: Predicting ODPair Accessibility Using ML Models
In this experiment, we randomly generate network designs and calculate the accessibility for each OD pair under each design. The accessibility associated with each network design constitutes a dataset on which we perform traintest data splits, train ML models to predict ODpair accessibility, and evaluate their outofsample prediction performance. Our goal is to i) compare the predictive power of our learned features and baseline features, ii) validate the effectiveness of our follower sampling method in improving the prediction accuracy, and iii) compare the prediction performance of online and offline ML methods.
Data generation. We consider three edge design budgets of 100, 300, and 500 distance units, corresponding to three levels of ambition, and generate 1,000 network designs for each. We set the node design budget to zero for computational tractability. We consider three locationbased accessibility measures using piecewise linear impedance functions that mimic negative exponential (EXP), linear (LIN), and rectangular (REC) functions. We also consider the utilitybased accessibility measure (UT) proposed by Liu et al. (2021b). The accessibility calculations are detailed in 12.2. In total, we generate 12,000 datasets (1,000 for each pair of design budget and accessibility measure).
Predictive power of the learned features. Since accessibility is a function of the travel time from origin to destination, which can be regarded as the optimal objective value of a twostop traveling salesman problem (TSP), we employ the features proposed by Liu et al. (2021a) for predicting TSP objective value as our baseline features. For each dataset, we perform a random traintest split and then train two ML models using the same training sample, but with different ODpair features. One uses the TSP features, while the other uses the features derived from our representation learning framework (REP). We consider ML models that are compatible with our MLaugmented model, including
NN, linear regression, lasso, and ridge regression. We vary the training sample size between 1%–5% of all OD pairs because practical implementation of our approach would be based on sampling a small number of OD pairs only. For each pair of sample size and ML model, we report the median test MAE over ten random traintest splits. The details on the baseline features and ML hyperparameters are given in
12.3 and 12.7, respectively.We observe that ML models generally perform better with the REP features than with the TSP features. As presented in Figure 1, when using NN, the REP features outperform the TSP features by a large margin (e.g. over 22.8%, 26.1%, 22.5%, and 25.7% for EXP, LIN, REC, and UT, respectively). However, linear regression performs better with the TSP features. One possible explanation is that linear regression models based on the REP features (16dimensional) are more complicated than those based on the TSP features (9dimensional), and thus are more likely to overfit the training data. When using lasso and ridge, the REP features are better than or competitive with the TSP features. Lasso and ridge also achieve lower prediction errors than linear regression, highlighting the importance of enforcing regularization constraints in the MLaugmented model (recall discussion following Assumption 4.2.2).
Effectiveness of the follower selection algorithm. Next, we use the REP features while applying different follower sampling methods, and compare the resulting test MAE achieved by the ML models. We consider i) median sampling as introduced in Section 4.3 (MED), 2) uniform sampling (UNI), which is commonly used in the transportation literature, and 3) center sampling (CEN) from Sener and Savarese (2018). Since the center and median problems are hard, we adapt heuristics from Boutilier and Chan (2020) and Gonzalez (1985) to solve them (detailed in 12.4). As a result, all methods involve some randomness. To account for the randomness, we apply each sampling method 10 times with different random seeds and report the median test MAE on each dataset.
From Figure 2, we observe that MED typically achieves the lowest test MAE, regardless of the accessibility measures, ML models, and sample sizes. Especially when the sample size is extremely small, the gap between MED and UNI can be up to 27% of the MAE achieved by UNI. MED outperforms CEN by a large margin when using locationbased accessibility measures and is competitive with CEN when using the utilitybased measure.
Online ML model versus offline ML model. We also compare the approximations for the overall leader’s objective achieved by the best online ML approach (train a
NN model using REP features and MED samples for each leader decision) versus an offline ReLU neural network (ReLU)
(Dumouchelle et al. 2022) that predicts the total accessibility of all OD pairs directly based on leader decisions (detailed in 12.6). As presented in Figure 3, our online ML model provides better approximations using only 1–2% of all OD pairs as training data. These results provide an empirical justification for integrating ML model training into the optimization problem.Summary. i) our learned features are predictive of accessibility and are generally more predictive than baseline features. ii) By applying follower sampling methods derived from solving a median problem, we can substantially improve the outofsample prediction accuracy of the ML models with our learned features. iii) the best online ML model yields a better approximation for the original objective than an offline ML model.
6.3 Experiment 2: Generating Leader Decisions using MLaugmented Models
Next, we investigate the extent to which our learned features and our follower samples can assist the MLaugmented model in generating highquality leader (i.e., transportation planner) decisions. We consider the reduced model, NNaugmented model, and linear regressionaugmented model using MED and UNI samples, totaling six methods for generating leader decisions. Results for CEN samples are in 12.9, as they are similar to UNI samples. We create 12 problem instances on the synthetic network (one for each pair of design budget and accessibility measure). We vary the sample size from 1% to 5%. We apply each model 10 times using 10 samples generated with different random seeds and report the average optimality gap of the leader decisions on the original problem.
From Figure 4, our first observation is that using MED samples enhances the performance of both the MLaugmented models and the reduced model. Significant performance gaps are observed for the two MLaugmented models in all problem instances considered. Using the MED samples on average reduces the optimality gap by 5.7% and 5.2% for the NNaugmented and linear regressionaugmented models, respectively. Even for the reduced model, our sampling strategy is competitive with or better than uniform sampling, with a difference in optimality gap of up to over 20% (e.g., EXP100, 3% training sample). These results highlight the importance of sample selection for both models.
Our second observation is that when using the MED samples, the MLaugmented models generally achieve better and more stable performance compared to the reduced model. As shown in Figure 4, the MLaugmented model (NNMED or REGMED) generally outperforms ReducedMED by a large margin, especially when using locationbased accessibility measures and when the sample size is extremely small (1%). Figure 5 compares the stability of ReducedMED and NNMED.
NNMED generally achieves a lower standard deviation of optimality gaps as ReducedMED, even though they use exactly the same samples. The integrated ML component helps to better capture the impact of leader decisions on unsampled followers, leading to solutions of higher quality and stability.
Finally, we observe no dominance of a single MLaugmented model. REGMED and NNMED are comparable on mid and largebudget instances (300 and 500). NNMED performs better when using locationbased accessibility measures and a small budget (100). One possible reason is that when the budget is limited, few OD pairs have positive accessibility while all others have zero accessibility. The featureaccessibility relationship is highly nonlinear, making NN a better fit for the accessibility prediction than linear regressions.
7 Case Study: Cycling Infrastructure Planning in the City of Toronto
In this section, we present a case study applying our methodology to the City of Toronto, Canada. Toronto has built over 65 km of new cycling infrastructure from 2019–2021, partially in response to the increased cycling demand amid the COVID19 pandemic. It plans to expand the network by 100 km from 2022–2024. We started a collaboration with the City’s Transportation Services Team in September 2020, focusing on developing quantitative tools to support cycling infrastructure planning in Toronto. As an evaluation metric, lowstress cycling accessibility has been used by the City of Toronto to support project prioritization
(City of Toronto 2021a, b). We introduce Toronto’s cycling network in Section 7.1 and use our methodology to examine actual and future potential decisions regarding network expansion in Section 7.2.7.1 Cycling Network in Toronto
We construct Toronto’s cycling network based on the centerline network retrieved from the Toronto Open Data Portal (City of Toronto 2020). We preprocess the network by removing roads where cycling is legally prohibited, deleting redundant nodes and edges, and grouping arterial roads into candidate cycling infrastructure projects (detailed in 13.1). The final cycling network has 10,448 nodes, 35,686 edges, and 1,296 candidate projects totaling 1,913 km. We use the methods and data sources summarized in Lin et al. (2021)
to calculate the LTS of each link in the cycling network. LTS1 and LTS2 links are classified as lowstress, while LTS3 and LTS4 links are highstress since LTS2 corresponds to the cycling stress tolerance for the majority of the adult population
(Furth et al. 2016). Although most local roads are lowstress, highstress arterials create many disconnected lowstress “islands”, limiting lowstress cycling accessibility in many parts of Toronto (Figure 6).The city is divided into 3,702 geographical units called dissemination areas (DAs). We define each DA centroid as an origin with every other DA centroid that is reachable within 30 minutes on the overall network being a potential destination, totalling 1,154,663 OD pairs. The lowstress cycling accessibility of a given DA is defined as the sum of the destination opportunities that are reachable within 30 minutes using the lowstress network. We use DA job counts retrieved from the 2016 Canadian census (Statistics Canada 2016) to represent opportunities. The resulting accessibility measure has been shown to be highly correlated with cycling mode choice in Toronto (Imani et al. 2019). We assume a constant cycling speed of 15 km/h for travel time calculation.
7.2 Expanding Toronto’s Cycling Network
As a part of our collaboration, in January 2021 we were asked to evaluate the accessibility impact of three project alternatives for building bike lanes (see Figure 7) to meet the direction of Toronto’s City Council within the City’s adopted Cycling Network Plan, intended to provide a cycling connection between midtown and the downtown core (City of Toronto 2021b). These projects were proposed in 2019 but their evaluation and implementation were accelerated because of increased cycling demand during COVID. We determined that alternative 2 had the largest accessibility impact. It was ultimately implemented due to its accessibility impact and other performance indicators (City of Toronto 2021b).
This decisionmaking process exemplifies the current practice of cycling infrastructure planning in Toronto: i) manually compile a list of candidate projects, ii) rank the candidate projects based on certain metrics, and iii) design project delivery plans (City of Toronto 2021c). From a computational perspective, steps i) and ii) serve as a heuristic for solving MaxANDP. This heuristic approach was necessary for several reasons, including political buyin for the three alternatives, and the computational intractability of solving MaxANDP at the city scale prior to developing the methodology in this paper. Now, we can use our MLaugmented model to search for project combinations without prespecifying candidates.
To this end, we first apply the MLaugmented model with a budget of 6 km (similar to alternative 2). The optimal projects (see Figure 7) improve Toronto’s total lowstress cycling accessibility by approximately 9.5% over alternative 2. Instead of constructing only one corridor as in alternative 2, the MLaugmented model selects six disconnected road segments. Some of them serve as connections between existing cycling infrastructure, others bridge currently disconnected lowstress subnetworks consisting of lowstress local roads.
Next, we increase the road design budget from 10 to 100 km in increments of 10 km. The 100 km budget aligns with Toronto’s cycling network expansion plan for 2022–2024 (City of Toronto 2021a). We compare our approach to a greedy heuristic that iteratively selects the candidate project that leads to the maximum increase in total accessibility until the budget is depleted. The greedy heuristic took over 3 days to expand the network by 100 km as each iteration involves solving millions of shortest path problems. Our approach took around 4 hours to find a leader decision using a sample of 2,000 OD pairs (1.7% of all OD pairs). Given this speedup, we can solve our model multiple times with different samples and report the best solution as measured by the total accessibility of all OD pairs. The computational setups of the greedy heuristic and our approach are detailed in 13.3.
As shown in Figure 8, when holding both methods to the same computational time (meaning that we solve our MLaugmented model with 21 different sets of OD pair samples and taking the best solution), our approach increases accessibility by 19.2% on average across different budgets. For example, with a budget of 70 km, we can improve the total accessibility by a similar margin as achieved by the greedy heuristic using a 100 km budget, corresponding to a savings of 18 million Canadian dollars estimated based on the City’s proposed budget (City of Toronto 2021a). If instead we used the full 100 km budget, we would achieve 11.3% greater accessibility. The improvements mainly come from identifying projects that have little accessibility impact when constructed alone, yet significantly improve the accessibility of their surrounding DAs when combined (visualized in 13.4). These projects are typically not directly connected to existing cycling infrastructure, and thus are difficult to identify through manual analysis. Finally, we note that solution quality was similar between 14 and 21 samples, meaning that with we can achieve the above gains while simultaneously reducing solution time by approximately 33%.
In summary, our approach is a valuable tool for transportation planners to search for optimal project combinations that maximize the lowstress cycling accessibility. Although this is not the only goal of cycling network design, we believe it can be useful in at least three contexts: i) in the long term, our model can be used to generate a base plan that can later be tuned by transportation planners; ii) in the near term, our approach can efficiently search for project combinations from a large pool that would be very difficult to analyze manually; iii) Given a fixed budget, our model provides a strong benchmark against which to validate the goodness of humanproposed solutions.
8 Conclusion
In this paper, we present a novel MLbased approach to solving bilevel (stochastic) programs with a large number of independent followers (scenarios). We build on two existing strategies—sampling and approximation—to tackle the computational challenges imposed by a large follower set. The model considers a sampled subset of followers while integrating an ML model to estimate the impact of leader decisions on unsampled followers. Unlike existing approaches for integrating optimization and ML models, we embed the ML model training into the optimization model, which allows us to employ general follower features that may not be compactly represented by leader decisions. Under certain assumptions, the generated leader decisions enjoy solution quality guarantees as measured by the original objective function considering the full follower set. We also introduce practical strategies, including follower sampling algorithms and a representation learning framework, to enhance the model performance. Using both synthetic and realworld instances of a cycling network design problem, we demonstrate the strong computational performance of our approach in generating highquality leader decisions. The performance gap between our approach and baseline approaches are particularly large when the sample size is small.
The authors are grateful to Sheng Liu, Merve Bodur, Elias Khalil, and Rafid Mahmood for helpful comments and discussions. This research is supported by funding from the City of Toronto and NSERC Alliance Grant 56121220. Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute.
References
 Alizadeh et al. (2013) Alizadeh S, Marcotte P, Savard G (2013) Twostage stochastic bilevel programming over a transportation network. Transportation Research Part B: Methodological 58:92–105.
 Babier et al. (2022) Babier A, Chan TCY, Diamant A, Mahmood R (2022) Learning to optimize contextually constrained problems for realtime decisiongeneration. arXiv preprint arXiv:1805.09293 .
 Bagloee et al. (2016) Bagloee SA, Sarvi M, Wallace M (2016) Bicycle lane priority: Promoting bicycle as a green mode even in congested urban area. Transportation Research Part A: Policy and Practice 87:102–121.
 Ban and Rudin (2019) Ban GY, Rudin C (2019) The big data newsvendor: Practical insights from machine learning. Operations Research 67(1):90–108.
 Bao et al. (2017) Bao J, He T, Ruan S, Li Y, Zheng Y (2017) Planning bike lanes based on sharingbikes’ trajectories. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1377–1386.
 Bard (2013) Bard JF (2013) Practical bilevel optimization: algorithms and applications, volume 30 (Springer Science & Business Media).

Bello et al. (2017)
Bello I, Pham H, Le QV, Norouzi M, Bengio S (2017) Neural combinatorial optimization with reinforcement learning.
International Conference on Learning Representations Workshop.  Bergman et al. (2022) Bergman D, Huang T, Brooks P, Lodi A, Raghunathan AU (2022) Janos: an integrated predictive and prescriptive modeling framework. INFORMS Journal on Computing 34(2):807–816.
 Bertsimas and Kallus (2020) Bertsimas D, Kallus N (2020) From predictive to prescriptive analytics. Management Science 66(3):1025–1044.
 Bertsimas and Mundru (2022) Bertsimas D, Mundru N (2022) Optimizationbased scenario reduction for datadriven twostage stochastic optimization. Operations Research 0(0).

Biggs et al. (2017)
Biggs M, Hariss R, Perakis G (2017) Optimizing objective functions determined from random forests, available at SSRN:
https://ssrn.com/abstract=2986630.  Birge and Louveaux (2011) Birge JR, Louveaux F (2011) Introduction to stochastic programming (Springer Science & Business Media).
 Bodur and Luedtke (2017) Bodur M, Luedtke JR (2017) Mixedinteger rounding enhanced benders decomposition for multiclass servicesystem staffing and scheduling with arrival rate uncertainty. Management Science 63(7):2073–2091.
 Boutilier and Chan (2020) Boutilier JJ, Chan TCY (2020) Ambulance emergency response optimization in developing countries. Operations Research 68(5):1315–1334.
 Buehler and Dill (2016) Buehler R, Dill J (2016) Bikeway networks: A review of effects on cycling. Transport Reviews 36(1):9–27.
 Buehler and Pucher (2021) Buehler R, Pucher J (2021) COVID19 impacts on cycling, 2019–2020. Transport Reviews 41(4):393–400.
 Candler and Townsley (1982) Candler W, Townsley R (1982) A linear twolevel programming problem. Computers & Operations Research 9(1):59–76.
 Carrión et al. (2009) Carrión M, Arroyo JM, Conejo AJ (2009) A bilevel stochastic programming approach for retailer futures market trading. IEEE Transactions on Power Systems 24(3):1446–1456.
 Chen et al. (2008) Chen X, Sim M, Sun P, Zhang J (2008) A linear decisionbased approximation approach to stochastic programming. Operations Research 56(2):344–357.
 City of Toronto (2020) City of Toronto (2020) City of Toronto open data. https://www.toronto.ca/citygovernment/dataresearchmaps/opendata/, accessed: 20200915.
 City of Toronto (2021a) City of Toronto (2021a) 2021 cycling network plan update. Accessed via https://www.toronto.ca/legdocs/mmis/2021/ie/bgrd/backgroundfile173663.pdf on July 8, 2022.
 City of Toronto (2021b) City of Toronto (2021b) ActiveTO: Lessons learned from 2020 and next steps for 2021. Accessed via https://www.toronto.ca/legdocs/mmis/2021/ie/bgrd/backgroundfile164864.pdf on July 21, 2022.
 City of Toronto (2021c) City of Toronto (2021c) Cycling network plan update — external stakeholders briefing summary. Accessed via https://www.toronto.ca/wpcontent/uploads/2021/06/8ea2ExternalBriefingMeetingSummaryJune72021.pdf on July 21, 2022.

Cleveland and Devlin (1988)
Cleveland WS, Devlin SJ (1988) Locally weighted regression: an approach to regression analysis by local fitting.
Journal of the American Statistical Association 83(403):596–610.  Crainic et al. (2014) Crainic TG, Hewitt M, Rei W (2014) Scenario grouping in a progressive hedgingbased metaheuristic for stochastic network design. Computers & Operations Research 43:90–99.
 Dill and McNeil (2016) Dill J, McNeil N (2016) Revisiting the four types of cyclists: Findings from a national survey. Transportation Research Record 2587(1):90–99.
 Dumouchelle et al. (2022) Dumouchelle J, Patel R, Khalil E, Bodur M (2022) Neur2sp: Neural twostage stochastic programming. arXiv preprint arXiv.2205.12006 .
 Dupačová et al. (2003) Dupačová J, GröweKuska N, Römisch W (2003) Scenario reduction in stochastic programming. Mathematical Programming 95(3):493–511.
 Duthie and Unnikrishnan (2014) Duthie J, Unnikrishnan A (2014) Optimization framework for bicycle network design. Journal of Transportation Engineering 140(7):04014028.
 Elmachtoub and Grigas (2022) Elmachtoub AN, Grigas P (2022) Smart “predict, then optimize”. Management Science 68(1):9–26.
 Fampa et al. (2008) Fampa M, Barroso L, Candal D, Simonetti L (2008) Bilevel optimization applied to strategic pricing in competitive electricity markets. Computational Optimization and Applications 39(2):121–142.
 Ferreira et al. (2016) Ferreira KJ, Lee BHA, SimchiLevi D (2016) Analytics for an online retailer: Demand forecasting and price optimization. Manufacturing & Service Operations Management 18(1):69–88.
 Fischetti et al. (2017) Fischetti M, Ljubić I, Sinnl M (2017) Redesigning benders decomposition for largescale facility location. Management Science 63(7):2146–2162.
 Furth et al. (2016) Furth PG, Mekuria MC, Nixon H (2016) Network connectivity for lowstress bicycling. Transportation Research Record 2587(1):41–49.
 Furth et al. (2018) Furth PG, Putta TV, Moser P (2018) Measuring lowstress connectivity in terms of bikeaccessible jobs and potential biketowork trips. Journal of Transport and Land Use 11(1):815–831.
 Gehrke et al. (2020) Gehrke SR, Akhavan A, Furth PG, Wang Q, Reardon TG (2020) A cyclingfocused accessibility tool to support regional bike network connectivity. Transportation Research Part D: Transport and Environment 85:102388.
 Geurs and Van Wee (2004) Geurs KT, Van Wee B (2004) Accessibility evaluation of landuse and transport strategies: review and research directions. Journal of Transport Geography 12(2):127–140.
 Gonzalez (1985) Gonzalez TF (1985) Clustering to minimize the maximum intercluster distance. Theoretical Computer Science 38:293–306.
 Gurobi Optimization, LLC (2022) Gurobi Optimization, LLC (2022) Gurobi Optimizer Reference Manual. URL https://www.gurobi.com.
 Harkey et al. (1998) Harkey DL, Reinfurt DW, Knuiman M (1998) Development of the bicycle compatibility index. Transportation Research Record 1636(1):13–20.
 Hewitt et al. (2021) Hewitt M, Ortmann J, Rei W (2021) Decisionbased scenario clustering for decisionmaking under uncertainty. Annals of Operations Research 1–25.
 Hochbaum and Shmoys (1985) Hochbaum DS, Shmoys DB (1985) A best possible heuristic for the kcenter problem. Mathematics of Operations Research 10(2):180–184.
 Hoeffding (1994) Hoeffding W (1994) Probability inequalities for sums of bounded random variables. The Collected Works of Wassily Hoeffding, 409–426 (Springer).
 Iacono et al. (2010) Iacono M, Krizek KJ, ElGeneidy A (2010) Measuring nonmotorized accessibility: issues, alternatives, and execution. Journal of Transport Geography 18(1):133–140.
 Imani et al. (2019) Imani AF, Miller EJ, Saxe S (2019) Cycle accessibility and level of traffic stress: A case study of Toronto. Journal of Transport Geography 80:102496.

Jia and Shen (2021)
Jia H, Shen S (2021) Benders cut classification via support vector machines for solving twostage stochastic programs.
INFORMS Journal on Optimization 3(3):278–297.  Kent and Karner (2019) Kent M, Karner A (2019) Prioritizing lowstress and equitable bicycle networks using neighborhoodbased accessibility measures. International Journal of Sustainable Transportation 13(2):100–110.
 Keutchayan et al. (2021) Keutchayan J, Ortmann J, Rei W (2021) Problemdriven scenario clustering in stochastic optimization. arXiv preprint arXiv:2106.11717 .
 Khalil et al. (2017) Khalil E, Dai H, Zhang Y, Dilkina B, Song L (2017) Learning combinatorial optimization algorithms over graphs. Advances in Neural Information Processing Systems, 6348–6358.

Khalil et al. (2016)
Khalil E, Le Bodic P, Song L, Nemhauser G, Dilkina B (2016) Learning to branch
in mixed integer programming.
Proceedings of the AAAI Conference on Artificial Intelligence
, volume 30.  Kool et al. (2019) Kool W, van Hoof H, Welling M (2019) Attention, learn to solve routing problems! 7th International Conference on Learning Representations, ICLR, URL https://openreview.net/forum?id=ByxBFsRqYm.
 Kou et al. (2020) Kou Z, Wang X, Chiu SFA, Cai H (2020) Quantifying greenhouse gas emissions reduction from bike share systems: a model considering realworld trips and transportation mode choice patterns. Resources, Conservation and Recycling 153:104534.
 Kraus and Koch (2021) Kraus S, Koch N (2021) Provisional COVID19 infrastructure induces large, rapid increases in cycling. Proceedings of the National Academy of Sciences 118(15).
 Landis et al. (1997) Landis BW, Vattikuti VR, Brannick MT (1997) Realtime human perceptions: toward a bicycle level of service. Transportation Research Record 1578(1):119–126.
 Larsen et al. (2013) Larsen J, Patterson Z, ElGeneidy A (2013) Build it. but where? the use of geographic information systems in identifying locations for new cycling infrastructure. International Journal of Sustainable Transportation 7(4):299–317.
 Leal et al. (2020) Leal M, Ponce D, Puerto J (2020) Portfolio problems with two levels decisionmakers: Optimal portfolio selection with pricing decisions on transaction costs. European Journal of Operational Research 284(2):712–727.
 Lewis and Gale (1994) Lewis DD, Gale WA (1994) A sequential algorithm for training text classifiers. SIGIR’94, 3–12.
 Li et al. (2017) Li S, Muresan M, Fu L (2017) Cycling in Toronto, Ontario, Canada: Route choice behavior and implications for infrastructure planning. Transportation Research Record 2662(1):41–49.
 Lim et al. (2021) Lim J, Dalmeijer K, Guhathakurta S, Van Hentenryck P (2021) The bicycle network improvement problem: Optimization algorithms and a case study in Atlanta. arXiv preprint arXiv:2107.04451 .
 Lin et al. (2021) Lin B, Chan TCY, Saxe S (2021) The impact of COVID19 cycling infrastructure on lowstress cycling accessibility: A case study in the City of Toronto. Findings 19069.

Liu et al. (2019)
Liu H, Szeto W, Long J (2019) Bike network design problem with a pathsize logitbased equilibrium constraint: Formulation, global optimization, and matheuristic.
Transportation Research Part E: Logistics and Transportation Review 127:284–307.  Liu et al. (2021a) Liu S, He L, Shen ZJM (2021a) Ontime lastmile delivery: Order assignment with traveltime predictors. Management Science 67(7):4095–4119.
 Liu et al. (2021b) Liu S, Shen ZJM, Ji X (2021b) Urban bike lane planning with bike trajectories: Models, algorithms, and a realworld case study. Manufacturing & Service Operations Management 0(0).
 Liu et al. (2022) Liu S, Siddiq A, Zhang J (2022) Planning bike lanes with data: Ridership, congestion, and path selection, available at SSRN: https://ssrn.com/abstract=4055703.
 Lowry et al. (2012) Lowry MB, Callister D, Gresham M, Moore B (2012) Assessment of communitywide bikeability with bicycle level of service. Transportation Research Record 2314(1):41–48.
 Magnanti et al. (1986) Magnanti TL, Mireault P, Wong RT (1986) Tailoring benders decomposition for uncapacitated network design. Netflow at Pisa, 112–154 (Springer).
 Mauttone et al. (2017) Mauttone A, Mercadante G, Rabaza M, Toledo F (2017) Bicycle network design: model and solution algorithm. Transportation Research Procedia 27:969–976.
 Mesbah et al. (2012) Mesbah M, Thompson R, Moridpour S (2012) Bilevel optimization approach to design of network of bike lanes. Transportation Research Record 2284(1):21–28.

Mikolov et al. (2013)
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality.
Advances in Neural Information Processing Systems, volume 26.  Mišić (2020) Mišić VV (2020) Optimization of tree ensembles. Operations Research 68(5):1605–1624.
 Morabit et al. (2021) Morabit M, Desaulniers G, Lodi A (2021) Machinelearning–based column selection for column generation. Transportation Science 55(4):815–831.
 Mueller et al. (2018) Mueller N, RojasRueda D, Salmon M, Martinez D, Ambros A, Brand C, De Nazelle A, Dons E, GauppBerghausen M, Gerike R, et al. (2018) Health impact assessment of cycling network expansions in european cities. Preventive Medicine 109:62–70.
 NaoumSawaya and Elhedhli (2013) NaoumSawaya J, Elhedhli S (2013) An interiorpoint benders based branchandcut algorithm for mixed integer programs. Annals of Operations Research 210(1):33–55.
 Nazari et al. (2018) Nazari M, Oroojlooy A, Snyder L, Takác M (2018) Reinforcement learning for solving the vehicle routing problem. Advances in Neural Information Processing Systems, 9839–9849.
 Notz and Pibernik (2022) Notz PM, Pibernik R (2022) Prescriptive analytics for flexible capacity management. Management Science 68(3):1756–1775.

Olmos et al. (2020)
Olmos LE, Tadeo MS, Vlachogiannis D, Alhasoun F, Alegre XE, Ochoa C, Targa F, González MC (2020) A data science framework for planning the growth of bicycle infrastructures.
Transportation Research Part C: Emerging Technologies 115:102640.  Ospina et al. (2022) Ospina JP, Duque JC, BoteroFernández V, Montoya A (2022) The maximal covering bicycle network design problem. Transportation Research part A: Policy and Practice 159:222–236.
 Papadakos (2008) Papadakos N (2008) Practical enhancements to the MagnantiWong method. Operations Research Letters 36(4):444–449.
 Parzen (1962) Parzen E (1962) On estimation of a probability density function and mode. The Annals of Mathematical Statistics 33(3):1065–1076.
 Pedregosa et al. (2011) Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikitlearn: Machine learning in Python. Journal of Machine Learning Research 12:2825–2830.
 Perozzi et al. (2014) Perozzi B, AlRfou R, Skiena S (2014) Deepwalk: Online learning of social representations. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 701–710.
 Prochazka and Wallace (2020) Prochazka V, Wallace SW (2020) Scenario tree construction driven by heuristic solutions of the optimization problem. Computational Management Science 17(2):277–307.
 Putta and Furth (2019) Putta T, Furth PG (2019) Method to identify and visualize barriers in a lowstress bike network. Transportation Research Record 2673(9):452–460.
 Řehůřek and Sojka (2010) Řehůřek R, Sojka P (2010) Software Framework for Topic Modelling with Large Corpora. Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, 45–50.
 Römisch and Schultz (1991) Römisch W, Schultz R (1991) Stability analysis for stochastic programs. Annals of Operations Research 30(1):241–266.
 Römisch and Wets (2007) Römisch W, Wets RB (2007) Stability of approximate solutions to convex stochastic programs. SIAM Journal on Optimization 18(3):961–979.
 Saghapour et al. (2017) Saghapour T, Moridpour S, Thompson RG (2017) Measuring cycling accessibility in metropolitan areas. International Journal of Sustainable Transportation 11(5):381–394.

Sener and Savarese (2018)
Sener O, Savarese S (2018) Active learning for convolutional neural networks: A coreset approach.
International Conference on Learning Representations.  Settles (2009) Settles B (2009) Active learning literature survey. Computer Sciences Technical Report 1648, University of Wisconsin–Madison.
 Settles et al. (2007) Settles B, Craven M, Ray S (2007) Multipleinstance active learning. Proceedings of the TwentyFirst Annual Conference on Neural Information Processing Systems, 1289–1296.
 Shapiro et al. (2009) Shapiro A, Dentcheva D, Ruszczynski A (2009) Lectures on stochastic programming: modeling and theory (SIAM).
 Sisson et al. (2006) Sisson SB, Lee SM, Burns EK, TudorLocke C (2006) Suitability of commuting by bicycle to Arizona elementary schools. American Journal of Health Promotion 20(3):210–213.
 Sohn (2011) Sohn K (2011) Multiobjective optimization of a road diet network design. Transportation Research Part A: Policy and Practice 45(6):499–511.
 Statistics Canada (2016) Statistics Canada (2016) Population and dwelling count, 2016 census. https://www12.statcan.gc.ca/censusrecensement/2016/dppd/hltfst/pdpl/Table.cfm?Lang=Eng&T=1902&PR=35&S=3&O=D&RPP=50, accessed: 20201115.
 Tang et al. (2020) Tang Y, Agrawal S, Faenza Y (2020) Reinforcement learning for integer programming: Learning to cut. International Conference on Machine Learning, 9367–9376 (PMLR).
 Vale et al. (2016) Vale DS, Saraiva M, Pereira M (2016) Active accessibility: A review of operational measures of walking and cycling accessibility. Journal of Transport and Land Use 9(1):209–235.
 Van Hoesel (2008) Van Hoesel S (2008) An overview of stackelberg pricing in networks. European Journal of Operational Research 189(3):1393–1402.
 Vinyals et al. (2015) Vinyals O, Fortunato M, Jaitly N (2015) Pointer networks. Advances in Neural Information Processing Systems, 2692–2700.
 Weaver and Church (1985) Weaver JR, Church RL (1985) A median location model with nonclosest facility service. Transportation Science 19(1):58–74.

White and Anandalingam (1993)
White DJ, Anandalingam G (1993) A penalty function approach for solving bilevel linear programs.
Journal of Global Optimization 3(4):397–419.  Wu et al. (2022) Wu Y, Song W, Cao Z, Zhang J (2022) Learning scenario representation for solving twostage stochastic integer programs. International Conference on Learning Representations.
 Zetina et al. (2019) Zetina CA, Contreras I, Cordeau JF (2019) Exact algorithms based on benders decomposition for multicommodity uncapacitated fixedcharge network design. Computers & Operations Research 111:311–324.
 Zugno et al. (2013) Zugno M, Morales JM, Pinson P, Madsen H (2013) A bilevel model for electricity retailers’ participation in a demand response market environment. Energy Economics 36:182–197.
9 Omitted Proof of Statements
Proof.
Proof of Theorem 4.2.3 We start by decomposing the optimality gap as follows.