Log In Sign Up

Smart Containers With Bidding Capacity: A Policy Gradient Algorithm for Semi-Cooperative Learning

by   Wouter van Heeswijk, et al.

Smart modular freight containers – as propagated in the Physical Internet paradigm – are equipped with sensors, data storage capability and intelligence that enable them to route themselves from origin to destination without manual intervention or central governance. In this self-organizing setting, containers can autonomously place bids on transport services in a spot market setting. However, for individual containers it may be difficult to learn good bidding policies due to limited observations. By sharing information and costs between one another, smart containers can jointly learn bidding policies, even though simultaneously competing for the same transport capacity. We replicate this behavior by learning stochastic bidding policies in a semi-cooperative multi agent setting. To this end, we develop a reinforcement learning algorithm based on the policy gradient framework. Numerical experiments show that sharing solely bids and acceptance decisions leads to stable bidding policies. Additional system information only marginally improves performance; individual job properties suffice to place appropriate bids. Furthermore, we find that carriers may have incentives not to share information with the smart containers. The experiments give rise to several directions for follow-up research, in particular the interaction between smart containers and transport services in self-organizing logistics.


Consolidation via Policy Information Regularization in Deep RL for Multi-Agent Games

This paper introduces an information-theoretic constraint on learned pol...

Strategic bidding in freight transport using deep reinforcement learning

This paper presents a multi-agent reinforcement learning algorithm to re...

Deep reinforcement learning of event-triggered communication and control for multi-agent cooperative transport

In this paper, we explore a multi-agent reinforcement learning approach ...

Status-quo policy gradient in Multi-Agent Reinforcement Learning

Individual rationality, which involves maximizing expected individual re...

Independent Natural Policy Gradient Always Converges in Markov Potential Games

Multi-agent reinforcement learning has been successfully applied to full...

Contextual Policy Optimisation

Policy gradient methods have been successfully applied to a variety of r...

PolicySpace2: modeling markets and endogenous housing policies

Policymakers decide on alternative policies facing restricted budgets an...

1 Introduction

The logistics domain is increasingly moving towards self-organization, meaning that freight transport is planned without direct human intervention. The Physical Internet is often considered as the ultimate form of self-organizing logistics, having smart modular containers equipped with sensors and intelligence able to interact with their surroundings and to route themselves Van Heeswijk et al. (2019a). Due to the standardized shapes of the containers, they can easily be combined into full truckloads and be decomposed with equal ease. The concept also suggests that the system should be able to function without a high degree of central governance, rather converging to an organically functioning system by itself. Moreover, it is more efficient than traditional logistics systems, being able to dynamically respond to disruptions and opportunities in the logistics system utilizing intelligent decision-making policies. It is this notion of autonomy and self-organizing systems that inspired the present study.

We model smart containers as independent job agents that – on behalf of their shippers – are able to place a bid on the transport service that they wish to use. In a dynamic setting, the bid price should depends on the state of the system. Rather than having the fixed contract prices that preside in contemporary transport markets, dynamic bidding mimics financial spot markets that constantly balance demand and supply. For instance, if a warehouse holds relatively few containers waiting for transport, a low bid may suffice to get accepted for the transport service, whereas higher bids might be required during busy times. Additionally, there is also an anticipatory element involved in the bidding decision. Assuming each container has a given due date, the bidding strategy should also take into account the probability of future bids getting accepted.

The optimal bidding strategy may be influenced by many factors. We want to have a policy that provides us with the bid that minimizes expected transport costs given the current state of the system. Such an optimal policy is very difficult to derive analytically, but may be approximated by means of reinforcement learning. However, as each job terminates upon delivery of the smart container, lifespans of individual jobs are very limited. As the quality of a learned bidding policy to a large extent depends upon the number of observations made, it is therefore difficult to learn policies individually. Semi-cooperative learning Boukhtouta et al. (2011) might alleviate this problem; even though all endeavoring to minimize their own costs rather than striving towards a common goal, smart containers can share observations to jointly learn better bidding policies that benefit the individual agents. On the other hand, if competing containers are aware of the exact bidding strategies of other containers, they may easily be countered. A fully deterministic policy might therefore not be realistic, we explore whether a stochastic policy yields sensible bidding decisions. Another question that we seek to answer is whether sharing additional information (other than bid prices and acceptance) helps in improving bidding policies. System information (e.g., total container volume in warehouse) may allow for better bids, but the competing containers also utilize this information for the same purpose.

The contribution of this paper is as follows. First, we explore a setting in which smart containers may place bids on transport capacity; to the best of our knowledge this topic has not been studied before from an operations research perspective. In particular, we aim to provide insights into the drivers that determine the bid price and the effects of information sharing on policy quality. Second, we present a policy gradient reinforcement learning algorithm to learn stochastic bidding policies, aiming to mimic a reality in which competing smart containers may deviate from jointly learned policies. Due to the explorative nature of the paper, we present a simplified problem settings involving a single transport service with fixed capacity that operates on the real line. The focus is on the basic mechanisms that govern bidding dynamics absent regulation and centralized control.

2 Literature

This literature overview is structured as follows. First, we discuss the role of smart containers in the Physical Internet. Second, we highlight several studies on reinforcement learning in the Delivery Dispatching Problem, which links to our problem from a carrier’s perspective. Third, we discuss studies that address the topic of bidding in freight transport.

The inspiration from this paper stems from the Physical Internet paradigm. We refer to the seminal works of Montreuil Montreuil (2011); Montreuil et al. (2013) for a conceptual outline of the Physical Internet, thoroughly addressing the foundations of the Physical Internet. It envisions an open market at which logistics services are offered, stating that (potentially automated) interactions between smart containers and other constituents of the Physical Internet determine routing decisions. Sallez et al. Sallez et al. (2016) stress the active role that smart containers have, being able to communicate, memorize, negotiate, and learn both individually and collectively. Ambra et al. Ambra et al. (2019) present a recent literature review of work performed in the domain of the Physical Internet. Interestingly, their overview does not mention any paper that defines the smart container itself as the targeted actor. Instead, existing works seem to focus on traditional actors such as carriers, shippers and logistics service providers, even though smart containers supposedly route themselves in the Physical Internet.

The problem studied in this paper is related to the Delivery Dispatching Problem Minkoff (1993), which entails dispatching decisions from a carrier’s perspective. In this problem setting, transport jobs arrive at a hub according to some external stochastic process. The carrier subsequently decides which subset of jobs to take, also considering future jobs that arrive according to the stochastic process. The most basic instances may be solved with queuing models, but more complicated variants quickly become computationally intractable, such that researchers often resort to reinforcement learning to learn high-quality policies. We highlight some recent works in this domain. Klapp et al. Klapp et al. (2018) develop an algorithm that solves the dispatching problem for a transport service operating on the real line. Van Heeswijk & La Poutré Van Heeswijk and La Poutré (2018) compare centralized and decentralized transport for networks with fixed line transport services, concluding that decentralized planning yields considerable computational benefits. Van Heeswijk et al. Van Heeswijk et al. (2015, 2019b) study a variant that includes a routing component, using value function approximation to find policies. Voccia et al. Voccia et al. (2019) solve a variant that includes both pickups and deliveries.

We highlight some works on optimal bidding in freight transport; most of these studies seem to adopt a viewpoint in which competing carriers bid on transport jobs. For instance, Yan et al. Yan et al. (2018)

propose a particle swarm optimization algorithm used by carriers to place bids on jobs. Miller & Nie

Miller and Nie (2019) present a solution that emphasizes the importance of integration between carrier competition, routing and bidding. Wang et al. Wang et al. (2019) design a reinforcement learning algorithm based on knowledge gradients to solve for a bidding structure with a broker intermediating between carriers and shippers. The broker aims to propose a price that satisfies both carrier and shipper, taking a percentage of accepted bids as its reward. In a Physical Internet context, Qiao et al. Qiao et al. (2019) model hubs as spot freight markets where carriers can place bids on transport bids. To this end, they propose a dynamic pricing model based on an auction mechanism.

3 Problem definition

This section formally defines our problem in the form of a Markov Decision Process (MDP) model. The model is designed from the perspective of a modular container – denoted as a job

– that aims to minimize its expected shipping costs over a finite discretized time horizon

, with each decision epoch

representing a day on which a bid for a capacitated transport service (the carrier) may be placed. In addition to this job-dependent time horizon, we define a system horizon with corresponding decision epochs denoted by . Thus, we use when referring to the individual job level and for the system level.

The cost function and job selection decision of the transport service are defined as well, yet the transport service agent has no learning capacity. As past bids and transport movements do not affect current decisions, the Markovian property is satisfied for this problem. Figure 1 illustrates the bidding problem.

Figure 1: Visual representation of the bidding problem. Modular smart containers (jobs) simultaneously place bids on a transport service with finite capacity; the bids are accepted or rejected based on their marginal contributions.

We now define the jobs, with each job representing a modular container that needs to be transported. A job is represented by the following attribute vector:

The attribute indicates how many decision epochs remain until the latest possible shipment date. When a new job enters the system, we set ; note that this horizon may differ among jobs and decreases over time; the attribute is decremented with each time step. When and the job has not been shipped, it is considered to be a failed job, incurs a penalty, and is removed from the system. The attribute indicates the position of the destination on the real line; the further away the higher the transport costs. The job volume indicates how much transport capacity the job requires. Let be the problem state, defined as a set containing the jobs present in the system at time . Furthermore, let be the set of feasible states at time .

At each decision epoch , a transport service with fixed capacity departs along the real line. For the transport service to decide which jobs to take, the selection procedure is modeled as a 0-1 knapsack problem that is solved using dynamic programming Kellerer et al. (2004). The value of each job is its bid price minus its transport costs. Jobs with negative values are always rejected. Note that when the transport capacity exceeds the cumulative job volume, the transport service will accept all positive bids. The decision vector for the carrier is denoted as , with . The set denotes the set of all feasible selections. The transport service’s cost function for shipping a single job is a function of distance and volume: . It maximizes its direct rewards by selecting as follows:



From the perspective of jobs, actions (i.e., bids) are defined on the level of individual containers. All bids are placed in parallel and independent of one another, yielding a vector . Unshipped jobs with incur holding costs and unshipped jobs with incur a failed job penalty, both are proportional to the job volume. At any given decision epoch, the direct reward function for individual jobs is defined as follows:

To obtain at the current decision epoch (which may be denoted by when explicitly including the decision epoch), we try to solve , i.e., the goal is to maximize the expected reward (minimize expected costs) over the horizon of the job. Note that – as a container cannot observe the bids of other jobs, nor the cost function of the transport service, nor future states – we can only make decisions based on expected rewards depending on the stochastic environment. The solution method presented in Section 4 further addresses this problem.

Finally, we define the transition function for the system state that occurs in the time step from decision epoch to . During this step two changes occur; we (i) decrease the due dates of jobs that are not shipped or otherwise removed from the system and (ii) add newly arrived jobs to the state. The set of new jobs arriving for is defined by . The transition function is defined by the following sequential procedure:

Algorithm 1

Transition function

0: Input: Current state, job arrivals, shipping selection
1: Initialize next state
2: Copy state (post-decision state)
3: Loop over all jobs
4:    Remove shipped job
5:    Remove unshipped job with due date 0
6:    Decrement time till due date
7: Merge existing and new job sets
8: Output: New state

4 Solution method

To learn the bidding strategy of the containers, we draw from the widely used policy gradient framework. For a detailed description and theoretical background, we refer to the REINFORCE algorithm by Williams Williams (1992)

. As noted earlier, the policy gradient algorithm returns a stochastic bidding policy, reflecting the deviations in bid prices adopted by individual containers. As bids can take any real value, we must adopt a policy suitable for continuous action spaces. In this paper we opt for a Gaussian policy, drawing bids from a normal distribution. The mean and standard deviation of this distribution are learned using reinforcement learning.

In policy-based reinforcement learning, we perform an action directly on the state and observe the corresponding rewards. Each simulated episode yields a batch of selected actions and related rewards according to the stochastic policy . Under our Gaussian policy, bids are drawn independently from other containers, i.e., . The randomness in action selection allows the policy to keep exploring the action space and to escape from local optima. From the observed actions and rewards during each episode , we deduce which actions result in above-average rewards and compute gradients ensuring that the policy is updated in that direction, yielding updated policies until we reach . For consistent policy updates, we only use observations for jobs that are either shipped or removed, for which we need some additional notation. Let be a vector containing the number of bid observations for such completed jobs. For example, if a job had an original due date of 4 and is shipped at , we would increment , and by 1 (using an update function ). Finally, we store all completed jobs (i.e., either shipped or failed) in a set . For each episode, the cumulative rewards per job are defined as follows:

Let be the vector containing all observed cumulative rewards at time in episode . At the end of each episode, we can then define the information vector

and corresponding updating function ; the information vector contains the states, actions and rewards necessary for the policy updates (i.e., a history similar to the well-known SARSA trajectory). The decision-making policy is updated according to the policy gradient theorem Sutton and Barto (2018), which may be formalized as follows:

Essentially, the theorem states that the gradient of the objective function is proportional to the value functions multiplied by the policy gradients for all actions in each state, given the probability measure implied by the prevailing decision-making policy .

We proceed to formalize the policy gradient theorem for our problem setting. Let be a vector of weight parameters that defines the decision-making policy . Furthermore, let be a feature vector that distills the most salient attributes from the problem state, e.g., the number of jobs waiting to be transported or the average time till due date. We will further discuss the features in Section 5. For the Gaussian case, we formalize the policy as follows:

with being the bid price, the Gaussian mean and the parametrized standard deviation. The action may be obtained from the inverse normal distribution. The corresponding gradients are defined by


The gradients are used to update the policy parameters. As the observations may exhibit large variance, we add a non-biased baseline value (i.e., not directly depending on the policy), namely the average observed value during the episode

Sutton and Barto (2018). For the prevailing episode , we define the baseline as

For the Gaussian policy, the weight update rule for (updating to ) is:


The standard deviation is updated as follows:


Intuitively, this means that after each episode we update the feature weights – which in turn provide the state-dependent mean bidding price – and the standard deviation of the bids. The mean bidding price – taking into account both individual job properties and the state of the system – represents the bid that is expected to minimize overall costs. If effective bids are very close to the mean, the standard deviation will decrease and the bidding policy will converge to an almost deterministic policy. However, if there is an expected benefit in varying bids, the standard deviation may grow larger. The algorithmic outline to update the parametrized policy is at follows:

Algorithm 2

Outline of the policy gradient bidding algorithm (based on Williams (1992))

0: Input: Differentiable parametrized policy
1: Set step sizes
2: Initialize standard deviation
3: Initialize policy parameters
4: Loop over episodes
5:    Initialize completed job set
6:    Initialize information set
7:    Generate initial state
8:    Loop over finite time horizon
9:      Bid placement jobs
10:      Job selection carrier, Eq. (1)
11:      Compute cumulative rewards
12:      Loop over completed jobs
13:        Update set of completed jobs
14:        Update number of completed jobs
15:      Store information
16:      Generate job arrivals
17:      Transition function, Algorithm 1
18:    Loop till maximum due date
19:      Loop over completed jobs
20:        Update Gaussian mean, Eq. (4)
21:        Update standard deviation, Eq. (7)
22: Output: Return tuned policy

5 Numerical experiments

This section describes the numerical experiments and the results. Section 5.1

explores the parameter space and aids in tuning the hyperparameters. The algorithm is written in Python 3.7 and available online.


5.1 Exploration of parameter space

The purpose of this section is twofold: we explore the effects of various parameter settings on the performance of the algorithm and select a suitable set of parameters for the remainder of the experiments. We make use of the instance settings summarized in Table 1. Note that the penalty for failed jobs is the main driver in determining bid prices; together with holding costs, it intuitively represents the maximum price the smart container is willing to bid to be transported.

Max. # job arrivals [0-10]
Due date [1-5]
Job transport distance [10-100]
Job volume [1-10]
Holding cost (per volume unit) 1
Penalty failed job (per volume unit) 10
Transport costs per mile (per volume unit) 0.1
Transport capacity 80
Table 1: Instance settings

To parametrize the policy we use several features. First, we use a scalar that serves as the bias. Second, we use the individual job properties of the job placing the bid, i.e., the time till due date, the job’s transport distance and the container volume. Third, in case the job shares its own properties with the system, it also use the generic system features: the total number of jobs, average distance, total volume, and average due date. Recall that these system features only include the data of other smart containers that share their information. All weight parameters in are initialized at 0, yielding initial bid prices of 0.

We perform a sequential search to set the simulation parameters. First, we tune the learning rates (learning rate for mean) and (learning rate for standard deviation), starting with a standard normal distribution. We test learning rates for both parameters and find that and are stable (i.e., no exploding gradients) and converge reasonably fast. Taking smaller learning rates yields no eminent advantages in terms of stability or eventual policy quality. Figure 2 shows two examples of parameter convergence under various learning rates.

(a) and
(b) and
Figure 2: Convergence of and (normalized) for various learning rates. Higher learning rates achieve both faster convergence and lower average bid prices.

Next, we tune the initial bias weight (using values ) and the initial standard deviation (using values ). Anticipating non-zero bids, we test several initializations with nonzero bias weights. Large standard deviations allow for much exploration early on, but may also result in unstable policies or slow convergence. From the experiments, we observe that the bias weight converges to a small or negative weight and that there is no benefit in different initializations. For the standard deviation, we find that yields the best results; the exploration triggered by setting large initial standard deviations helps avoiding local optima early on. In terms of performance, the average transport costs are 7.3% lower than under the standard normal initialization. Standard deviations ultimately converge to similarly small values regardless the initialization.

Finally, we determine the number of episodes and the length of each horizon . Longer horizons lead to larger and therefore more reliable batches of completed jobs per episode, but naturally require more computational effort. Thus, we compare settings for which the total number of time steps is equivalent. Each alternative simulates 1,000,000 time steps, using with corresponding values . To test convergence, after each 10% of completed training episodes we perform 10 validation episodes – always with for fair comparisons – to evaluate policy qualities. We find that having large batches provides notable advantages. Furthermore, in all cases 400,000 time steps appear sufficient to converge to stable policies. To illustrate the findings, Figure 2(a) shows the average transport costs measured after each 100,000 time steps (using the then-prevailing policy); Figure 2(b) shows the quality of the eventual policies for each time horizon.

(a) Comparison offline quality.
(b) Comparison final policy quality.
Figure 3: Policy performance for various time horizons. The horizon yields the best overall performance; batches too small diminish performance.

The final parameters to be used for the remainder of the numerical experiments are summarized as follows: , , , and .

5.2 Analysis of policy performance

Having determined the parameter settings, we proceed to analyze the performance of the jointly learned policies. This section addresses the effects of information sharing, the relevance of the used features in determining the bid, and the behavior of the bidding policy and its impact on carrier profits. All results in this section correspond to the performance of policies after training is completed. To obtain additional insights we sometimes define alternative instances; for conciseness we do not explicitly define the settings of these alternatives.

We first evaluate the effects of information sharing. According to preset ratios, we randomly assign whether or not a container shares its information with the system. Only containers that share information are able to see aggregate system information. Clearly, the more containers participate, the more accurate the system information will be. We test sharing rates ranging from 0% to 100% with increments of 10%; Table 2 shows the results. We observe that performance under full information sharing and no information sharing is almost equivalent, with partial information sharing always performing worse. The latter observation may be explained by the distorted view presented by only measuring part of the system state.

Feature 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
Scalar -10.10 -10.28 -8.03 -11.83 -10.98 -4.81 -7.17 -10.46 -8.71 -9.46 -12.12
Total job volume 0.06 0.06 -0.12 0.44 0.03 0.29 0.42 1.18 0.64 0.63
Average due date -2.37 -2.73 -3.28 -2.57 -2.73 -2.69 -2.49 -1.79 -2.02 -1.56
Average distance 1.83 3.07 2.59 2.36 3.18 3.58 3.16 3.44 2.26 2.92
# jobs -0.03 -0.31 -0.69 -0.21 -0.86 -0.77 -1.39 -0.02 -1.25 -1.70
Job volume 59.20 57.14 56.83 58.46 58.70 55.27 56.13 56.55 56.69 57.97 60.10
Job due date -22.45 -23.25 -24.02 -21.16 -21.95 -23.89 -25.96 -22.89 -22.26 -22.82 -21.27
Job distance 49.49 48.37 45.88 49.38 48.81 45.98 45.27 46.85 48.00 49.15 49.54
Average reward -46.87 -51.78 -55.96 -49.12 -49.54 -57.75 -56.75 -55.71 -53.12 -49.43 -46.32
Table 2: Feature weights for various information sharing rates.

In the policy parametrization all features are scaled to the domain, such that the magnitudes of weights are comparable among each other. We see that the generic features have relatively little impact on the overall bid price. This underlines the limited difference observed between full information sharing and no information sharing. Job volume and job distance are by far the most significant drivers of a job’s bid prices. In line with expectations, as the costs incurred by the carrier depend on these two factors, therefore requiring higher compensations. Furthermore, holding- and penalty costs are proportional to the job volume. The relationship with the time till due date is negative; if more time remains, it might be prudent to place lower bids without an imminent risk of penalties. On average, each job places 1.36 bids and 99.14% of jobs is ultimately shipped; the capacity of the transport service is rarely restrictive. Figure 4 illustrates bidding behavior with respect to transport distance and time till due date, respectively.

(a) Bids relative to transport distance.
(b) Sample paths of bids over time.
Figure 4: Visualizations of bidding policies with respect to volume and due date Bids tend to increase when (a) transport distance is larger and (b) the job is closer to its due date.

Next, we discuss some behavior of the bidding policy and its effect on carrier profits. As our carrier is a passive agent we omit overly quantitative assessments here. We simulate various toy-sized instances, adopting simplified deterministic settings with a single container type (time till due date is zero, identical volume and distance). If transport capacity is guaranteed to suffice, the learned bid prices converge to almost zero. If two jobs always compete for one transport service and the other incurs a penalty, the bid will be slightly below the penalty cost. Several other experiments with scarce capacity also show bid prices slightly below the expected penalty costs that would otherwise be incurred. For our base scenario, the profit margin for the carrier is 20.2%. This positive margin indicates that the features do not encapsulate all information required to learn the minimum bidding price. For comparison, we run another scenario in which the carrier’s transport costs – which are unknown to the smart containers – are the sole feature; in that case all jobs trivially learn the minimum bidding price. This result implies that carrier may have financial incentives not to divulge too much information to the smart containers.

For the carrier, the bidding policies deployed by the smart container greatly influence its profit. Scarcity of transport capacity drives up bid prices, yet also increases the likelihood of missed revenue. To gain more insight into this trade-off, we simulate various levels of transport capacity, from scarce to abundant. These experiment indeed confirm that a (non-trivial) capacity level exists that maximizes profit. In addition, a carrier needs not to accept all jobs whose bid exceed their marginal transport costs, as we presumed in this paper. Having a carrier represented by an active agent stretches beyond the scope of this paper.

We summarize the main findings, reiterating that the setting of this paper is a highly stylized environment. The key insights are as follows:

  • Utilizing global system information only marginally reduces job’s transport costs compared to sharing no system information;

  • Jointly learned policies converge to stable bidding policies with small standard deviations;

  • Jobs with more time remaining till their due date are prone to place lower bids;

  • Carriers have an incentive not to disclose true cost information when transport capacity is abundant.

6 Conclusions

Traditional transport markets rely on (long-term) contracts between shippers and carriers in which price agreements are made. In contrast, self-organizing systems are expected to evolve into some form of spot market where demand and supply are dynamically matched based on the current state of the system. This paper explores the concept of smart containers placing bids on restricted transport capacity. We design a policy gradient algorithm to mimic joint learning of a bidding policy, sharing observations between autonomous smart modular containers. The stochastic policy reflects deviations made by individual containers, given that deterministic policies are easy to counter in a competitive environment. This stochastic approach appears effective. Standard deviations converge to small values, implying stable bidding policies. The performance of the policy is consistent with the effects of job volume, transport distance and till time due date, which are used to parametrize the bidding policy.

Numerical experiments show that sharing system information only marginally reduces bidding costs; individual job properties are the main driver in setting bids. The limited difference in policy quality with and without sharing system information is an interesting observation. This result implies that smart containers would only need to (anonymously) submit their key properties, submitted bid prices and incurred costs. There is no apparent need to share information on a system-wide level, which would greatly ease the system design.

The profitability of the transport service – which is modeled as a passive agent in this paper – strongly depends on the bidding policy of the smart containers. Experiments with varying transport capacities show that in turn, the carrier can also influence bidding policies by optimizing the transport capacity that is offered. Without central governance, unbalances between smart containers and transport services may cause disturbed performances. Based on the findings presented in this paper, one can imagine that the dynamic interplay between carriers and smart containers is a very interesting one that begs closer attention.

We re-iterate that this study is of an explorative nature; there are many avenues for follow-up research. In terms of algorithmic improvements, actor-critic methods (learning functions for expected downstream values rather than merely observing them) would be a logical continuation. Furthermore, the linear expression to compute the bidding price could be replaced by neural networks that capture potential non-linear structures. In addition, the basic problem presented in this paper lends itself for various extensions. So far the carrier has been assumed to be a passive agent, offering fixed transport capacity and services regardless of (anticipated) income. In reality the carrier will also make intelligent decisions based on the bidding policies of smart containers. A brokerage structure might also emerge in the Physical Internet context. Finally, we considered only a single transport service operating on the real line. Using the same algorithmic setup, this setting could be extended to more realistic networks with multiple carriers, routes and destination nodes.


  • T. Ambra, A. Caris, and C. Macharis (2019) Towards freight transport system unification: reviewing and combining the advancements in the physical internet and synchromodal transport research. International Journal of Production Research 57 (6), pp. 1606–1623. Cited by: §2.
  • A. Boukhtouta, J. Berger, W. B. Powell, and A. George (2011) An adaptive-learning framework for semi-cooperative multi-agent coordination. In 2011 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning (ADPRL), pp. 324–331. Cited by: §1.
  • H. Kellerer, U. Pferschy, and D. Pisinger (2004) Multidimensional knapsack problems. In Knapsack problems, pp. 235–283. Cited by: §3.
  • M. A. Klapp, A. L. Erera, and A. Toriello (2018) The one-dimensional dynamic dispatch waves problem. Transportation Science 52 (2), pp. 402–415. Cited by: §2.
  • J. Miller and Y. M. Nie (2019) Dynamic trucking equilibrium through a freight exchange. Transportation Research Part C: Emerging Technologies. Cited by: §2.
  • A. S. Minkoff (1993)

    A Markov decision model and decomposition heuristic for dynamic vehicle dispatching

    Operations Research 41 (1), pp. 77–90. Cited by: §2.
  • B. Montreuil, R. D. Meller, and E. Ballot (2013) Physical Internet foundations. In Service orientation in holonic and multi agent manufacturing and robotics, pp. 151–166. Cited by: §2.
  • B. Montreuil (2011) Toward a Physical Internet: meeting the global logistics sustainability grand challenge. Logistics Research 3 (2-3), pp. 71–87. Cited by: §2.
  • B. Qiao, S. Pan, and E. Ballot (2019) Dynamic pricing model for less-than-truckload carriers in the physical internet. Journal of Intelligent Manufacturing 30 (7), pp. 2631–2643. Cited by: §2.
  • Y. Sallez, S. Pan, B. Montreuil, T. Berger, and E. Ballot (2016) On the activeness of intelligent physical internet containers. Computers in Industry 81, pp. 96–104. Cited by: §2.
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §4, §4.
  • W. J. A. Van Heeswijk and H. La Poutré (2018) Scalability and performance of decentralized planning in flexible transport networks. In 2018 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 292–297. Cited by: §2.
  • W. J. A. Van Heeswijk, M. R. K. Mes, and J. M. J. Schutten (2019a) Transportation management. In Operations, Logistics and Supply Chain Management, pp. 469–491. Cited by: §1.
  • W. J. A. Van Heeswijk, M. R. K. Mes, and J. M. J. Schutten (2015) An approximate dynamic programming approach to urban freight distribution with batch arrivals. In International Conference on Computational Logistics, pp. 61–75. Cited by: §2.
  • W. J. A. Van Heeswijk, M. R. K. Mes, and J. M. J. Schutten (2019b) The delivery dispatching problem with time windows for urban consolidation centers. Transportation Science 53 (1), pp. 203–221. Cited by: §2.
  • S. A. Voccia, A. M. Campbell, and B. W. Thomas (2019) The same-day delivery problem for online purchases. Transportation Science 53 (1), pp. 167–184. Cited by: §2.
  • Y. Wang, J. M. Do Nascimento, and W. Powell (2019) Reinforcement learning for dynamic bidding in truckload markets: an application to large-scale fleet management with advance commitments. stat 1050, pp. 4. Cited by: §2.
  • R. J. Williams (1992) Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning 8 (3-4), pp. 229–256. Cited by: §4, Algorithm 2.
  • F. Yan, Y. Ma, M. Xu, and X. Ge (2018) Transportation service procurement bid construction problem from less than truckload perspective. Mathematical Problems in Engineering 2018. Cited by: §2.