Log In Sign Up

Automatic Curriculum Generation for Learning Adaptation in Networking

by   Zhengxu Xia, et al.

As deep reinforcement learning (RL) showcases its strengths in networking and systems, its pitfalls also come to the public's attention–when trained to handle a wide range of network workloads and previously unseen deployment environments, RL policies often manifest suboptimal performance and poor generalizability. To tackle these problems, we present Genet, a new training framework for learning better RL-based network adaptation algorithms. Genet is built on the concept of curriculum learning, which has proved effective against similar issues in other domains where RL is extensively employed. At a high level, curriculum learning gradually presents more difficult environments to the training, rather than choosing them randomly, so that the current RL model can make meaningful progress in training. However, applying curriculum learning in networking is challenging because it remains unknown how to measure the "difficulty" of a network environment. Instead of relying on handcrafted heuristics to determine the environment's difficulty level, our insight is to utilize traditional rule-based (non-RL) baselines: If the current RL model performs significantly worse in a network environment than the baselines, then the model's potential to improve when further trained in this environment is substantial. Therefore, Genet automatically searches for the environments where the current model falls significantly behind a traditional baseline scheme and iteratively promotes these environments as the training progresses. Through evaluating Genet on three use cases–adaptive video streaming, congestion control, and load balancing, we show that Genet produces RL policies which outperform both regularly trained RL policies and traditional baselines in each context, not only under synthetic workloads but also in real environments.


page 4

page 10


Iroko: A Framework to Prototype Reinforcement Learning for Data Center Traffic Control

Recent networking research has identified that data-driven congestion co...

Grounding Aleatoric Uncertainty in Unsupervised Environment Design

Adaptive curricula in reinforcement learning (RL) have proven effective ...

DCUR: Data Curriculum for Teaching via Samples with Reinforcement Learning

Deep reinforcement learning (RL) has shown great empirical successes, bu...

Brick Tic-Tac-Toe: Exploring the Generalizability of AlphaZero to Novel Test Environments

Traditional reinforcement learning (RL) environments typically are the s...

Procedural Level Generation Improves Generality of Deep Reinforcement Learning

Over the last few years, deep reinforcement learning (RL) has shown impr...

Adversarial Environment Generation for Learning to Navigate the Web

Learning to autonomously navigate the web is a difficult sequential deci...

Online Safety Assurance for Deep Reinforcement Learning

Recently, deep learning has been successfully applied to a variety of ne...

1. Introduction

Many recent techniques based on deep reinforcement learning (RL) are now among the state-of-the-arts for various networking and systems adaptation problems, including congestion control (CC) (aurora), adaptive-bitrate streaming (ABR) (pensieve), load balancing (LB) (park), wireless resource scheduling (chinchali2018cellular), and cloud scheduling (decima). For a given distribution of training network environments (e.g., network connections with certain bandwidth pattern, delay, and queue length), RL trains a policy to optimize performance over these environments.

However, these RL-based techniques face two challenges that can ultimately impede their wide use in practice:

  • Training in a wide range of environments: When the training distribution spans a wide variety of network environments (e.g., a large range of possible bandwidth), an RL policy may perform poorly even if tested in the environments drawn from the same distribution as training.

  • Generalization: RL policies trained on one distribution of synthetic or trace-driven environments may have poor performance and even erroneous behavior when tested in a new distribution of environments.

Our analysis in §2 will reveal that, across three RL use cases in networking, these challenges can cause well-trained RL policies to perform much worse than traditional rule-based schemes in a range of settings.

These problems are not unique to networking. In other domains (e.g., robotics, gaming) where RL is widely used, there have been many efforts to address these issues, by enhancing offline RL training or re-training a deployed RL policy online. Since updating a deployed model is not always possible or easy (e.g., loading a new kernel module for congestion control or integrating an ABR logic into a video player), we focus on improving RL training offline.

A well-studied paradigm that underpins many recent techniques to improve RL training is curriculum learning (narvekar2020curriculum). Unlike traditional RL training which samples training environments in a random order, curriculum learning generates a training curriculum that gradually increases the difficulty level of training environments, resembling how humans are guided to comprehend more complex concepts. Curriculum learning has been shown to improve generalization (adr; adr2; paired) as well as asymptotic performance (weinshall2018curriculum; justesen2018illuminating), namely the final performance of a model after training runs to convergence. Following an easy-to-difficult routine allows the RL model to make steady progress and reach good performance.

In this work, we present Genet, the first training framework that systematically introduces curriculum learning to RL-based networking algorithms. Genet automatically generates training curricula for network adaptation policies. The challenge of curriculum learning in networking is how to sequence network environments in an order that prioritizes highly rewarding environments where the current RL policy’s reward can be considerably improved. Unfortunately, as we show in §3, several seemingly natural heuristics to identify rewarding environments suffer from limitations.

  • First, they use innate properties of each environment (e.g., shorter network or workload traces (decima) and smoother network conditions (robustifying) are supposedly easier), but these innate properties fail to indicate whether the current RL model can be improved in an environment.

  • Second, they use handcrafted heuristics which may not capture all aspects of an environment that affect RL training (e.g., bandwidth smoothness does not capture the impact of router queue length on congestion control, or buffer length on adaptive video streaming). Each new application (e.g., load balancing) also requires a new heuristic.

The idea behind Genet is simple: An environment is considered rewarding if the current RL model has a large gap-to-baseline, i.e., how much the RL policy’s performance falls behind a traditional rule-based baseline (e.g., Cubic or BBR for congestion control, MPC or BBA for adaptive bitrate streaming) in the environment. We show in §4.1 that the gap-to-baseline of an environment is highly indicative of an RL model’s potential improvement in the environment. Intuitively, since the baseline already shows how to perform better in the environment, the RL model may learn to “imitate” the baseline’s known rules while training in the same environment, bringing it on par with—if not better than—the baseline. On the flip side, if an environment has a small or even negative gap-to-baseline, chances are that the environment is intrinsically hard (a possible reason why the rule-based baseline performs badly), or the current RL policy already performs well and thus training on it is unlikely to improve performance by a large margin.

Use case Observed state (policy input) Action (policy output) Reward (performance)
Adaptive Bitrate (ABR) Streaming future chunk size, history throughput, current buffer length bitrate selected for the
next video chunk
Congestion Control (CC) RTT inflation, sending/receiving rate,
avg RTT in a time window, min RTT change of sending rate in
the next time window
Load Balancing (LB) past throughput, current request size, number of queued requests per server server selection for
the current request
Table 1. RL use cases in networked systems. Default reward parameters: (rebuffer in seconds), (bitrate in Mbps), (bitrate change in Mbps), (throughput in Kbps), (latency in seconds), . Details in A.5.
Figure 1. Genet creates training curricula by iteratively finding rewarding environments where the current RL policy has high gap-to-baseline.

Inspired by the insight, Genet generates RL training curricula by iteratively identifying rewarding environments where the current RL model has a large gap-to-baseline and then adding them to RL training (Figure 1). For each RL use case, Genet parameterizes the network environment space, allowing us to search for rewarding environments in both synthetically instantiated environments and trace-driven environments. Genet also uses Bayesian Optimization to facilitate the search in a large space. Genet is generic, since it does not use handcrafted heuristics to measure the difficulty of a network environment; instead, it uses rule-based algorithms, which are abundant in the literature of many networking and system problems, to generate training curricula. Moreover, by focusing training on places where RL falls behind rule-based baselines, Genet directly minimizes the chance of performance regressions relative to the baselines. This is important, because system operators are more willing to deploy an RL policy if it outperforms the incumbent rule-based algorithm in production without noticeable performance regressions.222An example of this mindset is that a new algorithm must compete with the incumbent algorithm in A/B testing before being rolled out to production.

We have implemented Genet as a separate module with a unifying abstraction that interacts with the existing codebases of RL training to iteratively select rewarding environments and promote them in the course of training. We have integrated Genet with three existing deep RL codebases in the networking area—adaptive video streaming (ABR) (pensieve-code), congestion control (CC) (aurora-code), and load balancing (LB) (park-code).

It stands to reason that Genet is not without limitations. For instance, Genet-trained RL policies might not outperform all rule-based baselines (§5.5 shows that when using a naive baseline to guide Genet, the resulting RL policy could still be inferior to stronger baselines). Genet-trained RL policies may also achieve undesirable performance in environments beyond the training ranges (e.g., if we train a congestion-control algorithm on links with bandwidth between 0 and 100 Mbps, Genet will not optimize for the bandwidth of 1 Gbps). Moreover, Genet does not guarantee adversarial robustness which sometimes conflicts with the goal of generalization (raghunathan2019adversarial).

Using a combination of trace-driven simulation and real-world tests across three use cases (ABR, CC, LB), we show that Genet improves asymptotic performance by 8–25% for ABR, 14–24% for CC, 15% for LB, compared with traditional RL training methods. We also show that Genet-trained RL policies generalize well to new distributions of network or workload characteristics (different distributions of bandwidth, delay, queue length, etc.).

2. Motivation

Deep reinforcement learning (RL) trains a deep neural net (DNN) as the decision-making logic (policy) and is well-suited to many sequential decision-making problems in networking (park; haj2019view).333There are rule-based alternatives to DNN-based policies, but they are not as expressive and flexible as DNNs, which limits their performance. Oboe (oboe)

, for instance, sets optimal hyperparameters for RobustMPC based on the mean and variance of network bandwidth and as shown in §

5.4, is a very competitive baseline, but it performs worse than the best RL strategy. We use three use cases (summarized in Table 1) to make our discussion concrete:

  • An adaptive bitrate (ABR) algorithm adapts the chunk-level video bitrate to the dynamics of throughput and playback buffer (input state) over the course of a video session. ABR policies, including RL-based ones (Pensieve (pensieve)), choose the next chunk’s bitrate (output decision) at the chunk boundary to maximize session-wide average bitrate, while minimizing rebuffering and bitrate fluctuation.

  • A congestion control (CC) algorithm at the transport layer adapts the sending rate based on the sender’s observations of the network conditions on a path (input state). An example of RL-based CC policy (Aurora (aurora)) makes sending rate decisions at the beginning of each interval (of length proportional to RTT), to maximize the reward (a combination of throughput, latency, and packet loss rate).

  • A load balancing (LB) algorithm in a key-replicated distributed database reroutes each request to one of the servers (whose real-time resource utilization is unknown), based on the request arrival intervals, resource demand of past requests, and the number of outstanding requests currently assigned to each server.

We choose these use cases because they have open-source implementations (Pensieve 

(pensieve-code) for ABR, Aurora (aurora-code) for CC, and Park (park-code) for LB). Our goal is to improve existing RL training in networking. Revising the RL algorithm per se (input, output, or DNN model) is beyond our scope.

Network environments:  We generate simulated training environments with a range of parameters, following prior work (pensieve; aurora; park). An environment can be synthetically generated using a list of parameters as configuration, e.g., in the context of ABR, a configuration encompasses bandwidth range, frequency of bandwidth change, chunk length, etc. Meanwhile, when recorded bandwidth traces are available (for CC and ABR experiments), we can also create trace-driven environments where the recorded bandwidth is replayed. Note that bandwidth is only one dimension of an environment and must be complemented with other synthetic parameters in order to create a simulated environment. (Our environment generator and a full list of parameters are documented in §A.2.) In recent papers, both trace-driven (e.g., (pensieve; robustifying)) and synthetic environments (e.g., (aurora; park)) are used to train RL-based network algorithms. We will explain in §4.2 how our technique applies to both types of environments.

Traditional RL training:  Given a user-specified distribution of (trace-driven or synthetic) training environments, the traditional RL training method works in iterations. Each iteration randomly samples a subset of environments from the provided distribution and then updates the DNN-based RL policy (via forward and backward passes).For instance, Aurora (aurora) uses a batch size of 7200 steps (i.e., 30–50 30-second network environments) and applies the PPO1 algorithm to update the policy network by simulating the network environments in each batch.

Several previous efforts have demonstrated the promise of the traditional RL training—given a distribution of target environments, an RL policy can be trained to perform well on these environments (e.g., (pensieve; aurora)). Unfortunately, this approach falls short on two fronts.

(a) Performance gains of RL schemes over the baselines diminish as the target distribution spans a wide range of environments.
(b) Even if RL schemes perform better on average, they are worse than the baselines on a substantial fraction of test environments.
Figure 2. Challenge of RL training over a wider range of environments from small (RL1), medium (RL2), to large (RL3).
(a) RL-based CC trained in synthetic network environments performs worse on real network traces than the rule-based baseline.
(b) RL-based CC trained over one real trace set performs worse on another real trace set than the rule-based baseline.
Figure 3. Generalization issues of RL-based schemes using CC as an example.

Challenge 1: Training over wide environment distributions.  When the training distribution of network environments has a wide spread (e.g., a large range of possible bandwidth values), RL training tends to result in poor asymptotic performance (model performance after reaching convergence) even when the test environments are drawn from the same distribution as training.

In Figure 2, for each use case, we choose three target distributions (with increasing parameter ranges), labeled RL1/RL2/RL3 ranges of synthetic environment parameters in Table 3, 4, and 5. Figure 2(a) compares the asymptotic performance of three RL policies (with different random seeds) with rule-based baselines, MPC (mpc) for ABR, BBR (bbr) for CC, and least-load-first (LLF) policy for LB, in test environments randomly sampled from the same ranges. It shows that RL’s performance advantage over the baselines diminishes rapidly when the range of target environments expands. Even though RL-based policies still outperform the baselines on average, Figure 2(b) reveals a more striking reality—their performance falls behind the baselines in a substantial fraction of test environments.

An intuitive explanation is that in each RL training iteration, only a batch of randomly sampled environments (typically 20–50) is used to update the model, and when the entire training set spans a wide range of environments, the batches between two iterations may have dramatically different distributions which potentially push the RL model to different directions. This causes the training to converge slowly and makes it difficult to obtain a good policy (narvekar2020curriculum).

Challenge 2: Low generalizability.  Another practical challenge arises when the training process does not have access to the target environment distribution. This calls for models with good generalization, i.e., the RL policies trained on one distribution also perform well on a different environment distribution during testing. Unfortunately, existing RL training methods often fall short of this ideal. Figure 3 evaluates the generalizability of RL-based CC schemes in two ways.

  • First, we train an RL-based CC algorithm on the same range of synthetic environments as specified in its original paper (aurora). We first validate the model by confirming its performance against a rule-based baseline BBR, in environments that are independently generated from the same range as training (Figure 3(a); left). Nevertheless, when tested on real-world recorded network traces under the category of “Cellular” and “Ethernet” from Pantheon (pantheon) (Table 2), the RL-based policy yields much worse performance than the rule-based baseline.

  • Second, we train the RL-based CC algorithm on the “Cellular” trace set and test it on the “Ethernet” trace set (Figure 3(b); left), or vice versa (Figure 3(b); right). Similarly, its performance degrades significantly when tested on a different trace set.

The observations in Figure 3 are not unique to CC. Prior work (robustifying) also shows a lack of generalization of RL-based ABR algorithms.

Summary:  In short, we observe two challenges faced by the traditional RL training mechanism:

  • The asymptotic performance of the learned policies can be suboptimal, especially when they are trained over a wide range of environments.

  • The trained RL policies may generalize poorly to unseen network environments.

3. Curriculum learning for networking

Given these observations regarding the limitations of RL training in networking, a natural question to ask is how to improve RL training such that the learned adaptation policies achieve good asymptotic performance across a broad range of target network environments.444An alternative is to retrain the deployed RL policy whenever it meets a new domain (e.g., a new network connection with unseen characteristics), but this does not apply when the RL policy cannot be updated frequently. Besides, it is also challenging to precisely detect model drift in the network conditions that necessitate retraining the RL policy.

Curriculum learning:  We cast the training of RL-based network adaptation to the well-studied framework of curriculum learning. Unlike the traditional RL training which samples training environments from a fixed distribution in each iteration, curriculum learning gradually increases the difficulty of training environments, so that it always focuses on training environments that are easier to improve, i.e., most rewarding environments.555

In deep learning literature, finding the optimal training environments (hypothesized in the seminal paper 

(bengio2009curriculum) as “not too hard or too easy”) in the general setting still remains an open question. Prior work has demonstrated the benefits of curriculum learning in other applications of RL, including faster convergence, higher asymptotic performance, and better generalization.

However, the challenge of employing curriculum learning lies in determining which environments are rewarding. Apparently the answer to this question varies with applications, but three general approaches exist: (1) training the current model on a set of environments individually to determine in which environment the training progresses faster; (2) using heuristics to quantify the easiness of achieving model improvement an environment; and (3) jointly training another model (typically DNN) to select rewarding environments. Among them, the first option is prohibitively expensive and thus not widely used, whereas the third introduces extra complexity of training a second DNN. Therefore, we take a pragmatic stance and explore the second approach, while leaving the other two for future work.

Why sequencing training environments is difficult:  To motivate our design choices, we first introduce three strawman approaches, each with different strengths and weaknesses. A common strategy in curriculum learning for RL is to measure environment difficulty and gradually introduce more difficult environments to training.

Strawman 1: inherent properties. The first idea is to quantify the difficulty level of an environment using some of its inherent properties. In congestion control, for instance, network traces with higher bandwidth variance are intuitively more difficult. This approach, however, only distinguishes environments that differ in the hand-picked properties and may not suffice under complex environments (e.g., adding bandwidth traces with similar variance to training can have different effects).

Strawman 2: performance of rule-based baselines. Alternatively, one can use the test performance of a traditional algorithm to indicate the difficulty of an environment. Lower performance may suggest a more difficult environment (weinshall2018curriculum). While this method can distinguish any two environments, it does not hint how to improve the current RL model during training.

Strawman 3: performance of the current RL model. To fix the second strawman solution, one can use the performance of the current RL policy, instead of a traditional algorithm. If the current RL model performs poorly in an environment, it can potentially improve a lot when trained in this environment (or similar ones). However, this approach may fail since some environments are inherently hard for a model to improve on. In CC, examples of such environments include links with frequently varying bandwidth.

Figure 4. A simple example where adding trace set Y to training has a different effect than adding Z. Adding Y to training improves performance on Y only marginally but hurts both X and Z, whereas adding Z improves the performance on both Y and Z without negative impact on X.
(a) Trace in Y (hard)
(b) Trace in Z (improvable)
Figure 5. Contrasting (a) an inherently hard (possibly unsolvable) environment with (b) an improvable environment. The difference is that the rule-based policy’s reward is higher than the RL policy in (b), whereas their rewards are similar in (a).

Example:  Figure 4 shows a concrete real example in ABR, where “Strawman 3” fails. (In §5.5, we empirically test these three curriculum-learning strategies.) We generate three sets of bandwidth traces , , and using three configurations (details in §A.3). We first train an RL-based ABR policy on trace set until it performs well in place (on ) but poorly on and . Since the performance of the current RL model is lower on than on , Strawman 3 opts for adding to the training in the next step. However, Figure 4 shows that training further on worsens the model performance on and , although the in-place performance (on ) is indeed improved. In fact, adding to training is better at this point—the performance on is improved without negative impact than that on or .

To take a closer look, we plot two example traces from and in Figure 5: The trace from fluctuates with a smaller magnitude but more frequently, whereas the trace from fluctuates with a greater magnitude but much less frequently. However, such observations cannot generalize to an arbitrary pair of environments or a different application.

4. Design and implementation of Genet

4.1. Curriculum generation

To identify rewarding environments, the idea of Genet is to find environments with a large gap-to-baseline, i.e., the RL policy is worse than a given rule-based baseline by a large margin. At a high level, adding such training environments to training has three practical benefits.

First, when a rule-based baseline performs much better than the RL policy in an environment, it means that the RL model may learn to “imitate” the baseline’s known rules while training in the environment, bringing it on par with—if not better than—the baseline666This may not be true when the behavior of the rule-based algorithm cannot be approximated by RL’s policy DNN, and we will discuss this issue in §4.2. Therefore, a large gap-to-baseline indicates a plausible room for the current RL model to improve. Figure 6 empirically confirms this with one example ABR policy and CC policy (both are intermediate models during Genet-based training). For example, among 73 randomly chosen synthetic environment configurations in CC, a configuration with larger gap-to-baseline is likely to yield more improvement when adding its environments to the RL training. Moreover, this correlation is stronger than using the current model’s performance (“Strawman 3” in §3) to decide which environments are rewarding.

Second, although not all rule-based algorithms are easily interpretable or completely fail-proof, many of them have traditionally been used in networked systems long before the RL-based approaches, and are considered more trustworthy than black-box RL algorithms. Therefore, operators tend to scrutinize any performance disadvantages of the RL policy compared with the rule-based baselines currently deployed in the system. By promoting environments with large gap-to-baseline, Genet directly reduces the possibility that the RL policy causes performance regressions.

In short, gap-to-baseline builds on the insight that rule-based baselines are complementary to RL policies—they are less susceptible to any discrepancies between training and test environments, whereas the performance of an RL policy is potentially sensitive to the environments seen during training. In §5.5, we will discuss the impact of different choices of rule-based baselines and why gap-to-baseline is a better way of using the rule-based baseline than alternatives.

It is worth noting that the rewarding environments (those with large gap-to-baseline) do not have particular meanings outside the context of a given pair of RL model and a baseline. For instance, when an RL-based CC model has greater gap-to-baseline in some network environments, it only means that it is easier to improve the RL model by training it in the these environments; it does not indicate if these environments are easy or challenging to any traditional CC algorithm.

(a) ABR
(b) CC
Figure 6. Compared to the current model’s performance (left), its gap-to-baseline (right) in an environment is more indicative of the potential training improvement on the environment.

4.2. Training framework

Figure 7 depicts Genet’s high-level iterative workflow to realize curriculum learning. Each iteration consists of three steps (which will be detailed shortly):

  • First, we update the current RL model for a fixed number of epochs over the current training environment distribution;

  • Second, we select the environments where the current RL model has a large gap-to-baseline; and

  • Third, we promote these selected environments in the training environments distribution used by the RL training process in the next iteration.

Training environment distribution:

  We define a distribution of training environments as a probability distribution over the space of


each being a vector of 5–7 parameters (summarized in Table 

3, 4, 5) used to generate network environments. An example configuration is: [BW: 2–3Mbps, BW changing frequency: 0–20s, Buffer length: 5–10s]. Genet

sets the initial training environment distribution to be a uniform distribution along each parameter, and automatically updates the distribution used in each iteration, effectively generating a training curriculum.

Figure 7. Overview of Genet’s training process.

When recorded traces are available, Genet

can augment the training with trace-driven environments as follows. Here we use bandwidth traces as an example. The first step is to categorize each bandwidth trace along with the bandwidth-related parameters (i.e., bandwidth range and variance in our case). Each time a configuration is selected by RL training to create new environments, with a probability of

(10% by default), Genet samples a bandwidth trace whose bandwidth-related parameters fall into the range of the selected configuration.

In §5.2, we will show that adding trace-driven environments to training improves performance of RL policies, especially when tested in unseen real traces from the same distribution. That said, even if we do not use trace-driven environments in RL training, our trained RL policies still outperform the traditional method of training RL over real traces or over synthetic traces.

Key components:  Each iteration of Genet starts with training the current model for a fixed number of epochs (defaults to 10). Here, Genet reuses the traditional training method in prior work (i.e., uniform sampling of training environments per epoch), which makes it possible to incrementally apply Genet to existing codebases (see our implementation in §4.3). Recent work on domain randomization (sadeghi16; tobin17; peng18) also shows that a similar training process can benefit the generalization of RL policies (sadeghi16; tobin17; peng18). The details of the training process are described in Algorithm 1.

After a certain number of epochs, the current RL model and a pre-determined rule-based baseline are given to a sequencing module to search for the environments where the current RL model has a large gap-to-baseline. Ideally, we want to test the current RL model on all possible environments and identify the ones with the largest gap-to-baseline, but this is prohibitively expensive. Instead, we use Bayesian Optimization (frazier2018tutorial) (BO) as follows. We view the expected gap-to-baseline over the environments created by configuration as a function of : , where is the average reward of a policy (either the rule-based baseline or the RL model ) over 10 environments randomly generated by configuration . BO then searches in the environment space for the configuration that maximizes .

Once a new configuration is selected, the environments generated by this configuration are then added to the training distribution as follows. When the RL training process samples a new training environment, it will choose the new configuration with probability (30% by default) or uniformly sample a configuration from the old distribution with probability (70% by default), and then create an environment based on the selected configuration. Next, training is resumed over the new environment distribution.

It is important to notice that the BO-based search does not carry its states when searching rewarding environments for a new RL model. Instead, Genet restarts the BO search every time the RL model is updated. The reason is that the rewarding environments can change once the RL model changes.

Design rationale:  The process described above embeds several design decisions that make it efficient.

How to choose rule-based baselines? For Genet to be effective, the baselines should not fail in simple environments; otherwise Genet would ignore them given that the RL policy could easily beat the baselines. For instance, when using Cubic as the baseline in training RL-based CC policies, we observe that the RL policy is rarely worse than Cubic along the dimension of random loss rate, because Cubic’s performance is susceptible to random packet losses. That said, we find that the choice of baselines does not have a significant impact on the effectiveness of Genet, although better choice tends to yield more improvement (as shown in §5.5).777One possible refinement in this regard is to use an “ensemble” of rule-based heuristics, and let the training scheduler focus on environments where the RL policy falls short of any one of a set of rule-based heuristics.

Why is BO-based exploration effective? Admittedly, it can be challenging for BO to search for the rewarding environments in a high-dimensional space. In practice, however, we observe that BO is highly efficient at identifying a good configuration within a relatively small number of steps (15 by default). We empirically validate it in §5.5.

Impact of forgetting? It is important that we train models over the full range of environments. Genet does begin the training over the whole space of environment in the first iteration, but each subsequent iteration introduces a new configuration, thus diluting the percentage of random environments in training. This might lead to the classic problem of forgetting—the trained model may forget how to handle environments seen before. While we do not address this problem directly, we have found that Genet is affected by this issue only mildly. The reason is that Genet stops the training after changing the training distribution for 9 times, and by then the original environment distribution still accounts for about 10%.888When we impose a minimum fraction of “exploration” (i.e., uniformly randomly pick an environment from the original training distribution) in the training (which is a typical strategy to prevent forgetting (zaremba2014learning)), Genet’s performance actually becomes worse.

4.3. Implementation

Genet is fully implemented in Python and Bash, and has been integrated with three existing RL training code. Next, we describe the interface and implementation of Genet, as well as optimizations for eliminating Genet’s performance bottlenecks.

Figure 8. Components and interfaces needed to integrate Genet with an existing RL training code.

API:  Genet interacts with an existing RL training code with two APIs (Figure 8): Train signals the RL to continue the training using the given distribution of environment configurations and returns a snapshot of model after a specified number of training epochs; Test calculates the average reward of a given algorithm (RL model or a baseline) over a specified number of environments drawn from the given distribution of configurations.

Integration with RL training:  We have integrated Genet with Pensieve ABR (pensieve-code), Aurora CC (aurora-code), and Park LB (park-code), which use different RL algorithms (e.g., A3C, PPO) and network simulators (e.g., packet level, chunk level). We implement the two APIs above using functionalities provided in the existing codebase.

Rule-based baselines:  Genet takes advantage of the fact that many RL training codebases (including our three use cases) have already implemented at least one rule-based baseline (e.g., MPC in ABR, Cubic in CC) that runs in their simulators. In addition, we also implemented a few baselines by ourselves, including the shortest-job-first in LB, and BBR in CC. The implementation is generally straightforward, but sometimes the simulator (though sufficient for the RL policy) lacks crucial features for a faithful implementation of the rule-based logic. Fortunately, Genet-based RL training merely uses the baseline to select training environments, so the consequence of having a suboptimal baseline is not considerable.

# traces, total length (s)
# traces, total length (s)
FCC ABR 85, 105.8k 290, 89.9k
Norway ABR 115, 30.5k 310, 96.1k
Ethernet CC 64, 1.92k 112, 3.35k
Cellular CC 136, 4.08k 121, 3.64k
Table 2. Network traces used in ABR and CC tests.

5. Evaluation

The key takeaways of our evaluation are:

  • Across three RL use cases in networking, Genet improves the performance of RL algorithms when tested in new environments drawn from the training distributions that include wide ranges of environments (§5.2).

  • Genet improves the generalization of RL performance, allowing models trained over synthetic environments to perform well even in various trace-driven environments as well as on real-world network connections (§5.3).

  • Genet-trained RL policies have a much higher chance to outperform various rule-based baselines specified during Genet-based RL training (§5.4).

  • Finally, the design choices of Genet, such as its curriculum learning strategy, BO-based search, are shown to be effective compared to seemingly natural alternatives (§5.5).

Given the success of curriculum learning in other RL domains, these improvements are not particularly surprising, but by showing them for the first time in facilitating RL training in networking, we hope to inspire more follow-up research in this direction.

5.1. Setup

We train Genet for three RL use cases in networking, using their original simulators: congestion control (CC) (aurora-code), adaptive-bitrate streaming (ABR) (pensieve-code), and load balancing (LB) (park-code). As discussed in §4.1, we train and test RL policies over two types of environments.

Synthetic environments:  We generate synthetic environments using the parameters described in detail in §A.2 and Table 3,4,5. We choose these environment parameters to cover a range of factors that affect RL performance. For instance, in CC tests, our environment parameters specify bandwidth (e.g., the range, variance, and how often it changes), delay, and queue length, etc.

Trace-driven environments:  We also use real traces for CC and ABR (summarized in Table 2) to create trace-driven environments (in both training and testing), where the bandwidth timeseries are set by the real traces but the remaining environment parameters (e.g., queue length or target video buffer length) are set as in the synthetic environments. We test ABR policies by streaming a pre-recorded video over 290 traces from FCC broadband measurements (fccdata) (labeled “FCC”) and 310 cellular traces (norway-data) (labeled “Norway”). We test CC policies on 121 cellular traces (labeled “Cellular”) and 112 ethernet traces (labeled “Ethernet”) collected by the Pantheon platform (pantheon).

All used real traces are released in

Baselines:  We compare Genet-trained RL policies with several baselines.

First, traditional RL trains RL policies by uniformly sampling environments from the target distribution per epoch. We train three types of RL policies (RL1, RL2, RL3) over fixed-width uniform distribution of synthetic environments, specified in Table 34 5. From RL1 to RL3, the sizes of their training environment ranges are in ascending order.

We also train RL policies over trace-driven environments, i.e., randomly picking bandwidth traces from one of the recorded sets. This is the same as prior work, except that we also vary non-bandwidth related parameters (e.g., queue length, buffer length, video length, etc) to increase its robustness. In addition, we test an early attempt to improve RL (robustifying) which generates new training bandwidth traces that maximize the gap between the RL policy and optimal adaptation with an unsmoothness penalty (§5.5).

Second, traditional rule-based algorithms include BBA (bba) and RobustMPC (mpc) for ABR, PCC-Vivace (vivace), BBR (bbr) and CUBIC for CC, and least-load-first (LLF) for LB.999By default, we use RobustMPC as MPC and PCC Vivace-latency as Vivace, since they appear to perform better than their perspective variants. They can be viewed as a reference point of traditional non-ML solutions.

5.2. Asymptotic performance

We first compare Genet-trained RL policies and traditional RL-trained policies, in terms of their asymptotic performance (i.e., test performance over new test environments drawn independently from the training distribution). In other words, we train RL policies over environments from the target distribution and test them in new environments from the same distribution.

Synthetic environments:  We first test Genet-trained CC, ABR, and LB policies under their perspective RL3 synthetic ranges (where all parameters are set to its full range) as the target distribution. As shown in Figure 2, in these training ranges, traditional RL training yields little performance improvement over the rule-based baselines. Figure 9 compares Genet-trained CC, ABR, and LB policies with their respective baselines over 200 new synthetic environments randomly drawn with the target distribution.

Across three use cases, we can see that consistently Genet improves over traditional RL-trained policies by 8–25% for ABR, 14–24% for CC, 15% for LB, compared with traditional RL training methods. We notice that there is no clear ranking among the three traditional RL-trained policies. This is because RL1 helps training to converge better but only sees a small slice of the target distribution, whereas RL3 sees the whole distribution but cannot train a good model. In contrast, Genet outperforms them, as curriculum learning allows it to learn more efficiently from the large target distribution.

Figure 9. Comparing performance of Genet-trained RL policies for CC, ABR, and LB, with baselines in unseen synthetic environments drawn from the training distribution, which sets all environment parameters at their full ranges.
Figure 10. Test of ABR policies along individual environment parameters.
Figure 11. Test of LB policies along individual environment parameters.

To show the performance more thoroughly, Figure 10 picks ABR as an example and shows the performance across different values along six environment parameters. We vary one parameter at a time while fixing other parameters at the same default values (see Table 3, 4, 5). We see that Genet-trained RL policies enjoy consistent performance advantages (in reward) over the RL policies trained by traditional RL-trained models. This suggests that the improvement of Genet shown in Figure 9 is not a result of improving rewards in some environments at the cost of degrading rewards in others; instead, Genet improves rewards in most cases.

We also run CC emulation on a Dell Inspiron 5521 machine with a Mahimahi-emulated link with controlled bandwidth, delay, and queue length. We run LB emulation on a local Cassandra testbed with three machines hosted in the Chameleon cluster (Chameleon-cloud) fed with key-value requests at Poisson arrival intervals. Figure 11 shows in synthetic environments, the Genet-trained LB policy outperforms its baselines by 15%.

Trace-driven environments:  Next, we set the target environment distributions of ABR and CC to be the environments generated from multiple real-world trace sets (FCC and Norway for ABR, Ethernet and Cellular for CC). We partition each trace set as listed in Table 2. Genet trains ABR and CC policies by combining the trace-driven environments and the synthetic environments (described in §4.2). For a thorough comparison, both Genet and the traditional RL training have access to the training portion of the real traces as well as the synthetic environments. We vary the ratio of real traces and synthetic environments and feed them to the traditional RL training method, e.g., if the ratio of real traces is 20%, then the traditional RL training randomly draws a trace-driven environment with 20% probability and synthetic environments with 80% probability. That is, we test different ways for the traditional RL training to combine the training traces and synthetic environments. Figure 12 tests Genet-trained ABR and CC policies with their respective traditional RL-trained baselines over new environments generated from the traces in the testing set. Figure 12 shows that Genet-trained policies outperform traditional RL training by 17-18%, regardless of the ratio of real traces, including when training the model entirely on real traces.

(a) Congestion control (CC)
(b) Adaptive bitrate (ABR)
Figure 12. Asymptotic performance of Genet-trained CC policies (a) and ABR policies (b) and baselines, when the real network traces are randomly split to training set and test set.
(a) CC test in trace-driven environments (Cellular)
(b) CC test in trace-driven environments (Ethernet)
(c) ABR test in trace-driven environments (FCC)
(d) ABR test in trace-driven environments (Norway)
Figure 13. Generalization test: Training of various methods is done entirely in synthetic environments, but the testing is over various real network trace sets.

5.3. Generalization

Next, we take the RL policies of ABR and CC trained (by Genet and other baselines) entirely over synthetic environments (the RL3 synthetic environment range) and test their generalization in trace-driven environments generated by the ABR (and CC) testing traces in Table 2.

Figure 13 shows that they perform better than traditional RL baselines trained over the same synthetic environment distribution. Though Figure 13 uses the same testing environments as Figure 12 and has a similar relative ranking between Genet and traditional RL training, the implications are different: Figure 12 shows that when the real traces are not accessible in training, Genet can produce models with better generalization in real-trace-driven environments than the baselines, whereas Figure 13 shows their performance when the training real traces are actively used in training of Genet and the baselines.

5.4. Comparison with rule-based baselines

Impact of the choice of rule-based baselines:  Figure 14 shows the performance of Genet-trained policies when using different rule-based baselines. We choose MPC and BBA as baselines in the ABR experiments and BBR and Cubic as baselines in CC experiments, respectively. We observe that in all cases, Genet-trained policies outperform their respective rule-based baselines.

(a) ABR
(b) CC
Figure 14. Genet can outperform the rule-based baselines used in its training.
(a) ABR
(b) CC
Figure 15. Fraction of real traces where Genet trained policies (and traditional RL) are better than the rule-based baseline.
(a) ABR
(b) CC
Figure 16. Testing ABR and CC policies in real-world environments.

What if Genet uses naive rule-based baselines?  As explained in §4.2, the rule-based baseline should have a reasonable (though not necessarily optimal) performance; otherwise, it would be unable to indicate when the RL policy can be improved. To empirically verify it, we use two unreasonable baselines: choosing the highest bitrate when rebuffer in ABR, and choosing the highest loaded server in LB. In both cases, the BO-based search fails to find useful training environments, because the RL policy very quickly outperforms the naive baseline everywhere. That said, the negative impact of using a naive baseline is restricted to the selection of training environments, rather than the RL training itself (a benefit of decoupling baseline-driven environment selection and RL training), so in the worst case, Genet would be roughly as good as traditional RL training.

How likely Genet outperforms rule-based baselines: 
One of Genet’s benefits is to increase how often the RL policy is better than the rule-based baseline used in Genet. In Figure 15, we create various versions of Genet-trained RL policies by setting the rule-based baselines to be Cubic and BBR (for CC), and MPC and BBA (for ABR). Compared to RL1, RL2, RL3 (unaware of rule-based baselines), Genet-trained policies remarkably increase the fraction of real-world traces (emulated) where the RL policy outperforms the baseline used to train them. This suggests that operators can specify a rule-based baseline, and Genet will train an RL policy that outperforms it with high probability.

(a) CC (Cellular traces)
(b) CC (Ethernet traces)
(c) ABR (FCC traces)
(d) ABR (Norway traces)
Figure 17. RL-based ABR and CC vs. rule-based baselines.

Breakdown of performance:  Figure 17 takes one Genet-trained ABR policy (with MPC as the rule-based baseline) and one Genet-trained CC policy (with BBR as the rule-based baseline) and compares their performance with a range of rule-based baselines along individual performance metrics. We see that the Genet-trained ABR and CC policies stay on the frontier and outperform other baselines.

Real-world tests:  We also test the Genet-trained ABR and CC policies in five real wide-area network paths (without emulated delay/loss), between four nodes reserved from (onl), one laptop at home, and two cloud servers (§A.4), allowing us to observe their interactions with real network traffic. For statistical confidence, we run the Genet-trained policies and their baselines back-to-back, each with at least five times, and show their performance in Figure 16. In all but two cases, Genet outperforms the baselines. On Path-2, Genet-trained ABR has little improvement, because the bandwidth is always much higher than the highest bitrate, and the baselines will simply use the highest bitrate, leaving no room for improvement. On Path-3, Genet-trained CC has negative improvement, because the network has a deeper queue than used in training, so RL cannot handle it well. This is an example where Genet can fail when tested out of the range of training environments. These results do not prove the policies generalize to all environments; instead, they show Genet’s performance in a range of network settings.

5.5. Understanding Genet’s design choices

Alternative curriculum-learning schemes:  Figure 18 compares Genet’s training curve with that of traditional RL training and three alternatives of selecting training environments described in §3. CL1 uses hand-picked heuristics (gradually increasing the bandwidth fluctuation frequency in the training environments), CL2 uses the performance of a rule-based baseline (gradually adding environments where BBR for CC and MPC for ABR performs badly), and CL3 adds traces where the current RL model performs badly (whereas Genet picks the traces where the current RL model is much worse than a rule-based baseline). Compared to these baselines, In Figure 18, we show that Genet’s training curves have faster ramp-ups, suggesting that with the same number of training epochs, Genet can arrive at a much better policy, which corroborates the reasoning in §3.

Figure 18. Genet’s training ramps up faster than better than alternative curriculum learning strategies.
Figure 19. Genet outperforms Robustifying (robustifying) which improves RL performance by generating adversarial bandwidth traces, and variants of Genet which uses the Robustifying’s criteria in BO-based environment selection.

In addition, “Robustifying” (robustifying)101010In lack of a public implementation, we follow the description in (robustifying) (e.g., unsmoothness weight) and apply it to Pensieve (with the only difference being that for fair comparisons with other baselines, we apply it on Pensieve trained on our synthetic training environments). We have verified that our implementation of Robustifying achieves similar improvements in the setting of original paper. More details are in Appendix A.6. (which learns an adversarial bandwidth generator) also tries to improve ABR logic by adding more challenging environments to training. For a more direct comparison with Genet, we implement a variant of Genet where BO picks configurations that maximize the gap between RL and the optimal reward (penalized by bandwidth unsmoothness with different weights of ). Figure 19 compares the resulting RL policies with Genet-trained RL policy and MPC as a baseline on the synthetic traces in Figure 10. We see that they perform worse than Genet-trained ones and that by changing the BO’s environment selection criteria, Genet becomes less effective. Genet outperforms Robustifying, because the unsmoothness metric used in (robustifying) may not completely capture the inherent difficulty of bandwidth traces (Figure 5 shows a concrete example).

BO-based search efficiency:  Genet uses BO to explore the multi-dimensional environment space environment to find the environment configuration with a high gap-to-baseline. While BO may not identify the single optimal point in arbitrarily complex relationships between environment parameters and gap-to-baseline, we found it to be a highly pragmatic solution, within a small number of steps (by default, 15), it can identify a configuration that is almost as good as the randomly searching for many more points. To show it, we randomly choose an intermediate RL models during the Genet training of ABR and CC. Figure 20 shows the gap-to-baseline of the configuration selected by BO for each model within 15 search steps, and it compares the values with the maximum gap-to-baseline of 100 randomly selected points (which represents a much more expensive alternative).

(a) ABR
(b) CC
Figure 20. BO-based search is more efficient at finding environments with high gap-to-baseline than random exploration in the environment configuration space.

6. Related work

Improving RL for networking:  Some of our findings regarding the lack of generalization corroborate those in previous work (remy; pensieve; aurora; robustifying; rotman2020online; dethise2019cracking). To improve RL for networking use cases, prior work has attempted to apply and customize techniques from the ML literature. For instance, (robustifying) applies adversarial learning by generating relatively smooth bandwidth traces that maximize the RL regret w.r.t. optimal outcomes, (verifying; kazak2019verifying) show that the generalization of RL can be improved by incorporating training environments where a given RL policy violates pre-defined safety conditions, (wield; schaarschmidt2020end) incorporate randomization in the evaluation of RL-based systems, and Fugu (puffer) achieves a similar goal through learning a transmission time predictor in situ. Other proposals seek to safely deploy a given RL policy in new environments (trainingwheel; rotman2020online; shi2021adapting). In many ways, Genet follows this line of work, but it is different in that it systematically introduces curriculum learning, which has underpinned many recent enhancements of RL and demonstrates its benefits across multiple applications.

Curriculum learning for RL:  There is a substantial literature on improving deep RL with curricula ((narvekar2020curriculum; hacohen2019power; portelas2020automatic) give more comprehensive surveys on this subject). Each component of curriculum learning has been extensively studied, including how to generate tasks (environments) with potentially various difficulties (silva2018object; schmidhuber2013powerplay), how to sequence tasks (ren2018self; sukhbaatar2017intrinsic)

, and how to add a new task to training (transfer learning). In this work, we focus on sequencing tasks to facilitate RL training. It is noticed that, for general tasks that do not have a clear definition of difficulty (like networking tasks), optimal task sequencing is still an open question. Some approaches, such as self-paced learning 

(kumar2010self) advocates the use of easier training examples first, while the other approaches prefer to use harder examples first (chang2017active). Recent work tries to bridge the gap by suggesting that an ideal next training task should be difficult with respect to the current model’s hypothesis, while it is also beneficial to prefer easier points with respect to the target hypothesis (hacohen2019power). In other words, we should prefer an easy environment that the current RL model cannot handle well, which confirms the intuition elaborated in Bengio’s seminal paper (bengio2009curriculum), which hypothesizes that “it would be beneficial to make learning focus on ‘interesting’ examples that are neither too hard or too easy.” Genet is an instantiation of this idea in the context of networking adaptation, and the way to identify the rewarding (or “interesting”) environments is by using the domain-specific rule-based schemes to identify where the current RL policy has a large room for improvement.

Automatic generation of curricula also benefits generalization, particularly when used together with domain randomization (peng18). Several schemes boost RL’s training efficiency by iteratively creating a curriculum of challenging training environments (e.g., (paired; adr)) where the RL performance is much worse than the optimal outcome (i.e., maximal regret). When the optimal policy is unavailable, they learn a competitive baseline (paired) to approximate the optimal policy or a metric (adr) to approximate the regret. Genet falls in this category, but proposes a domain-specific way of identifying rewarding environments using rule-based algorithms.

Some proposals in safe policy improvement (SPI) for RL also use rule-based schemes (ghavamzadeh2016safe; laroche2019safe), though for different purposes than Genet. While Genet uses the performance of rule-based schemes to identify where the RL policy can be maximally improved, SPI uses the decisions of rule-based algorithms to avoid violation of failures during training.

7. Conclusion

We have presented Genet, a new training framework to improve training of deep RL-based network adaptation algorithms. For the first time, we introduce curriculum learning to the networking domain as the key to reaching better RL performance and generalization. To make curriculum learning efficient in networking, the main challenge is how to automatically identify the “rewarding” environments that can maximally benefit from retraining. Genet addresses this challenge by a simple-yet-efficient idea that highly rewarding network environments are where the current RL performance falls significantly behind that of a rule-based baseline scheme. Our evaluation on three RL use cases shows that Genet improves RL policies (in both performance and generalization) in a range of environments and workloads.

Ethics:  This work does not raise any ethical issues.


Appendix A Appendix

a.1. Details of RL implementation

The input of RL algorithm consists of a space of configurations, an initial policy parameters and predefined total number of epochs to train. The space of configurations is constructed by ranges of environment configurations. Each range is marked by the configuration’s min and max values. Within a training epoch, each dimension of the space of configurations is uniformly sampled to create K configurations. For each configuration, N random environments are created. Thus, rollouts are collected by running the policy on total K*N environments to update the policy. When the policy is updated for the predefined number of epochs, the RL algorithm stops training and outputs a trained policy.

1:: space of configurations, : initial policy parameters, : # of epochs
2:: returned policy parameters
3:for  from 1 to  do
5:     for 1 to  do : # configs per epoch
6:           Uniformly sampled config in
7:          for 1 to  do : # random envs per config
8:                Create a simulated env by
9:               rollout Rollout policy on
11:          end for
12:     end for
13:     with update:
14:      Gradient update with rate
15:end for
Algorithm 1 Traditional Reinforecment Learning (RL)
1:: uniform configuration distribution (equal probability on each configuration), : rule-based policy.
2:: final RL policy parameters
3:function Genet()
4:      Random initial policy parameters
5:      will be updated and used for training
6:     for from 1 to  do # of exploration iterations
7:          BO.initialize Initialize with full config space
8:          for from 1 to  do # of trial configs by BO
9:                BO.getNextChoice()
10:                CalcBaselineGap
11:               BO.update
12:          end for
13:          BO.getDecision()
14: Weight new config by and old configs by
16:          UniformDomainRand
17:     end for
18:     return
19:end function
20:function CalcBaselineGap()
21:     Initialize:
22:     for 1 to  do # of reward comparisons
23:           Create a simulated env by
24:          rollout Rollout RL
25:          rollout Rollout rule-based
26:          add to
27:     end for
28:     return mean
29:end function
Algorithm 2 Genet training framework

a.2. Trace generator logic

ABR:  For the simulation in ABR, the link bandwidth trace has the format of [timestamp (s), throughput (Mbps)]. Our synthetic trace generator includes 4 parameters: minimum BW (Mbps), maximum BW (Mbps), BW changing interval (s), and trace duration (s). Each timestamp represents one second with a uniform [-0.5, 0.5] noise. Each throughput follows a uniform distribution between [min BW, max BW]. The BW changing interval controls how often does throughput change over time, with uniform [1, 3] noise. Trace duration represents the total time length of the current trace.

CC:  The trace generator in the CC simulation takes 6 inputs: maximum BW(Mbps), BW changing interval (s), link one-way latency (ms), queue size (Packets), link random loss rate, delay noise (ms), and duration (s). It outputs a series of timestamps with 0.1s step length and dynamic bandwidth series. Each bandwidth value is drew from a uniform distribution of range [1, max BW] Mbps. The BW changing interval allows bandwidth to change every certain seconds. The link one-way latency is used to simulate packet RTT. The queue size simulates a single queue in a sender-receiver network. Link random loss rate determines the chance of random packet loss in the network. Delay noise determines how large a Gaussian noise is added to a packet. The trace duration is determined by the duration input.

LB:  We use the similar workload traces generator as the Park [park-code] project, where jobs arrive according to a Poisson process, and the job sizes follow a Pareto distribution with parameters [shape, scale]. In the simulation, all servers process jobs from their queues at identical rates.

a.3. Sequencing training trace adding example

Trace sets in Figure 4 was generated by three configurations. For trace set X, we used BW range: 0-4Mbps, BW changing frequency: 4-10s. For trace set Y, we used BW range: 0-1Mbps, BW changing frequency: 0-2s. For trace set Z, we used BW range: 0-3Mbps, BW changing frequency: 2-15s. As a motivation example, each trace set contains 20 traces to show the testing reward trend.

a.4. Testbed setup

ABR:  To test our model on a client-side system, we first leverage the testbed from Pensieve [pensieve-code], which modifies dash.js (version 2.4) [2] to support MPC, BBA, and RL-based ABR algorithms. We use the “Envivio- Dash3” video which format follows the Pensieve settings. In this emulation setup, the client video player is a Google Chrome browser (version 85) and the video server (Apache version 2.4.7) run on the same machine as the client. We use Mahimahi [netravali2015mahimahi] to emulate the network conditions from our pre-recorded FCC [hongzi-fccdata], cellular [norway-data], Puffer [puffer] network traces, along with an 80 ms RTT, between the client and server. All above experiments are performed on UChicago servers.

To compare with Fugu, we then modify the interface to connect Genet trained model with Puffer’s system [puffer]. This experiment was performed on Azure Virtual Machines.

CC:  We build up CC testbed on Pantheon [pantheon] platform. Pantheon uses network emulator Mahimahi [netravali2015mahimahi] and a network tunnel which records packet status inside the network link. We run local customized network emulation in Mahimahi by providing a bandwidth trace and network configurations. We run remote network experiment by delopying pantheon platform on the nodes shown in Figure 21. Among all the CC algorithms tested, BBR [bbr] and TCP Cubic [cubic] are provided by Linux kernel and are called via iperf3. PCC-Aurora [aurora] and PCC-Vivace [vivace]

are implemented on top of UDP. We train our models in python and Tensorflow framework and port the models into the Aurora C++ code.

LB:  We implement the testbed within Cassandra (distributed database), and use different scheduling (Genet trained, LLF) policies to select the replica. We modify Cassandra’s internal read-request routing mechanism (originally every Cassandra node is both a client and server), this means that one of Cassandra nodes is a client and three others are servers. We generate each key size as 100B and value size as 1KB, which served as the dataset. In the Cassandra testbed, each request job size follows the same Pareto distribution as in the simulator, which is replicated to three replicas (three Emulab nodes), so each replica has a copy of each key. Each experiment involves 10000 operations of the workload and is repeated 10 times. All experiments were performed on Chameleon servers [Chameleon-cloud].

Real network testbed:  We also test the Genet-trained ABR and CC policies in real wide-area network paths (depicted in Figure 21), including four nodes reserved from [onl], one laptop at home, and two cloud servers.

Figure 21. Real-world network paths used to test ABR and CC policies.

a.5. Details on reward definition

ABR:  The reward function of ABR is a linear combination of bitrate, rebuffering time, and bitrate change. The bitrate is observed in Kbps, and the rebuffering time is second, and bitrate change is the bitrate change between bitrate of current video chunk and that of the previous video chunk. Therefore, a reward value can be computed for a video chunk. The total reward of a video is the sum of the rewards of all video chunks.

CC:  The reward function of CC is a linear combination of the throughput (packets per second), average latency (s), and packet loss (percentage) over a network connection. In training, a reward value is computed using the above metrics observed within a monitor interval. The total reward is the sum of the rewards of all monitor intervals in a connection.

LB:  The reward function of LB is the average runtime delay of a job set, which is measured by milliseconds. For each server, we observe its total work waiting time in the queue and the remaining work currently being processed. After the incoming job being assigned, the server would summarize and update the delay of all active jobs.

ABR Parameter RL1 RL2 RL3 Default Original
Max playback buffer (s) [5, 10] [5, 105] [5, 500] 60 60
Total video length (s) [40, 45] [40, 240] [40, 800] 200 196
Video chunk length (s) [1, 6] [1, 11] [1, 100] 4 4
Min link RTT (ms) [20, 30] [20, 220] [20, 1000] 80 80
Bandwidth change frequency (s) [0, 2] [0, 20] [0, 100] 5
Max link bandwidth (Mbps) [0, 5] [0, 20] [0, 100] 5
Min link bandwidth (Mbps) [0, 0.7max] [0, 0.4max] [0, 0.1max] 0.1
Table 3. Parameters in ABR simulation. Colored rows show the configurations (and their ranges) used in the simulator in the original paper. The synthetic trace generator is described in §A.2.
CC Parameter RL1 RL2 RL3 Default Original
Maximum Link bandwidth (Mbps) [0.5, 7] [0.4, 14] [0.1, 100] 3.16 [1.2, 6]
Minimum link RTT (ms) [205, 250] [156, 288] [10, 400] 100 [100, 500]
Bandwidth change interval (s) [11, 13] [8, 3] [0, 30] 7.5
Random loss rate [0.01, 0.014] [0.007, 0.02] [0, 0.05] 0 [0, 0.05]
Queue (Packets) [2, 6] [2, 11] [2, 200] 10 [2, 2981]
Table 4. Parameters in CC simulation. Colored rows show the configurations (and their ranges) used in the simulator in the original paper. The synthetic trace generator is described in §A.2. The range of RL1 is defined as 1/9 of the range of RL3 and the range of RL2 is defined as 1/3 of RL3. The CC parameters shown here for RL1 and RL2 are example sets.
LB Parameter RL1 RL2 RL3 Default Original
Service rate [0.1, 2] [0.1, 10] [0.1, 100] 1.5 [2, 4]
Job size (byte) [1, 100] [1, ] [1, ] [100, 1000]
Job interval (ms) [0.1, 10] [0.1, 100] [0.1, 1000] 100 100
Number of jobs [1, 100] [1, ] [1, ] 1000 1000
Queue shuffled frequency (episodes) [1, 10] [1, 100] [1, 1000] 100
Queue shuffled probability [0.1, 0.2] [0.1, 0.5] [0.1, 1] 0.5
Table 5. Parameters in LB simulation. Colored rows show the configurations (and their ranges) used in the simulator in the original paper. The synthetic trace generator is described in §A.2.

a.6. Baseline implementation

According to the paper[robustifying]

, we train an additional RL model for Robustify to improve the main RL-policy model by generating adversarial network traces inside ABR. The state of the adversary model contains the bitrate chosen by the protocol for the previous chunk, the client buffer occupancy, the possible sizes of the next chunk, the number of remaining chunks, and the throughput and download time for the last downloaded video chunk. The action is to generate the next bandwidth in the networking trace, in order to optimize the gap between the ABR optimal policy, RL-policy, and the unsmoothness, which is the absolute difference between the last two chosen bandwidths. Here, the penalty of unsmoothness is set as 1, same as the paper.

We use PPO as the training algorithm, and train the Robustify adversary model with a RL model until they both converge. Afterward, we add the traces Robustify model generated into the RL training process to retrain the RL. The PPO parameter settings follow the original paper.

As an alternative implementation, we also use the reward defined in Robustify as the training signal for BO to search and update environments. For the unsmoothness penalty here, we empirically tried three numbers: 0.1, 0.5, 1. From our results, penalty=0.5 works better than others.

a.7. BO search behavior

Figure 22 projects an example trajectory of configurations chosen by BO on a 2-D configuration space (“max link bandwidth” and “bandwidth change interval” in Table 3). It starts with an easy configuration (a large maximum bandwidth value and low bandwidth change frequency). After that, BO gradually lowers the bandwidth value while increasing the bandwidth change frequency, effectively raising the difficulty of the chosen configuration each time. In other words, Genet automatically chooses a sequence of environments with “emergent complexity,” a desirable behavior of RL training [paired].

Figure 22. Exploration by Genet’s Bayesian Optimization in a 2-D configuration space.