Comyco: Quality-Aware Adaptive Video Streaming via Imitation Learning

08/06/2019 ∙ by Tianchi Huang, et al. ∙ Beijing Kuaishou Technology Co.,Ltd. 0

Learning-based Adaptive Bit Rate (ABR) method, aiming to learn outstanding strategies without any presumptions, has become one of the research hotspots for adaptive streaming. However, it is still suffering from several issues, i.e., low sample efficiency and lack of awareness of the video quality information. In this paper, we propose Comyco, a video quality-aware ABR approach that enormously improves the learning-based methods by tackling the above issues. Comyco trains the policy via imitating expert trajectories given by the instant solver, which can not only avoid redundant exploration but also make better use of the collected samples. Meanwhile, Comyco attempts to pick the chunk with higher perceptual video qualities rather than video bitrates. To achieve this, we construct Comyco's neural network architecture, video datasets and QoE metrics with video quality features. Using trace-driven and real-world experiments, we demonstrate significant improvements of Comyco's sample efficiency in comparison to prior work, with 1700x improvements in terms of the number of samples required and 16x improvements on training time required. Moreover, results illustrate that Comyco outperforms previously proposed methods, with the improvements on average QoE of 7.5 Comyco also surpasses state-of-the-art approach Pensieve by 7.37 video quality under the same rebuffering time.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recent years have seen a tremendous increase in the requirements of watching online videos (Cisco, 2017). Adaptive bitrate (ABR) streaming, the method that dynamically switches download chunk bitrates for restraining rebuffering event as well as obtaining higher video qualities, has become the popular scheme to deliver videos with high quality of experience (QoE) to the users (Bentaleb et al., 2018). Recent model-based ABR approaches (§7) pick the next chunk’s video bitrate via only current network status (Jiang et al., 2014), or buffer occupancy (Spiteri et al., 2016), or joint consideration of both two factors(Yin et al., 2015)

. However, such heuristic methods are usually set up with presumptions, that fail to work well under unexpected network conditions 

(Mao et al., 2017)

. Thus, learning-based ABR methods adopt reinforcement learning (RL) method to

learn the strategies without any presumptions, which outperform traditional model-based approaches.

Nevertheless, learning-based ABR methods suffer from two key issues. While recent work (Mao et al., 2017; Gadaleta et al., 2017) often adopts RL methods to train the neural network, such methods lack the efficiency of both collected and exploited expert samples, which leads to the inefficient training (Mendonca et al., 2019). Besides, the majority of existing ABR approaches (Yin et al., 2015; Mao et al., 2017; Akhtar and et al., 2018) neglect the video quality information, while perceptual video quality is a non-trivial feature for evaluating QoE (§5.1,(Huang et al., 2018b)). Thus, despite their abilities to achieve higher QoE objectives, such schemes may generate a strategy that diverges from the actual demand (§2.2).

In this paper, we propose Comyco, a novel video quality-aware learning-based ABR system, aiming to remarkably improve the overall performance of ABR algorithms via tackling the above challenges. Unlike previous RL-based schemes (Mao et al., 2017), Comyco leverages imitation learning (Osa et al., 2018)

to train the neural network (NN). That is because the near-optimal policy can be precisely and instantly estimated via the current state in the ABR scenario and the collected expert policies can enable the NN for fast learning. Following this thought (§

3.1), the agent is allowed to explore the environment and learn the policy via the expert policies given by the solver (§4.5). Specifically, we propose instant solver (§4.2) to estimate the expert action with a faithful virtual player (§6.1). Furthermore, we utilize experience replay buffer (§4.4

) to store expert policies and train the NN via the specific loss function


Besides, Comyco aims to select bitrate with high perceptual video quality rather than high video bitrate. To achieve this, we first integrate the information of video contents, network status, and video playback states into the Comyco’s NN for bitrate selection (§4.1). Next, we consider using VMAF (Rassool, 2017), an objective full-reference perceptual video quality metric, to measure the video quality. Concurrently, we also propose a linear combination of video quality-based QoE metric that achieves the state-of-art performance on Waterloo Streaming SQoE-III (Duanmu et al., 2018) dataset (§5.1). Finally, we collect a DASH-video dataset with various types of videos, including movies, sports, TV-shows, games, news, and music videos (MV) (§5.2).

Using trace-driven emulation (§6.1), we find that Comyco significantly accelerates the training process, with 1700x improvements in terms of number of samples required compared to recent work (§6.2). Comparing Comyco with existing schemes under various network conditions (§6.1) and videos (§5.2

), we show that Comyco outperforms previously proposed methods, with the improvements on average QoE of 7.5% - 16.79%. In particular, Comyco performs better than state-of-the-art learning-based approach Pensieve, with the improvements on the average video quality of 7.37% under the same rebuffering time. Further, we present results which highlight Comyco’s performance with different hyperparameters and settings (§

6.4). Finally, we validate Comyco in real world network scenarios (§6.5). Extensive results indicate the superiority of Comyco over existing state-of-the-art approaches.

In general, we summarize the contributions as follows:

  1. [leftmargin=*]

  2. We propose Comyco, a video quality-aware learning-based ABR system, that significantly ameliorates the weakness of the learning-based ABR schemes from two perspectives.

  3. To the best of our knowledge, we are the first to leverage imitation learning to accelerate the training process for ABR tasks. Results indicate that utilizing imitation learning can not only achieve fast convergence rates but also improve performance.

  4. Unlike prior work, Comyco picks the video chunk with high perceptual video quality instead of high video bitrate. Results also demonstrate the superiority of the proposed algorithm.

2. Background and Challenges

2.1. ABR Overview

Due to the rapid development of network services, watching video online has become a common trend. Today, the predominant form for video delivery is adaptive video streaming, such as HLS (HTTP Live Streaming) (17) and DASH (10), which is a method that dynamically selects video bitrates according to network conditions and clients’ buffer occupancy. Traditional video streaming framework consists of a video player client with a constrained buffer length and an HTTP-Server or Content Delivery Network (CDN). The video player client decodes and renders video frames from the playback buffer. Once the streaming service starts, the client fetches the video chunk from the HTTP Server or CDN in order by an ABR algorithm. Meanwhile, the algorithm, deployed on the client side, determines the next chunk and next chunk video quality via throughput estimation and current buffer utilization. The goal of the ABR algorithm is to provide the video chunk with high qualities and avoid stalling or rebuffering (Bentaleb et al., 2018).

2.2. Challenges for learning-based ABRs

Most traditional ABR algorithms (Jiang et al., 2014; Yin et al., 2015; Spiteri et al., 2016) leverage time-series prediction or automation control method to make decisions for the next chunk. Nevertheless, such methods are built in pre-assumptions that it is hard to keep its performance in all considered network scenarios (Mao et al., 2017). To this end, learning-based ABR algorithms (Mao et al., 2017; Gadaleta et al., 2017; Huang et al., 2018a) are proposed to solve the problem from another perspective: it adopts deep reinforcement learning (DRL) to train a neural network (NN) from scratch towards the better QoE objective. Despite the outstanding results that recent work has obtained, learning-based ABR methods suffer from several key issues:

The weaknesses of RL-based ABR algorithms. Recent learning-based ABR schemes often adopt RL methods to maximize the average QoE objectives. During the training, the agent rollouts a trajectory and updates the NN with policy gradients. However, the effect of calculated gradients heavily depends on the amount and quality of collected experiences. In most cases, the collected samples seldom stand for the optimal policy of the corresponding states, which leads to a long time to converge to the sub-optimal policy (Osa et al., 2018; Mao et al., 2019). Thus, we are facing the first challenge: Considering the characteristic of ABR tasks, can we precisely estimate the optimal direction of gradients to guide the model for better updating?

Figure 1. We evaluate quality-aware ABR algorithm and bitrate-aware ABR algorithm with the same video on Norway network traces respectively. Results are plotted as the curves of selected bitrate, buffer occupancy and the selected chunk’s VMAF (§5.1,(Rassool, 2017)) for entire sessions.

The unique video quality. What’s more, previous learning-based ABR schemes (Yin et al., 2015; Mao et al., 2017) are evaluated by typical QoE objectives that use the combination of video bitrates, rebuffering times and video smoothness. However, such QoE metrics are short-handed because these forms of parameters neglect the quality of video presentations (Wang, 2017). Meanwhile, recent work (Qin et al., 2018; Duanmu et al., 2017) has found that perceptual video quality features play a vital part in evaluating the performance of VBR-encoded ABR streaming services. To prove this, we plot the trajectory generated by the quality-aware ABR and bitrate-aware algorithm on Figure 1. As shown, the bitrate-aware algorithm selects the video chunk with higher bitrate but neglects the corresponding video quality, resulting in a large fluctuation in the perceptual video qualities. What’s more, bitrate-aware algorithm often wastes the buffer on achieving a slight increase in video quality, which may cause unnecessary stalling event. On the contrast, the quality-aware algorithm picks the chunk with high and stable perceptual video quality and preserves the buffer occupancy within an allowable range. To this end, one of the better solutions is to add video bitrates as another metric to describe the perceptual video quality. We, therefore, encounter the second challenge of our work: How to construct a video quality-aware ABR system?

3. Methods

Motivated by the key challenges (§2.2), we propose Comyco, a video quality-aware learning-based ABR scheme. In this section, we introduce two main ideas of Comyco: training NN via imitation learning (§3.1) and a complete video quality-based ABR system (§3.2).

(a) Supervised learning
(b) Imitation learning
Figure 2. The real trajectory on the ABR task given by imitation learning and supervised learning, where the red background means the player occurs the rebuffering event.

3.1. Training ABRs via Imitation Learning

Recall that the key principle of RL-based method is to maximize reward of each action taken by the agent in given states per step, since the agent doesn’t really know the optimal strategy (Sutton and Barto, 2018). However, recent work (Yin et al., 2015; Mao et al., 2017; Spiteri et al., 2018; Akhtar and et al., 2018; Pereira et al., 2018; Huang et al., 2018a) has demonstrated that the ABR process can be precisely emulated by an offline virtual player (§6.1) with complete future network information. What’s more, by taking several steps ahead, we can further accurately estimate the near-optimal expert policy of any ABR state within an acceptable time (§4.2). To this end, the intuitive idea is to leverage supervised learning methods to minimize the loss between the predicted and the expert policy. Nevertheless, it’s impractical because the off-policy method (Sutton and Barto, 2018) suffers from compounding error when the algorithm executes its policy, leading it to drift to new and unexpected states (Laskey et al., 2017). For example, as shown in Figure 2[a], in the beginning, supervised learning-based ABR algorithm fetches the bitrate that is consistent with the expert policy, but when it selects a bitrate with a minor error (after the black line), the state may be transitted to the situation not included in the dataset, so the algorithm would select another wrong bitrate. Such compounding errors eventually lead to a continuous rebuffering event. As a result, supervised-learning methods cannot learn to recover from failures.

In this paper, we aim to leverage imitation learning, a method that closely related to RL and supervised learning, to learn the strategy from the expert policy samples. Imitation learning method reproduces desired behavior according to expert demonstrations (Osa et al., 2018). The key idea of imitation learning is to allow the NN to explore environments and collect samples (just like RL) and learn the policy based on the expert policy (just as supervised learning). In detail, at step , the algorithm infers a policy at ABR state . It then computes a loss w.r.t the expert policy . After observing the next state , the algorithm further provides a different policy for the next step that will incur another loss . Thus, for each in the class of policies , we can find the policy through any supervised learning algorithms (Eq. 1).


Figure 2[b] elaborates the principle of imitation learning-based ABR schemes: the algorithm attempts to explore the strategy in a range near the expert trajectory to avoid compounding errors.

3.2. Video Quality-aware ABR System Setup

Our next challenge is to set up a video quality-aware ABR system. The work is generally composed of three tasks: 1) We construct Comyco’s NN architecture with jointly considering several underlying metrics, i.e, past network features and video content features as well as video playback features (§4.1). 2) We propose a quality-based QoE metric (§5.1). 3) We collect a video quality DASH dataset which includes various types of videos (§5.2).

4. System Overview

In this section, we describe the proposed system in detail. Comyco’s basic system work-flow is illustrated in Figure 3. The system is mainly composed of a NN, an ABR virtual player, an instant solver, and an experience replay buffer. We start by introducing the Comyco’s modules. Then we explain the basic training methodology. Finally, we further illustrate Comyco with a multi-agent framework.

Figure 3. Comyco’s Basic System Work-flow Overview. Training methodologies are available in §4.5.

4.1. NN Architecture Overview

Motivated by the recent success of on-policy RL-based methods, Comyco’s learning agent is allowed to explore the environment via traditional rollout methods. For each epoch

, the agent aims to select next bitrate via a neural network (NN). We now explain the details of the agent’s NN including its inputs, outputs, network architecture, and implementation.

Inputs. We categorize the NN into three parts, network features, video content features and video playback features (). Details are described as follows.

  • [leftmargin=*]

  • Past Network features. The agent takes past

    chunks’ network status vector

    into NN, where represents the throughput measured for video chunk . Specifically, is computed by , in which is the downloaded video size of chunk with selected bitrates , and means download time for video chunk .

  • Video content features. Besides that, we also consider adding video content features into NN’s inputs for improving its abilities on detecting the diversity of video contents. In details, the learning agent leverages to represent video content features. Here is a vector that reflects the video size for each bitrate of the next chunk , and is a vector which stands for the perceptual video quality metrics for each bitrate of the next chunk.

  • Video playback features. The last essential feature for describing the ABR’s state is the current video playback status. The status is represented as , where is the perceptual video quality metric for the past video chunk selected, are vectors which stand for past t chunks’ buffer occupancy and download time, and means the normalized video chunk remaining.


Same as previous work, we consider using discrete action space to describe the output. Note that the output is an n-dim vector indicating the probability of the bitrate being selected under the current ABR state


Implementation. As shown in Figure  4

, for each input type, we use a proper and specific method to extract the underlying features. In details, we first leverage a single 1D-CNN layer with kernel=4, channels=128, stride=1 to extract network features to a 128-dim layer. We then use two 1D-CNN layers with kernel=1x4, channels=128 to fetch the hidden features from the future chunk’s video content matrix. Meanwhile, we utilize 1D-CNN or fully connected layer to extract the useful characteristics from each metric upon the video playback inputs. The selected features are passed into a GRU layer and outputs as a 128-dims vector. Finally, the output of the NN is a

6-dims vector, which represents the probabilities for each bitrate selected. We utilize RelU

as the active function for each feature extraction layer and leverage

softmax for the last layer.

Figure 4. Comyco’s NN architecture Overview.

4.2. Instant Solver

Once the sampling module rolls out an action , we aim to design an algorithm to fetch all the optimal actions with respect to current state . Followed by these thoughts, we further propose Instant Solver. The key idea is to choose future chunk ’s bitrate by taking steps ahead via an offline virtual player, and solves a specific QoE maximization problem with future network throughput measured , in which the future real throughput can be successfully collected under both offline environments and real-world network scenarios. Inspired by recent model-based ABR work (Yin et al., 2015), we formulate the problem as demonstrated in Eq. 4.2, denoted as . In detail, the virtual player consists of a virtual time, a real-world network trace and a video description. At virtual time , we first calculate download time for chunk via , where is the video chunk size for bitrate , and is average throughput measured. We then update buffer occupancy for chunk , in which reflects the waiting time such as Round-Trip-Time (RTT) and video render time, and is the max buffer size. Finally, we refresh the virtual time for the next computation. Note that the problem can be solved with any optimization algorithms, such as memoization, dynamic programming as well as Hindsight (Huang et al., 2019). Ideally, there exists a trade-off between the computation overhead and the performance. We list the performance comparison of instant solver with different in §6.4. In this work, we set .


4.3. Choice of Loss Functions for Comyco

In this section, we start with designing the loss function from the fundamental RL training methodologies. The goal of the RL-based method is to maximize the Bellman Equation, which is equivalent to maximize the value function  (Sutton and Barto, 2018). The equation is listed in Eq. 3, where stands for the maximum action value function on all policies, is the value function, is the rollout policy. Thus, given an expert action , we can update the model via minimizing the gap between the true action probability and , where

is an one hot encoding in terms of

. In this paper, we use cross entropy error as the loss function. Recall that the function can be represented as any traditional behavioral cloning loss methods (Osa et al., 2018), such as Quadratic, LI-loss and Hinge loss function. In addition, we find that the other goal of the loss function is to maximize the probabilities of the selected action, while the goal significantly reduces the aggressiveness of exploration, and finally, resulting in obtaining the sub-optimal performance. Thus, motivated by the recent work on RL (Mnih et al., 2016), we add the entropy of the policy to the loss function. It can encourage the algorithm to increase the exploration rate in the early stage and discourage it in the later stage. The loss function for Comyco is described in Eq 4.


Here is the rollout policy selected by the NN, is the real action probability vector generated by the expert actor , represents the entropy of the policy, is a hyperparameter which controls the encouragement of exploration. In this paper, we set and discuss with different in §6.4.

4.4. Training Comyco with Experience Replay

Recent off-policy RL-based methods (Mnih et al., 2013) leverage experience replay buffer to achieve better convergence behavior when training a function approximator. Inspired by the success of these approaches, we also create a sample buffer which can store the past expert strategies and allow the algorithm to randomly picks the sample from the buffer during the training process. We will discuss the effect of utilizing experience replay on Comyco in §6.4.

4.5. Methodology

We summarize the Comyco’s training methodology in Alg. 1.

1:Training model , Instant Solver4.2).
2:Sample Training Batch .
3:procedure Training
4:     Initialize .
5:     Get State ABR state .
6:     repeat
7:          Picks according to policy .
8:          Expert action .
9:          .
10:          Samples a batch .
11:          Updates network with using Eq.4;
12:          Produces next ABR state according to and .
14:     until Converged
15:end procedure
Algorithm 1 Overall Training Procedure
Figure 5. Comyco’s Multi-Agent Framework Overview.

4.6. Parallel Training

It’s notable that the training process can be designed asynchronously, which is quite suitable for multi-agent parallel training framework. Inspired by the multi-agent training method (Mnih et al., 2016; Huang et al., 2018b), we modify Comyco’s framework from single-agent training to multi-agent training. As illustrated in Figure 5, Comyco’s multi-agent training consists of three parts, a central agent with a NN, an experience replay buffer, and a group of agents with a virtual player and an instant solver. For any ABR state , the agents use virtual player to emulate the ABR process w.r.t current states and actions given by the NN which placed on the central agent, and collect the expert action through the instant solver; they then submit the information containing to the experience replay buffer. The central agent trains the NN by picking the sample batch from the buffer. Note that this can happen asynchronously among all agents. By default, Comyco uses 12 agents, which is the same number of CPU cores of our PC, to accelerate the training process.

4.7. Implementation

We now explain how to implement Comyco. We use TensorFlow 

(Abadi et al., 2016) to implement the training workflow and utilizing TFlearn (Tang, 2016) to construct the NN architecture. Besides, we use C++ to implement instant solver and the virtual player. Then we leverage Swig (Beazley and others, 1996) to compile them as a python class. Next, we will show more details: Comyco takes the past sequence length  (as suggested by (Mao et al., 2017)) and future video chunk features (as suggested by (Yin et al., 2015)) into the NN. We set learning rate and use Adam optimizer (Kingma and Ba, 2014) to optimize the model. For more details, please refer to our repository 111

5. QoE Metrics & Video Datasets

Upon constructing the Comyco’s NN architecture with considering video content features, we have yet discussed how to train the NN. Indeed, we lack a video quality-aware QoE model and an ABR video dataset with video quality assessment. In this section, we use VMAF to describe the perceptual video quality of our work. We then propose a video quality-aware QoE metric under the guidance of real-world ABR QoE dataset (Duanmu et al., 2018). Finally, we collect and publish a DASH video dataset with different VMAF assessments.

(a) Video Bitrate: 0.480
(b) SSIM: 0.592
(c) VMAF: 0.689
Figure 6. Correlation comparison of video presentation quality metrics on the SQoE-III dataset (Duanmu et al., 2018). Results are summarized by Pearson correlation coefficient (Benesty et al., 2009).

5.1. QoE Model Setup

Motivated by the linear-based QoE metric that widely used to evaluate several ABR schemes (Pereira et al., 2018; Yin et al., 2015; Akhtar and et al., 2018; Mao et al., 2017; Bentaleb et al., 2016; Qin et al., 2018), we concluded our QoE metric as:


where N is the total number of chunks during the session, represents the each chunk’s video bitrate, reflects the rebuffering time for each chunk , is a function that maps the bitrate to the video quality perceived by the user, denotes positive video bitrate smoothness, meaning switch the video chunk from low bitrate to high bitrate and is negative smoothness. Note that , , , are the parameters to describe their aggressiveness.

Choice of . To better understand the correlation between video presentation quality and QoE metric, we test the correlation between mean opinion score (MOS) and video quality assessment (VQA) metrics, including video bitrate, SSIM (Hore and Ziou, 2010) and Video Multimethod Assessment Fusion (VMAF) (Rassool, 2017), under the Waterloo Streaming QoE Database III (SQoE-III)222SQoE-III is the largest and most realistic dataset for dynamic adaptive streaming over HTTP (Duanmu et al., 2018), which consists of a total of 450 streaming videos created from diverse source content and diverse distortion patterns. (Duanmu et al., 2018), where SSIM is a image quality metric which used by D-DASH (Gadaleta et al., 2017) and VMAF is an objective full-reference video quality metric which is formulated by Netflix to estimate subjective video quality. Results are collected with Pearson correlation coefficient (Benesty et al., 2009) as suggested by (Abar et al., 2017). Experimental results (Fig. 6) show that VMAF achieves the highest correlation among all candidates, with the improvements in the coefficient of 16.39%-43.54%. Besides, VMAF are also a popular scheme with great potential on both academia and industry (Aaron et al., 2015). We, therefore, set .

QoE model Type VQA SRCC
Pensieve’s (Mao et al., 2017) linear - 0.6256
MPC’s (Yin et al., 2015) linear - 0.7143
Bentaleb’s (Bentaleb et al., 2016) linear SSIMplus (Rehman et al., 2015) 0.6322
Duanmu’s (Duanmu et al., 2018) linear - 0.7743
Comyco’s linear VMAF (Rassool, 2017) 0.7870
Table 1. Perfomance Comparison of QoE Models on Waterloo Streaming SQoE-III (Duanmu et al., 2018)

QoE Parameters Setup.

Recall that main goal of our paper is to propose a feasible ABR system instead of a convincing QoE metric. In this work, we attempt to leverage linear-regression methods to find the proper parameters. Specifically, we randomly divide the SQoE-III database into two parts, 80% of the database for training and 20% testing. We follow the idea by 

(Duanmu et al., 2018) and run the training process for 1,000 times to mitigate any bias caused by the division of data. As a result, we set , , , . We leverage spearman correlation coefficient (SRCC), as suggested by (Duanmu et al., 2018), to evaluate the performance of our QoE model with existing proposed models and the median correlation and its corresponding regression model are demonstrated in Table 1. As shown, model outperforms recent work. In conclusion, the proposed QoE model is well enough to evaluate ABR schemes.

5.2. Video Datasets

To better improve the Comyco’s generalization ability, we propose a video quality DASH dataset involves movies, sports, TV-shows, games, news and MVs. Specifially, we first collect video clips with highest resolution from Youtube, then leverage FFmpeg (FFmpeg, ) to encode the video by H.264 codec and MP4Box (GPAC, ) to dashify videos according to the encoding ladder of video sequences (Z. Duanmu, A. Rehman, and Z. Wang (2018); 10). Each chunk is encoded as 4 seconds. During the trans-coding process, for each video, we measure VMAF, VMAF-4K and VMAF-phone metric with the reference resolution of respectively. In general, the dataset contains 86 complete videos, with 394,551 video chunks and 1,578,204 video quality assessments.

6. Evaluation

6.1. Methodology

Virtual Player. We design a faithful ABR offline virtual player to train Comyco via network traces and video descriptions. The player is written in C++ and Python3.6 and is closely refers to several state-of-the-art open-sourced ABR simulators including Pensieve, Oboe and Sabre (Spiteri et al., 2018).

Testbed. Our work consists of two testbeds. Both server and client run on the 12-core, Intel i7 3.7 GHz CPUs with 32GB RAM running Windows 10. Comyco can be trained efficiently on both GPU and CPU. Detailing the testbed, that includes:

  • [leftmargin=*]

  • Trace-driven emulation.  Following the instructions of recent work (Mao et al., 2017; Akhtar and et al., 2018), we utilize Mahimahi (Netravali et al., 2015) to emulate the network conditions between the client (ChromeV73) and ABR server (SimpleHTTPServer by Python2.7) via collected network traces.

  • Real world Deployment. Details are illustrated in §6.5.

Network Trace Datasets. We collect about 3,000 network traces, totally 47 hours, from public datasets for training and testing:

  • [leftmargin=*]

  • Chunk-level network traces: including HSDPA (Riiser et al., 2013): a well-known 3G/HSDPA network trace dataset, we use a slide-window to upsampling the traces as mentioned by Pensieve (1000 traces, 1s granularity); FCC (Report, 2016): a broadband dataset (1000 traces, 1s granularity); Oboe (Usc-Nsl, 2018) (428 traces, 1-5s granularity): a trace dataset collected from wired, WiFi and cellular network connections (Only for validation.)

  • Synthetic network traces: uses a Markovian model where each state represented an average throughput in the aforementioned range(Mao et al., 2017). We create network traces in over 1000 traces with 1s granularity.

ABR Baselines. In this paper, we select several representational ABR algorithms from various type of fundamental principles:

  • [leftmargin=*]

  • Rate-based Approach (RB) (Jiang et al., 2014):

     uses harmonic mean of past five throughput measured as future bandwidth.

  • BOLA (Spiteri et al., 2016): turns the ABR problem into a utility maximization problem and solve it by using the Lyapunov function. It’s a buffer-based approach. We use BOLA provided by the authors (Spiteri et al., 2018).

  • Robust MPC (Yin et al., 2015): inputs the buffer occupancy and throughput predictions and then maximizes the QoE by solving an optimization problem. We use C++ to implement RobustMPC and leverage  (§5.1) to optimize the strategy.

  • Pensieve (Mao et al., 2017): the state-of-the-art ABR scheme which utilizes Deep Reinforcement Learning (DRL) to pick bitrate for next video chunks. We use the scheme implemented by the authors (Mao, 2017) but retrain the model for our work (§6.2).

Figure 7. Comparing Comyco with existing ABR approaches under the HSDPA and FCC network traces. Results are illustrated with CDF distributions, QoE improvement curves and the comparion of several undelying metrics (§5.1).
(a) Epochs
(b) Training Time
Figure 8. Comparing the performance of Comyco with Pensieve and Supervised learning-based method under the HSDPA dataset. Comyco is able to achieve the highest performance with significant gains in sample efficiency.

6.2. Comyco vs. ABR schemes

In this part, we attempt to compare the performance of Comyco with the recent ABR schemes under several network traces via the trace-driven virtual player. The details of selected ABR baselines are described in §6.1. We use EnvivoDash3, a widely used (Mao et al., 2017; Yin et al., 2015; Pereira et al., 2018; Akhtar and et al., 2018) reference video clip (10) and to measure the ABR performance.

Pensieve Re-training. We retrain Pensieve via our datasets (§6.1), NN architectures (§4.1) and QoE metrics (§5.1). Followed by recent work (Akhtar and et al., 2018), our experiments use different entropy weights in the range of to and dynamically decrease the weight every iterations. Training time takes about 8 hours and we show that Pensieve outperforms RobustMPC, with an overall average QoE improvement of 3.5% across all sessions. Note that same experiments can improve the  (Yin et al., 2015) by 10.5%. It indicates that cannot be easily improved because the metric reflects the real world MOS score.

Comparison of Learning-based ABR schemes. Figure 8 illustrates the average QoE of learning-based ABR schemes on HSDPA datasets. We validate the performance of two schemes respectively during the training. Results are shown with two perspectives including Epoch-Average QoE and Training time-Average QoE and we see about 1700x improvement in terms of the number of samples required and about 16x improvement in terms of training time required. As expected (§3.1), we observe that supervised learning-based method fails to find a strategy, which thereby leads to the poor performance.

Comyco vs. Existing ABRs. Figure 7 shows the comparison of QoE metrics for existing ABR schemes (§6.1). Comyco outperforms recent ABRs, with the improvements on average QoE of 7.5% - 17.99% across the HSDPA dataset and 4.85%-16.79% across the FCC dataset. Especially, Besides, we also show the CDF of the percentage of improvent in QoE for Comyco over existing schemes. Comyco surpasses state-of-the-art ABR approach Pensieve for 91% of the sessions across the HSDPA dataset and 78% of the sessions across the FCC dataset. What’s more, we also report the performance of underlying metrics including average video quality (VMAF), rebuffering time, positive and negative smoothness, as well as QoE. We find that Comyco is well behaved on the average quality metric, which improves 6.84%-15.64% compared with other ABRs. Moreover, Comyco is able to avoid rebuffering and bitrate changes, which performs as same as state-of-art schemes.

Figure 9. Comparing Comyco with existing ABR approaches under the Oboe network traces and various types of videos.

6.3. Comyco with Multiple Videos

To better understand how does Comyco perform on various videos, we randomly pick videos from different video types and utilize Oboe network traces to evaluate the performances of the proposed methods. Oboe network traces have diversity network conditions, which brings more challenges for us to improve the performance. Figure 9 illustrates the comparison of QoE metrics for state-of-the-art ABR schemes under various video types. We find that Comyco generalizes well under all considered video scenarios, with the improvements on average QoE of 2.7%-23.3% compared with model-based ABR schemes and 2.8%-13.85% compared with Pensieve. Specifically, Comyco can provide high quality ABR services under movies, news, and sports, which are all the scenarios with frequent scene switches. We also find that Comyco fails to demonstrate overwhelming performance in serving music videos. It’s really an interesting topic and we’ll discuss it in future work.

6.4. Ablation Study

In this section, we set up several experiments that aim to provide a thorough understanding of Comyco, including its hyperparameters and overhead. Note that, we have computed the offline-optimal results via dynamic programming and complete network status (Mao et al., 2017) before the experiment and treated it as a baseline.

/N 5 6 7 8 9
Replay Off 0.883 0.893 0.917 0.932 0.942
Replay On 0.911 0.921 0.937 0.946 0.960
TimeSpan(Opt. Off)(ms) 1.56 8.74 58.44 389.68 2604.46
Table 2. Comyco with different and replay strategies.
0.1 0.01 0.001 0.0001 0
k=4 0.883 0.895 0.904 0.881 0.867
Table 3. Comyco with different .

Comparison of different future step N. We report normalized QoE and raw time span of Comyco with different N and replay experience strategy in Table 2. Results are collected under the Oboe dataset. As shown, we find that experience replay can help Comyco learn better. Despite the outstanding performance of Comyco with N=9, this scheme lacks the algorithmic efficiency and can hardly be deployed in practice. Thus, we choose k=8 for harmonizing the performance and the cost.

Comyco with different . Further, we compare the normalized QoE of Comyco with different under the Oboe dataset. As listed in Table 3, we confirm that represents the best parameters for our work. Meanwhile, results also prove the effective of utilizing entropy loss (§4.3).

Comyco Overhead. We calculate (Molchanov et al., 2016) the number of floating-point operations (FLOPs) of Comyco and find that Comyco has the computation of 229 Kflops, which is only 0.15% of the light-weighted neural network ShuffleNet V2 (Ma et al., 2018) (146 Mflops). In short, we believe that Comyco can be successfully deployed on the PC and laptop, or even, on the mobile.

Network RTT (ms) (KB/s) 4G 65.91 325.23 53.72 WiFi 15.58 292.98 27.65 Inter. 193.3 420.15 266.9
Figure 10. Comparing Comyco with Pensieve and RobustMPC under the real-world network conditions. We take as baselines.

6.5. Comyco in the Real World

We establish a full-system implementation to evaluate Comyco in the wild. The system mainly consists of a video player, an ABR server and an HTTP content server. On the server-side, we deploy an HTTP video content Server. On the client-side, we modify Dash.js (10) to implement our video player client and we use Chrome to watch the video. Moreover, we implement Comyco as a service on the ABR server. We evaluate the performance of proposed schemes under various network conditions including  4G/LTE network, WiFi network and international link (from Singapore to Beijing). Figure 10 illustrates network status, where is the average throughput measured and

represents standard deviation from the average. For each round, we randomly picks a scheme from candidates and summarize the bitrate selected and rebuffering time for each chunk. Each experiment takes about 2 hours. Figure 

10 shows the average QoE results for each scheme under different network conditions. It’s clear that Comyco also outperforms previous state-of-the-art ABR schemes and it improves the average QoE of 4.57%-9.93% compared with Pensieve and of 6.43%-9.46% compared with RobustMPC.

7. Related Work

ABR schemes. Client-based ABR algorithms (Bentaleb et al., 2018) are mainly organized into two types: model-based and learning-based.

Model-based. The development of ABR algorithms begins with the idea of predicting throughput. FESTIVE (Jiang et al., 2014) estimates future throughput via the harmonic mean of the throughput measured for the past chunk downloads. Meanwhile, many approaches are designed to select the appropriate high bitrate next video chunk and avoid rebuffering events based on playback buffer size observed. BBA (Huang et al., 2015) proposes a linear criterion threshold to control the available playback buffer size. Mixed approaches, e.g., MPC (Yin et al., 2015), select bitrate for the next chunk by adjusting its throughput discount factor based on past prediction errors and estimating its playback buffer size. What’s more, Akhtar et al. (Akhtar and et al., 2018) propose an auto-tuning method to improve the model-based ABR’s performance.

Learning-based: Several attempts have been made to optimize the ABR algorithm based on RL method due to the difficulty of tuning mixed approaches for handling different network conditions. Pensieve (Mao et al., 2017) is a system that uses DRL to select bitrate for future video chunks. D-DASH (Gadaleta et al., 2017) uses Deep Q-learning method to perform a comprehensive evaluation based on state-of-the-art algorithms. Tiyuntsong optimizes itself towards a rule or a specific reward via the competition with two agents under the same network condition (Huang et al., 2018a).

Imitation Learning meets Networking. Imitation learning (Hussein et al., 2017) has been widely used in the various fields including networking. Tang et al. (Tang et al., 2018)

propose real-time deep learning based intelligent network traffic control method to represent the considered Wireless Mesh Network (WMN) backbone via imitation learning. Indigo 

(Yan et al., 2018) uses DAgger (Ross et al., 2011) to train a congestion-control NN scheme in the offline network emulator.

8. Conclusion

In this work, we propose Comyco, a learning-based ABR system which aim to thoroughly improve the performance of learning-based algorithm. To overcome the sample inefficiency problem, we leverage imitation learning method to guide the algorithm to explore and exploit the better policy rather than stochastic sampling. Moreover, we construct the video quality-based ABR system, including its NN architectures, datasets and QoE metrics. With trace-driven emulation and real-world deployment, we show that Comyco significantly improves the performance and effectively accelerates the training process.

Acknowledgement. We thank the anonymous reviewer for the valuable feedback. Special thanks to Huang’s wife Yuyan Chen, also namely Comyco, for her great support and, happy Chinese valentine’s day. This work was supported by the National Key R&D Program of China (No. 2018YFB1003703), NSFC under Grant 61521002, Beijing Key Lab of Networked Multimedia, and Kuaishou-Tsinghua Joint Project (No. 20192000456).


  • A. Aaron, Z. Li, M. Manohara, J. Y. Lin, E. C. Wu, and C. J. Kuo (2015) Challenges in cloud based ingest and encoding for high quality streaming media. In 2015 IEEE International Conference on Image Processing (ICIP), pp. 1732–1736. Cited by: §5.1.
  • M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016)

    TensorFlow: a system for large-scale machine learning.

    In OSDI, Vol. 16, pp. 265–283. Cited by: §4.7.
  • T. Abar, A. B. Letaifa, and S. El Asmi (2017) Machine learning based qoe prediction in sdn networks. In 2017 13th International Wireless Communications and Mobile Computing Conference (IWCMC), pp. 1395–1400. Cited by: §5.1.
  • Z. Akhtar and et al. (2018) Oboe: auto-tuning video abr algorithms to network conditions. In SIGCOMM 2018, pp. 44–58. Cited by: §1, §3.1, §5.1, item , §6.2, §6.2, §7.
  • D. M. Beazley et al. (1996) SWIG: an easy to use tool for integrating scripting languages with c and c++.. In Tcl/Tk Workshop, pp. 43. Cited by: §4.7.
  • J. Benesty, J. Chen, Y. Huang, and I. Cohen (2009) Pearson correlation coefficient. In Noise reduction in speech processing, pp. 1–4. Cited by: Figure 6, §5.1.
  • A. Bentaleb, A. C. Begen, and R. Zimmermann (2016) SDNDASH: improving qoe of http adaptive streaming using software defined networking. In Proceedings of ACM MultiMedia 2016, pp. 1296–1305. Cited by: §5.1, Table 1.
  • A. Bentaleb, B. Taani, A. C. Begen, C. Timmerer, and R. Zimmermann (2018) A survey on bitrate adaptation schemes for streaming media over http. IEEE Communications Surveys & Tutorials. Cited by: §1, §2.1, §7.
  • Cisco (2017) Cisco visual networking index: forecast and methodology, 2016-2021. External Links: Link Cited by: §1.
  • [10] (2019) DASH industry forum — catalyzing the adoption of mpeg-dash. External Links: Link Cited by: §2.1, §5.2, §6.2, §6.5.
  • Z. Duanmu, A. Rehman, and Z. Wang (2018) A quality-of-experience database for adaptive video streaming. IEEE Transactions on Broadcasting 64 (2), pp. 474–487. Cited by: §1, Figure 6, §5.1, §5.1, §5.2, Table 1, §5, footnote 2.
  • Z. Duanmu, K. Zeng, K. Ma, A. Rehman, and Z. Wang (2017) A quality-of-experience index for streaming video. IEEE Journal of Selected Topics in Signal Processing 11 (1), pp. 154–166. Cited by: §2.2.
  • [13] FFmpeg FFmpeg. External Links: Link Cited by: §5.2.
  • M. Gadaleta, F. Chiariotti, M. Rossi, and A. Zanella (2017) D-dash: a deep q-learning framework for dash video streaming. IEEE Transactions on Cognitive Communications and Networking 3 (4), pp. 703–718. External Links: Document, ISSN Cited by: §1, §2.2, §5.1, §7.
  • [15] GPAC MP4BOX. External Links: Link Cited by: §5.2.
  • A. Hore and D. Ziou (2010) Image quality metrics: psnr vs. ssim. pp. 2366–2369. Cited by: §5.1.
  • [17] (2019) HTTP live streaming. Note: Cited by: §2.1.
  • T. Huang, C. Ekanadham, A. J. Berglund, and Z. Li (2019) Hindsight: evaluate video bitrate adaptation at scale. In Proceedings of the 10th ACM Multimedia Systems Conference, MMSys ’19, New York, NY, USA, pp. 86–97. External Links: ISBN 978-1-4503-6297-9, Link, Document Cited by: §4.2.
  • T. Huang, R. Johari, N. McKeown, M. Trunnell, and M. Watson (2015) A buffer-based approach to rate adaptation: evidence from a large video streaming service. ACM SIGCOMM Computer Communication Review 44 (4), pp. 187–198. Cited by: §7.
  • T. Huang, X. Yao, C. Wu, R. Zhang, and L. Sun (2018a) Tiyuntsong: a self-play reinforcement learning approach for abr video streaming. arXiv preprint arXiv:1811.06166. Cited by: §2.2, §3.1, §7.
  • T. Huang, R. Zhang, C. Zhou, and L. Sun (2018b) QARC: video quality aware rate control for real-time video streaming based on deep reinforcement learning. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 1208–1216. Cited by: §1, §4.6.
  • A. Hussein, M. M. Gaber, E. Elyan, and C. Jayne (2017) Imitation learning: a survey of learning methods. ACM Computing Surveys (CSUR) 50 (2), pp. 21. Cited by: §7.
  • J. Jiang, V. Sekar, and H. Zhang (2014) Improving fairness, efficiency, and stability in http-based adaptive video streaming with festive. TON 22 (1), pp. 326–340. Cited by: §1, §2.2, item , §7.
  • D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.7.
  • M. Laskey, J. Lee, R. Fox, A. Dragan, and K. Goldberg (2017) Dart: noise injection for robust imitation learning. arXiv preprint arXiv:1703.09327. Cited by: §3.1.
  • N. Ma, X. Zhang, H. Zheng, and J. Sun (2018) Shufflenet v2: practical guidelines for efficient cnn architecture design. In

    Proceedings of the European Conference on Computer Vision (ECCV)

    pp. 116–131. Cited by: §6.4.
  • H. Mao, R. Netravali, and M. Alizadeh (2017) Neural adaptive video streaming with pensieve. In Proceedings of the 2017 ACM SIGCOMM Conference, pp. 197–210. Cited by: §1, §1, §1, §2.2, §2.2, §3.1, §4.7, §5.1, Table 1, item , item , item , §6.2, §6.4, §7.
  • H. Mao, S. B. Venkatakrishnan, M. Schwarzkopf, and M. Alizadeh (2019) Variance reduction for reinforcement learning in input-driven environments. international conference on learning representations. Cited by: §2.2.
  • Mao (2017) Hongzimao/pensieve. External Links: Link Cited by: item .
  • R. Mendonca, A. Gupta, R. Kralev, P. Abbeel, S. Levine, and C. Finn (2019) Guided meta-policy search. arXiv preprint arXiv:1904.00956. Cited by: §1.
  • V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu (2016) Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pp. 1928–1937. Cited by: §4.3, §4.6.
  • V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller (2013) Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602. Cited by: §4.4.
  • P. Molchanov, S. Tyree, T. Karras, T. Aila, and J. Kautz (2016)

    Pruning convolutional neural networks for resource efficient inference

    arXiv preprint arXiv:1611.06440. Cited by: §6.4.
  • R. Netravali, A. Sivaraman, S. Das, A. Goyal, K. Winstein, J. Mickens, and H. Balakrishnan (2015) Mahimahi: accurate record-and-replay for http. pp. 417–429. Cited by: item .
  • T. Osa, J. Pajarinen, G. Neumann, J. A. Bagnell, P. Abbeel, J. Peters, et al. (2018) An algorithmic perspective on imitation learning. Foundations and Trends® in Robotics 7 (1-2), pp. 1–179. Cited by: §1, §2.2, §3.1, §4.3.
  • P. G. Pereira, A. Schmidt, and T. Herfet (2018) Cross-layer effects on training neural algorithms for video streaming. In Proceedings of the 28th ACM SIGMM Workshop on Network and Operating Systems Support for Digital Audio and Video, pp. 43–48. Cited by: §3.1, §5.1, §6.2.
  • Y. Qin, S. Hao, K. R. Pattipati, F. Qian, S. Sen, B. Wang, and C. Yue (2018) ABR streaming of vbr-encoded videos: characterization, challenges, and solutions. In Proceedings of CoNeXT 2018, pp. 366–378. Cited by: §2.2, §5.1.
  • R. Rassool (2017) VMAF reproducibility: validating a perceptual practical video quality metric. In Broadband Multimedia Systems and Broadcasting (BMSB), 2017 IEEE International Symposium on, pp. 1–2. Cited by: §1, Figure 1, §5.1, Table 1.
  • A. Rehman, K. Zeng, and Z. Wang (2015) Display device-adapted video quality-of-experience assessment. In Human Vision and Electronic Imaging XX, Vol. 9394, pp. 939406. Cited by: Table 1.
  • M. F. B. Report (2016) Raw data measuring broadband america 2016. Note:[Online; accessed 19-July-2016] Cited by: item .
  • H. Riiser, P. Vigmostad, C. Griwodz, and P. Halvorsen (2013) Commute path bandwidth traces from 3g networks: analysis and applications. In Proceedings of the 4th ACM Multimedia Systems Conference, pp. 114–118. Cited by: item .
  • S. Ross, G. Gordon, and D. Bagnell (2011) A reduction of imitation learning and structured prediction to no-regret online learning. In

    Proceedings of the fourteenth international conference on artificial intelligence and statistics

    pp. 627–635. Cited by: §7.
  • K. Spiteri, R. Sitaraman, and D. Sparacio (2018) From theory to practice: improving bitrate adaptation in the dash reference player. In Proceedings of the 9th MMSys, pp. 123–137. Cited by: §3.1, item , §6.1.
  • K. Spiteri, R. Urgaonkar, and R. K. Sitaraman (2016) BOLA: near-optimal bitrate adaptation for online videos. In INFOCOM 2016, IEEE, pp. 1–9. Cited by: §1, §2.2, item .
  • R. S. Sutton and A. G. Barto (2018) Reinforcement learning: an introduction. MIT press. Cited by: §3.1, §4.3.
  • F. Tang, B. Mao, Z. M. Fadlullah, N. Kato, O. Akashi, T. Inoue, and K. Mizutani (2018) On removing routing protocol from future wireless networks: a real-time deep learning approach for intelligent traffic control. IEEE Wireless Communications 25 (1), pp. 154–160. External Links: Document, ISSN 1536-1284 Cited by: §7.
  • Y. Tang (2016) TF. learn: tensorflow’s high-level module for distributed machine learning. arXiv preprint arXiv:1612.04251. Cited by: §4.7.
  • Usc-Nsl (2018) USC-nsl/oboe. External Links: Link Cited by: item .
  • Z. Wang (2017) Video qoe: presentation quality vs. playback smoothness. External Links: Link Cited by: §2.2.
  • F. Y. Yan, J. Ma, G. D. Hill, D. Raghavan, R. S. Wahby, P. Levis, and K. Winstein (2018) Pantheon: the training ground for internet congestion-control research. In 2018 USENIX Annual Technical Conference (USENIXATC 18), pp. 731–743. Cited by: §7.
  • X. Yin, A. Jindal, V. Sekar, and B. Sinopoli (2015) A control-theoretic approach for dynamic adaptive video streaming over http. In ACM SIGCOMM Computer Communication Review, pp. 325–338. Cited by: §1, §1, §2.2, §2.2, §3.1, §4.2, §4.7, §5.1, Table 1, item , §6.2, §6.2, §7.