Nebula: Reliable Low-latency Video Transmission for Mobile Cloud Gaming

Mobile cloud gaming enables high-end games on constrained devices by streaming the game content from powerful servers through mobile networks. Mobile networks suffer from highly variable bandwidth, latency, and losses that affect the gaming experience. This paper introduces Nebula, an end-to-end cloud gaming framework to minimize the impact of network conditions on the user experience. Nebula relies on an end-to-end distortion model adapting the video source rate and the amount of frame-level redundancy based on the measured network conditions. As a result, it minimizes the motion-to-photon (MTP) latency while protecting the frames from losses. We fully implement Nebula and evaluate its performance against the state of the art techniques and latest research in real-time mobile cloud gaming transmission on a physical testbed over emulated and real wireless networks. Nebula consistently balances MTP latency (<140 ms) and visual quality (>31 dB) even in highly variable environments. A user experiment confirms that Nebula maximizes the user experience with high perceived video quality, playability, and low user load.



There are no comments yet.


page 8


Foveated Video Streaming for Cloud Gaming

Good user experience with interactive cloud-based multimedia application...

360NorVic: 360-Degree Video Classification from Mobile Encrypted Video Traffic

Streaming 360 video demands high bandwidth and low latency, and poses si...

Cloud Gaming With Foveated Graphics

Cloud gaming enables playing high end games, originally designed for PC ...

Delay Sensitivity Classification of Cloud Gaming Content

Cloud Gaming is an emerging service that catches growing interest in the...

Realistic Video Sequences for Subjective QoE Analysis

Multimedia streaming over the Internet (live and on demand) is the corne...

An Efficient Video Streaming Architecture with QoS Control for Virtual Desktop Infrastructure in Cloud Computing

In virtual desktop infrastructure (VDI) environments, the remote display...

MPTCP meets FEC: Supporting Latency-Sensitive Applications over Heterogeneous Networks

Over the past years, TCP has gone through numerous updates to provide pe...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Figure 1. minimizes latency while maximizing mobile cloud gaming video quality by performing joint source rate and FEC redundancy control over unpredictable mobile wireless networks.

Cloud gaming enables high-quality video gaming on lightweight clients, supported by powerful servers in the cloud (wu2015enabling). As a highly interactive multimedia application, cloud gaming requires reliable and low-latency network communication (chen2011latencyCG). Mobile cloud gaming (MCG) leverages the pervasivity of mobile networks to serve video game content on constrained mobile devices. Such networks exhibit unpredictable variations of bandwidth, latency, jitter, and packet losses, significantly impacting the transmission of game content.

Currently, most commercial cloud gaming platforms target sedentary gaming at home. These platforms often prioritize low latency over video quality through protocols such as WebRTC (carrascosa2020cloud; domenico2020network). Although WebRTC minimizes transmission latency, the resulting low video quality can have a detrimental effect on the users’ quality of experience (QoE)111 Hence, improving the visual quality while preserving low latency remains a core challenge. Mobile networks’ inherent unpredictability further complicates the transmission of real-time MCG content. Sudden bandwidth drops reduce the amount of data that can be transmitted, while packet losses cause distortions in the decoded video flow. Cloud providers over public WANs present higher latency variation compared to providers with private network infrastructure (corneo2021surrounded). Adaptive bit rate (ABR) and forward error correction (FEC) have been widely used to minimize the influence of bandwidth variations and packet loss on video transmission (perez2011effect)

. However, most ABR solutions suffer from inaccurate estimations of the available bandwidth 

(8424813), while FEC comes at the cost of high overhead. Besides, these two schemes have rarely been combined to optimize network usage and improve the QoE.

In this paper, we introduce , the first end-to-end framework combining video distortion propagation modelling with adaptive frame-level FEC and joint video encoding bitrate and FEC redundancy control for MCG over mobile networks (see Figure 1). relies on a mathematical model of video distortion that considers, for the first time, the propagation of errors among multiple game frames. As such, contrary to existing solutions that apply FEC to the entire group of picture (GoP) (wu2017streaming; wu2015leveraging), applies frame-level FEC and unequally protects GoP frames prioritized by their proximity to the GoP’s intra-frame. thus minimizes the end-to-end latency while significantly attenuating the effect of packet loss and the propagation of distortion in the GoP. also integrates source rate control to mitigate the effect of bandwidth variation.

We evaluate against standard TCP Cubic- and WebRTC-based video streaming solutions, and recent research on video streaming (Buffer-Occupancy, BO) (10.1145/2619239.2626296; huang2013bo) and mobile cloud gaming (ESCOT) (wu2017streaming). In both emulated environments and in-the-wild, balances motion-to-photon (MTP) latency and visual quality. TCP Cubic may present a higher visual quality, but the MTP latency shoots up with variable bandwidth. WebRTC keeps the MTP latency low at the cost of significant degradation of the visual quality, especially on networks with high jitter. BO and ESCOT critically affect video quality and latency, respectively. A user study performed with 15 participants over a real-life WiFi network confirms the benefits of on the QoE. results in higher perceived video quality, playability, and user performance while minimising the task load compared to all other streaming solutions.

Our contribution is fourfold:

  • We propose the first model of end-to-end distortion that accounts for the error propagation over a GoP.

  • We introduce , an end-to-end framework for joint source/FEC rate control in MCG video streaming.

  • We implement and the typical transmission protocols within a functional cloud gaming system.

  • We evaluate over a physical testbed through extensive experiments and a user study. balances low-latency with resilience to distortion, leading to a good QoE.

2. Background and Related Works

Although few works target MCG specifically, there is significant literature on real-time mobile video streaming. This section summarizes the major works on both video streaming and MCG systems.

2.1. Real-time Mobile Video Streaming

FEC leads to high decoding delay when applied to full GoPs (wu2017streaming; wu2015leveraging). Xiao et al. (xiao2013real) use randomized expanding Reed-Solomon (RS) code to recover frames using parity packets of received frames. They propose an FEC coding scheme to group GoPs as coding blocks, reducing the delay compared to GoP-based FEC (xiao2012supgop). However, they do not address burst packet loss. Yu et al. (yu2014can) study the effects of burst loss and long delay on three popular mobile video call applications. The study highlights the importance of conservative video rate selection and FEC redundancy schemes for better video quality. Frossard et al. (frossard2001jointmpeg2) adapt the source rate and FEC redundancy according to the network performance. Wu et al. (wu2013joint) propose flow rate allocation-based Joint source-channel coding to cope with burst and sporadic packet loss over heterogeneous wireless networks. Although these works employ FEC for video transmission, they do not address the low-latency requirements of MCG.

Congestion and overshooting are key sources of loss and distortion. WebRTC controls congestion using Google Congestion Control (GCC) algorithm. GCC adapts the sending bitrate based on the delay at the receiver and the loss at the sender (jansen2018performance; carlucci2016analysis). As a result, WebRTC favors real-time transmission to video quality. On the other hand, TCP Cubic (tcpcubic2008) is one the most widely deployed TCP variants in mobile networks that does not use latency as a congestion signal. TCP Cubic is sensitive to large buffers in cellular networks leading to bufferbloat (gettys2011bufferbloat). The Buffer-Occupancy (BO) algorithm (10.1145/2619239.2626296; huang2013bo) selects the video rate based on the playback buffer occupancy. BO keeps a reservoir of frames with minimum quality and requests a lower rate upon low buffer occupancy signals. Such buffer-based technique encounters either low visual quality or lengthy latency, especially in extremely variable settings (10.1145/2619239.2626296).

2.2. Mobile Cloud Gaming

A few works target specifically MCG due to its novelty. Provision of MCG is challenging owing to the high transmission rate and the limited resources. Chen et al. (chen2019framework) use collaborative rendering, progressive meshes, and 3D image warping to cope with provisioning issues and improve the visual quality. Guo et al. (guo2019qoe) propose a model to meet players’ quality requirements by optimizing the cloud resource utilization with reasonable complexity. Outatime (lee2015outatime) renders speculative frames that are delivered one RTT ahead of time, offsetting up to 120ms of network latency. Wu et al. (wu2015hetnet) distribute the game video data based on estimated path quality and end-to-end distortion. In another work, they adjust the frame selection dynamically and protect GoP frames unequally using forward error correction (FEC) coding (wu2015enabling). They refine the distortion model in (wu2017streaming) to support both source and channel distortion. Fu et al.  (fu2016distortion) investigate the source and channel distortion over LTE networks. All the above works apply GoP-level FEC, allowing them to cancel out error propagation at the cost of high motion to photon latency. In this work, we consider for the first time the distortion propagation across the GoP and adapt the frame-level FEC accordingly. As such, improves the quality of interaction and preserves the video quality beyond conventional (WebRTC and TCP Cubic) and research (ESCOT, BO) cloud gaming systems performance.

3. Modelling the Joint source/FEC rate problem

MCG relies on lossy video compression and protocols, leading to distortion in the decoded video. After justifying the need for joint source/FEC rate adaption, we refine existing distortion models to account for error propagation within a GoP, before formulating the joint source/FEC rate optimization problem.

3.1. Joint Source/FEC Rate Problem

MCG is a highly interactive application that promises high-quality graphics on mobile devices. As such, a key metric of user experience is the MTP latency, defined as the delay between a user action and its consequences on the display. In mobile networks, ensuring a low MTP latency with a high visual quality is the primary challenge.

FEC minimizes the impact of packet losses and video distortion. However, the redundancy introduces significant overhead that increases latency and reduces the available bandwidth. Moreover, network congestion often leads to burst of losses (wu2017streaming), for which FEC is often ineffective or even detrimental (yu2008model). Although FEC minimizes retransmission latency, most current works (wu2017streaming; wu2015enabling; wu2015leveraging) on FEC for cloud gaming apply FEC on the entire GoP, which either significantly increase the motion to photon delay (all GoP frames must be decoded before display), demand higher bandwidth, or decreases the visual quality. Besides, any large loss burst can prevent the recovery of the entire GoP.

We combine source rate with FEC redundancy control to mitigate congestion and residual packet losses and adapt the encoding bitrate of the video. To minimize the MTP latency, we perform FEC at frame level instead of GoP level and vary the amount of redundancy based on the position of the frame in the GoP.

3.2. Distortion Propagation Model

End-to-end distortion represents the difference between the original frame at the sender, and the decoded frame at the receiver, and is represented by the Mean Square Error (MSE):


where K is the number of frames with resolution MxN. is the original pixel value and is the reconstructed pixel value at location of the frame.

We consider three distortion components:

Encoder distortion : information lost during encoding. It provides the definitions for source distortion and transmission distortion as follows. Encoder distortion depends on the encoder rate and empirical parameters and so that  (wu2015hetnet).

Transmission distortion : caused by packet losses during network transmission. Depending on the compression scheme, a packet loss may result into partial or complete frame loss. Transmission distortion is directly related to the packet loss rate so that where is an empirical parameter (wu2015hetnet).

Decoder distortion : the additional distortion that propagates over multiple frames after a network packet loss. Existing works modelling distortion for FEC encoding applies FEC to the entire GoP so as not to consider error propagation in modelling video quality. However, in frame-level FEC encoding, losing any packet corrupts the corresponding frame, directly impacting the subsequent frames in the same group of pictures (GoP). The model should thus account for such distortion propagation.

Figure 2. Source, Transmission, Decoder, and end-to-end distortion vs source video bitrate. Packet loss rate , UDP packet size B. When considering the interframe error propagation (decoder distortion), the end-to-end distortion increases with the source rate.

Let denotes I-frame rate, number of Intra-frames per second. We define where is an empirical parameter. The end-to-end distortion is the summation of the three components:


Consequently, the PSNR of the pipeline is:


Figure 2 displays the source, transmission, and decoder distortion, and their combined effect on the end-to-end distortion. We also display the end-to-end distortion as represented in previous models (wu2017streaming). With our model, end-to-end distortion decreases at first, until the decoder distortion takes over. The end-to-end distortion then increases linearly with the bitrate. By considering for the first time interframe error propagation, our model raises a significant source of distortion for higher source rates.

3.3. Joint source/FEC Coding Model

Figure 3. ’s detailed architecture. On the video data line, the server captures frames, video-encodes them into multiple spatiotemporal resolutions, FEC-encodes the output, and sends them to the clients. The client FEC-decodes, video-decodes the incoming stream into frames, and then displays them. On the feedback line, the client measures the network performance (ClientMonitor), and sends the network parameters to the server so as to control the rate and redundancy (ServerController).

Distortion is a function of the available bandwidth and packet loss rate. The system objective is to minimize distortion while meeting the MCG constraints of real-time video transmission, limited and variable available bandwidth, and variable packet loss rate. We represent this objective as an optimization problem.

The available bandwidth between the MGC server and client determines the rate constraint as the upper bound for the final sending rate (). The goal of the adaption scheme is to find the optimal solution to minimize the end-to-end video distortion , given the measured network parameters at the client side (i.e., RTT, , and MTP latency ), video encoding rate and delay constraint . In MCG, interactive response is critical to ensure a high quality of experience, with the MTP latency as the primary metric. The MTP latency is defined as the time between a user action in the game and its effect on the display. It comprises user input delivery, game execution, frame rendering and capturing, video encoding, transmission, and decoding, channel encoding, and playback delay. MTP is disproportional to both sending rate and available bandwidth , and proportional to the queuing delay . The joint source and FEC coding adaptation problem for each frame can be formulated as follows.

Latency Cnst
Throughput Cnst

where is a hyper-parameter, is the upper-bound of MTP latency (i.e., 130ms), , , and are empirical parameters based on the encoder’s configuration, and , , , and are parameters derived from multivariate regression with goodness of fit . In the regression, we use a collective dataset of 12 experiments over both wired and wifi networks to fit the linear model. Given the number of source packets and , we derive the total number of packets (sum of source and redundant packets) as follows:

As the I-frame rate decreases, the decoding distortion, and thus the end-to-end distortion, increases rapidly as shown in Figure 2. The longer the error propagates, the higher the decoding distortion at the client side. To overcome the inter-frame error propagation that causes the decoding distortion, we protect the frames unevenly based on the frame index within the GoP. The redundant packets ensures higher protection as frame gets closer to the I-Frame (smaller ), where is an empirical weight that increases or decreases with the packet loss rate. When , all frames are protected equally, and there is no extra protection against error propagation.

3.4. Heuristic Model

Minimizing the encoding distortion requires maximizing the encoding bitrate

. We do so heuristically by taking the maximum feasible source bitrate and the minimum redundancy rate at each time step

as follows:

To satisfy the latency and throughput constraints in the optimization problem LABEL:eq:optimDist1, the sending bitrate () decreases automatically once experiencing a high round trip while not exceeding the measured throughput at the client. At a time t, the sending rate is upper-bounded by , where is the queuing delay, the difference between recent and min round trip time. Experimentally, satisfying the latency and throughput constraints prevents overshooting, thus satisfying the loss constraint implicitly. The formula that determines the sending bitrate is therefore defined as follows:

With is the current time, and is the total streaming duration. takes discrete values. In our evaluation, we choose a time interval of 1 second for a 1-minute long video game session.

The system periodically reacts to the loss rate , available bandwidth , and the MTP latency by adapting and , avoiding overshooting and distortions that result from sporadic packet loss.

4. Distortion Minimization Framework

Figure 3 illustrates our proposed end-to-end framework. This framework aims to provide MCG systems with error resiliency and achieve optimal video quality by combining source rate control and FEC coding adaptation. After providing a general description of the framework architecture, we will focus on how the adaptive FEC coding and the adaptive source coding models operate.

4.1. Framework Architecture

The framework includes two components, a controller and a parameter tuner. The parameter tuner receives feedback from the client at runtime, tunes parameters related to the source and FEC encoding rate, and passes them to the controller. The controller combines the video rate controller (to control the video encoder according to the tuned source coding rate ) and the redundancy rate controller (to control the RLNC coder according to the tuned redundancy rate and packet size ). Each captured frame at the server-side is compressed using the VP8 codec222 and split into packets encoded for network transmission using RLNC (heide2018random). The RLNC packets are transmitted using a UDP socket to the client’s mobile device. The client decodes the RLNC packets to recover the frame data, which is then decoded to recover and display the game frame.

The framework also includes the network performance monitor at the client-side to monitor the bandwidth, the loss rate, the network round-trip-time, and the total MTP latency. These parameters are sent periodically to the parameter tuner at the server-side.

4.2. Adaptive FEC Coder Model

Figure 4. Illustration of VPX and FEC coding. VPX encoder uses multiple encoders (Enc1-9) for multiple bitrates (resolutions). Performance Tuner determines encoding and redundancy rate. FEC encoder applies unequal redundancy for GoP frames (equation 3.3). FEC coder start decoding once packets arrived

Our system uses Random Linear Network Coding (RLNC) to protect the game frames against channel loss. The frame data is divided into packets of size and linearly combined. The RLNC encoder generates a redundancy rate of redundant packets with an FEC block of size data packets (heide2018random). The decoder recovers the frame if any random packets in the block are received, as shown in Figure 4. The parameter tuner extracts the packet loss information from the client’s report and tunes the redundancy based on the frame index within the GoP (see Equation 3.3). The variable FEC coding of the GoP frames ensures lower error propagation as the frames leading to higher propagation receive higher protection.

4.3. Adaptive Source Coding Model

The bandwidth often shrinks when congestion occurs, leading to burst traffic loss (wu2017streaming). Injecting more FEC-based redundant packets is ineffective to cope with congestion losses in the presence of burstiness (yu2008model). As such, we introduce the source rate controller to control both the frequency of capturing the rendered frames and video encoding rate . The parameter tuner extracts the available bandwidth and latency from the client’s report adjusts by scaling down the game frames. It adapts to the client’s bandwidth while minimizing both the distortion (see equation LABEL:eq:optimDist1) and latency and relieving the congestion. Figure 4 illustrates the operation of the source rate control and redundancy control. By encoding videos several resolutions, the video encoder can dynamically adapt the bitrate of the video to the channel conditions. The video is encoded into nine resolutions: HD (1080p, 720p), High (480p, 540p), Med (360p, 376p), and Low (270p, 288p and 144p), corresponding to bitrates of (6.5 Mb/s, 4.5 Mb/s), (3 Mb/s, 2 Mb/s), (1.8 Mb/s, 1.2 Mb/s), and (1 Mb/s, 0.6 Mb/s and 0.2 Mb/s), respectively. The source rate controller selects the closest bitrate to the measured throughput . The resulting video dynamically changes resolution with minimal distortion as few packets are lost due to the bandwidth variations.

5. Implementation

We implement the distortion minimization framework into a functional prototype system. The framework leaves multiple aspects at the discretion of the developer, that we discuss in this section.

5.1. Video and Network codecs

The framework is designed to be video and FEC codec-agnostic. However, we chose to focus on VP8 for video encoding/decoding and RLNC for FEC in our implementation.

Video codec: We use VP8 (bankoski2011vp8)

, an open-source codec developed by Google. Like H.264, VP8 compresses the game’s frames spatially (intra-frame coding) and temporally (inter-frame coding). VP8 presents it presents similar performance as H.264 (encoding time, video quality) while being open-source. VP8 also combines high compression efficiency and low decoding complexity 

(bankoski2011vp8tech). We rely on libvpx, the reference software implementation of VP8 developed by Google and the Alliance for Open Media (AOMedia) 333

FEC codec: We use Random Linear Network Coding (RLNC) 444 for channel coding. RLNC relies on simple linear algebraic operations with random generation of the linear coefficients. As such, RLNC is operable without a complex structure. RLNC can generate coded data units from any two or more coded or uncoded data units locally and on the fly (fouli2018rlnc). The RLNC decoder can reconstruct the source data packets upon receiving at least linearly independent packets (see Figure 4). To use RLNC encoding and decoding functionalities, we use Kodo (pedersen2011kodo), a c++ library for network/channel coding.

5.2. Client

Network Performance Monitor (NPM): This component measures the experienced packet loss rate , bandwidth , round trip time , and MTP latency (). The link throughput is computed every as the net summation of the received frames’ size over the net summation of frames’ receiving time as follows:


where is the number of frames received in the current second, is the current frame of packets, is the size of received frame, and is the elapsed time to receive it. We start the frame’s timer upon the reception of the frame’s first packet, and disregard the size of this packet in . is the ratio of the number of missed packets to , the number of missing sequence numbers in the received packets until the successful recovery of the frame. is computed as the difference between the time of user input and the time for the corresponding frame to be displayed on the client screen. The NPM smoothes these values by applying a moving average over the latest five measurements.

5.3. Server

The server measures the through periodic packet probing (luckie2001towards), at regular intervals .

Parameter Tuner tunes the source rate and the redundancy rate according to the packet loss rate , and network throughput received from the clients, and . It adjusts the number of redundant packets according to the number of source packets and based on equation 3.3. It also determines the suitable spatial resolution of the game frames and frame capturing frequency to maximize quality with a source rate lower than .

Redundancy Controller reads from the Parameter Tuner and updates the redundant packets for the encoded video frame. Likewise, the Rate Controller reads to update the frame resolution.

5.4. Prototype System Implementation

We implement our system as a python program with C and C++ bindings. We adopt a multi-process approach to improve performance and prevent a delayed operation to interrupt the entire pipeline. We separate the pipeline into 3 processes (frame capture, VP8 encoding, and RLNC encoding and frame transmission) at the server and 3 processes (frame reception and RLNC decoding, VP8 decoding, display) at the client. These processes intercommunicate using managed queues, where Queue and Process are classes in multiprocessing555 module. We develop a protocol on top of UDP to carry server-to-client video data and client-to-server information. Referring to figure 3, the first server process captures the frames using the mss library. The video encoding process VP8-encodes them using a custom-written wrapper to the libvpx666 Libvpx does not currently provide a GPU implementation, thus, encoding/decoding is performed on machines’ CPU. We use the python bindings of the Kodo777 library to handle the RLNC operations. The server transmits packets carrying the encoded data to the client using FEC Forward Recovery Realtime Transport Protocol (FRTP) packets. The client recovers the encoded frame using the Kodo RLNC decoder. The VP8 video frames are decoded by the libvpx and displayed. The client reports the experienced network conditions periodically using the Network Performance/Parameters Report (NPR) packets. The client also listens to the user’s input and sends the event information to the server. The server returns the event sequence number in the FRTP packet carrying the corresponding encoded frame, allowing the client to measure MTP latency. The server also probes the round-trip time periodically using RTTP packets (see Supplementary Materials - Figure 10).

5.5. Feasibility as a Web Service

Many implementation choices in this section are consistent with the current practice in WebRTC. Although most current web browsers provide VP8 encoding and decoding, they do not expose these functions to web applications. Instead, they provide an interface to the main functions of the libwebrtc. Similarly, we implement the core functions of as a C++ library to easily interface with web browsers with minimal logic in Python. is implemented primarily under an asynchronous fashion, exposing callbacks to applications that may use it. It can thus easily be implemented in current web browsers, and its API endpoints can be called through JavaScript with minimal modifications.

6. Evaluation

In this section, we evaluate through both system and user experiments. We first evaluate the system in terms of objective latency (MTP) and visual quality (PSNR) measures, before inviting users to rate their experience based on subjective metrics. All experiments are performed over a controlled testbed, with both controlled and uncontrolled network conditions to ensure reproducibility.

6.1. Evaluation Setup

Figure 5. Experimental Testbed. It emulates a mobile network’s link bandwidth, delay and packet loss using tc, netem and HTB. A reference link transmits original frame to compute the distortion .

We characterize over the physical testbed represented in Figure 5. This testbed is composed of two computers: PC1 as a server (Ubuntu 18.04 LTS, Intel i7-5820K CPU @ 3.30GHz) and PC2 as a client (Ubuntu 18.04 LTS, Intel(R) Core(TM) i5-4590 CPU @ 3.30GHz). Using a computer as a client simplifies the implementation of the prototype system and the system measures. The two computers are connected first through an Ethernet link to and ensure control and reproducibility. On PC1, we emulate a wireless link with bandwidth , latency , and packet loss rate using tc888, NetEm999, and Hierarchy Token Bucket (HTB)101010 We then connect the computers through the eduroam WiFi to perform in-the-wild experiments. On both machines, we set up the kernel HZ parameter to 1,000 to avoid burstiness in the emulated link up to 12Mb/s.

On this testbed, we compare to the following baselines (see Table 1 in 8): TCP Cubic, WebRTC, Buffer Occupancy (BO) (10.1145/2619239.2626296; huang2013bo), and ESCOT (wu2017streaming). These baselines are standard in network transmission (TCP Cubic), in video streaming (WebRTC), or correspond to the latest related research works (BO, ESCOT). We implement all baselines in the pipeline described in Section 6.2 as follows. TCP Cubic is provided by the Linux Kernel. We integrate WebRTC using AioRTC111111, that we modify to support customized source streams at the server and remote stream track of WebRTC at the client (see Supplementary Materials - Figure 12). We implement BO as a queue-based playback buffer in the pipeline. Finally, we modify to perform FEC at GoP level to implement ESCOT. We implement a user-event listener to send user’s input to the server. The server executes the game logic and renders the frames upon event arrival. The architecture of the WebRTC implementation requires us to measure the visual quality in real-time. As such, we use the PSNR as it is the least computationally-intensive measure. We monitor the network conditions and all performance metrics under the same method for all experiments to ensure results consistency.

6.2. Pipeline Characterization

Figure 6. Latency Decomposition of the MCG Pipeline (without the additional network latency).

We first evaluate the prototype system’s pipeline presented in Section 5.4. We set up the testbed with a fixed bandwidth of 20 Mb/s, latency of 50 ms, and PLR of 1%. These network conditions correspond to a typical client-to-cloud link (ARVE2018), and the minimal requirements of Google Stadia. Over this testbed, we generate, encode, transmit, decode, and display a 19201080 video, encoded at a bitrate of 6.5 Mb/s and a frame rate of 30 FPS, with GoPs of size 10.

(a) Emulated Network
(b) Eduroam WiFi Network
Figure 7. MTP latency, network RTT, and PSNR on emulated network (a) and eduroam WiFi network (b). On the emulated network (a), WebRTC and balance low MTP latency and high visual quality. BO achieves lower MTP latency with considerable visual quality degradation, while TCP Cubic and ESCOT suffer from massive MTP latency. On the eduroam WiFi (b), all solutions achieve a MTP latency lower than 130 ms except ESCOT. WebRTC’s visual quality collapses due to the high jitter, while TCP Cubic remains stable thanks to the higher bandwidth. In both scenarios, presents the second best visual quality and the second lowest MTP latency.

Figure 6 presents the latency decomposition of the pipeline without the network latency. Introducing RLNC adds a marginal delay before (0.8 ms), and after (1.1 ms) the transmission. Video encoding and decoding take the most time, 20.8 and 9.1 ms respectively as the libvpx operates solely on CPU. By leveraging hardware-level video encoding and decoding as it would be the case on mobile devices, we expect to reduce these values to 5-10 ms121212, and achieve a pipeline latency below 30 ms. thus does not introduce significant latency compared to typical cloud gaming systems, while bringing loss recovery capabilities in a lossy transmission environment.

6.3. Emulated Network

To showcase the loss recovery and rate adaptation properties of the considered solutions, we emulate a wireless network with varying bandwidth over our physical testbed. On this link, we set up a bandwidth ranging between 2 Mb/s and 10 Mb/s, changing every 5 s. Such values allow us to stress the transmission solutions over variable network conditions. To ensure the reproducibility of the experiments, we generate a sequence of bandwidth allocation over time represented by the gray filled shape in Figure8

and average the results over five runs for each solution. The link round-trip delay and loss probability remain constant over time, standing at 20 ms and 1% with probability of successive losses of 25%, respectively. Over this link, we transmit 60 seconds of a gameplay video, recorded at a resolution of 1080p, and encoded in VP8 in various resolutions by the transmission scheme.

Figure 7.fig:mtp represents the MTP latency, network RTT, and average PSNR for all solutions. Only BO satisfies the requirement of MTP latency below 130 ms in such constrained environment, while

and WebRTC remain close (138.6 ms and 144.8 ms respectively). TCP Cubic collapses under the abrupt bandwidth changes and presents the highest motion-to-photon latency (807.6 ms), with a high variance caused by the loss retransmission. Many frames do not reach on-time (330 ms with our pipeline) and are thus dropped by the pipeline. ESCOT also shows high MTP latency (489.8 ms) due to its GoP-level FEC encoding. As a result, ESCOT requires to receive the entire GoP before decoding, resulting in 330 ms added latency at 30 FPS with a GoP of 10. In terms of PSNR, BO can maintain such a low latency by significantly decreasing the video quality. TCP Cubic presents the highest PSNR as it does not integrate a mechanism to vary the video encoding quality. Only

and WebRTC balance latency and video quality, with presenting slightly higher PSNR and lower MTP latency.

Figure 8. Throughput of each connection with variation of the link bandwidth (grey). uses the bandwidth the most efficiently. ESCOT and TCP Cubic tend to overshoot while WebRTC significantly undershoots. BO barely utilizes the available bandwidth.

To better understand this phenomenon, we record the received rate at the client’s physical interface and present the throughput of each solution in Figure 8. BO strives to eliminate video playback stalls by keeping a reservoir of frames with minimum rate. It thus dramatically underestimates the available bandwidth, between 191 and 542 Kb/s. WebRTC consistently sends data at a rate much lower than the available bandwidth, while TCP Cubic tends to overshoot due to aggressive congestion window increase. Although and ESCOT rely on the same bandwidth estimation mechanism, ESCOT unsurprisingly tends to slightly overshoot by transmitting the entire GoP at once. By optimizing bandwidth usage, maximizes the video quality while minimising latency and losses.

6.4. Wireless Network

Following the evaluation on emulated links, we proceed to an experiment on a real-life uncontrolled WiFi network. We perform the experiment on eduroam, a network commonly used by all students of the university. To ensure the statistical significance of the results, we transmit the video from the previous experiment five times per solution and average the results. We first characterize the network by first sending a continuous flow using iPerf over 10 minutes. The network features an average bandwidth of 8.1 Mb/s (std=5.1 Mb/s) and latency of 22.7 ms (std=38.8 ms). These average values are fairly similar to our emulated network. However, the variability of the available bandwidth is lower while the latency jitter skyrockets, with up to 70 ms difference between measurement points.

Figure 7.6(b) represents the MTP latency, network RTT, and average PSNR for all solutions. With the more stable bandwidth, all solutions except ESCOT achieve a MTP latency below 130 ms. Similar to the previous experiment, BO achieves minimal MTP latency, at the cost of a PSNR half of the other solutions. WebRTC collapses under the high variability of the latency, resulting in a low PSNR. ESCOT still presents high MTP latency by transmitting frames by GoP-sized batches. Finally, and TCP Cubic perform similarly, with low MTP latency and high PSNR.

From the emulated network and uncontrolled WiFi experiments, only performs on average the best, balancing latency and visual quality consistently. Although TCP’s behavior with varying bandwidth and the moderate loss rate is known and expected, we were surprised to notice the collapse of the visual quality of WebRTC in environments with high latency jitter, as is the case in most public wireless and mobile networks. Overall, ESCOT performs poorly due to the GoP size used in the experiments. However, a lower GoP would lead to larger videos for similar quality, and we expect the PSNR to drop with the MTP latency. Finally, BO often underestimates the bandwidth and consistently chooses the lowest quality video flow, leading to a dramatically low PSNR.

6.5. User Study

Figure 9. Users’ perception of the gaming experience under traditional cloud gaming measures (left) and task load (right). presents the highest visual quality and playability and the lowest mental demand and frustration of all solutions. Score and perceived success are also the highest with . WebRTC collapses under the high jitter of the WiFi network, resulting in low playability.

Participants and Apparatus: We perform a user study with 15 participants, aged 20 to 40. The participants are recruited on campus and are primarily students and staff. Most participants play video games at least a few times a month and have some familiarity with First Person Shooter games. The participants play the game openarena in a cloud gaming setting using the prototype system described in Section 5 and the transmission techniques defined in Section 6.1 over the eduroam WiFi network.

Protocol: Each participant starts with 5 minutes of free play of the game openarena executed on-device. This phase allows the participants to get familiar with the game and establish a reference of playability and visual quality. During this phase, an operator guides the participants and answers their questions. The participants then play the game for two minutes for each streaming method. The order in which the participants experience each streaming solution follows the balanced latin square design (williams1949experimental) to avoid learning and order effect. After each run, the participants fill a short questionnaire on the perceived visual quality and playability on a 5-point Likert scale (1 - bad, 2 - poor, 3 - fair, 4 - good, 5 - excellent), and a simplified version of the NASA TLX (hart2006nasa) survey considering the perceived mental demand, frustration, and success on a [0-20] scale. We do not disclose the streaming methods used in order not to affect the participants’ ratings.

Results: Figure 9

presents the results of the user study, with the 95% confidence intervals as error bars.

results in the highest visual quality and playability, as well as the lowest mental demand and frustration. also leads to the highest objective (score) and subjective (success) performance indicators. TCP also performs well. However, the inability to reduce the video quality in case of congestion causes transient episodes of high latency that leads the server to drop the frames, resulting in lower visual quality and playability. Similar to Section 6.4, BO leads to low latency, explaining the high playability score. However, the video quality drops so low that players have trouble distinguishing what happens on screen, resulting in high mental demand and frustration. WebRTC collapses due to the high jitter, leading to an almost-unplayable experience. Finally, the high latency of GoP-level encoding proves to be highly incompatible with a fast-paced game such as openarena. Overall, maintains latency consistently low while adapting the visual quality to the best achievable quality given the network conditions, leading to the highest user satisfaction.

7. Conclusion

This paper introduced , an end-to-end framework combining per-frame adaptive FEC with source rate control to provide MCG with low-latency yet error-resilient video streaming capabilities. By combining adaptive FEC and source rate adaptation, minimizes the distortion propagation while maximizing the bandwidth usage, leading to a high end-to-end PSNR without sacrificing latency. Our system evaluation shows that balances low latency and high PSNR under all scenarios, even in highly variable conditions where other solutions (TCP Cubic and WebRTC) collapse. Under high latency jitter, it significantly outperforms WebRTC, the current industry standard for video transmission in cloud gaming. Our user study over a real-life public WiFi network confirms these findings, with consistently presenting higher visual quality, playability, and user performance while requiring a lower workload than the other solutions.

We aim to focus on multiplayer streaming in our future work, where several devices share a given scene. By generating the video in multiple resolutions, our system allows distributing content to clients connected through variable network conditions. We will also continue the study of over mobile networks to evaluate the effect of mobility on the frame transmission of video games and the resulting user experience.


8. Supplementary Materials

8.1. System Protocol Packets

Figure 10. Illustration of media delivery and RTT probing on the forward channel and feedback reporting and user input on the reverse channel.

In this section, we illustrate ’s transmission protocol packets as an integral component in the prototype system implementation (see Section 5.4). Figure 10 illustrates the packet type used to deliver the game frames (i.e., FRTP packet), the one used to collect reports about the network conditions (i.e., NPR), the packet type used to probe the presence of congestion (i.e., RTTP packet), and the one used to send the user input to the server (i.e., Event packet). The server sends packets carrying the encoded game frame’s data to the client using FEC Forward Recovery Realtime Transport Protocol (FRTP) packets. FRTP packets contains the event’s sequence number if the frame is rendered in response to a user’s event. The client reports the experienced network conditions and MTP periodically using the Network Performance/Parameters Report (NPR) packets. It also sends user input as event identifier and number using Event packets. Moreover, the server probes the round trip time periodically using RTT Probing (RTTP) packets. Eventually, the client can compute MTP latency by subtracting the timestamp of sent event from the timestamp of a received frame having a corresponding event sequence number.

8.2. Heuristic Algorithm Detail

’s rate and FEC adaptation is based on discrete optimization which follows the heuristic model (see section 3.4). Algorithm 1 presents the pseudo-code of the heuristic approach. It takes as input: the ascending ordered list , the previous encoding rate , the size of GoP frame , and the recent as well as constants, namely GoP length and end-to-end delay upper-bound . It returns the redundancy rate and the video encoding rate . After determining the number of packets and computing the number redundant packets , it computed the redundancy rate . Afterwards, it determines the video encoding rate heuristically (lines 8-10) based on network throughput and queuing delay and according to equation 3.4. Finally, it refines the video encoding rate based on the experienced motion-to-photon latency which ensures not to exceed the latency constraint (i.e., ) . in lines 16 or 19 determines the resultant quality level (resolution) and thus the suitable video encoding rate from the list . Given the heuristic algorithm, Figure 11 illustrates how adaptive are the redundancy and the total sending rate as response to packet loss rate and throughput changes. By considering all the frames, however, the redundancy rate show in Figure 11 increases reaching 9% on average, as described in Section 8.4.

1:Input: , , S, , , , MTP, F,
4:Compute using equation 3.3
7:if  1 - > 0  then
8:     for Each  do
9:         if [] +  then
10:               - 1
11:              break
12:         end if
13:     end for
14:end if
15:if [] & MTP >  then
16:      - 1
17:end if
18:if  1 - 0 |  then
20:end if
21: =
Algorithm 1 Pseudo-code of heuristic algorithm

[FEC adaptability] [Rate adaptability]

Figure 11. Illustration of redundancy and sending rate given a streaming of 1-minute gameplay video over emulated network (see Section 6.3). (a) FEC redundant information percentage (solid line) vs packet loss rate () (dashed line) for frames over 40 packets, and (b) sending rate vs network throughput.
Protocol Loss Recovery Rate Control
TCP Cubic NACK Congestion Window
BO (10.1145/2619239.2626296; huang2013bo) N/A Buffer Occupancy
ESCOT (wu2017streaming) GoP-level FEC Throughput-latency based
Frame-level FEC
Table 1. Baselines considered in the system evaluation

8.3. Baselines

Table 1 illustrated the baselines which are: the network transmission standard (TCP Cubic), the video streaming standard (WebRTC), Buffer Occupancy (BO)-based video streaming (10.1145/2619239.2626296; huang2013bo), and GoP-based MCG (ESCOT) (wu2017streaming). We integrate both BO and WebRTC to operate as streaming platform for mobile cloud gaming (see Figure 12). In both, we integrate the captured frames as source stream at the server, and a remote stream track at the client. We enable RTT probing in WebRTC using its data channel and we implement a queue-based playback buffer in BO based on which the Rate Selection component in the client requests the next rate from the server. Such integration of the baselines allows us not only to measure both visual quality and MTP on frame basis but also create reproducible experiments and findings.

8.4. Overhead

Overall, FEC-based approach tends to have a higher reliability overhead compared to ACK-based techniques that requires overhead equals to the packet loss rate. Figure 13 illustrates the reliability overhead of and the baselines’s streaming over the emulated network described in Section 6.3.

’s satisfactory visual quality is obtained at the expense of redundancy overhead. presents a redundancy rate of 9% on average, due to the per-frame encoding and the minimum redundancy. Each frame require a minimum of 1 redundant packet, and P-frames tend to be small (a couple of packets). Besides, unequally protecting the GoP frames leads to extra redundancy. As such, the overall overhead becomes relatively high, compared to GoP-level FEC coding. However, such overhead is compensated by the better throughput utilisation, minimizing its impact on the transmission.

Figure 12. WebRTC’s and BO’s experimental frameworks.
Figure 13. Redundant packet as reliability overhead for each solution. has the highest, followed by GoP-level FEC (ESCOT), then hybrid FEC/NACK (WebRTC), and finally NACK only (TCP/CUBIC), while BO has none.