Data-Driven Path Selection for Real-Time Video Streaming at the Network Edge

by   Sabur Baidya, et al.
University of California, Irvine

In this paper, we present a framework for the dynamic selection of the wireless channels used to deliver information-rich data streams to edge servers. The approach we propose is data-driven, where a predictor, whose output informs the decision making of the channel selector, is built from available data on the transformation imposed by the network on previously transmitted packets. The proposed technique is contextualized to real-time video streaming for immediate processing. The core of our framework is the notion of probes, that is, short bursts of packets transmitted over unused channels to acquire information while generating a controlled impact on other active links. Results indicate a high accuracy of the prediction output and a significant improvement in terms of received video quality when the prediction output is used to dynamically select the used channel for transmission.


Demo: A Unified Platform of Free-Space Optics for High-Quality Video Transmission

In this paper, we investigate video signal transmission through an FPGA-...

eBPF-based Content and Computation-aware Communication for Real-time Edge Computing

By placing computation resources within a one-hop wireless topology, the...

Soft Video Multicasting Using Adaptive Compressed Sensing

Recently, soft video multicasting has gained a lot of attention, especia...

Fresher Content or Smoother Playback? A Brownian-Approximation Framework for Scheduling Real-Time Wireless Video Streams

This paper presents a Brownian-approximation framework to optimize the q...

Index Coding at the WiFi Edge: An Implementation Study for Video Delivery

HTTP based video streaming has become the de facto standard for video co...

Link Adaptation for Wireless Video Communication Systems

This PhD thesis considers the performance evaluation and enhancement of ...

An Integrated P2P Framework for E-Learning

The paper is concerned with the design and development of a P2P Presenta...

I Introduction

Recently introduced paradigms where machines collaborate to accomplish a common task are imposing increasingly stringent constraints on the Quality of Service (QoS) achieved by the underlying communication layers. An illustrious example of such model, which have been receiving considerable attention, is edge computing [bonomi2012fog, shi2016promise, baidya2017netselect], where a compute-capable machine positioned at the network edge provides data processing services to mobile devices interconnected through a low-latency, one hop, wireless link.

However, the volatile nature of wireless links may induce periods of time where the ability of mobile devices to deliver data to edge servers degrades. For instance, bursts of data transmissions from other wireless terminals active on the same frequency channel, or temporary loss of line of sight connectivity to the access point or base station in urban environments may drastically reduce achievable data transfer rate and result in large delays or packet loss. Intuitively, these issues are particularly relevant when edge computing is used to support real-time mission critical applications.

The overall objective of this paper is to maximize the integrity of data streams emitted from mobile devices and directed to the network edge under stringent capture-to-delivery delay constraints imposed by real-time applications. We specifically focus on one of the most challenging and relevant applications: real-time video streaming for immediate edge-assisted analysis. We consider a scenario where a mobile device and an edge server can communicate over multiple wireless channels – for instance, multiple Wi-Fi channels associated with one or more access points, and make the following assumptions: (i) due to mobility and activity from other active devices the characteristics – e.g., short term throughput and packet-wise delay – of each channel present complex dynamics at different time scales; and (ii) devices can acquire information on the channel state – abstracted here as a transformation imposed to the packet stream – only by transmitting packets over available channels and observe their outcome. In this challenging environment, we formulate a channel selection problem whose goal is twofold: (i) maximize the perceived quality of the video stream at the edge server, while (ii) minimizing the impact of the information flow on the Quality of Service (QoS) of other active data streams. At the core of the problem is reliable channel prediction, that is, the ability of the channel selection logic to forecast future performance of the video stream to inform selection decisions. Critically, as only the channel currently being used is observed, the selection logic is clueless about the current state of the other channels. In order to acquire complete state information, all the channels would need to be used simultaneously, thus possibly degrading the QoS of other data streams. To solve this impasse, we propose a probe-based approach, where prediction over unused channels is informed by the transformation applied by the channel to a train of short packets periodically transmitted.

In summary, the paper makes the following contributions:

(a) By detailed simulation, we create a comprehensive datasets based on real-world application layer traffic traces.


) We characterize the information on the future Peak Signal to Noise Ratio (PSNR) of the video stream contained in features of the packet streams – both video and probes.


) We develop a prediction framework based on state of the art classification algorithms, including Convolutional Neural Networks (CNNs).

(d) Based on the predictor’s output, we implement an algorithm dynamically selecting the channel which most likely will provide the highest PSNRs in a predefined temporal window.

The rest of the paper is organized as follows. Section II discusses related work. Section III presents the considered scenario and channel selection problem, and introduces the prediction strategy. In Section IV, we describe the network environment, and the dataset.Section V presents the classification and selection approach. In Section VI, we provide performance evaluation and Section VII concludes the paper.

Ii Related Work

Real-time video streaming over wireless networks often needs to compromise its QoS to adapt to the inconsistencies in the communication channels. Dynamic Adaptive Streaming over HTTP (DASH) [stockhammer2011dynamic] and similar protocols [kua2017survey, zambelli2009iis] were developed to adapt the quality of the transmitted video – that is, its resolution, frame rate, etc. – to the capacity of the channel. However, leveraging other available channels can improve the QoS of the video streaming. Protocols, e.g., Multipath TCP (MPTCP) [barre2011multipath] was developed to harness the capacity of multiple communication paths, however, introduce challenges in terms of packet reordering and head-of-line blocking in case of high diversity among the paths. Based on MPTCP, Multi-Path DASH (MPDASH) [han2016mp] has been proposed to extend DASH to multi-path scenarios. However, MPDASH is based on MPTCP, and thus comes with all the limitations of TCP in a real-time streaming scenarios. Additionally, TCP-based techniques are not a good match to wireless scenarios due to their inherent fast variability. Finally, these techniques presume the use of all the channels simultaneously, with a possibly long adaptation time when a new path is added or removed, an event which may frequently occur in mobile wireless networks. Due to all the aforementioned reasons, for real-time video streaming, we choose to use the best of the available communication paths and use streaming over UDP.

However, proactively selecting the best wireless path ensuring stability of the video over a time window needs accurate prediction of the wireless network conditions. A number of recent contributions use machine learning techniques to predict the state of wireless channels. For instance, a decision tree based classification algorithm is employed in

[abbas2013learning] to choose the best available network between a 3G and a WLAN using features such as RSSI, location and type of application. In [kajita2015throughput]

, Support Vector Machines (SVM) is used to estimate the throughput and delay in a WiFi network, given the traffic volume and signal strength of active interferes as input features. A recent work presents “PenSeive”


, an Adaptive Bit Rate (ABR) video transmission framework based on reinforcement learning and neural networks. The primary goal is to learn from past decision on output bit rate with respect to channel state. However, these prediction techniques are focused on the physical layer, with no consideration of channel use patterns. Here we address a problem where due to the presence of other data streams, only Physical layer based frameworks, e.g., pure SNR-based prediction fail to capture the interaction of upper layer complexities for the coexistence.

Iii Preliminaries

The development of a predictive framework to dynamically select channels providing the desired quality to a data stream in presence of complex underlying traffic and channel gain presents two main technical challenges:

(ii) Develop an effective methodology to acquire information on the state of all available channels; and

(ii) Build effective strategies mapping the acquired state information onto future QoS or application layer performance of the data stream.

Figure 1:

High level schematics of the channel selection framework based on probe and data classifiers.

Fig. 1

shows the general diagram of the proposed framework: features extracted from the video and probe streams are used to predict the future quality of the video if any of the

channels were to be used for transmission. The individual components of the framework and the rationale behind its design are explained in the following.

Iii-a Scenario

We consider a scenario where a video stream captured in real-time by a mobile user needs to be transported to an edge server. The user and the edge server can communicate through independent wireless channels – for instance corresponding to different access points in the vicinity – indexed with . The channels are shared with other users and data streams, which all contend for available frequency resources.

The video stream is encoded, packetized using UDP, and transmitted over one of the wireless links to the edge server. The frame rate of the video is  frames/s, and we assume that all the packets referring to frame , are injected into the transport layer socket at time .

Iii-B Selection Problem

Due to practical considerations, we define the selection process at the temporal granularity of Group of Pictures (GoPs). Let’s index GoPs with . We, then, denote the average Peak Signal-to-Noise Ratio (PSNR) of GoP with , computed as the average PSNR of all the component frames, and define the control variable corresponding to the channel selected for the transmission of GoP .

In order to maximize the quality of the video, we simply aim at the selection of the channel providing the maximum average PSNR, that is,


Rather clearly, the selector needs to make the decision before the GoP is transmitted, and selection is necessarily based on prediction.

Iii-C Prediction

We define prediction as a classification problem, where classes are defined based on functionals of the GoPs’ PSNR within a window of future GoPs. This is motivated by the inherent difficulty in the short-term prediction of individual GoPs’ PSNR values in network environments characterized by mobility and complex traffic patterns. Let’s set the current GoP index to , we then define classes as


where is a functional mapping of the vector to the class .

Several choices of the classification function are possible, including a simple discretized average over the PSNRs within the temporal window. Herein, we choose a function which aims at preserving a general good quality of most GoPs in the window. Specifically, we define

where is a predefined PSNR threshold, and is the indicator function. Thus, we say that the channel in the window is “good” () if at least GoPs have a PSNR larger than . The selector, then, will attempt to assign the stream to a good channel.

We now define the input of the predictor. As mentioned earlier, due to both technological and environmental parameters and their variations, the channel applies a transformation to the packet transmitted by the mobile user. Define as the matrix composed of the row vectors and , whose elements are the injection time and size of all the packets transmitted during GoP . We characterize such transformation using the function as follows


which takes as input and a generic vector of latent variables to return the vectors and , which contain the delivery time and a delivery flag (equal to if the packet is delivered by the deadline and otherwise), respectively.

We, then, define the predictor , where is the predicted class and is a matrix


whose rows correspond to different features and columns to different GoP slots. The features are extracted from the matrix


We note that the features needs to be calculated at the access point or edge server, which feeds them back to the user by means of a short packet. In order to ensure a reliable delivery of this packet, features associated with the last frame of the last GoP may be omitted.

Iii-D Probing

The input of the predictor are features computed from the outcome of packets transmitted over the channel in the preceding temporal slots (corresponding to GoPs). Clearly, computing such features is possible only if the channel currently used to transmit the video. To overcome this issue, we introduce and study in this scenario the notion of probes. The core idea is to acquire information by periodically transmitting short trains of packets. This reduces the impact on other active streams compared to transmitting sections of the video.

Iv Network Environment and Dataset

Here we focus on the Wi-Fi communication technology, which has a high degree of unpredictability due to the contention mechanism and the low amount of control imposed by the infrastructure, and associate each of the channels to an access point. The investigation of a multi-technology environment, for instance including connection to LTE eNodeBs is left to future studies. We consider the following coexisting applications: (1) Youtube video streaming, (2) Skype voice call, (3) File Transfer Protocol, (4) Skype video call and (5) collection of background web traffic that persistently inject traffic in the network.

Figure 2: Envelope of the traffic classes at the network level. The set includes Youtube, Skype voice and video, and FTP applications as well as persistent background web traffic.

Figure 2 shows the traces collected for the mentioned classes of applications and combined complex structure of the overall envelope of the traffic shape when all of them are active. We inject these applications with stochastic activation/deactivation patterns.

The used video stream has a resolution of 1920x1080 and encoded with H.264/AVC [wiegand2003overview] format with average encoding bitrate of using ffmpeg [tomar2006converting]. We generate a series of Transport Stream (TS) packets and transmitted over UDP. All the other coexisting data streams are transported over TCP.

To extract the dataset, we create non-overlapping WiFi channels in the NS-3 simulator, and run simulations for GoPs. Each GoP is composed of frames transmitted over sec. Each video frame of the GoP is transmitted as a burst of packets. Since the video is temporally encoded, the first frame of the GoP is a reference frame with larger size of UDP packets of bytes, all following differential frames are smaller and contained in a burst of to UDP packets. Within any GoP slot, we transmit the video data on one of the channels and a burst of probe packets at the beginning of the GoP on the the other available channels (see Fig. 1).

Fig. 3 depicts an example of temporal trace of the PSNR and average packet delay at the GoP granularity. Periods of high channel load with different characteristics can be observed, interluded by periods of milder traffic conditions. Note the non-trivial relationship between packet delay and PSNR due to the complex characteristics of video encoding.

Figure 3: Temporal pattern of PSNR and average packet delay induced by the traffic traces and their activation/deactivation.

V Prediction Framework

The dataset described in the previous section is used to build the predictor . Note that we need two distinct classifiers, translating video stream and probe stream features into the binary class we use to represent the quality of future video GoPs. We compute the features over a temporal window −H,…,0, which we refer to as the history. To define the classes z=1 and 0 – high and low video quality, respectively, we set the PSNR threshold to 40 dB, and the number of above-threshold GoPs to k=6. As in the dataset collection, we set the prediction and history window size to 10.

In order to fully exploit the temporal dependencies which exist between the features and the future class, we use a deep Convolutional Neural Network (CNN) [lecun1998gradient]. This choice also allows the use of the CNN soft labels’ output to improve selection accuracy, as explained later. The architecture of CNNs is designed to take advantage of the multi-dimensional structure of the input. This is achieved using local connections and weights followed by pooling, which results in translation invariant features. In the considered case, we use the feature matrix as a 2-D input set feature. Then, different 1-D kernels (filters) are applied on the each row of the input matrix. Assuming that the size of all kernels is the same, and is equal to which . Kernel can be described as a vector as below:

The output of the convolution between and input feature vector is a vector with length in which the element is obtained as follows:


Where in the equation above can be any of three feature , or . After convolving with the features, each kernel roughly summarizes key trends in the feature progression. For example, a kernel can detect whether there is a decreasing or increasing trend in recent slots, or capture a peak or valley within the window. A larger number of kernels allows the model to capture a broader range of fluctuations and trends in the features’ history. After that, all

are unrolled and passed through a non linear activation function, which is, then, fed to a deep fully connected neural network as shown in Fig.

4 to output the class labels.

Figure 4: CNN Architecture used to build the predictor. The model takes as input a vector with shape , in which each row is associated with one feature type , and

to indicate average delay, delay variance and packet loss, respectively. Then a 1D CNN is applied to each row to analyze the temporal fluctuations and trends of each feature type, and a the filtered data is unrolled and fed to a fully connected neural network to extract the output class label.

V-a Channel Selection

The goal of selection decisions is to maximize the short-term future PSNR of the video given the current state of all the channels. Due to prediction errors, a channel whose predicted label is may still present a larger than expected number of low PSNR GoPs. To mitigate the impact of prediction errors, we rank the channels using the soft output of the corresponding predictors and select the channel with the highest rank. Therefore, the policy for the channel selector is to choose a channel that


where is the soft output of the classifier in channel .

Vi Performance Evaluation

We provide here an extensive evaluation of the offline and online performance of classification and channel selection. We used PyTorch 

[ketkar2017introduction] to build the CNN model, and the trained model is integrated into the NS-3 network environment for online performance test. To build the CNN model, we use different kernels and set the kernel length to which is equal to half of the history size. We set the learning rate to and use the Adam optimizer [kingma2014adam] to train the model. We train and test the classifier on the probe and video datasets. Each dataset includes approximately video and probe datapoints. We use of datapoints for training and the remaining to assess offline classification performance. Our model takes about 5 microseconds for inference and is pre-loaded for parallel execution in another processing unit, e.g. GPU at the edge. The inference time (5 microsec) being negligible compared to the GoP duration (1 sec) does not affect the feature values which we compute based on packet transmission in one GoP. So, we reduce the packet arrival deadlines from 1 sec to (1-0.000005) = 0.999995 sec to maintain real-time prediction and channel selection.

Figure 5: Performance Comparison of different classifiers for different batch sizes of 100 bytes probe and the video stream-based classifier.
Figure 6: Comparison of the effect of the history length on classification accuracy with CNN based classifier.

Vi-a Offline Classifier Performance

We first report micro-benchmarks on the performance of the classifier model in terms of prediction accuracy. Figure 5 compares the classification accuracy achieved by probe composed of a different number of packets (packet size equal to  bytes) and that achieved based on the video stream. Interestingly, the large amount of information collected by past video packets – which are larger in size and number compared to probes – allows accurate classification even when using simple classifiers. Conversely, accurate classification based on lightweight probes necessitates complex classifiers, capable of separating the two classes using more convoluted surfaces. This motivated our choice of CNN as the core of the proposed channel selection framework.

We analyze the effect of the history size on the classification performance in Fig. 6. Interestingly, while the classifier based on the video packet stream suffers a relatively small accuracy loss when the history size is reduced, the probe-based classifier necessitates a larger amount of GoP slots to provide comparable performance. This is connected to the need of a larger set of packets spread over time to adequately sample the network state and its dynamics.

Vi-B Online Channel Selection

First, in order to fully motivate the use of probes, we measure the degradation – in percentage with respect to the case with no video or probe stream – to other data streams’ throughput caused by different probe models and the video stream itself (Fig. 7). We can observe that an already effective set of probes (burst size equal to ) causes a % throughput loss, compared to the % caused by the video stream.

Figure 7: Throughput degradation of other applications imposed by probes or video streams.

We now analyze the performance of the data-driven channel selection framework in terms of output video quality. Based on our offline evaluation of the classifiers in Fig  5 and 6, here we use 50 probe packets for probe classifier and use a history of 10 GoP history slots for the features to achieve the accuracy comparable to the video classifier.

Avg. Encoding Bitrate FileSize (Bytes) Avg. PSNR (dB) 800 Kbps 104977 36.95 600 Kbps 78772 35.82 400 Kbps 53241 35.01 200 Kbps 27486 30.03 100 Kbps 16056 26.65 50 Kbps 13912 24.47 25 Kbps 10865 18.63
TABLE I: Mapping of bitrate to file size and achievable PSNR for ABR based streaming of the selected video.

We compare the data-driven channel selector with the following cases:

(i) Fixed Channel: The entire video is sent over one fixed channel which gives the lower bound of the performance.

(ii) Adaptive Bit Rate (ABR) with full knowledge: Assuming full knowledge of the channel throughput, we implement an ABR framework matching the available throughput to a coding rate. In order to compute the PSNR, we do not adapt the resolution of the video as in DASH [stockhammer2011dynamic] but adapt the encoding bitrate. Table I shows how different encoding bitrate maps the size of the video segment with corresponding achievable PSNR. In realistic conditions, the adaptation of the encoding rate in real-time streams is rather difficult due to the high computation demand.

(iii) Delay-Based Simple Channel Selector: We implement a simple delay-based selector, which compares the average packet delay in the last GoP slot for all the channels and selects the channel with the smallest delay.

(iv) Oracle Channel Selector: As an absolute upper bound, we consider an oracle channel selector that has a priori knowledge of the PSNR in all available channels.

The average PSNR achieved by the different selectors in scenarios where nodes are static or move according to a random walk with speed up to 5 m/s is shown in Table II. It can be seen how the proposed data-driven approach ( dB) performs closely to the Oracle channel selector ( dB), with a loss of less than  dB, and an improvement of  dB with respect to fixed channel transmission. ABR transmission and Delay-based selection, at and  dB, achieve intermediate performance. When considering high mobility, we note a reduction in the achievable average PSNR. However, also in this case the proposed predictive channel selector outperforms other options and performs close to the oracle selector.

PSNR (dB) Fixed Channel (lower bound) Oracle Channel Selector (Upper bound) ABR with full knowledge Delay-based Simple selector Probe-based predictive selector
No Mobility ccccc 37.52 ccccc 54.31 ccccc 45.27 ccccc 47.90 ccccc  51.85
With Mobility ccccc 35.24 ccccc 49.13 ccccc 42.54 ccccc 45.31 ccccc  48.33
TABLE II: Average PSNR comparisons in different real-time video streaming approaches.
Figure 8: CDF of PSNR for different transmission strategies.

The CDF of the PSNR is shown in Figure 8

. Transmission over a fixed channel results into a widely spread distribution of PSNR, which may impair the ability of edge server to perform analysis. The probe-based classifier is capable of eliminating most low and mid-range PSNRs, although prediction errors result in a larger probability of mid-range PSNR than that achieved by the oracle. ABR transmission efficiently removes most low-range PSNRs, but the CDF has a sharp increase at

 dBs, emphasizing the limits of adaptation against bad channel conditions, where the compression necessary to effectively deliver the packets produces a perceivable degradation.

Vii Conclusions

This paper presented a data-driven form of prediction and channel selection for wireless channels subject to interference and congestion. The core idea is to build classifiers capable of transforming features collected in real-time by wireless nodes into predicted quality of the data transmitted to edge servers. We specifically consider a scenario where a real-time video stream is acquired and transmitted over highly dynamical channels. An extensive discussion on the predictive power of features extracted from the data and probe stream was provided. We measured the performance of classification and channel selection by means of extensive simulations. Results show prediction accuracy of about % and average PSNR close to that of an oracle predictor with a limited disruption of other data streams’ performance.