Low-latency Cloud-based Volumetric Video Streaming Using Head Motion Prediction

01/17/2020 ∙ by Serhan Gül, et al. ∙ Telekom Deutschland GmbH Fraunhofer 0

Volumetric video is an emerging key technology for immersive representation of 3D spaces and objects. Rendering volumetric video requires lots of computational power which is challenging especially for mobile devices. To mitigate this, we developed a streaming system that renders a 2D view from the volumetric video at a cloud server and streams a 2D video stream to the client. However, such network-based processing increases the motion-to-photon (M2P) latency due to the additional network and processing delays. In order to compensate the added latency, prediction of the future user pose is necessary. We developed a head motion prediction model and investigated its potential to reduce the M2P latency for different look-ahead times. Our results show that the presented model reduces the rendering errors caused by the M2P latency compared to a baseline system in which no prediction is performed.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Recent advances in hardware for displaying of immersive media have aroused a huge market interest in vr and ar applications. Although the initial interest was focused on omnidirectional () video applications, with the improvements in capture and processing technologies, volumetric video has recently started to become the center of attention (schreer2019). Volumetric videos capture the 3D space and objects and enable services with 6dof, allowing a viewer to freely change both the position in space and the orientation.

Although the computing power of mobile end devices has dramatically increased in the recent years, rendering rich volumetric objects is still a very demanding task for such devices. Moreover, there are yet no efficient hardware decoders for volumetric content (e.g. point clouds or meshes), and software decoding can be prohibitively expensive in terms of battery usage and real-time rendering requirements. One way of decreasing the processing load on the client is to avoid sending the volumetric content and instead send a 2D rendered view corresponding to the position and orientation of the user. To achieve this, the expensive rendering process needs to be offloaded to a server infrastructure. Rendering 3D graphics on a powerful device and displaying the results on a thin client connected through a network is known as remote (or interactive) rendering (shi2015). Such rendering servers can be deployed at a cloud computing platform such that the resources can be flexibly allocated and scaled up when more processing load is present.

In a cloud-based rendering system, the server renders the 3D graphics based on the user input (e.g, head pose) and encodes the rendering result into a 2D video stream. Depending on the user interaction, camera pose of the rendered video is dynamically adapted by the cloud server. After a matching view has been rendered and encoded, the obtained video stream is transmitted to the client. The client can efficiently decode the video using its hardware video decoders and display the video stream. Moreover, network bandwidth requirements are reduced by avoiding the transmission of the volumetric content.

Despite these advantages, one major drawback of cloud-based rendering is an increase in the end-to-end latency of the system, typically known as m2p latency. Due to the added network latency and processing delays (rendering and encoding), the amount of time until an updated image is presented to the user is greater than a local rendering system. It is well-known that an increase in m2p latency may cause an unpleasant user experience and motion sickness (adelstein2003; allison2001). One way to reduce the network latency is to move the volumetric content to an edge server geographically closer to the user. Deployment of real-time communication protocols such as WebRTC are also necessary for ultra-low latency video streaming applications (holmberg2015). The processing latency at the rendering server is another significant latency component. Therefore, using fast hardware-based video encoders is critical for reducing the encoding latency.

Another way of reducing the m2p latency is to predict the future user pose at the remote server and send the corresponding rendered view to the client. Thus, it is possible to reduce or even completely eliminate the m2p latency, if the user pose is predicted for a lat equal to or larger than the M2P latency of the system (bao2016). However, mispredictions of head motion may potentially degrade the user’s qoe. Thus, design of accurate prediction algorithms has been a popular research area, especially for the viewport prediction for videos (see Section 2.3). However, application of such algorithms to 6dof movement (i.e. translational and rotational) has not yet been investigated.

In this paper, we describe our cloud-based volumetric streaming system, present a prediction model to forecast the 6dof position of the user and investigate the achieved rendering accuracy using the developed prediction model. Additionally, we present an analysis of the latency contributors in our system and a simple latency measurement technique that we used to characterize the the m2p latency of our system.

2. Background

Figure 1. Overview of the system components and interfaces.

2.1. Volumetric video streaming

Some recent works present initial frameworks for streaming of volumetric videos. Qian et al. (qian2019) developed a proof-of-concept point cloud streaming system and introduced optimizations to reduce the m2p latency. Van der Hooft et al. (van2019) proposed an adaptive streaming framework compliant to the recent point cloud compression standard MPEG V-PCC (schwarz2018). They used their framework for HTTP adaptive streaming of scenes with multiple dynamic point cloud objects and presented a rate adaptation algorithm that considers the user’s position and focus. Petrangeli et al. (petrangeli2019) proposed a streaming framework for ar applications that dynamically decides which virtual objects should be fetched from the server as well as their lod, depending on the proximity of the user and likelihood of the user to view the object.

2.2. Cloud rendering systems

The concept of remote rendering was first put forward to facilitate the processing of 3D graphics rendering when PCs did not have sufficient computational power for intensive graphics tasks. A detailed survey of interactive remote rendering systems in the literature is presented in  (shi2015). Shi et al. (shi2019) proposed a mec system to stream ar scenes containing only the user’s fov and a latency-adaptive margin around the fov. They evaluate the performance of their prototype on a mec node connected to a 4G (LTE) testbed. Mangiante et al. (mangiante2017) proposed an edge computing framework that performs fov rendering of videos. Their system aims to optimize the required bandwidth as well as reduce the processing requirements and battery utilization.

Cloud rendering has also started to receive increasing interest from the industry, especially for cloud gaming services. Nvidia CloudXR (nvidia2019) provides an SDK to run computationally intensive xr applications on Nvidia cloud servers to deliver advanced graphics performances to thin clients.

2.3. Head motion prediction techniques

Several sensor-based methods have been proposed in the literature that attempt to predict the user’s future viewport for optimized streaming of -videos. Those can be divided into two categories. Works such as (bao2016; bao2017; sanchez2019; aykut2018; oculus16) were specifically designed for vr applications and use the sensor data from hmd whereas the works in (barniv2005; koirala2015; bai2011) attempt to infer user motion based on some physiological data such as eeg and emg signals. Bao et al. (bao2016) collected head orientation data and exploited the correlations in different dimensions to predict the head motion using regression techniques. Their findings indicate that lat of - is a feasible range for sufficient prediction accuracy. Sanchez et al. (sanchez2019) analyzed the effect of m2p latency on a tile-based streaming system and proposed an angular acceleration-based prediction method to mitigate the impact on the observed fidelity. Barniv et al. (barniv2005)

used the myoelectric signals obtained from emg devices to predict the impending head motion. They trained a neural network to map emg signals to trajectory outputs and experimented with combining emg output with inertial data. Their findings indicate that a lat of

- are achievable with low error rates.

Most of the previous works target 3dof vr applications and thus focus on prediction of only the head orientation in order to optimize the streaming videos. However, little work has been done so far on prediction of 6dof movement for advanced ar and vr applications. In Sec. 5, we present an initial statistical model for 6dof prediction and discuss our findings.

3. System Architecture

This section presents the system architecture of our cloud rendering-based volumetric video streaming system and describes its different components. A simplified version of this architecture is shown in Fig. 1.

3.1. Server architecture

The server-side implementation is composed of two main parts: a volumetric video player and a cross-platform cloud rendering library, each described further in more detail.

Volumetric video player

The volumetric video player is implemented in Unity and plays a single MP4 file which has one video track containing the compressed texture data and one mesh track containing the compressed mesh data of a volumetric object. Before the playback of a volumetric video starts, the player registers all the required objects. For example, the virtual camera of the rendered view and the volumetric object are registered, and those can later be controlled by the client. After initialization, the volumetric video player can start playing the MP4 file. During playout, both tracks are demultiplexed and fed into the corresponding decoders; video decoder for texture track and mesh decoder for mesh track. After decoding, each mesh is synchronized with the corresponding texture and rendered to a scene. The rendered view of the scene is represented by a Unity RenderTexture that is passed to our cloud rendering library for further processing. While rendering the scene, the player concurrently asks the cloud rendering library for the latest positions of the relevant objects that were previously registered in the initialization phase.

Cloud rendering library

We created a cross-platform cloud rendering library written in C++ that can be integrated into a variety of applications. In the case of the Unity application, the library is integrated into the player as a native plugin. The library utilizes the GStreamer WebRTC plugin for low-latency video streaming between the server and client that is integrated into a media pipeline as described in Sec. 3.3. In addition, the library provides interfaces for registering the objects of the rendered scene and retrieving the latest client-controlled transformations of those objects while rendering the scene. In the following, we describe the modules of our library, each of which runs asynchronously in its own thread to achieve high performance.

The WebSocket Server is used for exchanging signaling data between the client and the server. Such signaling data includes sdp, ice as well as application-specific metadata for scene description. In addition, ws connection can also be used for sending the control data, e.g. changing the position and orientation of any registered game object or camera. Both, plain WebSockets as well as Secure WebSockets are supported which is important for practical operation of the system.

The GStreamer module contains the media processing pipeline which takes the rendered texture and compressed it into a video stream that is sent to the client using the WebRTC plugin. The most important components of the media pipeline are described in Sec. 3.3.

The Controller module represents the application logic and controls the other modules depending on the application state. For example, it closes the media pipeline if the client disconnects, re-initializes the media pipeline when a new client has connected, and updates the controllable objects based on the output of the Prediction Engine.

The Prediction Engine implements a regression-based prediction method (please refer to Sec. 5) and provides interfaces for usage of other potential methods. Based on the previously received input from the client and the implemented algorithm, the module updates the position of the registered objects accordingly such that the rendered scene corresponds to the predicted positions of the object after a given lat.

3.2. Client architecture

The client-side architecture is depicted on the left side of the Fig. 1. Before the streaming session starts, the client establishes a ws connection to the server and asks the server to send a description of the rendered scene. The server responds with a list of objects and parameters which the client is later allowed to update. After receiving the scene description, the client replicates the scene and initiates a p2p WebRTC connection to the server. The server and client begin the WebRTC negotiation process by sending sdp and ice data over the established ws connection. Finally, the p2p connection is established, and the client starts receiving a video stream corresponding to the current view of the volumetric video. At the same time, the client can use the ws connection, as well as the RTCPeerConnection for sending control data to the server in order to modify the properties of the scene. For example, the client may change its 6dof position, or it may rotate, move and scale any volumetric object in the scene.

We have implemented both a web player in JavaScript and a native application for the HoloLens, the untethered ar headset from Microsoft. While our web application targets vr, our HoloLens application is implemented for ar use cases. In the HoloLens application, we perform further processing to remove the background of the video texture before rendering the texture onto the ar display. In general, the client-side architecture remains the same for both vr and ar use cases, and the most complex client-side module is the video decoder. Thus, the complexity of our system is concentrated largely in our cloud-based rendering server.

3.3. Media pipeline

The simplified structure of the media pipeline is shown in the bottom part of Fig. 1. A rendered texture is given to the media pipeline as input using the AppSource element of Gstreamer. Since the rendered texture is originally in RGB format but the video encoder requires YUV input, we use the VideoConvert element to convert the RGB texture to I420 format111https://www.fourcc.org/pixel-format/yuv-i420. After conversion, the texture is passed to the encoder element, which can be set to any supported encoder on the system. Since encoder latency is a significant contributor the overall m2p latency, we evaluated the encoding performances of different encoders for a careful selection. For detailed results on the encoder performance, please refer to Sec. 4.1.

After the texture is encoded the resulting video bitstream is packaged into RTP packets, encrypted and sent to the client using WebRTC. WebRTC was chosen as the delivery method since it allows us to achieve an ultra-low latency while using the p2p connection between the client and server. In addition, WebRTC is already widely adopted by different web browsers allowing our system to support several different platforms.

4. Motion-to-Photon Latency

The different components of the m2p latency are illustrated in Fig. 2 and related by


where , and consist of the following component latencies:

Figure 2. Components of the m2p latency for a remote rendering system.

In our analysis, we neglect the time for the hmd to compute the user pose using its tracker module. This computation is typically based on a fusion of the sensor data from the imu and visual data from the cameras. Although ar device cameras typically operate at -

, imu are much faster and a reliable estimation of the user pose can be performed with a frequency of multiple

 (lincoln2016; daqri2018). Thus, the expected tracker latency is on the order of microseconds.

is the time to compress a frame and depends on the encoder type (hardware or software) and the picture resolution. We present a detailed latency analysis of different encoders that we tested for our system in Section 4.1.

is the network rtt. It consists of the propagation delay () and the transmission delay . , the time for the server to retrieve sensor data from the client, and the time for the server to transmit a compressed frame to the client.

is the time for the server to generate a new frame by rendering a view from the volumetric data based on the actual user pose. In general, it can be set to match the frame rate of the encoder for a given rendered texture resolution.

is the time to decode a compressed frame on the client device and is typically much smaller than since video decoding is inherently a faster operation than video encoding. Also, the end devices typically have hardware-accelerated video decoders that further reduce the decoding latency.

is the display latency and mainly depends on the refresh rate of the display. For a typical refresh rate of , the average value of is , and the worst-case value is in case the decoded frame misses the current VSync signal and has to wait in the frame buffer for the next VSync signal.

4.1. Encoder latency

We characterized the encoding speeds of different encoders using the test dataset provided by ISO/ITU Joint Video Exploration Team (JVET) for the next generation video coding standard Versatile Video Coding (VVC) (jvet_VVC_CfP). In our measurements, we used the FFmpeg libraries of the encoders NVENC, x264, x265, and Intel SVT-HEVC, enabled their low-latency presets and measured the encoded fps using FFmpeg -benchmark option. The measurements were performed on a Ubuntu 18.04 machine with 16 Intel Xeon Gold 6130 CPU (2.10GHz) CPUs using the default threading options for the tested software-based encoders. For x264 and x265, we used the ultrafast preset and zerolatency tuning (ffmpeg_x264). For NVENC, we evaluated the presets default, high-performance (HP), low-latency (LL) and low-latency high-performance (LLHP). A brief description of the NVENC presets can be found in (twitch2018).

Table 1 shows the mean fps over all tested sequences for different encoders. We observed that both H.264 and HEVC encoders of NVENC are significantly faster than x264 andx265 (both using ultrafast preset and zerolatency tuning) as well as SVT-HEVC (Low delay P). NVENC is able to encode 1080p and 4K videos in our test dataset with encoding speeds up to and , respectively. We also observed that for some sequences, HEVC encoding turned out to be faster than H.264 encoding. We believe that this difference is caused by a more efficient GPU implementation for HEVC.

All the low-latency presets tested in our experiments turn off B-frames to reduce latency. Despite that, we observed that the picture quality obtained by NVENC in terms of PSNR is comparable to the other tested encoders (using low-latency presets). As a result of our analysis, we decided to use NVENC H.264 (HP preset) in our system.

Standard Encoder Preset Mean fps
H.264 x264 Ultrafast 81
NVENC Default 353
HEVC x265 Ultrafast 33
SVT-HEVC Low delay P 74
NVENC Default 212
Table 1. Mean encoding performances over all tested sequences for different encoders and presets.

4.2. Latency measurements

We developed a framework to measure the m2p latency of our system. In our setup, we run the server application on an Amazon EC2 instance in Frankfurt, and the client application runs in a web browser in Berlin which is connected to the Internet over WiFi.

We implemented a server-side console application which is using the same cloud rendering library as described in Sec. 3.1 but instead of sending the rendered textures from the volumetric video player, the application sends predefined textures (known by the client) depending on the received control data from the client. These textures consist of simple vertical bars with different colors. For example, if the client instructs the server application to move the main camera to position , the server pushes the texture into the media pipeline. Similarly, another camera position results in the texture .

On the client side, we implemented a web-based application that connects to the server application and renders the received video stream to a canvas. Since the client knows exactly how those textures look like, it can evaluate the incoming video stream and determine when the requested texture was rendered on the screen. As soon as the client application sends to the server, it starts the timer and checks the canvas for at every web browser window repaint event. According to the W3C recommendation (w3c2015), the repaint event matches the refresh rate of the display. As soon as the texture is detected the client stops the timer and computes the m2p latency .

Once the connection is established, the user can start the session by defining the number of independent measurements. Since we are using the second smallest instance type of Amazon EC2 (t2.micro), we set the size of each video frame to pixels. We encode the stream using x264 configured with ultrafast preset and zerolatency tuning with an encoding speed of . As an example, we set the client to perform 100 latency measurements and calculated the average, minimum and maximum m2p latency. Our results show that fluctuates between and , and the measured average m2p latency is .

5. Head motion prediction

Figure 3. Mean absolute error (MAE) for different translational and rotational components averaged over five users. Results are given for the look-ahead times in the range -.

One important technique to mitigate the increased m2p latency in a cloud-based rendering system is the prediction of the user’s future pose. In this section, we describe our statistical prediction model for 6dof head motion prediction and evaluate its performance using real user traces.

5.1. Data collection

We collected motion traces from five users while they were freely interacting with a static virtual object using Microsoft HoloLens. We recorded the users’ movements in 6DoF space; i.e., collected position samples (, , ) and rotation samples represented as quaternions (, , , ). Since the raw sensor data we obtained from HoloLens was unevenly sampled (i.e. different temporal distances between consecutive samples) at , we interpolated the data to obtain temporally equidistant samples. We upsampled the position data using linear interpolation and the rotation data (quaternions) using slerp (shoemake1985). Thus, we obtained an evenly-sampled dataset with a sampling rate of (one sample at each ).111Our 6dof head movement dataset is freely available on Github for further usage in research community under: https://github.com/serhan-gul/dataset_6DoF

5.2. Prediction method

We use a simple autoreg model to predict the future user pose based on a time series of its past values. autoreg models use a linear combination of the past values of a variable to forecast its future values (hyndman2018).

An autoreg model of lag order can be written as


where is the true value of the time series at time ,

is the white noise,

are the coefficients of the model. Such a model with lagged values is referred to as an model. Some statistics libraries can determine the lag order automatically using statistical tests such as the aic (akaike1973).

We used the and values from one of the collected traces as training data and created two autoreg models using the Python library statsmodels (statsmodels), for translational and rotational components, respectively. Our model has a lag order of 32 samples i.e. it considers a hw of the past  ms and predicts the next sample using (5). Typically we need to predict not only the next sample but multiple samples in the future to achieve a given lat; therefore, we repeat the prediction step by adding the just-predicted sample to the history window and iterating (5) until we obtain the future sample corresponding to the desired lat. The process is then repeated for each frame e.g. each for an assumed display refresh rate.

We used the trained model to predict the users’ translational (, , ), and rotational motion (, , , ). We perform the prediction of rotations in the quaternion domain since we readily obtain quaternions from the sensors and they allow smooth interpolation using techniques like slerp. After prediction, we convert the predicted quaternions to Euler angles (yaw, pitch, roll) and evaluate the prediction accuracy in the domain of Euler angles since they are better suited for understanding the rendering offsets in terms of angular distances.

5.3. Evaluation

In our evaluation, we investigated the effect of prediction on the accuracy of the rendered image displayed to the user, i.e. the rendering offset. Specifically, we compared the predicted user pose (given a certain lat ) to the real user pose as obtained from the sensors. As a benchmark, we evaluated a baseline case in which the rendered pose lags behind the actual user pose by a delay corresponding to the m2p latency (), i.e., no prediction is performed.

For each user trace, we evaluated the prediction algorithm for a ranging between -. In each experiment, we assume that the m2p latency is equal to the prediction time () such that the prediction model attempts to predict the pose that the user will attain at the time the rendered image is displayed to the user. We evaluated our results by computing the mae between the true and predicted values for the different components.

Figure 4. Comparison of the prediction (blue) and baseline (red) results for the x and roll components of one of the traces (sample time ; showing the time range -) for  ms. The dashed gray line shows the recorded sensor data.

Fig. 3 compares the average rendering errors over five traces obtained using our prediction method to the baseline. We observe that for all considered , prediction reduces the average rendering error for both positional and rotational components.

Fig. 4 shows for one of the traces the predicted and baseline (lagged by m2p latency) values for the x and roll components. At the beginning of the session, the prediction module collects the required amount of samples for a hw of and makes the first prediction, i.e., the pose that the user is predicted to attain after a time of  ms (green shaded). We observe that the accuracy of the prediction depends on the frequency of the abrupt, short-term changes of the user pose. If a component of the user pose linearly changes over a hw (without changing direction), the resulting predictions for that component are fairly accurate. Otherwise, if short-term changes are present within a hw, the prediction tends to perform worse than the baseline.

6. Conclusion

We presented a cloud-based based volumetric streaming system that offloads the rendering to a powerful server and thus reduces the rendering load on the client-side. To compensate the added network and processing latency, we developed a method to predict the user’s head motion in  6dof. Our results show that the developed prediction model reduces the rendering errors caused by the added latency due to the cloud-based rendering. In our future work, we will analyze the effect of motion-to-photon latency on the user experience through subjective tests and develop more advanced prediction techniques e.g. based on Kalman filtering.