The remainder of the paper is structured as follows. Section 2 presents the overall architecture of our implementation and describes each component of the system. The proof of concept is presented in Section 3, explaining the most important challenges and workarounds. Finally, Section 4 concludes our paper.
A general overview of the main components involved in our implementation is depicted in Figure 1. It consists of six main modules which interact with each other and together provide a fully functional player for OMAF 360-degree video. Those modules are: Player, Downloader (DL), MPD Parser (MP), Scheduler (SE), Media Engine (ME) and finally the Renderer (RE).
The Player module represents the core of the entire application. It connects all modules with each other and controls them. The DL module deals with all HTTP requests to the server. The MP module implements the parsing of the DASH manifest file (MPD) together with additional metadata defined in OMAF. The SE module controls the DL module and decides when requests for the next segments should be executed, based on the current status of the player. The task of the ME module is to parse OMAF related metadata on the File Format level and re-package the downloaded OMAF content in such a way that the Media Source Extensions API of the web browser can process the data. Finally, the RE module uses the OMAF metadata in order to correctly render the video texture on the canvas using WebGL API. The following subsections describe each of these six modules in more detail.
As already mentioned in the previous section the Player module can be seen as the core of the entire application. Its main goal is to connect all modules with each other and to control them. It also connects HTML elements such as video and canvas from the main page with the ME and RE modules. In addition, it provides a basic functionality to the user such as load a source, play, pause, loop, change into or go out of full screen mode and retrieve certain metrics of the application in order to plot the data on the screen.
The main task of this module is to manage all HTTP traffic between our application and the server. This module receives a list of URLs from the player module and downloads them using the Fetch API. After all required HTTP requests are processed and all requested data is successfully downloaded, it forwards the data to the ME module for further processing. Since the current version of the player fires a lot of simultaneous HTTP requests it is desirable to host the media data on an HTTP/2 enabled server in order to improve the streaming performance.
2.3. MPD Parser
OMAF uses Dynamic Adaptive Streaming over HTTP (DASH) as a primary delivery mechanism for VR media. It also specifies additional metadata for 360-degree video streaming such as:
Projection type: only Equirectangular (ERP) or Cubemap (CMP) projections are allowed.
Content coverage: is used to determine which region each DASH Adaptation Set covers while each HEVC tile is stored in a separate Adaptation Set. This information is required to select only those Adaptation Sets which cover the entire 360-degree space.
Spherical region-wise quality ranking: (SRQR) is used to determine where the region with the highest quality is located within an Adaptation Set. This metadata allows us to select an Adaptation Set based on current orientation of the viewport.
In addition to OMAF metadata, another notable feature is the DASH Preselection Descriptor which indicates the dependencies between different DASH Adaptation Sets. The MP module parses all required DASH manifest metadata and implements several helper functions which are used by the Player module in order to make appropriate HTTP requests.
One of the key aspects of any streaming service is maintaining a sufficient buffer in order to facilitate smooth media playback. In the implementation, the buffer is maintained using a parameter named ’buffer limit’ which can be set prior or during a streaming session. The parameter indicates the maximum buffer fullness level in milliseconds and depending on its value the SE module schedules the next request. If the buffer is able to accommodate a segment, then the SE module initiates the request for the next segment. On the other hand, if the buffer is full, then the request for the next segment is delayed until buffer fullness and the current time of the media playback indicate otherwise.
An important point to mention for any viewport-dependent streaming implementation, is that the user can change the viewport orientation during the playback at any time. Therefore, the system should adapt to the changed viewport orientation as quickly as possible, which implies that the buffer fullness limit must be kept relatively small. Preferably, it should be in the range of a few seconds.
2.5. Media Engine
Before the ME module processes the media segments, each segment should be completely downloaded and consist of multiple hvc1 video tracks (one track for each HEVC tile) and one additional hvc2 video track with Extractor NAL Units for HEVC video (ISO/IEC, 2017). The extractor track is required for the creation of a single HEVC bitstream from the individually delivered HEVC tiles in order to use a single decoder instance. An extractor represents the in-stream data structure using a NAL unit header for extraction of data from other tracks and can be logically seen as a pointer to data located in a different File Format track. Unfortunately, currently available web browsers do not natively support File Format extractor tracks and thus a repackaging workaround as performed by the ME module is necessary. Therefore, the ME module resolves all extractors within an extractor track and packages the resolved bitstream into a new track with a unique track ID, even if the extractor track ID changes. Hence, from the MSE SourceBuffer perspective, it looks like the segments are coming from the same DASH Adaptation Set even if the player switches between different tiling configurations.
After the repackaged segment is processed by the MSE SourceBuffer, the browser decodes the video and the video texture is finally rendered by the RE module using OMAF metadata. The RE module is implemented using a custom shader written in OpenGL Shading Language (GLSL) together with a three.js library (WebGL library) (three.js, 2019) which is used to implement three-dimensional graphics on the Web. Our rendering implementation is based on triangular polygon mesh objects and supports both: equirectangular and cubemap pojections. In case of a cubemap projection one cube face is divided into two triangular surfaces, while in case of an equirectangular projection, a helper class from three.js library is used to create a sphere geometry.
Figure 3 shows three main processes used for rendering, from the decoded picture to the final result rendered on the cube. It shows an example where each face of the cube is divided into four tiles while the decoded picture is composed of 12 high-resolution and 12 low-resolution tiles. The 12 triangular surfaces of the cube as depicted in Figure 3 (c) can be represented as a 2D plane like in Figure 3 (b). The fragment shader of the RE module uses OMAF metadata to render the decoded picture correctly at the cube faces as shown in Figure 3 (b). The Region-wise Packing (RWPK) of OMAF metadata has top-left position, width, height in packed and unpacked coordinates as well as rotation of tiles for all tracks.
Since the shader reconstructs the position of pixels based on OMAF metadata, Figure 3 (b) can be assumed to be a region-wise Un-Packed image. Therefore, the shader sets the rendering range of the Figure 3 (a) using the RWPK metadata, and renders the tiles of the decoded picture to the cube faces of the Figure 3 (b). However, when there is a change in the viewport position, the RE module has to be given correct metadata for the current track. In the implementation, when the manifest file is loaded, the RE module is initialized with all RWPK metadata to correctly render all tracks. The synchronization of the track switching is covered in the following section.
3. Proof of concept
In this section, we first give a brief overview of the implementation. We then discuss the most important challenges we encountered during implementation and describe their workarounds.
3.1. Implementation overview
After the MPD is successfully loaded and parsed, the player downloads all initialization segments (one for each emphasized viewport, which, in relation to the example in Figure 3, corresponds to 24 extractor tracks), parses OMAF-related metadata and initializes the RE module with extracted region-wise packing information. In addition, the ME module creates a pseudo-initialization segment for the MSE SourceBuffer to initialize it in a way such that following repackaged media segments can be successfully decoded by the web browser.
The streaming session starts when the user presses the play button. The player then continuously downloads media segments of a certain extractor track, depending on the current orientation of the viewport. In addition, all dependent media segments (hvc1 tracks) are also downloaded. All downloaded media segments are then immediately repackaged and the corresponding RWPK metadata is used to correctly render the video on the canvas. We tested our implementation on Safari 12.02 since only web browsers from Apple and Microsoft111Microsoft Edge web browser supports the implementation on devices with a hardware HEVC decoder. For devices that do not provide hardware support for HEVC, the HEVC Video Extensions have to be enabed in the Microsoft Store. natively support it. Raw video sequences were provided by Ericsson and prepared for streaming using the OMAF file creation tools from Fraunhofer HHI (HHI, 2019a) (HHI, 2019b). Finally, the content was hosted on Amazon CloudFront CDN, which supports HTTP/2 in conjunction with HTTPS.
The following section covers some of the issues that we faced during the implementation.
3.2. Implementation challenges
An important prerequisite for good functionality of the implementation is the synchronization of the extractor track switching and the corresponding RWPK metadata. When the user changes the viewport, the extractor track is changed and the high and low-quality tile positions of the decoded picture are derived using the corresponding region-wise packing metadata. The RE module has to reset the texture mapping of the decoded picture according to the changed track and it shall be done at exactly the same time when the video texture changes from one track to another. The simplest way to detect the exact frame number is to check the current time of the video. Unfortunately, W3C organizations are still discussing the precise accuracy of the currentTime of the video element (Media and Group, 2018) and nowadays it is not possible to conveniently detect the exact frame number of the video reliably using currentTime. Therefore, the ME module uses two video elements together with two MSE SourceBuffers and alternately switches between them when the extractor track (and RWPK) changes. The ME module saves the bitstream of the changed track in the SourceBuffer of the other video element. When the Player reaches the end of the active buffer it subsequently switches to the other video element. At the same time, the player informs the RE module through an event about the change of the video element. The RE module declares two scene objects and associates them with each video element. Also, the RE module calculates and stores the RWPK metadata of the decoded picture for all tracks in the initialization phase. When receiving the event about the change of the video element from the Player, the RE module replaces the scene and maps the video texture of the decoded picture based on the scene object, so that track synchronization is performed without an error.
While this solution works well on Safari, we discovered an open issue on Microsoft Edge browser (V., 2017) that interferes with the two buffers workaround. The Edge web browser requires a few seconds of data in each buffer in order to start the decoding process, and therefore the new segment at every track switch cannot be instantly rendered.
Furthermore, due to the track synchronization solution using two video elements, we need to operate two MSE SourceBuffer objects, which makes the buffering logic a bit more complex as the SE module has to monitor the buffer fullness level of both SourceBuffer objects. The duration of media segments present in the two media sources is combined together to determine the available buffer time at a given moment so that the SE module can make requests for the future media segments accordingly.
For future work we plan to further optimize the streaming performance of the player while reducing the amount of HTTP requests and implement suitable rate-adaptation algorithms for tile-based streaming.
- 3GPP (2019) 3GPP. 2019. 5G; 3GPP Virtual reality profiles for streaming applications. Technical Specification (TS) 26.118. 3rd Generation Partnership Project. Version 15.1.0.
- HHI (2019a) Fraunhofer HHI. 2019a. Better quality for 360-degree video. Retrieved April 19, 2019 from http://hhi.fraunhofer.de/omaf
- HHI (2019b) Fraunhofer HHI. 2019b. HTML5 MSE Playback of MPEG 360 VR Tiled Streaming. Retrieved April 19, 2019 from https://github.com/fraunhoferhhi/omaf.js
- ISO/IEC (2017) ISO/IEC 2017. 14496-15, Information technology - Coding of audio-visual objects - Part 15: Carriage of network abstraction layer (NAL) unit structured video in the ISO base media file format. ISO/IEC.
- ISO/IEC (2019) ISO/IEC 2019. 23090-2, Information technology - coded representation of immersive media (MPEG-I) - Part 2: Omnidirectional media format. ISO/IEC.
- Media and Group (2018) W3C Media and Entertainment Interest Group. 2018. Frame accurate seeking of HTML5 MediaElement. Retrieved April 19, 2019 from https://github.com/w3c/media-and-entertainment/issues/4
- V. (2017) David V. 2017. Video MSE issues as of March 01 2017. Retrieved April 19, 2019 from https://developer.microsoft.com/en-us/microsoft-edge/platform/issues/11147314