Applying VertexShuffle Toward 360-Degree Video Super-Resolution on Focused-Icosahedral-Mesh

06/21/2021 ∙ by Na Li, et al. ∙ 0

With the emerging of 360-degree image/video, augmented reality (AR) and virtual reality (VR), the demand for analysing and processing spherical signals get tremendous increase. However, plenty of effort paid on planar signals that projected from spherical signals, which leading to some problems, e.g. waste of pixels, distortion. Recent advances in spherical CNN have opened up the possibility of directly analysing spherical signals. However, they pay attention to the full mesh which makes it infeasible to deal with situations in real-world application due to the extremely large bandwidth requirement. To address the bandwidth waste problem associated with 360-degree video streaming and save computation, we exploit Focused Icosahedral Mesh to represent a small area and construct matrices to rotate spherical content to the focused mesh area. We also proposed a novel VertexShuffle operation that can significantly improve both the performance and the efficiency compared to the original MeshConv Transpose operation introduced in UGSCNN. We further apply our proposed methods on super resolution model, which is the first to propose a spherical super-resolution model that directly operates on a mesh representation of spherical pixels of 360-degree data. To evaluate our model, we also collect a set of high-resolution 360-degree videos to generate a spherical image dataset. Our experiments indicate that our proposed spherical super-resolution model achieves significant benefits in terms of both performance and inference time compared to the baseline spherical super-resolution model that uses the simple MeshConv Transpose operation. In summary, our model achieves great super-resolution performance on 360-degree inputs, achieving 32.79 dB PSNR on average when super-resoluting 16x vertices on the mesh.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

360-degree image/video, also known as spherical image/video, is an emerging format of media that captures views from all directions surrounding the camera. Unlike traditional 2D image/video that limits the user’s view to wherever the camera is facing during capturing, a 360-degree image/video allows the viewer to freely navigate a full omnidirectional scene around the camera position.

Despite its substantial promise of immersiveness, the utility of streaming 360-degree video is limited by the huge bandwidths required by most streaming implementations. When watching a 360-degree video, users can only watch a small portion of the full omnidirectional view. That is, while the 360-degree video encodes frames that cover the full field-of-view (FoV), the user may only observe a “view” of FoV of the omnidirectional frame at a time. If the omnidirectional frame is projected to the 2D frame using the equirectangular projection [14], then only roughly 15% of the pixels of the frame is viewed. The rest 85% pixels are not viewed, and are thus wasted.

To allow the users to observe “views” in high enough quality, full omnidirectional frames must be transmitted at 4K or 8K resolution. Streaming videos at 4K or 8K resolution requires a significant amount of network bandwidth (e.g., 100 Mbps for 8K video streaming) that may not be supported by most users’ network connections.

To address the bandwidth-waste problem, we proposed a efficient mesh representation, Focused Icosahedral Mesh, supports our model to focus on the more interesting portion of a sphere instead of the full mesh. It is more flexible and efficient.

Figure 1: This left shows the same 360-degree content as pixels on a sphere, which is the most natural way for representing 360-degree data. The right shows a 360-degree image encoded in the equirectangular projection, which is a widely used spherical projection for representing 360-degree images. However, projecting spherical signals to the 2D plane introduces distortion, e.g., the north and south pole areas.222The original image is from the following video in the 360-degree video head movement dataset [7]: https://www.youtube.com/watch?v=8lsB-P8nGSM

Another problem is that the omnidirectional views captured by 360-degree cameras are most naturally represented as uniformly dense pixels over the surface of a sphere (as shown in Figure 1 (left)). When spherical pixels are projected to planar surfaces, distortions are introduced. For example, the equirectangular projection [14] is a widely used spherical projection for representing 360-degree data. However, significant distortions can be observed around the north and south pole areas, as shown in Figure 1 (right).

Such distortions can reduce the efficiency of CNN operations by adding “over-represented” pixels, e.g., the regions near the north and south poles in the equirectangular projection. Further, training a CNN directly on the distorted representation could cause CNN models to learn characteristics of the planar distortion rather than relevant details of the high resolution representation.

Recent work [22, 5] pay attention to play Convolutional operation directly on spherical signals to prevent the distortion problem. Their works show that it is possible to analyze spherical signals directly without 2D projections. Furthermore, extensive experiments were conducted in these works to show the efficiency of their proposed spherical CNNs.

Motivated by 2D PixelShuffle [32], we also proposed our VertexShuffle that achieves great performance and parameter efficiency on mesh representation, which improved a lot based on MeshConv Transpose proposed in UGSCNN [22].

To illustrate the efficiency of our proposed Focused Icosahedral Mesh representation and VertexShuffle, we apply our methods on a popular problem in computer vision, Super-Resolution 

[21], which aims at recovering high-resolution images and videos from low-resolution images and videos.

In this paper, inspired by recent advances in spherical CNNs [22, 5] and state-of-the-art 2D super-resolution methods [10, 11, 12, 40, 34], We proposed an efficient Focused Icosahedral Mesh representation to better utilize the computational resources and a novel VertexShuffle operation that can significantly improve both the performance and the efficiency compared to the original MeshConv Transpose operation introduced in UGSCNN [22]. For evaluation, due to the lack of former spherical super resolution dataset, we also created a spherical super-resolution dataset from ten 360-degree videos in high resolution.

In summary, our paper makes the following main contributions:

  • We create a Focused Icosahedral Mesh representation of the sphere to efficiently represent spherical data, which not only saves computational resources but also improves memory efficiency.

  • We create a novel VertexShuffle operation, inspired by the 2D PixelShuffle [32]

    operation. The Vertex operation significantly increases both the visual quality metric (peak signal-to-noise ratio (PSNR)) and inference time over comparable transposed convolution operations.

  • We are the first to propose a super-resolution model that directly operates on a mesh representation of spherical pixels of 360-degree data.

  • We create a 360-degree super-resolution dataset from a set of high resolution 360-degree videos for evaluation.

  • Results show that our proposed SSR model achieves great super-resolution performance on 360-degree inputs, achieving 32.79 dB PSNR on average when super-resoluting 16x vertices on the mesh.

2 Related Work

2.1 360-degree video

Despite its potential for delivering more-immersive viewing experiences than standard video streams, current 360-degree video implementations requires bandwidths that are too high to deliver adequate experiences for many users.

Numerous approaches have been proposed for improving 360-degree bandwidth efficiency. These approaches have both attempted to improve the efficiency of how the 360-degree view is represented during transmission [30, 36, 29, 38, 8, 17, 33, 28, 31, 18] as well as improving a system’s ability to avoid delivering unviewed pixels [41]. Only recently have super-resolution (SR) approaches been proposed in conjunction with 360-degree video delivery [9, 4].

To avoid distortion problem in projecting 360-degree video to 2D planes, Xiong et al. [37]

developed a reinforcement learning approach to select a sequence of rotation angles to minimize interest area near or on the cube boundaries.

Marc et al. [13] proposed a spherical image representation that mitigates spherical distortion by rendering tangent icosahedron faces to a a set of oriented, low-distortion images to icosahedron faces. They also presented utilities of applying standard CNN on spherical data.

2.2 Spherical convolutional neural networks

Spherical CNN has been studied by the computer vision community recently as a number of real-world applications require processing signals in the spherical domain, including self-driving cars, panoramic videos, omnidirectional RGBD images, and climate science.

Recently works such as Cohen et al. [5] gave theoretical support of spherical CNNs for rotation-invariant learning problems, which is important for problems where orientation is crucial to the model performance. They first introduced the concepts of and . can be defined as the set of points on a unit sphere, and is the rotation group in Euclidean three dimensional space.

They replaced planar correlation with spherical correlation, which can be understood as the value of output feature map evaluated at rotation computed as an inner product between the input feature map and a filter, rotated by

. Furthermore, they implemented the generalized Fourier transform for

and .

Later, Cohen et al. [6]

introduced a theory that is equi-variance to symmetry transformations on manifolds. They further prompt a gauge equi-variant CNN for signals on the icosahedron using the icosahedral CNN, which implements gauge equi-variant convolution using a single conv2d call, making it a highly scalable and practical alternative to spherical CNNs.

UGSCNN [22]

is another recent work in spherical CNN. It presents a novel CNN approach on unstructured grids using parameterized differential operators for spherical signals. They introduce a basic convolution operation, called MeshConv, that can be applied on meshes rather than planar images. It achieves significantly better performance and parameter efficiency compared to state-of-the-art network architectures for 3D classification tasks since it does not require large amounts of geodestic computations and interpolations.

Zhang et al. [39] proposed to perform semantic segmentation on omnidirectional images by designing an orientation-aware CNN framework for the icosahedron mesh. They introduced fast interpolation of kernel convolutions and presented weight transfer from learned through classical CNNs to perspective data. Recently, Eder et al. [13] proposed a spherical image representation that mitigates spherical distortion by rendering a set of oriented, low-distortion images tangent to icosahedron faces. They also presented utilities of their approaches by applying standard CNN to the spherical data.

While these existing works demonstrate their effectiveness in classification and segmentation tasks, the super-resolution task was not considered. In this work, we found it possible to apply their work on the super-resolution task. Our work is based on the proposed MeshConv operation since it achieves better performance and parameter efficiency than other spherical convolutional networks. We also conduct experiment to show significant improvements over the baseline spherical super-resolution model that uses the simple MeshConv Transpose operation.

2.3 Super resolution

The super-resolution field has advanced rapidly from its origins in the deep learning age. The SRCNN 

[10, 11] model was the first to apply CNNs to SR. FSRCNN [12] was an evolution of SRCNN. It operated directly on a low-resolution input image and applied a a deconvolution layer to generate the high-resolution output. VDSR [23] was the first to apply residual layers [19] to the SR task, allowing for deeper SR networks. DRCN [24] introduced recursive learning in a very deep network for parameter sharing. Shi et al. [32] proposed “PixelShuffle”, a method for mapping values at low-resolution positions directly to positions in a higher-resolution image more efficiently than the deconvolution operation. SRResNet [26] introduced a modified residual layer tailored for the SR application. EDSR [27] further modified the SR-specific residual layer from SSResNet and introduced a multi-task objective in MDSR. SRGAN [26]

applied a Generative Adversarial Network (GAN) 

[16] to SR, allowing better resolution of high-frequency details. These works focus on 2D planar data, which may not be ideal for 360-degree image super-resolution due to the distortions introduced in the projected representation. Our proposed model, however, operates directly on spherical signals so that we can avoid the distortion problem.

Focusing on optimizing 360-degree video streaming, Chen et al. [3] focused to apply super-resolution on 360-degree video tiles. Their work mainly focused on the overall video streaming system rather than the super-resolution model implementation. This is different from our work that mainly focuses on the implementation of a novel spherical super-resolution model for 360-degree videos.

3 Methodology

3.1 Focused icosahedral mesh

In this section, we first introduce our proposed focused icosahedral mesh. The icosahedral spherical mesh [2] is a common discretization of the spherical surface. The mesh can be obtained by progressively sub-dividing each face of the unit icosahedron into four equal triangles.

Operations on a full spherical mesh, refined to a granularity that can include all pixels from a planar representation of a 360-degree video frame, however, requires a significant amount of computation. In addition, operations on the full mesh cannot easily support operations on sub-areas of the spherical surface.

Performing super-resolution on “sub-areas” of the spherical surface can be beneficial for real-world 360-degree applications. This is because human eyes as well as their viewing devices (e.g., the head-mounted display) have limited fields-of-view (FoV), usually represented as the angular extent of the field that can be observed. For example, Figure 2 represents a 80-degree by 80-degree FoV. To render the view shown in this figure, only part of the sphere is required.

Such “sub-areas” would be useful in “tiling” schemes that can be used to support spatial-adaptive super-resolution over the 360-degree view. That is, if only a small area on the sphere will be viewed by the user, we may only need to apply super-resolution to a sub-portion of the sphere instead of the full sphere. As a result, performing super-resolution on the full icosahedral mesh may no longer be necessary as it requires more computation resources.

To support both faster operation and super-resolution on a sub-portion of the sphere, we propose a partial refinement scheme to generate “Focused Icosahedral Mesh”.

To generate a focused icosahedral mesh, we first create a Level-1 icosahedral mesh by refining each face on a unit icosahedron into 4 faces. In this way, the 20-face icosahedron is refined into a Level-1 icosahedral mesh with 80 faces. An example full Level-1 mesh with 80 faces is shown in Figure 3(a).

Figure 2: The user can only observe a sub-portion of the sphere at a time. For example, this figure shows a 80-degree by 80-degree field-of-view.

We then select one face out of the 80 faces of the Level-1 icosahedral mesh and only refine triangles located inside the selected Level-1 face.

Specifically, in our focused mesh representation, we select the face of the Level-1 mesh that covers the position of latitude=0, longitude=0 on the sphere since very little distortion is introduced when pixels near this area are projected to the 2D plane. Figure 3(b) shows the Focused Level-2 mesh where the selected Level-1 face is refined into 4 smaller faces. Figures 3(c) and 3 (d) show the Focused Level-3 and Focused Level-5 meshes, respectively.

(a) Full Level-1 Mesh
(b) Focused Level-2 Mesh
(c) Focused Level-3 Mesh
(d) Focused Level-5 Mesh
Figure 3: Example of meshes in Level-1, Level-2, Level-3 and Level-5. To create “Focused icosahedral meshes”, we select one face in the full Level-1 mesh and repeatedly refine triangles in this Level-1 face to obtain Focused Level-X meshes.

3.1.1 Rotating content to the Focused Level-1 origin

Our model operates on a single focused icosahedral mesh, instead of operating on separate meshes for different Level-1 refined icosahedral faces. To allow our model to perform super-resolution for any area on the sphere, we need to map spherical pixel content that belongs to any arbitrary full Level-1 mesh face to the face that is selected to be refined. To do so, we pre-compute a rotation matrix , where represents the total number of faces in a full Level-1 mesh, which is . is the number of vertices in a Level-1 face, as shown in figure3. A Level-1 mesh consists of triangles, each containing vertices. represents the number of dimensions of Euclidean coordinates in sphere, namely xyz.

We denote the Level-1 face selected to be refined as face . To rotate an arbitrary face on the Level-1 mesh to the refined face , we need to find a rotation matrix for face such that , where and are matrices that represent the xyz coordinates of three vertices of a triangle face.

Therefore, we can obtain as: . We first rotate the vertices in the Focused Level-X Mesh with the rotation matrix , and then compute a mapping from each pixel in the input planar representation (e.g., equirectangular image) to the rotated Focused Level-X vertex. In this way, we can represent all 80 different faces on the full Level-1 mesh through a single Focused Mesh file, which has the potential to save a significant amount of computation and storage resources and achieves better parameter efficiency.

Figure 4 visualizes how one focused icosahedral mesh can be used to represent all 80 different Level-1 faces. Figure 4(a) shows an original equirectangular-projected 360-degree image. In this image, we highlight two areas marked by magenta circles. In Figure 4(b), the left-hand-side image shows the Focused Level-9 mesh visualized on an equirectangular image. Magenta points in this figure represent vertices in the full Level-1 mesh. There are 42 vertices in the full Level-1 mesh. The right-hand-side image in Figure 4(b) magnifies the refined face in the Focused icosahedral mesh to show details. We can see that content in this face are in the same position as in the original equirectangular-projected image.

Figure 4(c) shows the resulting visualization when we rotate a different Level-1 face to the refined face. The image on the right magnifies the refined face to show details.

(a) This figure shows an equirectangular-projected 360-degree image. Magenta circles, b and c, in this figure mark areas corresponding to two different refined faces. (Original photo by Timothy Oldfield on Unsplash: https://unsplash.com/photo/luufnHoChRU)
(b) The left-hand-side image displays the Focused Level-9 mesh visualized on an equirectangular image. The right-hand-side image displays a magnified view of the refined face. In both images, magenta points represent vertices in the full Level-1 mesh.
(c) This figure displays a different Level-1 icosahedral face rotated to the face refined in the Focused Mesh. Pixel values from the original image are attached to rotated vertices by inverting the rotation for positions of the mesh vertices then finding the nearest neighbor pixel of this rotated position.
Figure 4: Visualizing the Focused Icosahedral mesh.

3.1.2 Mesh sizes

Table 1 shows the number of vertices in both Full and Focused icosahedral meshes in different levels of refinement. A Full Level-9 mesh has more than 2.6 million vertices and requires more than 1.9 GB space for storage. On the other hand, a Focused Level-9 mesh has only about 33K vertices, requiring only about 31 MB storage space.

Level Level-6 Level-7 Level-8 Level-9
Full 40,962 163,842 655,362 2,621,442
Focused 600 2,184 8,424 33,192
2D planar 360x180 720x360 1440x720 2880x1440
Table 1: Number of vertices in Full icosahedral mesh, Focused icosahedral mesh, and their roughly-equivalent 2D planar resolution in the equirectangular projection.

We know that the area of a unit spherical surface is . A frame generated through the equirectangular projection covers a corresponding area of . Suppose there are

vertices in the Full Level-X mesh, given that vertices on the icosahedral mesh are roughly uniformly distributed on the sphere, we can estimate the equivalent 2D equirectangular-projected frame resolution as follows:

, , where and are the width and height of the equirectangular projection, respectively. The results are listed in Table 1. We find that Level-6 mesh is roughly equivalent to the 2D equirectangular projection in 360x180 resolution, and that Level-9 mesh is roughly equivalent to the 2D equirectangular projection in 2880x1440 resolution.

3.2 MeshConv Transpose

UGSCNN[22] also proposed MeshConv Transpose operation in their UNet architecture. MeshConv Transpose takes level- mesh for input and outputs a level- mesh, which can be described as follows:

where represents zero padding, and are level- mesh and level- mesh, respectively. In general, MeshConv Transpose simply padding s on new vertexes in level- mesh, then apply MeshConv on the new zero-padding level- mesh. However, it’s easy to implement but inefficient.

3.3 VertexShuffle

Motivated by PixelShuffle [32] commonly used in 2D super-resolution models, we proposed VertexShuffle in our spherical super-resolution model, which can be described as follows:

The input of our basic VertexShuffle operation can be represented as , where is the feature dimension in Level-, and represents the number of vertices of Level- mesh. The output is , where is the feature dimension in Level-, which is in our work, and represents the number of vertices of the Level- mesh.

We firstly split into four parts on feature map dimension, where , here is . We keep as our Level- mesh, which will be used later. are used to refined vertices in Level- mesh.

As we introduced before, a spherical mesh can be obtained by progressively sub-dividing each face of the unit icosahedron into four equal triangles. Here, we treat a single triangle face as a sequence of vertices, and a sequence of edges . The refinement process can be regarded as progressively construct midpoint vertex on associated edges, and new edges in Level- are created between each pair of midpoint vertices, thus a single face in Level- are refined into four new faces in Level-.

To fully make use of feature maps in Level-, we use to refine vertices in Level- mesh. Specifically, we use to calculate midpoint between , to calculate midpoint between , and to calculate midpoint between . Midpoint vertex values are constructed by averaging the values associated with the original two vertices on a edge, which can be described as follows:

Thus, we can get a set of midpoint vertices , which are new vertices generated in Level- mesh. However, there exits redundant midpoints due to the shared edges that may be calculated twice. We have to performs deduplication on the set of midpoint vertices. There are plenty of ways to select midpoint between the two calculated midpoint, such as, max, min, average, weighted average. In our paper, we simply select the first instance of a midpoint, which achieves best results in our experiments.

Then, we have a set of unique midpoint vertices that used to refine the next level mesh , where .

Finally, we concatenate partial of feature map in Level- with the new calculated midpoint vertices to formulate our Level- mesh.

Compared to MeshConv Transpose, we do not have extra learnable parameters. In other word, the implementation of VertexShuffle is not only more parameters efficient, but also achieves significantly better performance.

Figure 5: This figure shows the architecture of our proposed Spherical Super-Resolution (SSR) model that uses MeshConv and VertexShuffle operations. Here, L7 represents the input Level-7 mesh, and L9 represents the output Level-9 mesh. Our model starts with a MeshConv layer followed by 2 ResBlocks and 2 VertexShuffle layers, it then ends with a final MeshConv layer.
Figure 6: The adapted ResBlock used in our model.

3.4 Model architecture

We apply our Focused icosahedral mesh and VertexShuffle in super resolution. The architecture of our model is shown in Figure 5

. In this figure, we show the input of our model as a Level-7 Focused icosahedral mesh, it first goes through a MeshConv layer with Batch Normalization 

[20]

followed by a ReLU 

[1]activation function. Then, we use two adapted Residual Blocks [19] to further extract features, which we will explain in depth later. We further concatenate the output of the first MeshConv and the output from the two ResBlocks by element-wise addition. After that, we apply two VertexShuffle operations to upscale the features. Finally, our model ends up with a MeshConv layer. Thus, a Level-9 Focused icosahedral mesh is generated.

Adapted Residual Block. We adapt a regular residual block by adding two MeshConv layers in the residual block. Other settings are similar to the regular residual block [19].

MeshConv. The MeshConv operation introduced by Jiang [22] et al. is performed by taking a linear combination of linear operator values computed on a set of input mesh vertex values. MeshConv can be formulated as follows:

where represents for the identity, which can be regarded as the order differential, same as . and are derivatives in two orthogonal spatial dimensions respectively, which can be viewed as the order differential. stands for the Laplacian operator, which can be regarded as the order differential.

At a high-level, these linear operators can be viewed as computing a set of local information near each vertex of the mesh. The standard cross correlation operation can be viewed as a set of nine linear operators. Each of the linear operators returning a value of either the pixel itself or an adjacent pixel. Compared to the

convolution, it is clear that the set of four linear operators used by MeshConv are less expressive. They not only extract less information per pixel, but this information also can drop information about a vertex’s surrounding. For example, the gradient operation on the mesh computes a 3-dimensional average of either six or seven values. Another degree-of-freedom is dropped from the gradient when taking only the east-west and north-south components of the gradient. We hypothesize that some of the information excluded from the linear operator computations could be useful for the super-resolution task. To attempt to mitigate this information loss, rather than including single MeshConv ops in our network architecture, we include pairs of composed MeshConv ops. These paired operations aggregate more local information around a vertex before the non-linearity is applied, allowing the network to capture more-useful characteristics needed for the super-resolution task.

3.5 Loss function

Similar to general super-resolution tasks, our goal is minimizing the loss between the reconstructed images and the corresponding ground truth high-resolution images . Given a set of high-resolution images and their corresponding low-resolution images , we represent the loss as follows:

where

is the number of training samples. Here, we adapt mean square error as our loss function. It can be regarded as the negative peak signal-to-noise (PSNR) value, which is more straightforward in our task.

Model Small Large Average
Spherical: MeshConv with transposed MeshConv 18.52 16.57 17.54
Spherical: MeshConv with VertexShuffle (SSR) 31.44 34.13 32.79
Table 2: PSNR (dB) results for small and large dataset.
Model Total # of Parameters Per-image/frame Inference Time
Spherical: MeshConv with transposed MeshConv 1001225 5883 ms
Spherical: MeshConv with VertexShuffle (SSR) 734905 578 ms
Table 3: Comparison of total number of model parameters, and per-frame inference time.

4 Experiments

4.1 Dataset

Due to the lack of official spherical super-resolution datasets, we collect two publicly-available 360-degree video datasets: the 360-Degree Video Head Movement Dataset [7] and VR User Behavior Dataset [35] to generate a spherical super resolution dataset with high quality. The 360-Degree Video Head Movement Dataset [7] contains 5 videos in 4K quality, and the VR User Behavior Dataset contains 5 videos with resolution. We use FFmpeg [15] to extract the key frames of each video in the dataset.

We first construct a small dataset with the 360-Degree Video Head Movement Dataset [7]. This small dataset contains 345 images of resolution. We randomly split the dataset with training set and test set. The training set provide roughly 21,440 training items, and the test set provide 6,160 testing items. We evaluate our model with the upscaling factor of 4, that is, 16x super-resolution.

We also generate a larger dataset with both the 360-Degree Video Head Movement Dataset [7] and the VR User Behavior Dataset [35]. The large dataset contains 1,532 images in total with resolution. As with the small dataset, we split the dataset with training set and test set. The training set provide roughly 95,360 training items, and the test set provide 27,200 testing items. We evaluate our model with the same upscaling factor of 4.

4.2 Implementation details

With our generated Focused Icosahedral mesh, we map the 2D equirectangular-projected frame to the partial mesh sphere. In both the small and large datasets, the input data is in Level-7, which is roughly equivalent to a 2D equirectangular-projected frame in and resolution. The output data is in Level-9, which is roughly equivalent to and equirectangular-projected frame, respectively. The upscaling factor in our experiment set up is . That is, the number of output vertices is 16x the number of input vertices.

In our experiments, we train our model with 50 epochs with batch size of 64. we set learning rate as 0.01 and use Adam 

[25] as our optimizer. We use the PSNR as the performance metric to evaluate our models.

4.3 Comparison with MeshConv Transpose

We compare our spherical super-resolution (SSR) model that uses the VertexShuffle operation with a baseline model that uses the MeshConv Transposed proposed in the original UGSCNN paper  [22]. We conduct experiments on both of the small and large datasets with same configuration, i.e., using Focused Icosahedral mesh with an upscaling factor of .

Table 2 shows the performance of two models on both datasets. As we can see, our model achieves significantly higher performance than the baseline MeshConv transpose model. We also found that with the increasing volume of dataset, the performance of our model gets better. Our model can achieve the PSNR result on larger dataset of 34.13 dB, 31.44 dB on small on, and 32.79 dB on average, while the baseline method performs not so good in spherical super resolution task, it can only achieve 17.54 dB on average.

Figure 7: Visualization of results of our proposed SSR model (represented as “Ours”) and the MeshConv transpose model (represented as “Transpose”). “HR” represents the high resolution version of the data, i.e., the groundtruth.

4.4 Inference time and parameters

For fair comparison of model inference time, we fixed the batch sizes of both models to 16. We first compute the average inference time for each batch, then divide it by the batch size, and finally, we multiple the number of data items for a full frame, which is here. This allows us to roughly obtain the per-frame inference time. We collected our results via experiments on a Tesla P100 GPU.

Table 3 compares the inference time and total number of parameters of our proposed model with the baseline model that uses MeshConv transpose.

Our model can save roughly 20% parameters compared to the baseline model, which improves the efficiency significantly. In addition, MeshConv with transposed MeshConv requires nearly 6 seconds to process a full image/frame. With our proposed VertexShuffle operation, our SSR model can achieve more than 10x acceleration in processing a full image/frame, which is significantly faster than the spherical baseline model with transposed MeshConv operations.

In addition, given that a user can only watch a sub-portion of the spherical image/frame at a time, there is no need to perform super-resolution for all 80 faces of a frame at the same time. This indicates that our SSR model can be used in real-world applications with even faster per-image/frame processing time.

4.5 Quantitative results

Overall, our model outperforms the baseline MeshConv-transpose-based model in all of PSNR results, inference time, and the total number of parameters. Since we are the first to directly apply spherical convolutional neural network on super-resolution task, we have no benchmark to compare with. Simply comparing the PSNR results with the 2D super-resolution task would be unfair due to the different data format convolution method. Hence, we only compare our method with the MeshConv transpose model, which is a fair comparison to show our contributions.

4.6 Qualitative results

Figure 7 visualizes the results of our proposed SSR model and the MeshConv transpose model. We use two images with significantly different PSNR results as examples.333These images are from the following two videos in the 360-degree video head movement dataset [7]: https://www.youtube.com/watch?v=2OzlksZBTiA and https://www.youtube.com/watch?v=sJxiPiAaB4k Here, we first show the full image frame via two formats, in both the spherical domain and the 2D planar domain. Moreover, in figures (a) to (d), we select two focused icosahedral meshes from different locations on a sphere to demonstrate the efficiency of directly applying spherical super-resolution to spherical data.

4.7 Discussion

The computer vision community has made tremendous progress in 2D super-resolution while there is little precedent effort in directly applying spherical super-resolution. There are significant challenges in directly performing super-resolution on spherical signals, e.g., how to perform convolution operations in 3D space, how to perform deconvolution operations in 3D space, how to perform PixelShuffle in 3D space with other spherical convolution method, etc. In this paper, we provide a straightforward approach to directly apply 3D convolution to spherical signals and can achieve good results, which shows great potential in the 3D super-resolution area. In addition, we show that it is feasible to directly apply spherical super-resolution on spherical signals, which can avoid issues in applying 2D super-resolution in 3D space, such as distortion, oversampled pixels, etc. We believe there are a great number of interesting directions to exploit ahead on directly apply super resolution on spherical signals.

5 Conclusion

In this paper, we proposed a memory- and bandwidth-efficient representation of the spherical mesh – the Focused Icosahedral Mesh, which is more flexible than full meshes and saves a significant amount of computation resources, and a novel VertexShuffle operation to further improve the performance compared to the baseline model that uses the MeshConv Transpose operation. To illustrate our proposed idea, we present a spherical super resolution model and show great capacity and potential to apply the traditional 2D computer vision tasks on spherical signals. For evaluation, we create a new high-resolution spherical super resolution dataset by extracting key frames from a set of collected 360-degree videos. Experiments on the dataset show that our proposed model is superior to the baseline model in performing spherical super-resolution tasks with remarkable efficiency.

References

  • [1] A. F. Agarap (2018)

    Deep learning using rectified linear units (relu)

    .
    Note: cite arxiv:1803.08375Comment: 7 pages, 11 figures, 9 tables External Links: Link Cited by: §3.4.
  • [2] J. R. Baumgardner and P. O. Frederickson (1985) Icosahedral discretization of the two-sphere. SIAM Journal on Numerical Analysis 22 (6), pp. 1107–1115. Cited by: §3.1.
  • [3] J. Chen, M. Hu, Z. Luo, Z. Wang, and D. Wu (2020) SR360: boosting 360-degree video streaming with super-resolution. In

    Proceedings of the 30th ACM Workshop on Network and Operating Systems Support for Digital Audio and VideoProceedings of the European Conference on Computer Vision (ECCV)Proceedings of the IEEE International Conference on Computer VisionProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition2016 IEEE International Conference on Big Data (Big Data)Int. Conf. on Computer Vision Theory and Applications (VISAPP)Proceedings of the IEEE Conference on Computer Vision and Pattern RecognitionProceedings of the 8th International Conference on Multimedia SystemsProceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37

    ,
    NOSSDAV ’20MMSys ’17ICML’15, Vol. 1, New York, NY, USA. External Links: ISBN 9781450379458, Link, Document Cited by: §2.3.
  • [4] J. Chen, M. Hu, Z. Luo, Z. Wang, and D. Wu (2020) SR360: boosting 360-degree video streaming with super-resolution. In Proceedings of the 30th ACM Workshop on Network and Operating Systems Support for Digital Audio and Video, pp. 1–6. Cited by: §2.1.
  • [5] T. Cohen, M. Geiger, J. Köhler, and M. Welling (2017) Convolutional networks for spherical signals. arXiv preprint arXiv:1709.04893. Cited by: §1, §1, §2.2.
  • [6] T. S. Cohen, M. Weiler, B. Kicanaoglu, and M. Welling (2019) Gauge equivariant convolutional networks and the icosahedral cnn. arXiv preprint arXiv:1902.04615. Cited by: §2.2.
  • [7] X. Corbillon, F. De Simone, and G. Simon (2017) 360-degree video head movement dataset. In Proceedings of the 8th ACM on Multimedia Systems Conference, pp. 199–204. Cited by: §4.1, §4.1, §4.1, footnote 2, footnote 3.
  • [8] X. Corbillon, G. Simon, A. Devlic, and J. Chakareski (2017) Viewport-adaptive navigable 360-degree video delivery. In Communications (ICC), 2017 IEEE International Conference on, pp. 1–7. Cited by: §2.1.
  • [9] M. Dasari, A. Bhattacharya, S. Vargas, P. Sahu, A. Balasubramanian, and S. R. Das Streaming 360-degree videos using super-resolution. Cited by: §2.1.
  • [10] C. Dong, C. C. Loy, K. He, and X. Tang (2014) Learning a deep convolutional network for image super-resolution. In European conference on computer vision, pp. 184–199. Cited by: §1, §2.3.
  • [11] C. Dong, C. C. Loy, K. He, and X. Tang (2015) Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38 (2), pp. 295–307. Cited by: §1, §2.3.
  • [12] C. Dong, C. C. Loy, and X. Tang (2016)

    Accelerating the super-resolution convolutional neural network

    .
    In European conference on computer vision, pp. 391–407. Cited by: §1, §2.3.
  • [13] M. Eder, M. Shvets, J. Lim, and J. Frahm (2020) Tangent images for mitigating spherical distortion. pp. 12426–12434. Cited by: §2.1, §2.2.
  • [14] Equirectangular Projection. Note: http://mathworld.wolfram.com/EquirectangularProjection.html Cited by: §1, §1.
  • [15] FFmpeg. Note: http://www.ffmpeg.org/ Cited by: §4.1.
  • [16] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.3.
  • [17] M. Graf, C. Timmerer, and C. Mueller (2017) Towards bandwidth efficient adaptive streaming of omnidirectional video over http: design, implementation, and evaluation. In Proceedings of the 8th ACM on Multimedia Systems Conference, pp. 261–271. Cited by: §2.1.
  • [18] Y. Guan, C. Zheng, X. Zhang, Z. Guo, and J. Jiang (2019) Pano: optimizing 360 video streaming with a better understanding of quality perception. In Proceedings of the ACM Special Interest Group on Data Communication, pp. 394–407. Cited by: §2.1.
  • [19] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §2.3, §3.4, §3.4.
  • [20] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. Cited by: §3.4.
  • [21] M. Irani and S. Peleg (1991) Improving resolution by image registration. CVGIP: Graphical models and image processing 53 (3), pp. 231–239. Cited by: §1.
  • [22] C. Jiang, J. Huang, K. Kashinath, P. Marcus, M. Niessner, et al. (2019) Spherical cnns on unstructured grids. arXiv preprint arXiv:1901.02039. Cited by: Applying VertexShuffle Toward 360-Degree Video Super-Resolution on Focused-Icosahedral-Mesh, §1, §1, §1, §2.2, §3.2, §3.4, §4.3.
  • [23] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1646–1654. Cited by: §2.3.
  • [24] J. Kim, J. Kwon Lee, and K. Mu Lee (2016) Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1637–1645. Cited by: §2.3.
  • [25] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
  • [26] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. (2017) Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4681–4690. Cited by: §2.3.
  • [27] B. Lim, S. Son, H. Kim, S. Nah, and K. Mu Lee (2017) Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 136–144. Cited by: §2.3.
  • [28] A. Mahzari, A. Taghavi Nasrabadi, A. Samiei, and R. Prakash (2018) FoV-aware edge caching for adaptive 360â° video streaming. In 2018 ACM Multimedia Conference on Multimedia Conference, pp. 173–181. Cited by: §2.1.
  • [29] A. T. Nasrabadi, A. Mahzari, J. D. Beshay, and R. Prakash (2017) Adaptive 360-degree video streaming using scalable video coding. In Proceedings of the 2017 ACM on Multimedia Conference, pp. 1689–1697. Cited by: §2.1.
  • [30] S. Petrangeli, V. Swaminathan, M. Hosseini, and F. De Turck (2017) An http/2-based adaptive streaming framework for 360 virtual reality videos. In Proceedings of the 2017 ACM on Multimedia Conference, pp. 306–314. Cited by: §2.1.
  • [31] F. Qian, B. Han, Q. Xiao, and V. Gopalakrishnan (2018) Flare: practical viewport-adaptive 360-degree video streaming for mobile devices. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking, pp. 99–114. Cited by: §2.1.
  • [32] W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang (2016) Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1874–1883. Cited by: 2nd item, §1, §2.3, §3.3.
  • [33] L. Sun, F. Duanmu, Y. Liu, Y. Wang, Y. Ye, H. Shi, and D. Dai (2018) Multi-path multi-tier 360-degree video streaming in 5g networks. In Proceedings of the 9th ACM Multimedia Systems Conference, pp. 162–173. Cited by: §2.1.
  • [34] Y. Tai, J. Yang, and X. Liu (2017) Image super-resolution via deep recursive residual network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3147–3155. Cited by: §1.
  • [35] C. Wu, Z. Tan, Z. Wang, and S. Yang (2017) A dataset for exploring user behaviors in vr spherical video streaming. Taipei, Taiwan. Cited by: §4.1, §4.1.
  • [36] L. Xie, Z. Xu, Y. Ban, X. Zhang, and Z. Guo (2017) 360ProbDASH: improving qoe of 360 video streaming using tile-based http adaptive streaming. In Proceedings of the 2017 ACM on Multimedia Conference, pp. 315–323. Cited by: §2.1.
  • [37] B. Xiong and K. Grauman (2018) Snap angle prediction for 360 panoramas. pp. 3–18. Cited by: §2.1.
  • [38] A. Zare, A. Aminlou, M. M. Hannuksela, and M. Gabbouj (2016) HEVC-compliant tile-based streaming of panoramic video for virtual reality applications. In Proceedings of the 2016 ACM on Multimedia Conference, pp. 601–605. Cited by: §2.1.
  • [39] C. Zhang, S. Liwicki, W. Smith, and R. Cipolla (2019) Orientation-aware semantic segmentation on icosahedron spheres. pp. 3533–3541. Cited by: §2.2.
  • [40] Y. Zhang, Y. Tian, Y. Kong, B. Zhong, and Y. Fu (2018) Residual dense network for image super-resolution. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2472–2481. Cited by: §1.
  • [41] C. Zhou, Z. Li, and Y. Liu (2017) A measurement study of oculus 360 degree video streaming. In Proceedings of the 8th ACM on Multimedia Systems Conference, pp. 27–37. Cited by: §2.1.