Flat2Sphere: Learning Spherical Convolution for Fast Features from 360° Imagery

08/02/2017 ∙ by Yu-Chuan Su, et al. ∙ 0

While 360 cameras offer tremendous new possibilities in vision, graphics, and augmented reality, the spherical images they produce make core feature extraction non-trivial. Convolutional neural networks (CNNs) trained on images from perspective cameras yield "flat" filters, yet 360 images cannot be projected to a single plane without significant distortion. A naive solution that repeatedly projects the viewing sphere to all tangent planes is accurate, but much too computationally intensive for real problems. We propose to learn a spherical convolutional network that translates a planar CNN to process 360 imagery directly in its equirectangular projection. Our approach learns to reproduce the flat filter outputs on 360 data, sensitive to the varying distortion effects across the viewing sphere. The key benefits are 1) efficient feature extraction for 360 images and video, and 2) the ability to leverage powerful pre-trained networks researchers have carefully honed (together with massive labeled image training sets) for perspective images. We validate our approach compared to several alternative methods in terms of both raw CNN output accuracy as well as applying a state-of-the-art "flat" object detector to 360 data. Our method yields the most accurate results while saving orders of magnitude in computation versus the existing exact reprojection solution.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 9

page 11

page 13

page 14

page 15

page 16

page 17

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Unlike a traditional perspective camera, which samples a limited field of view of the 3D scene projected onto a 2D plane, a camera captures the entire viewing sphere surrounding its optical center, providing a complete picture of the visual world—an omnidirectional field of view. As such, viewing imagery provides a more immersive experience of the visual content compared to traditional media.

cameras are gaining popularity as part of the rising trend of virtual reality (VR) and augmented reality (AR) technologies, and will also be increasingly influential for wearable cameras, autonomous mobile robots, and video-based security applications. Consumer level cameras are now common on the market, and media sharing sites such as Facebook and YouTube have enabled support for content. For consumers and artists, cameras free the photographer from making real-time composition decisions. For VR/AR,

data is essential to content creation. As a result of this great potential, computer vision problems targeting

content are capturing the attention of both the research community and application developer.

Immediately, this raises the question: how to compute features from images and videos? Arguably the most powerful tools in computer vision today are convolutional neural networks (CNN). CNNs are responsible for state-of-the-art results across a wide range of vision problems, including image recognition zhou2014scenerecog ; he2016resnet , object detection girshick2014rcnn ; ren2015fasterRCNN , image and video segmentation long2015fcn ; he2017mask ; fusionseg2017 , and action detection twostream-actions ; feichtenhofer2016convaction . Furthermore, significant research effort over the last five years (and really decades lecun ) has led to well-honed CNN architectures that, when trained with massive labeled image datasets imagenet , produce “pre-trained" networks broadly useful as feature extractors for new problems. Indeed such networks are widely adopted as off-the-shelf feature extractors for other algorithms and applications (c.f., VGG simonyan2014vgg , ResNet he2016resnet , and AlexNet alexnet for images; C3D c3d for video).

However, thus far, powerful CNN features are awkward if not off limits in practice for imagery. The problem is that the underlying projection models of current CNNs and data are different. Both the existing CNN filters and the expensive training data that produced them are “flat", i.e., the product of perspective projection to a plane. In contrast, a image is projected onto the unit sphere surrounding the camera’s optical center.

To address this discrepancy, there are two common, though flawed, approaches. In the first, the spherical image is projected to a planar one,111e.g., with equirectangular projection, where latitudes are mapped to horizontal lines of uniform spacing then the CNN is applied to the resulting 2D image lai2017semantic ; hu2017deeppilot (see Fig. 1, top). However, any sphere-to-plane projection introduces distortion, making the resulting convolutions inaccurate. In the second existing strategy, the image is repeatedly projected to tangent planes around the sphere, each of which is then fed to the CNN xiao2012sun360 ; zhang2014panocontext ; su2016accv ; su2017cvpr (Fig. 1, bottom). In the extreme of sampling every tangent plane, this solution is exact and therefore accurate. However, it suffers from very high computational cost. Not only does it incur the cost of rendering each planar view, but also it prevents amortization of convolutions: the intermediate representation cannot be shared across perspective images because they are projected to different planes.

Figure 1: Two existing strategies for applying CNNs to images. Top: The first strategy unwraps the input into a single planar image using a global projection (most commonly equirectangular projection), then applies the CNN on the distorted planar image. Bottom: The second strategy samples multiple tangent planar projections to obtain multiple perspective images, to which the CNN is applied independently to obtain local results for the original image. Strategy I is fast but inaccurate; Strategy II is accurate but slow. The proposed approach learns to replicate flat filters on spherical imagery, offering both speed and accuracy.

We propose a learning-based solution that, unlike the existing strategies, sacrifices neither accuracy nor efficiency. The main idea is to learn a CNN that processes a image in its equirectangular projection (fast) but mimics the “flat" filter responses that an existing network would produce on all tangent plane projections for the original spherical image (accurate). Because convolutions are indexed by spherical coordinates, we refer to our method as spherical convolution (SphConv). We develop a systematic procedure to adjust the network structure in order to account for distortions. Furthermore, we propose a kernel-wise pre-training procedure which significantly accelerates the training process.

In addition to providing fast general feature extraction for imagery, our approach provides a bridge from content to existing heavily supervised datasets dedicated to perspective images. In particular, training requires no new annotations—only the target CNN model (e.g., VGG simonyan2014vgg pre-trained on millions of labeled images) and an arbitrary collection of unlabeled images.

We evaluate SphConv on the Pano2Vid su2016accv and PASCAL VOC pascal datasets, both for raw convolution accuracy as well as impact on an object detection task. We show that it produces more precise outputs than baseline methods requiring similar computational cost, and similarly precise outputs as the exact solution while using orders of magnitude less computation. Furthermore, we demonstrate that SphConv can successfully replicate the widely used Faster-RCNN ren2015fasterRCNN detector on data when training with only 1,000 unlabeled images containing unrelated objects. For a similar cost as the baselines, SphConv generates better object proposals and recognition rates.

2 Related Work

360° vision

Vision for data is quickly gaining interest in recent years. The SUN360 project samples multiple perspective images to perform scene viewpoint recognition xiao2012sun360 . PanoContext zhang2014panocontext parses images using 3D bounding boxes, applying algorithms like line detection on perspective images then backprojecting results to the sphere. Motivated by the limitations of existing interfaces for viewing video, several methods study how to automate field-of-view (FOV) control for display su2016accv ; su2017cvpr ; lai2017semantic ; hu2017deeppilot , adopting one of the two existing strategies for convolutions (Fig. 1). In these methods, a noted bottleneck is feature extraction cost, which is hampered by repeated sampling of perspective images/frames, e.g., to represent the space-time “glimpses" of su2017cvpr ; su2016accv . This is exactly where our work can have positive impact. Prior work studies the impact of panoramic or wide angle images on hand-crafted features like SIFT furnari2017affine ; hansen2007scale ; hansen2007wide . While not applicable to CNNs, such work supports the need for features specific to imagery, and thus motivates SphConv.

Knowledge distillation

Our approach relates to knowledge distillation ba2014distilling ; hinton2015distilling ; romero2014fitnets ; parisotto2015actormimic ; gupta2016suptransfer ; wang2016modelregression ; bucilua2006compression , though we explore it in an entirely novel setting. Distillation aims to learn a new model given existing model(s). Rather than optimize an objective function on annotated data, it learns the new model that can reproduce the behavior of the existing model, by minimizing the difference between their outputs. Most prior work explores distillation for model compression bucilua2006compression ; ba2014distilling ; romero2014fitnets ; hinton2015distilling . For example, a deep network can be distilled into a shallower ba2014distilling or thinner romero2014fitnets one, or an ensemble can be compressed to a single model hinton2015distilling . Rather than compress a model in the same domain, our goal is to learn across domains, namely to link networks on images with different projection models. Limited work considers distillation for transfer parisotto2015actormimic ; gupta2016suptransfer . In particular, unlabeled target-source paired data can help learn a CNN for a domain lacking labeled instances (e.g., RGB vs. depth images) gupta2016suptransfer , and multi-task policies can be learned to simulate action value distributions of expert policies parisotto2015actormimic

. Our problem can also be seen as a form of transfer, though for a novel task motivated strongly by image processing complexity as well as supervision costs. Different from any of the above, we show how to adapt the network structure to account for geometric transformations caused by different projections. Also, whereas most prior work uses only the final output for supervision, we use the intermediate representation of the target network as both input and target output to enable kernel-wise pre-training.

Spherical image projection

Projecting a spherical image into a planar image is a long studied problem. There exists a large number of projection approaches (e.g., equirectangular, Mercator, etc.) barre1987curvilinear . None is perfect; every projection must introduce some form of distortion. The properties of different projections are analyzed in the context of displaying panoramic images lihi-squaring . In this work, we unwrap the spherical images using equirectangular projection because 1) this is a very common format used by camera vendors and researchers xiao2012sun360 ; su2016accv ; fb2015meta , and 2) it is equidistant along each row and column so the convolution kernel does not depend on the azimuthal angle. Our method in principle could be applied to other projections; their effect on the convolution operation remains to be studied.

CNNs with geometric transformations

There is an increasing interest in generalizing convolution in CNNs to handle geometric transformations or deformations. Spatial transformer networks (STNs) 

jaderberg2015spatial represent a geometric transformation as a sampling layer and predict the transformation parameters based on input data. STNs assume the transformation is invertible such that the subsequent convolution can be performed on data without the transformation. This is not possible in spherical images because it requires a projection that introduces no distortion. Active convolution jeon2017active learns the kernel shape together with the weights for a more general receptive field, and deformable convolution dai2017deformable goes one step further by predicting the receptive field location. These methods are too restrictive for spherical convolution, because they require a fixed kernel size and weight. In contrast, our method adapts the kernel size and weight based on the transformation to achieve better accuracy. Furthermore, our method exploits problem-specific geometric information for efficient training and testing. Some recent work studies convolution on a sphere cohen2017convolutional ; khasanova2017graph using spectral analysis, but those methods require manually annotated spherical images as training data, whereas our method can exploit existing models trained on perspective images as supervision. Also, it is unclear whether CNNs in the spectral domain can reach the same accuracy and efficiency as CNNs on a regular grid.

3 Approach

We describe how to learn spherical convolutions in equirectangular projection given a target network trained on perspective images. We define the objective in Sec. 3.1. Next, we introduce how to adapt the structure from the target network in Sec. 3.2. Finally, Sec. 3.3 presents our training process.

3.1 Problem Definition

Let be the input spherical image defined on spherical coordinates , and let be the corresponding flat RGB image in equirectangular projection. is defined by pixels on the image coordinates , where each is linearly mapped to a unique . We define the perspective projection operator which projects an -degree field of view (FOV) from to pixels on the the tangent plane . That is, . The projection operator is characterized by the pixel size in , and denotes the resulting perspective image. Note that we assume following common digital imagery.

Given a target network222e.g., could be AlexNet alexnet or VGG simonyan2014vgg pre-trained for a large-scale recognition task. trained on perspective images with receptive field (Rf) , we define the output on spherical image at as

(1)

where w.l.o.g. we assume for simplicity. Our goal is to learn a spherical convolution network that takes an equirectangular map as input and, for every image position , produces as output the results of applying the perspective projection network to the corresponding tangent plane for spherical image :

(2)

This can be seen as a domain adaptation problem where we want to transfer the model from the domain of to that of . However, unlike typical domain adaptation problems, the difference between and is characterized by a geometric projection transformation rather than a shift in data distribution. Note that the training data to learn requires no manual annotations: it consists of arbitrary images coupled with the “true" outputs computed by exhaustive planar reprojections, i.e., evaluating the rhs of Eq. 1 for every . Furthermore, at test time, only a single equirectangular projection of the entire input will be computed using to obtain the dense (inferred) outputs, which would otherwise require multiple projections and evaluations of .

3.2 Network Structure

Figure 2: Inverse perspective projections to equirectangular projections at different polar angles . The same square image will distort to different sizes and shapes depending on . Because equirectangular projection unwraps the longitude, a line will be split into two if it passes through the longitude, which causes the double curve in .

The main challenge for transferring to is the distortion introduced by equirectangular projection. The distortion is location dependent—a square in perspective projection will not be a square in the equirectangular projection, and its shape and size will depend on the polar angle . See Fig. 2

. The convolution kernel should transform accordingly. Our approach 1) adjusts the shape of the convolution kernel to account for the distortion, in particular the content expansion, and 2) reduces the number of max-pooling layers to match the pixel sizes in

and , as we detail next.

We adapt the architecture of from

using the following heuristic. The goal is to ensure each kernel receives enough information from the input in order to compute the target output. First, we untie the weight of convolution kernels at different

by learning one kernel for each output row . Next, we adjust the shape of such that it covers the Rf of the original kernel. We consider to cover if more than of pixels in the Rf of are also in the Rf of in . The Rf of in is obtained by backprojecting the grid to using , where the center of the grid aligns on . should be large enough to cover , but it should also be as small as possible to avoid overfitting. Therefore, we optimize the shape of for layer as follows. The shape of is initialized as . We first adjust the height and increase by 2 until the height of the Rf is larger than that of in . We then adjust the width similar to . Furthermore, we restrict the kernel size to be smaller than an upper bound . See Fig. 4. Because the Rf of depends on , we search for the kernel size starting from the bottom layer.

It is important to relax the kernel from being square to being rectangular, because equirectangular projection will expand content horizontally near the poles of the sphere (see Fig. 2). If we restrict the kernel to be square, the Rf of can easily be taller but narrower than that of which leads to overfitting. It is also important to restrict the kernel size, otherwise the kernel can grow wide rapidly near the poles and eventually cover the entire row. Although cutting off the kernel size may lead to information loss, the loss is not significant in practice because pixels in equirectangular projection do not distribute on the unit sphere uniformly; they are denser near the pole, and the pixels are by nature redundant in the region where the kernel size expands dramatically.

Besides adjusting the kernel sizes, we also adjust the number of pooling layers to match the pixel size in and . We define and restrict to ensure . Because max-pooling introduces shift invariance up to pixels in the image, which corresponds to degrees on the unit sphere, the physical meaning of max-pooling depends on the pixel size. Since the pixel size is usually larger in and max-pooling increases the pixel size by a factor of , we remove the pooling layer in if .

Fig. 3 illustrates how spherical convolution differs from ordinary CNN. Note that we approximate one layer in by one layer in , so the number of layers and output channels in each layer is exactly the same as the target network. However, this does not have to be the case. For example, we could use two or more layers to approximate each layer in . Although doing so may improve accuracy, it would also introduce significant overhead, so we stick with the one-to-one mapping.

Figure 3: Spherical convolution. The kernel weight in spherical convolution is tied only along each row of the equirectangular image (i.e., ), and each kernel convolves along the row to generate 1D output. Note that the kernel size differs at different rows and layers, and it expands near the top and bottom of the image.

3.3 Training Process

Given the goal in Eq. 2 and the architecture described in Sec. 3.2, we would like to learn the network by minimizing the loss . However, the network converges slowly, possibly due to the large number of parameters. Instead, we propose a kernel-wise pre-training process that disassembles the network and initially learns each kernel independently.

To perform kernel-wise pre-training, we further require to generate the same intermediate representation as in all layers :

(3)

Given Eq. 3, every layer is independent of each other. In fact, every kernel is independent and can be learned separately. We learn each kernel by taking the “ground truth” value of the previous layer as input and minimizing the loss , except for the first layer. Note that refers to the convolution output of layer

before applying any non-linear operation, e.g. ReLU, max-pooling, etc. It is important to learn the target value before applying ReLU because it provides more information. We combine the non-linear operation with

during kernel-wise pre-training, and we use dilated convolutionyu2015dilated to increase the Rf size instead of performing max-pooling on the input feature map.

For the first convolution layer, we derive the analytic solution directly. The projection operator is linear in the pixels in equirectangular projection: , for coefficients

from, e.g., bilinear interpolation. Because convolution is a weighted sum of input pixels

, we can combine the weight and interpolation coefficient as a single convolution operator:

(4)

The output value of will be exact and requires no learning. Of course, the same is not possible for because of the non-linear operations between layers.

After kernel-wise pre-training, we can further fine-tune the network jointly across layers and kernels by minimizing the loss of the final output. Because the pre-trained kernels cannot fully recover the intermediate representation, fine-tuning can help to adjust the weights to account for residual errors. We ignore the constraint introduced in Eq. 3 when performing fine-tuning. Although Eq. 3 is necessary for kernel-wise pre-training, it restricts the expressive power of

and degrades the performance if we only care about the final output. Nevertheless, the weights learned by kernel-wise pre-training are a very good initialization in practice, and we typically only need to fine-tune the network for a few epochs.

One limitation of SphConv is that it cannot handle very close objects that span a large FOV. Because the goal of SphConv is to reproduce the behavior of models trained on perspective images, the capability and performance of the model is bounded by the target model . However, perspective cameras can only capture a small portion of a very close object in the FOV, and very close objects are usually not available in the training data of the target model . Therefore, even though images offer a much wider FOV, SphConv inherits the limitations of , and may not recognize very close large objects. Another limitation of SphConv is the resulting model size. Because it unties the kernel weights along , the model size grows linearly with the equirectangular image height. The model size can easily grow to tens of gigabytes as the image resolution increases.

Figure 4: Method to select the kernel height . We project the receptive field of the target kernel to equirectangular projection and increase until it is taller than the target kernel in . The kernel width is determined using the same procedure after is set. We restrict the kernel size by an upper bound .

4 Experiments

To evaluate our approach, we consider both the accuracy of its convolutions as well as its applicability for object detections in data. We use the VGG architecture333https://github.com/rbgirshick/py-faster-rcnn and the Faster-RCNN ren2015fasterRCNN model as our target network . We learn a network to produce the topmost (conv5_3) convolution output.

Datasets

We use two datasets: Pano2Vid for training, and Pano2Vid and PASCAL for testing.

Pano2Vid: We sample frames from the videos in the Pano2Vid dataset su2016accv for both training and testing. The dataset consists of 86 videos crawled from YouTube using four keywords: “Hiking,” “Mountain Climbing,” “Parade,” and “Soccer”. We sample frames at 0.05fps to obtain 1,056 frames for training and 168 frames for testing. We use “Mountain Climbing” for testing and others for training, so the training and testing frames are from disjoint videos. See appendix for sampling process. Because the supervision is on a per pixel basis, this corresponds to (non i.i.d.) samples. Note that most object categories targeted by the Faster-RCNN detector do not appear in Pano2Vid, meaning that our experiments test the content-independence of our approach.

PASCAL VOC: Because the target model was originally trained and evaluated on PASCAL VOC 2007, we “360-ify” it to evaluate the object detector application. We test with the 4,952 PASCAL images, which contain 12,032 bounding boxes. We transform them to equirectangular images as if they originated from a camera. In particular, each object bounding box is backprojected to 3 different scales and 5 different polar angles on the image sphere using the inverse perspective projection, where

is the resolution of the target network’s Rf. Regions outside the bounding box are zero-padded. See appendix for details. Backprojection allows us to evaluate the performance at different levels of distortion in the equirectangular projection.

Metrics

We generate the output widely used in the literature (conv5_3) and evaluate it with the following metrics.

Network output error measures the difference between and . In particular, we report the root-mean-square error (RMSE) over all pixels and channels. For PASCAL, we measure the error over the Rf of the detector network.

Detector network performance measures the performance of the detector network in Faster-RCNN using multi-class classification accuracy. We replace the ROI-pooling in Faster-RCNN by pooling over the bounding box in . Note that the bounding box is backprojected to equirectangular projection and is no longer a square region.

Proposal network performance evaluates the proposal network in Faster-RCNN using average Intersection-over-Union (IoU). For each bounding box centered at , we project the conv5_3 output to the tangent plane using and apply the proposal network at the center of the bounding box on the tangent plane. Given the predicted proposals, we compute the IoUs between foreground proposals and the bounding box and take the maximum. The IoU is set to 0 if there is no foreground proposal. Finally, we average the IoU over bounding boxes.

We stress that our goal is not to build a new object detector; rather, we aim to reproduce the behavior of existing 2D models on data with lower computational cost. Thus, the metrics capture how accurately and how quickly we can replicate the exact solution.

Baselines

We compare our method with the following baselines.

  • [leftmargin=*,label=]

  • Exact — Compute the true target value for every pixel. This serves as an upper bound in performance and does not consider the computational cost.

  • Direct — Apply on directly. We replace max-pooling with dilated convolution to produce a full resolution output. This is Strategy I in Fig. 1 and is used in video analysis lai2017semantic ; hu2017deeppilot .

  • Interp — Compute every -pixels and interpolate the values for the others. We set such that the computational cost is roughly the same as our SphConv. This is a more efficient variant of Strategy II in Fig. 1.

  • Perspect — Project onto a cube map fb2015cubemap and then apply on each face of the cube, which is a perspective image with FOV. The result is backprojected to to obtain the feature on . We use for the cube map resolution so is roughly the same as . This is a second variant of Strategy II in Fig. 1 used in PanoContext zhang2014panocontext .

SphConv variants

We evaluate three variants of our approach:

  • [leftmargin=*,label=]

  • OptSphConv — To compute the output for each layer , OptSphConv computes the exact output for layer using then applies spherical convolution for layer . OptSphConv serves as an upper bound for our approach, where it avoids accumulating any error across layers.

  • SphConv-Pre — Uses the weights from kernel-wise pre-training directly without fine-tuning.

  • SphConv — The full spherical convolution with joint fine-tuning of all layers.

Implementation details

We set the resolution of to . For the projection operator , we map to pixels following SUN360 xiao2012sun360 . The pixel size is therefore for and for . Accordingly, we remove the first three max-pooling layers so has only one max-pooling layer following conv4_3. The kernel size upper bound

following the max kernel size in VGG. We insert batch normalization for conv4_1 to conv5_3. See appendix for details.

4.1 Network output accuracy and computational cost

(a) Network output errors vs. polar angle
(b) Cost vs. accuracy
Figure 5: (a) Network output error on Pano2Vid; lower is better. Note the error of Exact is 0 by definition. Our method’s convolutions are much closer to the exact solution than the baselines’. (b) Computational cost vs. accuracy on PASCAL. Our approach yields accuracy closest to the exact solution while requiring orders of magnitude less computation time (left plot). Our cost is similar to the other approximations tested (right plot). Plot titles indicate the y-labels, and error is measured by root-mean-square-error (RMSE).

Fig. 4(a) shows the output error of layers conv3_3 and conv5_3 on the Pano2Vid su2016accv dataset (see appendix for similar results on other layers.). The error is normalized by that of the mean predictor. We evaluate the error at 5 polar angles uniformly sampled from the northern hemisphere, since error is roughly symmetric with the equator.

First we discuss the three variants of our method. OptSphConv performs the best in all layers and , validating our main idea of spherical convolution. It performs particularly well in the lower layers, because the Rf is larger in higher layers and the distortion becomes more significant. Overall, SphConv-Pre performs the second best, but as to be expected, the gap with OptConv becomes larger in higher layers because of error propagation. SphConv outperforms SphConv-Pre in conv5_3 at the cost of larger error in lower layers (as seen here for conv3_3). It also has larger error at for two possible reasons. First, the learning curve indicates that the network learns more slowly near the pole, possibly because the Rf is larger and the pixels degenerate. Second, we optimize the joint loss, which may trade the error near the pole with that at the center.

Comparing to the baselines, we see that ours achieves lowest errors. Direct performs the worst among all methods, underscoring that convolutions on the flattened sphere—though fast—are inadequate. Interp performs better than Direct, and the error decreases in higher layers. This is because the Rf is larger in the higher layers, so the -pixel shift in causes relatively smaller changes in the Rf and therefore the network output. Perspective performs similarly in different layers and outperforms Interp in lower layers. The error of Perspective is particularly large at , which is close to the boundary of the perspective image and has larger perspective distortion.

Fig. 4(b) shows the accuracy vs. cost tradeoff. We measure computational cost by the number of Multiply-Accumulate (MAC) operations. The leftmost plot shows cost on a log scale. Here we see that Exact—whose outputs we wish to replicate—is about 400 times slower than SphConv, and SphConv approaches Exact’s detector accuracy much better than all baselines. The second plot shows that SphConv is about faster than Interp (while performing better in all metrics). Perspective is the fastest among all methods and is faster than SphConv, followed by Direct which is faster than SphConv. However, both baselines are noticeably inferior in accuracy compared to SphConv.

Figure 6: Three AlexNet conv1 kernels (left squares) and their corresponding four SphConv-Pre kernels at (left to right).

To visualize what our approach has learned, we learn the first layer of the AlexNet alexnet

model provided by the Caffe package 

jia2014caffe and examine the resulting kernels. Fig. 6 shows the original kernel and the corresponding kernels at different polar angles . is usually the re-scaled version of , but the weights are often amplified because multiple pixels in fall to the same pixel in like the second example. We also observe situations where the high frequency signal in the kernel is reduced, like the third example, possibly because the kernel is smaller. Note that we learn the first convolution layer for visualization purposes only, since (only) has an analytic solution (cf. Sec 3.3). See appendix for the complete set of kernels.

4.2 Object detection and proposal accuracy

(a) Detector network performance.
(b) Proposal network accuracy (IoU).
Figure 7: Faster-RCNN object detection accuracy on a version of PASCAL across polar angles , for both the (a) detector network and (b) proposal network. refers to the Rf of . Best viewed in color.

Having established our approach provides accurate and efficient convolutions, we now examine how important that accuracy is to object detection on inputs. Fig. 6(a) shows the result of the Faster-RCNN detector network on PASCAL in format. OptSphConv performs almost as well as Exact. The performance degrades in SphConv-Pre because of error accumulation, but it still significantly outperforms Direct and is better than Interp and Perspective in most regions. Although joint training (SphConv) improves the output error near the equator, the error is larger near the pole which degrades the detector performance. Note that the Rf of the detector network spans multiple rows, so the error is the weighted sum of the error at different rows. The result, together with Fig. 4(a), suggest that SphConv reduces the conv5_3 error in parts of the Rf but increases it at the other parts. The detector network needs accurate conv5_3 features throughout the Rf in order to generate good predictions.

Direct again performs the worst. In particular, the performance drops significantly at , showing that it is sensitive to the distortion. In contrast, Interp performs better near the pole because the samples are denser on the unit sphere. In fact, Interp should converge to Exact at the pole. Perspective outperforms Interp near the equator but is worse in other regions. Note that falls on the top face, and is near the border of the face. The result suggests that Perspective is still sensitive to the polar angle, and it performs the best when the object is near the center of the faces where the perspective distortion is small.

Fig. 6(b) shows the performance of the object proposal network for two scales (see appendix for more). Interestingly, the result is different from the detector network. OptSphConv still performs almost the same as Exact, and SphConv-Pre performs better than baselines. However, Direct now outperforms other baselines, suggesting that the proposal network is not as sensitive as the detector network to the distortion introduced by equirectangular projection. The performance of the methods is similar when the object is larger (right plot), even though the output error is significantly different. The only exception is Perspective, which performs poorly for regardless of the object scale. It again suggests that objectness is sensitive to the perspective image being sampled.

[width=trim=0 .60pt 0 0,clip]./Examples/Rf224tilt32/008428-bbox000.jpg

[width=trim=0 .60pt 0 0,clip]./Examples/Rf336tilt32/006500-bbox003.jpg

[width=trim=0 .60pt 0 0,clip]./Examples/Rf336tilt32/004362-bbox000.jpg

[width=trim=0 .60pt 0 0,clip]./Examples/Rf336tilt32/004904-bbox001.jpg

Figure 8: Object detection examples on

PASCAL test images. Images show the top 40% of equirectangular projection; black regions are undefined pixels. Text gives predicted label, multi-class probability, and IoU, resp. Our method successfully detects objects undergoing severe distortion, some of which are barely recognizable even for a human viewer.

Fig. 8 shows examples of objects successfully detected by our approach in spite of severe distortions. See appendix for more examples.

5 Conclusion

We propose to learn spherical convolutions for images. Our solution entails a new form of distillation across camera projection models. Compared to current practices for feature extraction on images/video, spherical convolution benefits efficiency by avoiding performing multiple perspective projections, and it benefits accuracy by adapting kernels to the distortions in equirectangular projection. Results on two datasets demonstrate how it successfully transfers state-of-the-art vision models from the realm of limited FOV 2D imagery into the realm of omnidirectional data.

Future work will explore SphConv in the context of other dense prediction problems like segmentation, as well as the impact of different projection models within our basic framework.

References

  • (1) https://facebook360.fb.com/editing-360-photos-injecting-metadata/.
  • (2) https://code.facebook.com/posts/1638767863078802/under-the-hood-building-360-video/.
  • (3) J. Ba and R. Caruana. Do deep nets really need to be deep? In NIPS, 2014.
  • (4) A. Barre, A. Flocon, and R. Hansen. Curvilinear perspective, 1987.
  • (5) C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil. Model compression. In ACM SIGKDD, 2006.
  • (6) T. Cohen, M. Geiger, J. Köhler, and M. Welling. Convolutional networks for spherical signals. arXiv preprint arXiv:1709.04893, 2017.
  • (7) J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks. In ICCV, 2017.
  • (8) J. Deng, W. Dong, R. Socher, L. Li, and L. Fei-Fei. Imagenet: a large-scale hierarchical image database. In CVPR, 2009.
  • (9) M. Everingham, S. M. A. Eslami, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1):98–136, Jan. 2015.
  • (10) C. Feichtenhofer, A. Pinz, and A. Zisserman. Convolutional two-stream network fusion for video action recognition. In CVPR, 2016.
  • (11) A. Furnari, G. M. Farinella, A. R. Bruna, and S. Battiato. Affine covariant features for fisheye distortion local modeling. IEEE Transactions on Image Processing, 26(2):696–710, 2017.
  • (12) R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
  • (13) S. Gupta, J. Hoffman, and J. Malik. Cross modal distillation for supervision transfer. In CVPR, 2016.
  • (14) P. Hansen, P. Corke, W. Boles, and K. Daniilidis. Scale-invariant features on the sphere. In ICCV, 2007.
  • (15) P. Hansen, P. Corket, W. Boles, and K. Daniilidis. Scale invariant feature matching with wide angle images. In IROS, 2007.
  • (16) K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In ICCV, 2017.
  • (17) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • (18) G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  • (19) H.-N. Hu, Y.-C. Lin, M.-Y. Liu, H.-T. Cheng, Y.-J. Chang, and M. Sun. Deep 360 pilot: Learning a deep agent for piloting through sports video. In CVPR, 2017.
  • (20) M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatial transformer networks. In NIPS, 2015.
  • (21) S. Jain, B. Xiong, and K. Grauman. Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in video. In CVPR, 2017.
  • (22) Y. Jeon and J. Kim. Active convolution: Learning the shape of convolution for image classification. In CVPR, 2017.
  • (23) Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM MM, 2014.
  • (24) R. Khasanova and P. Frossard. Graph-based classification of omnidirectional images. arXiv preprint arXiv:1707.08301, 2017.
  • (25) D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • (26) A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  • (27) W.-S. Lai, Y. Huang, N. Joshi, C. Buehler, M.-H. Yang, and S. B. Kang. Semantic-driven generation of hyperlapse from 360° video. IEEE Transactions on Visualization and Computer Graphics, PP(99):1–1, 2017.
  • (28) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. In Proc. of the IEEE, 1998.
  • (29) J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  • (30) E. Parisotto, J. Ba, and R. Salakhutdinov.

    Actor-mimic: Deep multitask and transfer reinforcement learning.

    In ICLR, 2016.
  • (31) S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS, 2015.
  • (32) A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin deep nets. In ICLR, 2015.
  • (33) K. Simonyan and A. Zisserman. Two-stream convolutional networks for action recognition in videos. In NIPS, 2014.
  • (34) K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • (35) Y.-C. Su and K. Grauman. Making 360° video watchable in 2d: Learning videography for click free viewing. In CVPR, 2017.
  • (36) Y.-C. Su, D. Jayaraman, and K. Grauman. Pano2vid: Automatic cinematography for watching 360° videos. In ACCV, 2016.
  • (37) D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features with 3d convolutional networks. In ICCV, 2015.
  • (38) Y.-X. Wang and M. Hebert. Learning to learn: Model regression networks for easy small sample learning. In ECCV, 2016.
  • (39) J. Xiao, K. A. Ehinger, A. Oliva, and A. Torralba. Recognizing scene viewpoint using panoramic place representation. In CVPR, 2012.
  • (40) F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.
  • (41) L. Zelnik-Manor, G. Peters, and P. Perona. Squaring the circle in panoramas. In ICCV, 2005.
  • (42) Y. Zhang, S. Song, P. Tan, and J. Xiao.

    Panocontext: A whole-room 3d context model for panoramic scene understanding.

    In ECCV, 2014.
  • (43) B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.

    Learning deep features for scene recognition using places database.

    In NIPS, 2014.

Appendix A Spherical Convolution Network Structure

Fig. 9 shows how the proposed spherical convolutional network differs from an ordinary convolutional neural network (CNN). In a CNN, each kernel convolves over the entire 2D map to generate a 2D output. Alternatively, it can be considered as a neural network with a tied weight constraint, where the weights are shared across all rows and columns. In contrast, spherical convolution only ties the weights along each row. It learns a kernel for each row, and the kernel only convolves along the row to generate 1D output. Also, the kernel size may differ at different rows and layers, and it expands near the top and bottom of the image.

Figure 9: Spherical convolution illustration. The kernel weights at different rows of the image are untied, and each kernel convolves over one row to generate 1D output. The kernel size also differs at different rows and layers.

Appendix B Additional Implementation Details

We train the network using ADAM kingma2014adam . For pre-training, we use the batch size of 256 and initialize the learning rate to 0.01. For layers without batch normalization, we train the kernel for 16,000 iterations and decrease the learning rate by 10 every 4,000 iterations. For layers with batch normalization, we train for 4,000 iterations and decrease the learning rate every 1,000 iterations. For fine-tuning, we first fine-tune the network on conv3_3 for 12,000 iterations with batch size of 1. The learning rate is set to 1e-5 and is divided by 10 after 6,000 iterations. We then fine-tune the network on conv5_3 for 2,048 iterations. The learning rate is initialized to 1e-4 and is divided by 10 after 1,024 iterations. We do not insert batch normalization in conv1_2 to conv3_3 because we empirically find that it increases the training error.

Appendix C Data Preparation

This section provides more details about the dataset splits and sampling procedures.

Pano2Vid

For the Pano2Vid dataset, we discard videos with resolution and sample frames at 0.05fps. We use “Mountain Climbing” for testing because it contains the smallest number of frames. Note that the training data contains no instances of “Mountain Climbing”, such that our network is forced to generalize across semantic content. We sample at a low frame rate in order to reduce temporal redundancy in both training and testing splits. For kernel-wise pre-training and testing, we sample the output on 40 pixels per row uniformly to reduce spatial redundancy. Our preliminary experiments show that a denser sample for training does not improve the performance.

Pascal Voc 2007

As discussed in the main paper, we transform the 2D PASCAL images into equirectangular projected data in order to test object detection in omnidirectional data while still being able to rely on an existing ground truthed dataset. For each bounding box, we resize the image so the short side of the bounding box matches the target scale. The image is backprojected to the unit sphere using , where the center of the bounding box lies on . The unit sphere is unwrapped into equirectangular projection as the test data. We resize the bounding box to three target scales corresponding to , where is the Rf of . Each bounding box is projected to 5 tangent planes with and . By sampling the boxes across a range of scales and tangent plane angles, we systematically test the approach in these varying conditions.

Appendix D Complete Experimental Results

This section contains additional experimental results that do not fit in the main paper.

Figure 10: Network output error.

Fig. 10 shows the error of each meta layer in the VGG architecture. This is the complete version of Fig. 4a in the main paper. It becomes more clear to what extent the error of SphConv increases as we go deeper in the network as well as how the error of Interp decreases.

Figure 11: Proposal network accuracy (IoU).

Fig. 11 shows the proposal network accuracy for all three object scales. This is the complete version of Fig. 6b in the main paper. The performance of all methods improves at larger object scales, but Perspective still performs poorly near the equator.

Appendix E Additional Object Detection Examples

Figures 12, 13 and 14 show example detection results for SphConv-Pre on the version of PASCAL VOC 2007. Note that the large black areas are undefined pixels; they exist because the original PASCAL test images are not data, and the content occupies only a portion of the viewing sphere.

Figure 12: Object detection results on PASCAL VOC 2007 test images transformed to equirectangular projected inputs at different polar angles . Black areas indicate regions outside of the narrow field of view (FOV) PASCAL images, i.e., undefined pixels. The polar angle from top to bottom. Our approach successfully learns to translate a 2D object detector trained on perspective images to inputs.
Figure 13: Object detection results on PASCAL VOC 2007 test images transformed to equirectangular projected inputs at .
Figure 14: Object detection results on PASCAL VOC 2007 test images transformed to equirectangular projected inputs at .

Fig. 15 shows examples where the proposal network generate a tight bounding box while the detector network fails to predict the correct object category. While the distortion is not as severe as some of the success cases, it makes the confusing cases more difficult. Fig. 16 shows examples where the proposal network fails to generate tight bounding box. The bounding box is the one with the best intersection over union (IoU), which is less than 0.5 in both examples.

Figure 15: Failure cases of the detector network.
Figure 16: Failure cases of the proposal network.

Appendix F Visualizing Kernels in Spherical Convolution

Fig. 17 shows the target kernels in the AlexNet alexnet model and the corresponding kernels learned by our approach at different polar angles . This is the complete list for Fig. 5 in the main paper. Here we see how each kernel stretches according to the polar angle, and it is clear that some of the kernels in spherical convolution have larger weights than the original kernels. As discussed in the main paper, these examples are for visualization only. As we show, the first layer is amenable to an analytic solution, and only layers are learned by our method.

Figure 17: Learned conv1 kernels in AlexNet (full). Each square patch is an AlexNet kernel in perpsective projection. The four rectangular kernels beside it are the kernels learned in our network to achieve the same features when applied to an equirectangular projection of the viewing sphere.