NeurVPS: Neural Vanishing Point Scanning via Conic Convolution

10/14/2019 ∙ by Yichao Zhou, et al. ∙ 0

We present a simple yet effective end-to-end trainable deep network with geometry-inspired convolutional operators for detecting vanishing points in images. Traditional convolutional neural networks rely on aggregating edge features and do not have mechanisms to directly exploit the geometric properties of vanishing points as the intersections of parallel lines. In this work, we identify a canonical conic space in which the neural network can effectively compute the global geometric information of vanishing points locally, and we propose a novel operator named conic convolution that can be implemented as regular convolutions in this space. This new operator explicitly enforces feature extractions and aggregations along the structural lines and yet has the same number of parameters as the regular 2D convolution. Our extensive experiments on both synthetic and real-world datasets show that the proposed operator significantly improves the performance of vanishing point detection over traditional methods. The code and dataset have been made publicly available at https://github.com/zhou13/neurvps.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Vanishing point detection is a classic and important problem in 3D vision. Given the camera calibration, vanishing points give us the direction of 3D lines, and thus let us infer 3D information of the scene from a single 2D image. A robust and accurate vanishing point detection algorithm enables and enhances applications such as camera calibration Cipolla et al. (1999), 3D reconstruction Guillou et al. (2000), photo forensics O’Brien and Farid (2012), object detection Hoiem et al. (2008), wireframe parsing Zhou et al. (2019a, b), and autonomous driving Lee et al. (2017).

Although there has been a lot of work on this seemingly basic vision problem, no solution seems to be quite satisfactory yet. Traditional methods (see Zhang and Kosecka (2002); Kosecka and Zhang (2002); Tardif (2009)

and references therein) usually first use edge/line detectors to extract straight lines and then cluster them into multiples groups. Many recent methods have proposed to improve the detection by training deep neural networks with labeled data. However, such neural networks often offer only a coarse estimate for the position of vanishing points

Kluger et al. (2017) or horizontal lines Zhai et al. (2016)

. The output is usually a component of a multi-stage system and used as an initialization to remove outliers for line clustering. Arguably the main reason for neural networks’ poor precision in vanishing point detection (compared to line clustering-based methods) is likely because existing neural network architectures are not designed to represent or learn the special geometric properties of vanishing points and their relations to structural lines.

To address this issue, we propose a new convolutional neural network, called Neural Vanishing Point Scanner (NeurVPS), that explicitly encodes and hence exploits the global geometric information about vanishing points and can be trained in an end-to-end manner to both robustly and accurately predict vanishing points. Our method samples a sufficient number of point candidates and the network then determines which of them are valid. A common criterion of a valid vanishing point is whether it lies on the intersection of a sufficient number of structural lines. Therefore, the role of our network is to measure the intensity of the signals of the structural lines passing through the candidate point. Although this notion is simple and clear, it is a challenging task for neural networks to learn such geometric concept since the relationship between the candidate point and structural lines not only depend on global line orientations but also their pixel locations. In this work, we identify a canonical conic space in which this relationship only depends on local line orientations. For each pixel, we define this space as a local coordinate system in which the x-axis is chosen to be the direction from the pixel to the candidate point, so the associated structural lines in this space are always horizontal.

We propose a conic convolution operator, which applies regular convolution for each pixel in this conic space. This is similar to apply regular convolutions on a rectified image where the related structural lines are transformed into horizontal lines. Therefore the network can determine how to use the signals based on local orientations. In addition, feature aggregation in this rectified image also becomes geometrically meaningful, since horizontal aggregation in the rectified image is identical to feature aggregation along the structural lines.

Based on the canonical space and the conic convolution operator, we are able to design the convolutional neural network that accurately predicts the vanishing points. We conduct extensive experiments and show the improvement by a significant margin on both synthetic and real-world datasets. With the ablation studies, we verify the importance of the proposed conic convolution operator.

2 Related Work

Vanishing Point Detection.

Vanishing point detection is a fundamental and yet surprisingly challenging problem in computer vision. Since initially proposed by

Barnard (1983), researchers have been trying to tackle this problem from different perspectives. Early researches estimate vanishing points using sphere geometry Barnard (1983); Magee and Aggarwal (1984); Straforini et al. (1993), hierarchical Hough transformation Quan and Mohr (1989), or the EM algorithms Zhang and Kosecka (2002); Kosecka and Zhang (2002). Researches such as Wildenauer and Hanbury (2012); Mirzaei and Roumeliotis (2011); Bazin et al. (2012); Antunes and Barreto (2013) use the Manhattan world assumptions Coughlan and Yuille (1999) to improve the accuracy and the reliability of the detection. Barinova et al. (2010) extends the mutual orthogonality assumption to a set of mutual orthogonal vanishing point assumption (Atlanta world Schindler and Dellaert (2004)).

The dominant approach is line-based vanishing point detection algorithms, which are often divided into several stages. Firstly, a set of lines are detected Canny (1987); Von Gioi et al. (2008). Then a line clustering algorithm McLean and Kotturi (1995) are used to propose several guesses of target vanishing point position based on geometric cues. The clustering methods include RANSAC Bolles and Fischler (1981), J-linkage Tardif (2009), Hough transform Hough (1959), or EM Zhang and Kosecka (2002); Kosecka and Zhang (2002). Zhou et al. (2017) uses contour detection and J-linkage in Natural Scenes but only one dominate vanishing point can be detected. Our method does not rely on existing line detectors, and it can automatically learn the line features in the conic space to predict any number of vanishing points from the image.

Recently, with the help of convolutional neural networks, the vision community has tried to tackle the problem from a data-driven and supervised learning approach.

Chang et al. (2018); Borji (2016); Zhang et al. (2018) formulate the vanishing point detection as a patch classification problem. They can only detect vanishing points within the image frame. Our method does not have such limitation. Zhai et al. (2016)

detects vanishing points by first estimating horizontal vanishing line candidates and score them by the vanishing points they go through. They use an ImageNet pre-trained neural network that is fine-tuned on Google street images.

Kluger et al. (2017) uses inverse gnomonic image and regresses the sphere image representation of vanishing point. Both work rely on traditional line detection algorithms while our method learns it implicitly in the conic space.

Structured Convolution Operators.

Recently more and more operators are proposed to model spatial and geometric properties in images. For instance the wavelets or x-lets based scattering networks (ScatNet) Bruna and Mallat (2013); Sifre and Mallat (2013) are introduced to ensure certain transform (say translational) invariance of the network. Jaderberg et al. (2015) first explores geometric deformation with modern neural networks. Dai et al. (2017b); Jeon and Kim (2017) modify the parameterization of the global deformable transformation into local convolution operators to improve the performance on image classification, object detection, and semantic segmentation. More recently, structured and free-form filters are composed Shelhamer et al. (2019). While these methods allow the network to learn about the space where the convolution operates on, we here explicitly define the space from first principle and exploit its geometric information. Our method is similar to Jaderberg et al. (2015) in the sense that we both want to rectify input to a canonical space. The difference is they learn a global rectification transformation while our transformation is adaptive to local. Different from Dai et al. (2017b); Jeon and Kim (2017), our convolutional kernel shape is not learned but designed according to the desired geometric property.

Guided design of convolution kernels in canonical space is well practiced for irregular data. For spherical images, Cohen et al. (2018) design operator for rotation-invariant features, while Jiang et al. (2019) operate in the space defined by longitude and latitude, which is more meaningful for climate data. In 3D vision, geodesic CNN Masci et al. (2015) adopts mesh convolution with the spherical coordinate, while TextureNet Huang et al. (2019) operates in a canonical space defined by globally smoothed principal directions. Although we are dealing with regular images, we observe a strong correlation between the vanishing point and the conic space, where the conic operator is more effective than regular 2D convolution.

3 Methods

3.1 Overview

Figure 1: Illustration of sampled locations of conic convolutions. The bright yellow region is the output pixel and stands for the vanishing point. Upper and lower figures illustrate the cases when the vanishing point is outside and inside the image, respectively. Figure 2:

Illustration of the overall network structure. The number of each convolutional block is the kernel size and output dimension respectively. The number of fully connected layer block is the output dimension. The kernel size of Max Pooling layer is 3 and stride is 2. Batch normalization and ReLU activation are appended after each conv/fc layer except the last one use sigmoid as activation.

Figure 2

illustrates the overall structure of our NeurVPS network. Taken an image and a vanishing point as input, our network predicts the probability of a candidate being near a ground-truth vanishing point. Our network has two parts: a backbone feature extraction network and a conic convolution sub-network. The backbone is a conventional CNN that extracts semantic features from images. We use a single-stack hourglass network

Newell et al. (2016) for its ability to possess a large receptive field while maintaining fine spatial details. The conic convolutional network (Section 3.4) takes feature maps from the backbone as input and determines the existence of vanishing points around candidate positions (as a classification problem). The conic convolution operators (Section 3.3) exploit the geometric priors of vanishing points, and thus allow our algorithm to achieve superior performance without resorting to line detectors. Our system is end-to-end trainable.

Due to the classification nature of our model, we need to sample enough number of candidate points during inference. It is computationally infeasible to directly sample sufficiently dense candidates. Therefore, we use a coarse-to-fine approach (Section 3.5). We first sample points on the unit sphere and calculate their likelihoods of being the line direction (Section 3.2

) of a vanishing point using the trained neural network classifier. We then pick the top

candidates and sample another points around each of their neighbours. This step is repeated until we reach the desired resolution.

3.2 Basic Geometry and Representations of Vanishing Points

The position of a vanishing point encodes the 3D direction of lines. For a 3D ray described by where is its origin and

is its direction vector, its 2D projection on the image is

(1)

where and are the coordinates in the image space, is the depth in the camera space, is the calibration matrix, is the focal length, and is the optical center of the camera. The vanishing point is the point with , whose image coordinate is . We can then derive the 3D direction of a line in term of its vanishing point:

(2)

In the literature, a normalized line direction vector is also called the Gaussian sphere representation Barnard (1983) of the vanishing point . The usage of instead of avoids the degenerated cases when is parallel to the image plane. It also gives a natural metric that defines the distance between two vanishing points, the angle between their normalized line direction vectors: for two unit line directions . Finally, sampling vanishing points with the Gaussian sphere representation is easy, as it is equivalent to sampling on a unit sphere, while it remains ambiguous how to sample vanishing points directly in the image plane.

3.3 Conic Convolution Operators in Conic Space

In order for the network to effectively learn vanishing point related line features, we want to apply convolutions in the space where related lines can be determined locally. We define a conic space for each pixel in the image domain as a rotated regular local coordinate system where the x-axis is the direction from the pixel to the vanishing point. In this space, related lines can be identified locally by whether its orientation is horizontal. Accordingly, we propose a novel convolution operator, named conic convolution, which applies the regular convolution in this conic space. This operator effectively encodes global geometric cues for classifying whether a candidate point (Section 3.6) is a valid vanishing point. Figure 1 illustrates how this operator works.

A conic convolution takes the input feature map and the coordinate of convolution center (the position candidates of vanishing points) and outputs the feature map with the same resolution. The output feature map can be computed with

(3)

Here is the coordinates of the output pixel, is a trainable convolution filter, is the rotational matrix that rotates a 2D vector by counterclockwise, and is the normalized direction vector that points from the output pixel to the convolution center

. We use the bilinear interpolation to access values of

at non-integer coordinates.

Intuitively, conic convolution makes edge detection easier and more accurate. An ordinary convolution may need hundreds of filters to recognize edge with different orientations, while conic convolution requires much less filters to recognize edges aligning with the candidate vanishing point because filters are firstly rotated towards the vanishing point. The strong/weak response (depends on the candidate is positive/negative) will then be aggregated by subsequent fully-connected layers.

Figure 3: Illustration of vanishing points’ Gaussian sphere representation of an image from the SU3 wireframe dataset Zhou et al. (2019b) and our multi-resolution sampling procedure in the coarse-to-fine inference. In the right three figures, the red triangles represent the ground truth vanishing points and the dots represent the sampled locations.

3.4 Conic Convolutional Network

The conic convolutional network is a classifier that takes the image feature map and a candidate vanishing point position as input. For each angle threshold , the network predicts whether there exists a real vanishing point in the image so that the angle between the 3D line directions between and is less than the threshold . The choice in will be discussed in Section 3.5.

Figure 2 shows the structure diagram of the proposed conic convolutional network. We first reduce the dimension for the feature map from the backbone to save the GPU memory footprint with an convolution layer. Then 4 consecutive conic convolution (with ReLU activation) and max-pooling layers are applied to capture the geometric information at different spatial resolutions. The channel dimension is increased by a factor of two in each layer to compensate the reduced spatial resolution. After that, we flatten the feature map and use two fully connected layers to aggregate the features. Finally, a sigmoid classifier with binary cross entropy loss is applied on top of the feature to discriminate positive and negative samples with respect to different thresholds from .

3.5 Coarse-to-fine Inference

Figure 4: Illustration of the variables used in uniform spherical cap sampling.

With the backbone and the conic convolutional network, we can compute the probability of vanishing point over the hemisphere of the unit line direction vector , as shown in Figure 3. We utilize a multi-resolution strategy to quickly pinpoint the location of the vanishing points. We use rounds to search for the vanishing points. In the -th round, we uniformly sample line direction vectors on the surface of the unit spherical cap with direction and polar angle using the Fibonacci lattice González (2010). Mathematically, the -th sampled line direction vector can be written as

in which and are two arbitrary orthogonal unit vectors that are perpendicular to , as shown in Figure 4. We initialize and . For the round , we set the threashold

(4)

and to the whose vanishing point obtains the best score from the conic convolutional network classifier with angle threshold . Here,

is a hyperparameter controlling the distance between two nearby spherical caps. Therefore, we set the threshold set

in Section 3.3 to be accordingly.

The above process detects a single dominant vanishing point in a given image. To search for more than one vanishing point, one can modify the first round to find the best line directions and use the same process for each line direction in the remaining rounds.

3.6 Vanishing Point Sampling for Training

During training, we need to generate positive samples and negative samples. For each ground-truth vanishing point with line direction and threshold , we sample positive vanishing points and negative vanishing points. The positive vanishing points are uniformly sampled from and the negative vanishing points are uniformly sampled from . In addition, we sample random vanishing points for each image to reduce the sampling bias. The line directions of those vanishing points are uniformly sampled from the unit hemisphere.

4 Experiments

4.1 Datasets and Metric

We conduct experiments on both synthetic Zhou et al. (2019b) and real-world Zhou et al. (2017); Dai et al. (2017a) datasets.

Natural Scene Zhou et al. (2017). This dataset contains images of natural scenes from AVA and Flickr. The authors pick the images that contain only one dominating vanishing point and label their locations. There are 2,275 images in the dataset. We divide them into 2,000 training images and 275 test images randomly. Because this dataset does not contain the camera calibration information, we set the focal length to the half of the sensor width for vanishing point sampling and evaluation. Such focal length simulates the wide-angle lens used in landscape photography.

ScanNet Dai et al. (2017a). ScanNet is a 3D indoor environment dataset with reconstructed meshes and RGB images captured by mobile devices. For each scene, we find the three orthogonal principal directions for each scene which align with most of the surface normals and use them to compute the vanishing points for each RGB image. We split the dataset as suggested by ScanNet v2 tasks, and train the network to predict the three vanishing points given the RGB image. There are 266,844 training images. We randomly sample images from validation set as our test set.

SU3 Wireframe Zhou et al. (2019b). The “ground-truth” vanishing point positions in real world datasets are often inaccurate. To systematically evaluate the performance of our algorithm, we test our method on the recent synthetic SceneCity Urban 3D (SU3) wireframe dataset Zhou et al. (2019b). This dataset is created with a procedural building generator, in which the vanishing points are directly computed from the CAD models of the buildings. It contains 22,500 training images and 500 validation images.

Evaluation Metrics. Previous methods usually use horizon detection accuracy Barinova et al. (2010); Lezama et al. (2014); Zhai et al. (2016) or pixel consistency Zhou et al. (2017) to evaluate their method. These metrics are indirect for this task. To better understand the performance of our algorithm, we propose a new metric, called angle accuracy (AA). For each vanishing point from the predictions, we calculate the angle between the ground-truth and the predicted one. Then we count the percentage of predictions whose angle difference is within a pre-defined threshold. By varying different thresholds, we can plot the angle accuracy curves. is defined as the area under the curve between divided by . In our experiments, the upper bound is set to be , , and on the synthetic dataset and , , and on the real world dataset. Two angle accuracy curves (coarse and fine level) are plotted for each dataset. Our metric is able to show the algorithm performance under different precision requirements. For a fair comparison, we also report the performance metrics used by the dataset paper Zhou et al. (2017) in the supplementary materials.

4.2 Implementation Detail

We implement the conic convolution operator in PyTorch by modifying the “im2col + GEMM” function, which is often used to implement ordinary convolution. We change the sampling locations of im2col function according to

Equation 3, similar to the method used in Dai et al. (2017b). Input images are resized to . During training, the Adam optimizer Kingma and Ba (2014) is used. Learning rate and weight decay are set to be and , respectively. All experiments are conducted on two NVIDIA RTX 2080Ti GPUs, with each GPU holding 6 mini-batches. For synthetic data Zhou et al. (2019b), we train epochs and reduce the learning rate by at the -th epoch. We use , and . We set and use , , and in the coarse-to-fine inference for the SU3 dataset, the Natural Scene dataset, and the ScanNet dataset, respectively. For the Natural Scene dataset, due to the small amount of the training data we finetune the model trained on the SU3 dataset for epochs with learning rate . For ScanNet Dai et al. (2017a), we train the model for epochs. We augment the data with horizontal flip. During inference, the results from the backbone network can be shared so only the conic convolution layers need to be forwarded multiple times. Using the Nature Scene dataset as an example, we conduct 4 rounds of coarse-to-fine inference, in each of which we sample 64 vanishing points. So we forward the conic convolution part 256 times for each image during testing. The evaluation speed is about 1.5 vanishing points per second on a single GPU.

(a) Angle difference ranges from to .
(b) Angle difference ranges from to .
Figure 5: Angle accuracy curves for different methods on the SU3 wireframe dataset Zhou et al. (2019b).
(a) Angle difference ranges from to .
(b) Angle difference ranges from to .
Figure 6: Angle accuracy curves for different methods on the Natural Scene dataset Zhou et al. (2017).
(a) Angle difference ranges from to .
(b) Angle difference ranges from to .
Figure 7: Angle accuracy curves for different methods on the ScanNet dataset Dai et al. (2017a).

4.3 Ablation Studies on the Synthetic Dataset

AA AA AA mean median
LSD Feng et al. (2010) 27.9 47.9 61.5 3.89 0.21
REG 2.2 6.5 15.0 2.07 1.48
CLS 2.2 9.1 23.7 1.77 0.99
Conic2 10.5 28.9 50.3 0.78 0.43
Conic4 47.5 74.2 86.3 0.15 0.09
Conic6 49.1 74.0 86.2 0.14 0.09
Table 1: Ablation study of our method. “REG” denotes the baseline that directly regress line direction in the camera space. “CLS” denotes the baseline that do vanishing point classification using image feature and its coordinate. Conic denotes our methods with varying number of conic convolution layers.

Comparison with Baseline Methods. We compare our method with both traditional line detection based methods and neural network based methods. The sample images and results can be found in Figure 3 and supplementary materials. For line based methods, the LSD line detection method with J-linkage clustering Von Gioi et al. (2008); Tardif (2009) probably is the most widely used method for vanishing point detection. We test the implementation from Feng et al. (2010). Note that LSD is a strong competitor on the SU3 dataset as the images contain a lot of sharp edges and long straight lines.

We aimed to compare pure neural network methods that only rely on raw pixels as input. Existing methods such as Chang et al. (2018); Denis et al. (2008); Borji (2016) can only detect vanishing points inside images. Zhai et al. (2016); Kluger et al. (2017) rely on an external line map as initial inputs. To the best of our knowledge, there is no existing pure neural network methods that are general enough to handle our case. Therefore, we propose two intuitive and reasonable baselines.

The first baseline, called REG, is a neural network that direct regresses value of using chamfer- loss, similar to the network in Zhou et al. (2019b). We change all the conic convolutions to traditional 2D convolutions to make the numbers of parameters be the same. The second baseline, called CLS, uses our fine-to-coarse classification approach. We change all the conic convolutions to their traditional counterparts, and concatenate to the feature map right before feeding it to the NeurVPS head to make the neural network aware of the position of vanishing points.

The results are shown in Table 1 and Figure 5. Our method (conic4) can effectively utilize the geometric priors and large-scale training data, and significantly outperform other baselines across all the metrics. We note that, compared to LSD, neural network baselines perform better in terms of mean angle difference but much worse for AA. This is because there is about 20% failed predictions (angle difference 45) for LSD, while the predictions of neural network are relatively stable (but not accurate enough). This phenomenon is also observed in Figure 4(b), where neural network baselines achieve higher percentage when the angle difference is larger than 4.5.

Effect of Conic Convolution. We now exam the effect of different numbers of conic convolution layers. We test with conic convolution layers, denoted as Conic, respectively. For Conic2, we only keep the last two conic convolutions and replace others as their plain counterparts. For Conic6, we add two more conic convolution layers at the finest level, without max pooling appended. The results are shown in Table 1 and Figure 5. We observe that the performance keeps increasing when adding more conic convolutions. We hypothesize that this is because stacking multiple conic convolutions enables our model to capture higher order edge information and thus significantly increase the performance. The performance improvement saturates at Conic.

4.4 NeurVPS on the Real World Datasets

AA AA AA mean median
Contour Zhou et al. (2017) 18.5 33.0 60.0 12.6 1.56
CLS 0.0 0.0 3.2 29.3 26.0
REG 0.0 1.5 26.6 14.4 8.37
Ours 19.6 35.9 65.9 11.0 1.32
Ours 23.4 40.5 69.2 7.88 1.10
Table 2: Performance of algorithms on the Natural Scene dataset Zhou et al. (2017). The row labelled with “Ours” shows the results of our NeurVPS pretrained on the SU3 dataset Zhou et al. (2019b).
AA AA AA mean median
LSD Feng et al. (2010) 1.7 5.4 24.8 12.6 11.8
REG 1.5 5.1 45.1 6.9 5.0
CLS 2.0 8.1 55.9 5.3 3.6
Ours 3.4 11.5 61.7 4.5 3.0
Table 3: Performance of algorithms on ScanNet Dai et al. (2017a).

Natural Scene Zhou et al. (2017) We finally validate our method on real world datasets to demonstrate its effectiveness and generalizability. The results on the Natural Scene dataset Zhou et al. (2017) are shown in Table 3 and Figure 6. We also report the performance in the metric used by the dataset paper Zhou et al. (2017) in the supplementary materials. Because this dataset contains only 2,000 training images, we pre-train our model on the SU3 dataset and then fine-tune it on Zhou et al. (2017). Our result outperforms the strong baseline method (labeled as Contour) proposed by the dataset paper Zhou et al. (2017)

with a large margin at both coarse and fine level, even without the pre-training. The success of the pre-training shows that our model has certain ability to transfer low-level geometric priors, otherwise the pre-training will not help since the image styles of two datasets are different. It is worth noting that previous deep learning methods

Zhai et al. (2016) cannot outperform our baseline Zhou et al. (2017) in such natural scene setting, as shown in the reference Zhou et al. (2017). We also find the classification baseline barely converges. We suspect that this is because the neural network tries to use the concatenated vanishing point feature in the fashion of a nearest neighbour classifier, which does not generalize well on such small dataset. The regression baseline works a little bit better, but still under-performs the baseline method by a fairly large margin until the angle difference tolerance is greater than .

ScanNet Dai et al. (2017a) The results on the ScanNet dataset Dai et al. (2017a) are shown in Table 3 and Figure 7. For baseline of traditional methods, we only compare our method with LSD + J-linkage because other methods such as Zhou et al. (2017) are not directly applicable when there are three vanishing points in a scene. Our results reduced the mean and median error by 6 and 4 times, respectively. The angle accuracy also improves by a large margin. The ScanNet Dai et al. (2017a) is a large dataset, so both CLS and REG works reasonable good. However, because the traditional convolution cannot fully exploit the geometry structure of vanishing points, the performance of those baseline algorithms is worse than the performance of our conic convolutional neural network. It is also worth mentioning that errors of ground truth vanishing points of the ScanNet dataset are quite large due to the inaccurate 3D reconstruction and budget capture devices, which probably is the reason why the performance gap between conic convolutional networks and traditional 2D convolutional networks is not so significant.

One drawback of our data-driven method is the need of large amount of training data. We do not evaluate our method on datasets such as YUD Denis et al. (2008), ECD Barinova et al. (2010), and HLW Workman et al. (2016) because there is no suitable public dataset for training. In the future, we will study how to exploit geometric information under unsupervised or semi-supervised settings hence to alleviate the data scarcity problem.

References

  • [1] M. Antunes and J. P. Barreto (2013) A global approach for the detection of vanishing points and mutually orthogonal vanishing directions. In CVPR, Cited by: §2.
  • [2] O. Barinova, V. Lempitsky, E. Tretiak, and P. Kohli (2010) Geometric image parsing in man-made environments. In ECCV, Cited by: §2, §4.1, §4.4.
  • [3] S. T. Barnard (1983) Interpreting perspective images. Artificial intelligence. Cited by: §2, §3.2.
  • [4] J. Bazin, Y. Seo, C. Demonceaux, P. Vasseur, K. Ikeuchi, I. Kweon, and M. Pollefeys (2012) Globally optimal line clustering and vanishing point estimation in Manhattan world. In CVPR, Cited by: §2.
  • [5] R. C. Bolles and M. A. Fischler (1981) A RANSAC-based approach to model fitting and its application to finding cylinders in range data. In IJCAI, Cited by: §2.
  • [6] A. Borji (2016) Vanishing point detection with convolutional neural networks. arXiv preprint. Cited by: §2, §4.3.
  • [7] J. Bruna and S. Mallat (2013) Invariant scattering convolution networks. IEEE TPAMI 35 (8), pp. 1872–1886. Cited by: §2.
  • [8] J. Canny (1987) A computational approach to edge detection. Morgan Kaufmann Publishers Inc.. Cited by: §2.
  • [9] C. Chang, J. Zhao, and L. Itti (2018) DeepVP: deep learning for vanishing point detection on 1 million street view images. In ICRA, Cited by: §2, §4.3.
  • [10] R. Cipolla, T. Drummond, and D. P. Robertson (1999) Camera calibration from vanishing points in image of architectural scenes.. In BMVC, Cited by: §1.
  • [11] T. S. Cohen, M. Geiger, J. Köhler, and M. Welling (2018) Spherical CNNs. In ICLR 2018, Cited by: §2.
  • [12] J. M. Coughlan and A. L. Yuille (1999)

    Manhattan world: compass direction from a single image by Bayesian inference

    .
    In ICCV, Cited by: §2.
  • [13] A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner (2017) ScanNet: richly-annotated 3D reconstructions of indoor scenes. In CVPR, Cited by: Figure 7, §4.1, §4.1, §4.2, §4.4, Table 3.
  • [14] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei (2017) Deformable convolutional networks. In ICCV, Cited by: §2, §4.2.
  • [15] P. Denis, J. H. Elder, and F. J. Estrada (2008) Efficient edge-based methods for estimating Manhattan frames in urban imagery. In ECCV, Cited by: §4.3, §4.4.
  • [16] C. Feng, F. Deng, and V. R. Kamat (2010) Semi-automatic 3D reconstruction of piecewise planar building models from single image. CONVR. Cited by: §4.3, Table 1, Table 3.
  • [17] Á. González (2010) Measurement of areas on a sphere using fibonacci and latitude–longitude lattices. Mathematical Geosciences. Cited by: §3.5.
  • [18] E. Guillou, D. Meneveaux, E. Maisel, and K. Bouatouch (2000) Using vanishing points for camera calibration and coarse 3D reconstruction from a single image. The Visual Computer. Cited by: §1.
  • [19] D. Hoiem, A. A. Efros, and M. Hebert (2008) Putting objects in perspective. IJCV. Cited by: §1.
  • [20] P. V. Hough (1959) Machine analysis of bubble chamber pictures. In International Conference on High Energy Accelerators and Instrumentation, Cited by: §2.
  • [21] J. Huang, H. Zhang, L. Yi, T. Funkhouser, M. Nießner, and L. Guibas (2019) TextureNet: consistent local parametrizations for learning from high-resolution signals on meshes. In CVPR, Cited by: §2.
  • [22] M. Jaderberg, K. Simonyan, A. Zisserman, et al. (2015) Spatial transformer networks. In NIPS, Cited by: §2.
  • [23] Y. Jeon and J. Kim (2017) Active convolution: learning the shape of convolution for image classification. In CVPR, Cited by: §2.
  • [24] C. Jiang, J. Huang, K. Kashinath, P. Marcus, M. Niessner, et al. (2019) Spherical CNNs on unstructured grids. In ICLR 2019, Cited by: §2.
  • [25] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint. Cited by: §4.2.
  • [26] F. Kluger, H. Ackermann, M. Y. Yang, and B. Rosenhahn (2017) Deep learning for vanishing point detection using an inverse gnomonic projection. In GCPR, Cited by: §1, §2, §4.3.
  • [27] J. Kosecka and W. Zhang (2002) Video compass. In ECCV, Cited by: §1, §2, §2.
  • [28] S. Lee, J. Kim, J. Shin Yoon, S. Shin, O. Bailo, N. Kim, T. Lee, H. Seok Hong, S. Han, and I. So Kweon (2017) VPGNet: vanishing point guided network for lane and road marking detection and recognition. In ICCV, Cited by: §1.
  • [29] J. Lezama, R. Grompone von Gioi, G. Randall, and J. Morel (2014) Finding vanishing points via point alignments in image primal and dual domains. In CVPR, Cited by: §4.1.
  • [30] M. J. Magee and J. K. Aggarwal (1984) Determining vanishing points from perspective images. Computer Vision, Graphics, and Image Processing. Cited by: §2.
  • [31] J. Masci, D. Boscaini, M. Bronstein, and P. Vandergheynst (2015) Geodesic convolutional neural networks on riemannian manifolds. In ICCV Workshop, Cited by: §2.
  • [32] G. McLean and D. Kotturi (1995) Vanishing point detection by line clustering. PAMI. Cited by: §2.
  • [33] F. M. Mirzaei and S. I. Roumeliotis (2011) Optimal estimation of vanishing points in a Manhattan world. In ICCV, Cited by: §2.
  • [34] A. Newell, K. Yang, and J. Deng (2016)

    Stacked hourglass networks for human pose estimation

    .
    In ECCV, Cited by: §3.1.
  • [35] J. F. O’Brien and H. Farid (2012) Exposing photo manipulation with inconsistent reflections.. ToG. Cited by: §1.
  • [36] L. Quan and R. Mohr (1989) Determining perspective structures using hierarchical Hough transform. Pattern Recognition Letters, pp. 279–286. Cited by: §2.
  • [37] G. Schindler and F. Dellaert (2004)

    Atlanta world: an expectation maximization framework for simultaneous low-level edge grouping and camera calibration in complex man-made environments

    .
    In CVPR, Cited by: §2.
  • [38] E. Shelhamer, D. Wang, and T. Darrell (2019) Blurring the line between structure and learning to optimize and adapt receptive fields. arXiv preprint. Cited by: §2.
  • [39] L. Sifre and S. Mallat (2013) Rotation, scaling and deformation invariant scattering for texture discrimination. In CVPR, Cited by: §2.
  • [40] M. Straforini, C. Coelho, and M. Campani (1993) Extraction of vanishing points from images of indoor and outdoor scenes. Image and Vision Computing. Cited by: §2.
  • [41] J. Tardif (2009) Non-iterative approach for fast and accurate vanishing point detection. In ICCV, Cited by: §1, §2, §4.3.
  • [42] R. G. Von Gioi, J. Jakubowicz, J. Morel, and G. Randall (2008) LSD: a fast line segment detector with a false detection control. PAMI. Cited by: §2, §4.3.
  • [43] H. Wildenauer and A. Hanbury (2012) Robust camera self-calibration from monocular images of Manhattan worlds. In CVPR, Cited by: §2.
  • [44] S. Workman, M. Zhai, and N. Jacobs (2016) Horizon lines in the wild. In BMVC, Cited by: §4.4.
  • [45] M. Zhai, S. Workman, and N. Jacobs (2016) Detecting vanishing points using global image context in a non-manhattan world. In CVPR, Cited by: §1, §2, §4.1, §4.3, §4.4.
  • [46] W. Zhang and J. Kosecka (2002) Efficient detection of vanishing points. In ICRA, Cited by: §1, §2, §2.
  • [47] X. Zhang, X. Gao, W. Lu, L. He, and Q. Liu (2018) Dominant vanishing point detection in the wild with application in composition analysis. Neurocomputing. Cited by: §2.
  • [48] Y. Zhou, H. Qi, and Y. Ma (2019) End-to-end wireframe parsing. In ICCV, Cited by: §1.
  • [49] Y. Zhou, H. Qi, Y. Zhai, Q. Sun, Z. Chen, L. Wei, and Y. Ma (2019) Learning to reconstruct 3D Manhattan wireframes from a single image. In ICCV, Cited by: Figure 9, §A.2, §1, Figure 3, Figure 5, §4.1, §4.1, §4.2, §4.3, Table 3.
  • [50] Z. Zhou, F. Farhat, and J. Z. Wang (2017)

    Detecting dominant vanishing points in natural scenes with application to composition-sensitive image retrieval

    .
    IEEE Transactions on Multimedia. Cited by: Figure 10, §A.2, §2, Figure 6, §4.1, §4.1, §4.1, §4.4, §4.4, Table 3.

Appendix A Supplementary Materials

a.1 Consistency Measure on the Natural Scene Dataset

Figure 8: Consistency measure on the Nature Scene dataset.

For the fairness and completeness of our experiment, we show the curves of the consistency measure in Figure 8, generated by the code provided by the authors of the Nature Scene dataset. The blue curve computed from the prediction of our SU3-pretrained conic convolutional network. The consistency measure uses the similar energy function as the one used by the baseline method, so it might favour the baseline method slightly. However, the result still shows the similar trend as the trend of Figure 6 of the paper.

a.2 Visualization

Figure 9 and Figure 10 show the visual quality of the ground truth vanishing points and our predicted vanishing points on the testing images in SU3 wireframe dataset [49] and natural scene dataset [50]. We display the images and three vanishing points of the SU3 wireframe dataset on Gaussian spheres because most of them are on the outside of the images, and we display the single dominating vanishing points in the natural scene dataset directly on the image.

Figure 9: Visualization of SU3 wireframe dataset [49]. The lines on the sphere shows the ground truth lines and the colored dots shows the predicted vanishing points.
Figure 10: Visualization of natural scene dataset [50]. The red dots represent the ground truth vanishing points and the blue dots represent our predicted vanishing points. The last row shows the failure cases of our method. Images are cropped for typesetting purpose.