1 Introduction
Humans can naturally sense the geometric structures of a scene by a single glance, while developing such a system remains to be quite challenging in several intelligent applications such as robotics [Kanji2015] and automatic navigation [Nieuwenhuisen et al.2010] . In this work, we investigate a novel learningbased approach for geometric scene parsing, which is capable of simultaneously labeling geometric surfaces (e.g. sky, ground and vertical) and determines the interaction relations (e.g. layering, supporting, siding and affinity [Liu et al.2014]) between main regions, and further demonstrate its effectiveness in 3D reconstruction from a single scene image. An example generated by our approach is presented in Figure 1
. In the literature of scene understanding, most of the efforts are dedicated for pixelwise semantic labeling / segmentation
[Long et al.2015][Pinheiro and Collobert2015]. Although impressive progresses have been made, especially by the deep neural networks, these methods may have limitations on handling the geometric scene parsing due to the following challenges.
The geometric regions in a scene often have diverse appearances and spatial configurations, e.g. the vertical plane may include trees and buildings of different looks. Labeling these regions generally requires fully exploiting image cues from different aspects ranging from local to global.

In addition to region labeling, discovering the interaction relations between the main regions is crucial for recovering the scene structure in depth. The main difficulties for the relation prediction lie in the ambiguity of multiscale region grouping and the fusion of hierarchical contextual information.
To address these above issues, we develop a novel Hierarchical LSTM (HLSTM) recurrent network that simultaneously parses a still image into a series of geometric regions and predicts the interaction relations among these regions. The parsing results can be directly used to reconstruct the 3D structure from a single image. As shown in Figure 2, the proposed model collaboratively integrates the Pixel LSTM (PLSTM) [Liang et al.2015] and Multiscale Superpixel LSTM (MSLSTM) subnetworks into a unified framework. First, the PLSTM subnetwork produces the geometric surface regions, where local contextual information from neighboring positions is imposed on each pixel to better exploit the spatial dependencies. Second, the Multiscale Superpixel LSTM (MSLSTM) subnetwork generates the interaction relations for all adjacent surface regions based on the multiscale superpixel representations. Benefiting from the diverse levels of information captured by hierarchical representations (i.e. pixels and multiscale superpixels), the proposed HLSTM can jointly optimize the two tasks based on the hierarchical information, where different levels of contexts are captured for better reasoning in local area. Based on the shared basic convolutional layers, the parameters in PLSTM and MSLSTM subnetworks are jointly updated during the backpropagation. Therefore, the pixelwise geometric surface prediction and the superpixelwise relation categorization can mutually benefit from each other.
The proposed HLSTM is primarily inspired by the success of Long ShortTerm Memory Networks (LSTM) [Graves et al.2007] [Kalchbrenner et al.2015] [Liang et al.2015] on the effective incorporation of long and short rang dependencies from the whole image. Different from previous LSTM structure [Byeon et al.2014] [Byeon et al.2015] [Liang et al.2015] that simply operates on each pixel, our HLSTM exploits hierarchical information dependencies from different levels of units such as pixels and multiscale superpixels. The hidden cells are treated as the enhanced features and the memory cells can recurrently remember all previous contextual interactions for different levels of representations from different layers.
Since the geometric surface labeling needs the fine prediction results while the relation prediction cares more about the coarse semantic layouts, we thus resort to the specialized PLSTM and MSLSTM to separately address these two tasks. In terms of geometric surface labeling
, the PLSTM is used to incorporate the information from neighboring pixels to guide the local prediction of each pixel, where the local contextual information can be selectively remembered and then guide the feature extraction in the later layer. In terms of
interaction relation prediction, the MSLSTM effectively reduces the information redundancy by the natural smoothed regions and different levels of information can be hierarchically used to extract interaction relations in different layers. Particularly, in each MSLSTM layer, the superpixel map with a specific scale is used to extract the smoothed feature representation. Then, the features of adjacent superpixels are fed into the LSTM units to exploit the spatial dependencies. The superpixel map with larger scale is used in the deep layer to extract the higherlevel contextual dependencies. After passing through all of the hierarchical MSLSTM layers, the final interaction relation prediction can be obtained by the final relation classifier based on the enhanced features benefiting from the hierarchical LSTM units.
This paper makes the following three contributions. (1) A novel recurrent neural network model is proposed for geometric scene parsing, which jointly optimizes the geometric surface labeling and relation prediction. (2) Hierarchically modeling image contexts with LSTM units over superpixels is original to the literature, which can be extended to similar tasks such as human parsing. (3) Extensive experiments on three public benchmark demonstrate the superiority of our HLSTM model over other stateoftheart geometric surface labeling approaches. Moreover, we show promising 3D reconstruction results from the still images based on the geometric parsing.
2 Related Work
Semantic Scene Labeling. Most of the existing works focused on the semantic region labeling problem [Krähenbühl and Koltun2011] [Socher et al.2011] [Long et al.2015], while the critical interaction relation prediction is often overlooked. Based on the handcrafted features and models, the CRF inference [Ladicky et al.2009] [Krähenbühl and Koltun2011] refines the labeling results by considering the label agreement between similar pixels. The fully convolutional network (FCN) [Long et al.2015] and its expansion [Chen et al.2015] have achieved great success on the semantic labeling. [Liu et al.2015] incorporates the markov random field (MRF) into deep networks for pixellevel labeling. Most recently, the multidimensional LSTM [Byeon et al.2015] has also been employed to capture the local spatial dependencies. However, our HLSTM differs from these works in that we train a unified network to collaboratively address the geometric region labeling and relation prediction. The novel PLSTM and MSLSTM can effectively capture the longrange spatial dependencies benefiting from the hierarchical feature representation on the pixels and multiscale superpixels.
Single View 3D Reconstruction. The 3D reconstruction from the singe view image is an under explored task and only a few researches have made some efforts on this task. Mobahi et al. [Mobahi et al.2011] reconstructed the urban structures from the single view by transforming invariant lowrank textures. Without the explicit assumptions about the structure of the scene, Saxena et al. [Saxena et al.2009] trained the MRF model to discover the depth cues as well as the relationships between different parts of the image in a fully supervised manner. An attribute grammar model [Liu et al.2014] regarded superpixels as its terminal nodes and applied five production rules to generate the scene into a hierarchical parse graph. Differed from the previous methods, the proposed HLSTM predicts the layout segmentation and the spatial arrangement with a unified network architecture, and thus can reconstruct the 3D scene from a still image directly.
3 Hierarchical LSTM
Overview. The geometric scene parsing aims to generate the pixelwise geometric surface labeling and relation prediction for each image. As illustrated in Figure 2, the input image is first passed through a stack of convolutional and pooling layers to generate a set of convolutional feature maps. Then the PLSTM and MSLSTM take these feature maps as inputs in a share mode, and their outputs are the pixelwise geometric surface labeling and interaction relations between adjacent regions respectively.
Notations. Each LSTM [Hochreiter and Schmidhuber1997] unit in th layer receives the input from the previous state, and determines the current state which is comprised of the hidden cells and the memory cells , where is the dimension of the network output. Similar to the work in [Graves et al.2013], we apply ,,, to indicate the input, forget, memory and output gate respectively. Define ,,, as the corresponding recurrent gate weights. Thus the hidden and memory cells for the current state can be calculated by,
(1) 
where denotes the concatenation of input and previous state .
is a sigmoid function with the form
, and indicates the elementwise product. Following [Kalchbrenner et al.2015], we can simplify the expression Eqn.(1) as,(2) 
where is the concatenation of four different kinds of recurrent gate weights.
3.1 PLSTM for Geometric Surface Labeling
Following [Liang et al.2015], we use the PLSTM to propagate the local information to each position and further discover the shortdistance contextual interactions in pixel level. For the feature representation of each position , we extract spatial hidden cells from local neighbor pixels and one depth hidden cells from previous layer. Note that the “depth” in a special position indicates the features produced by the hidden cells at that position in the previous layer. Let indicate the set of hidden cells from neighboring positions to pixel , which are calculated by the spatial LSTMs updated in th layer. And denotes the hidden cells computed by the th layer depth LSTM on the pixel . Then the input states of pixel for the th layer LSTM can be expressed by,
(3) 
where . By the same token, let be the memory cells for all spatial dimensions of pixel in the th layer and be memory cell for the depth dimension. Then the hidden cells and memory cells of each position in the th layer for all dimensions are calculated as,
(4) 
where and indicate the weights for spatial and depth dimension in the th layer, respectively. Note that should be distinguished from by the directions of information propagation. represents the hidden cells position to its th neighbor, which is used to generate the input hidden cells of th neighbor position for the next layer. In contrast, is the neighbor hidden cells fed into Eqn.(3) to calculate the input state of pixel .
In particular, the PLSTM subnetwork is built upon the modified VGG16 model [Simonyan and Zisserman2015]. We remove the last two fullyconnected layers in VGG16, and replace with two fullyconvolutional layers to obtain the convolutional feature maps for the input image. Then the convolutional feature maps are fed into the transition layer [Liang et al.2015] to produce hidden cells and memory cells of each position in advance, and make sure the number of the input states for the first PLSTM layer is equal to that of following PLSTM layer. Then the hidden cells and memory cells are passed through five stacked PLSTM layers. By this way, the receptive field of each position can be considerably increased to sense a much larger contextual region. Note that the intermediate hidden cells generated by PLSTM layer are also taken as the input to the corresponding Superpixel LSTM layer for relation prediction. Please check more details of this part in Sec. 3.2. At last, several feedforward convolutional filters are applied to generate confidence maps for each geometric surface. The final label of each pixel is returned by a softmax classifier with the form,
(5) 
where
is the predicted geometric surface probability of the
th pixel, and denotes the network parameter. is a transformation function.3.2 MSLSTM for Interaction Relation Prediction
The Multiscale Superpixel LSTM (MSLSTM) is used to explore highlevel interaction relation between pairwise superpixels, and predict the functional boundaries between geometric surfaces. The hidden cells of th position in th MSLSTM layer are the concatenation of hidden cells from previous layer (same as the depth dimension in PLSTM) and from the corresponding PLSTM layer. For simplicity, we rewrite the enhanced hidden cells as . In each MSLSTM layer, an oversegmentation algorithm [Liu et al.2011b] is employed to produce the superpixel map with a specific scale . To obtain the compact feature representation for each superpixel, we use LogSumExp(LSE) [Boyd and Vandenberghe2004], a convex approximation of the max function to fuse the hidden cells of pixels in the same superpixel,
(6) 
where denotes the hidden cells of the superpixel in the th superpixel layer, denotes the enhance hidden cells of the th position, is the total number of pixels in , and is a hyperparameter to control smoothness. With higher value of , the function tends to preserve the max value for each dimension in the hidden cells, while with lower value the function behaves like an averaging function.
Similar to the Eqn.(3), let indicate the set of hidden cells from adjacent superpixels of . Then the input states of superpixel for the th MSLSTM layer can be computed by,
(7) 
where . The hidden cells and memory cells of superpixel in the th layer can be calculated by,
(8) 
where denotes the concatenation gate weights of th MSLSTM layer. is the average value of the memory cells of each position in superpixel . Note that the dimension of in Eqn.(8) is , which is equal to the output hidden cells from the PLSTM. In the th layer, the values of and can be directly assigned to the hidden cells and memory cells of each position in superpixel . Then the new hidden states can be accordingly learned by applying MSLSTM layer on the superpixel map with larger scale.
In particular, the MSLSTM layers share the convolutional feature maps with the PLSTM. In total, five stacked MSLSTM layers are applied to extract hierarchical feature representations with different scales of contextual dependencies. Therefore, five superpixel maps with different scales (i.e. 16, 32, 48, 64 and 128) are extract by the oversegmentation algorithm [Liu et al.2011b]. Note that the scale in here refers to the average number of pixels in each superpixel. Thus these multiscale superpixel maps are employed by different MSLSTM layers, and the hidden cells for each layer are enhanced by the output of the corresponding PLSTM layer. After passing though these hierarchical MSLSTM layers, the local inference of each superpixel can be influenced by different degrees of context, which enables the model simultaneously taking the local semantic information into account. Finally, the interaction relation prediction of adjacent superpixels is optimized as,
(9) 
where
is the predicted relation probability vector between superpixel
and , and denotes the network parameters. is a transformation function.3.3 Model Optimization
The total loss of HLSTM is the sum of losses of two tasks: geometric surface labeling loss by PLSTM and relation prediction loss by MSLSTM. Given training images with , where indicates the groundtruth geometric surfaces for all pixels for image ,and
denotes the groundtruth relation labels for all of adjacent superpixel pairs in different scales. The overall loss function is as follows,
(10) 
where and indicate the parameters of PLSTM and MSLSTM, respectively, and denotes all of the parameters with the form .
is the parameters of Convolution Neural Network. We apply the back propagation algorithm to update all the parameters.
is the standard pixelwise crossentropy loss. is the crossentropy loss for all superpixels under all scales. Each MSLSTM layer with a specific scale of the superpixel map can output the final interaction relation prediction. Note that is the sum of losses after all MSLSTM layers.4 Application to 3D Reconstruction
In this work, we apply our geometric scene parsing results for singleview 3D reconstruction. The predicted geometric surfaces and their relations are used to ”cut and fold” the image into a popup model [Hoiem et al.2005]
. This process contains two main steps: (1) restoring the 3D spatial structure based on the interaction relations between adjacent superpixels, (2) constructing the positions of the specific planes using projective geometry and texture mapping from the labelled image onto the planes. In practice, we first find the groundvertical boundary according to the predicted supporting relations and estimate the horizon position as the benchmark of 3D structure. Then the algorithm uses the different kinds of predicted relations to generate the polylines and folds the space along these polylines. The algorithm also cuts the groundsky and verticalsky boundaries according to the layering relations. At last, the geometric surface is projected onto the above 3D structures to reconstruct the 3D model.
5 Experiment
5.1 Experiment Settings
Datasets. We validate the effectiveness of the proposed HLSTM on three public datasets, including SIFTFlow dataset [Liu et al.2011a], LM+SUN dataset [Tighe and Lazebnik2013] and Geometric Context dataset [Hoiem et al.2007]. The SIFTFlow consists of 2,488 training images and 200 testing images. The LM+SUN contains 45,676 images (21,182 indoor images and 24,494 outdoor images), which is derived by mixing part of SUN dataset [Xiao et al.2010] and LabelMe dataset [Russell et al.2008]. Following [Tighe and Lazebnik2013], we apply 45,176 images as training data and 500 images as test ones. For these two datasets, three geometric surface classes (i.e. sky, ground and vertical) are considered for the evaluation. The Geometric Context dataset includes 300 outdoor images, where 50 images are used for training and the rest for testing as [Liu et al.2014]. Except for the three main geometric surface classes as used in the previous two datasets, Geometric Context dataset also labels the five subclasses: left, center, right, porous, and solid for vertical class. For all of three datasets, four interaction relation labels (i.e. layering, supporting, siding and affinity) are defined and evaluated in our experiments.
Evaluation Metrics. Following [Long et al.2015], we use the pixel accuracy and mean accuracy metrics as the standard evaluation criteria for the geometric surface labeling. The pixel accuracy assesses the classification accuracy of pixels over the entire dataset while the mean accuracy calculates the mean accuracy for all categories. To evaluate the performance of relation prediction, the average precision metric is adopted.
Implementation Details. In our experiment, we keep the original size of the input image for the SIFTFlow dataset. The scale of input image is fixed as for LM+SUN and Geometric Context datasets. All the experiments are carried out on a PC with NVIDIA Tesla K40 GPU, Intel Core i73960X 3.30GHZ CPU and 12 GB memory. During the training phase, the learning rates of transition layer, PLSTM layers and MSLSTM layers are initialized as and that of pretraining CNN model is initialized as . The dimension of hidden cells and memory cells, which is corresponding to the symbol in Sec. 3, is set as 64 in both PLSTM and MSLSTM.
5.2 Performance Comparisons
Geometric Surface Labeling. We compare the proposed HLSTM with three recent stateoftheart approaches, including Superparsing [Tighe and Lazebnik2013], FCN [Long et al.2015] and DeepLab [Chen et al.2015] on the SIFTFlow and LM+SUN datasets. Figure 4 gives the the comparison results on the pixel accuracy. Table 1 and Table 2 show the performance of our HLSTM and comparisons with three stateoftheart methods on the perclass accuracy. It can be observed that the proposed HLSTM can significantly outperform three baselines in terms of both metrics. For the Geometric Context dataset, the model is finetuned based on the trained model on LM+SUN due to the small size of training data. We compare our results with those reported in [Hoiem et al.2008], [Tighe and Lazebnik2013] and [Liu et al.2014]. Table 3 reports the pixel accuracy on three main classes and five subclasses. Our HLSTM can outperform the three baselines over and when evaluating on three main classes and five subclasses, respectively. This superior performance achieved by HLSTM on three public datasets demonstrates that incorporating the coupled PLSTM and MSLSTM in a unified network is very effective in capturing the complex contextual patterns within images that are critical to exploit the diverse surface structures.
Method  Sky  Ground  Vertical  Mean Acc. 

Superparsing        89.2 
FCN  96.4  93.1  91.8  93.8 
DeepLab  96.1  93.8  93.4  94.4 
Ours  96.4  95.1  93.1  94.9 
Method  Sky  Ground  Vertical  Mean Acc. 

Superparsing        86.8 
FCN  81.8  83.5  94.1  86.4 
DeepLab  76.2  72.8  94.6  81.2 
Ours  83.9  83.6  94.1  87.2 
Method  Subclasses  Main classes 

Hoiem et al.  68.8  89.0 
Superparsing  73.7  88.2 
Liu et al.  76.3   
Ours  80.1  91.8 
Interaction Relation Prediction. The MSLSTM subnetwork can predict the interaction relation results for two adjacent superpixels. Note that we use five MSLSTM layers and five scales of superpixel maps are sequentially employed, including 128, 64, 48, 32, 16 superpixels in five layers. The HLSTM outputs the interaction relation prediction results after each MSLSTM layer to enable the deep supervision for better feature learning. Table 4 shows the average precision after passing different number of MSLSTM layers. The improvements can be observed on most of datasets by gradually using more MSLSTM layers. It verifies well the effectiveness of exploiting more discriminative feature representation based on the hierarchical multiscale superpixel LSTM. The hierarchical MSLSTM enables the model to simultaneously capture the global geometric structure information by increasingly sensing the larger contextual region and also keep track of local fine details by remembering the local interaction of small superpixels.

SIFTFlow  LM+SUN  GContext  

HLSTM_1  85.8  89.1  87.8  
HLSTM_2  89.8  94.7  90.6  
HLSTM_3  90.3  95.6  89.8  
HLSTM_4  90.4  96.7  90.7  
HLSTM  91.2  95.8  90.8 
5.3 Ablative Study
We further evaluate different architecture variants to verify the effectiveness of the important components in our HLSTM, presented in Table 5.
Comparison with convolutional layers. To strictly evaluate the effectiveness of using the proposed PLSTM layer, we report the performance of purely using convolutional layers, i.e. “convolution”. To make fair comparison with PLSTM layer, we utilize five convolutional layers, each of which contains convolutional filters with size , because nine LSTMs are used in a PLSTM layer and each of them has 64 hidden cell outputs. Compared with “HLSTM (ours)”, “convolution” decreases the pixel accuracy. It demonstrates the superiority of using PLSTM layers to harness complex longdistances dependencies over convolutional layers.
Multitask learning. Note that we jointly optimize the geometric surface labeling and relation prediction task within a unified network. We demonstrate the effectiveness of multitask learning by comparing our HLSTM with the version that only predicts the geometric surface labeling, i.e. “PLSTM”. The supervision information for interaction relation and MSLSTM networks are discarded in “PLSTM”. The large performance decrease speaks well that these two tasks can mutually benefit from each other and help learn more meaningful and discriminative features.
Comparison with single scale of superpixel map. We also validate the advantage of using multiscale superpixel representation in the MSLSTM subnetwork on interaction relation prediction. “SLSTM” shows the results of using the same scale of superpixels (i.e. 48 superpixels) in each SLSTM layer. The improvement of “HLSTM” over “PLSTM+SLSTM” demonstrates that the richer contextual dependencies can be captured by using hierarchical multiscale feature learning.
Model settings  SIFTFlow  LM+SUN 

Convolution  94.66  89.92 
PLSTM  94.68  90.13 
PLSTM + SLSTM  95.24  91.06 
HLSTM (ours)  95.41  91.34 
5.4 Application to 3D Reconstruction
Our main geometric class labels and interaction relation prediction over regions are sufficient to reconstruct scaled 3D models of many scenes. Figure 5
shows some scene images and the reconstructed 3D scenes generated based on our geometric parsing results. Besides the obvious graphic applications, e.g. creating virtual walkthroughs, we believe that extra valuable information could be provided by such models to other artificial intelligence applications.
6 Conclusion
In this paper, we have proposed a multiscale and contextaware scene paring model via recurrent Long ShortTerm Memory neural network. Our approach have demonstrated a new stateoftheart on the problem of geometric scene parsing, and also impressive results on 3D reconstruction from still images.
References
 [Boyd and Vandenberghe2004] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
 [Byeon et al.2014] Wonmin Byeon, Marcus Liwicki, and Thomas M Breuel. Texture classification using 2d lstm networks. In ICPR, 2014.
 [Byeon et al.2015] Wonmin Byeon, Thomas M. Breuel, Federico Raue, and Marcus Liwicki. Scene labeling with LSTM recurrent neural networks. In CVPR, 2015.
 [Chen et al.2015] LiangChieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. In ICLR, 2015.
 [Graves et al.2007] A. Graves, S. Fernandez, and J. Schmidhuber. Multidimensional recurrent neural networks. In ICANN, 2007.
 [Graves et al.2013] A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural networks. In ICASSP, 2013.
 [Hochreiter and Schmidhuber1997] Sepp Hochreiter and Jürgen Schmidhuber. Long shortterm memory. Neural computation, 9(8):1735–1780, 1997.
 [Hoiem et al.2005] Derek Hoiem, Alexei A. Efros, and Martial Hebert. Automatic photo popup. ACM Trans. Graph., 24(3):577–584, 2005.

[Hoiem et al.2007]
Derek Hoiem, Alexei A Efros, and Martial Hebert.
Recovering surface layout from an image.
International Journal of Computer Vision
, 75(1):151–172, 2007. 
[Hoiem et al.2008]
Derek Hoiem, Alexei Efros, Martial Hebert, et al.
Closing the loop in scene interpretation.
In
Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on
, pages 1–8. IEEE, 2008.  [Kalchbrenner et al.2015] Nal Kalchbrenner, Ivo Danihelka, and Alex Graves. Grid long shortterm memory. arXiv preprint arXiv:1507.01526, 2015.
 [Kanji2015] Tanaka Kanji. Unsupervised partbased scene modeling for visual robot localization. In ICRA, 2015.
 [Krähenbühl and Koltun2011] Philipp Krähenbühl and Vladlen Koltun. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011.
 [Ladicky et al.2009] Lubor Ladicky, Christopher Russell, Pushmeet Kohli, and Philip H. S. Torr. Associative hierarchical crfs for object class image segmentation. In ICCV, 2009.
 [Liang et al.2015] Xiaodan Liang, Xiaohui Shen, Donglai Xiang, Jiashi Feng, Liang Lin, and Shuicheng Yan. Semantic object parsing with localglobal long shortterm memory. arXiv preprint arXiv:1511.04510, 2015.
 [Liu et al.2011a] Ce Liu, Jenny Yuen, and Antonio Torralba. Nonparametric scene parsing via label transfer. IEEE Trans. Pattern Anal. Mach. Intell., 33(12):2368–2382, 2011.
 [Liu et al.2011b] MingYu Liu, Oncel Tuzel, Srikumar Ramalingam, and Rama Chellappa. Entropy rate superpixel segmentation. In CVPR, 2011.
 [Liu et al.2014] Xiaobai Liu, Yibiao Zhao, and SongChun Zhu. Singleview 3d scene parsing by attributed grammar. In CVPR, 2014.
 [Liu et al.2015] Ziwei Liu, Xiaoxiao Li, Ping Luo, Chen Change Loy, and Xiaoou Tang. Semantic image segmentation via deep parsing network. In ICCV, 2015.
 [Long et al.2015] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
 [Mobahi et al.2011] Hossein Mobahi, Zihan Zhou, Allen Y. Yang, and Yi Ma. Holistic 3d reconstruction of urban structures from lowrank textures. In ICCV Workshops, 2011.
 [Nieuwenhuisen et al.2010] Matthias Nieuwenhuisen, Jörg Stückler, and Sven Behnke. Improving indoor navigation of autonomous robots by an explicit representation of doors. In ICRA, 2010.
 [Pinheiro and Collobert2015] Pedro H. O. Pinheiro and Ronan Collobert. From imagelevel to pixellevel labeling with convolutional networks. In CVPR, 2015.
 [Russell et al.2008] Bryan C. Russell, Antonio Torralba, Kevin P. Murphy, and William T. Freeman. Labelme: A database and webbased tool for image annotation. International Journal of Computer Vision, 77(13):157–173, 2008.
 [Saxena et al.2009] Ashutosh Saxena, Min Sun, and Andrew Y. Ng. Make3d: Learning 3d scene structure from a single still image. IEEE Trans. Pattern Anal. Mach. Intell., 31(5):824–840, 2009.
 [Simonyan and Zisserman2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 [Socher et al.2011] Richard Socher, Cliff ChiungYu Lin, Andrew Y. Ng, and Christopher D. Manning. Parsing natural scenes and natural language with recursive neural networks. In ICML, 2011.
 [Tighe and Lazebnik2013] Joseph Tighe and Svetlana Lazebnik. Superparsing  scalable nonparametric image parsing with superpixels. International Journal of Computer Vision, 101(2):329–349, 2013.
 [Xiao et al.2010] Jianxiong Xiao, James Hays, Krista A. Ehinger, Aude Oliva, and Antonio Torralba. SUN database: Largescale scene recognition from abbey to zoo. In CVPR, 2010.
Comments
There are no comments yet.