1 Introduction
Image segmentation is a fundamental problem in computer vision. Most standard image segmentation techniques rely on exploiting differences between pixel regions such as color and texture. Hence, segmenting subparts of an object which have similar characteristics can be a daunting task. We propose a method that performs such subsegmentation and does not require user interaction or prior training. A result from our method is shown in Figure 1 with the car subsegmented into a collection of parts. This includes the hood of the car, windshield, fender, front and back doors/windows.
Many industry applications require an image of a known object to be subsegmented and separated into its parts. Examples include identification of individual parts of a car given a photograph for automatic damage identification or the identification of subparts of a component in a manufacturing plant for process control work. Subsegmenting parts of an object which share the same color and texture is very hard, if not impossible, with conventional segmentation methods. However, prior knowledge of the shape of the known object and its components can be exploited to make this task easier. Based on this rationale we propose a novel Model Assisted Segmentation method for image segmentation.
We propose to register a 3D model of the known object over a given photograph/image in order to initialise the segmentation process. The segmentation is performed over each part of the object in order to obtain subsegments from the image. A major contribution of this work is a novel gradient based loss function, which is used to estimate the full 3D pose of the object in the given image. The projected parts of the 3D model may not perfectly match the corresponding parts in the photo due to dents in a damaged vehicle or inaccuracies in the 3D model. Therefore, a levelset
[11] based segmentation method is initialised using initial contour information obtained by projecting parts of the 3D model at this 3D pose. We focus our work on subsegmentation of known car images. Cars pose a difficult segmentation task due to highly reflective surfaces in the car body. The method can be adapted to work for any object.The remainder of this paper is organised as follows. Previous work related to our paper is described in Section 2. We describe the method used to estimate the 3D pose of the object in Section 3. The contour based image segmentation approach is described next in Section 4. This is followed by results on real photos which are benchmarked against state of the art methods in Section 5.
2 Related Work
Model based object recognition has received considerable attention in computer vision. A survey by Chin and Dyer [5] shows that model based object recognition algorithms generally fall into three categories, based on the type of object representation used  namely 2D representations, 2.5D representations and 3D representations.
2D representations [18, 28] aim to identify the presence and orientation of a specific face of 3D objects, for example parts on a conveyor belt. These approaches require prior training to determine which face to match to, and are unable to generalise to other faces of the same object.
2.5D approaches [19, 8, 7] are also viewer centred, where the object is known to occur in a particular view. They differ from the 2D approach as the model stores additional information such as intrinsic image parameters and surfaceorientation maps.
3D approaches are utilised in situations where the object of interest can appear in a scene from multiple viewing angles. Common 3D representation approaches can be either an ‘exact representation’ or a ‘multiview feature representation’. The latter method uses a composite model consisting of 2D/2.5D models for a limited set of views. Multiview feature representation is used along with the concept of generalised cylinders by Brooks and Binford [3] to detect different types of industrial motors in the so called ACRONYM system. The models used in the exact representation method, on the contrary, contain an exact representation of the complete 3D object. Hence a 2D projection of the object can be created for any desired view. Unfortunately, this method is often considered too costly in terms of processing time. The 2D and 2.5D representations are insufficient for general purpose applications. For example, a vehicle may be photographed from an arbitrary view in order to indicate the damaged parts. Similarly, the 3D multiview feature representation is also not suitable, as we are not able to limit the pose of the vehicle to a small finite set of views. Therefore, pose identification has to be done using an exact 3D model. Little work has been done to date on identifying the pose of an exact 3D model from a single 2D image.
Image gradients. Gray scale image gradients have been used to estimate the 3D pose in traffic video footage from a stationary camera by Kollnig and Nagel [10]
. The method compares image gradients instead of simple edge segments, for better performance. Image gradients from projected polyhedral models are compared against image gradients in video images. The pose is formulated using three degrees of freedom; two for position and one for angular orientation. Tan and Baker
[27] use image gradients and a Hough transform based algorithm for estimating vehicle pose in traffic scenes, once more describing the pose via three degrees of freedom. Pose estimation using three degrees of freedom is adequate for traffic image sequences, where the camera position remains fixed with respect to the ground plane. This approach does not recover the full 3D pose as in our method.attempt to simultaneously solve the pose and point correspondence problems. The success of these methods are affected by the quality of the features extracted from the object, which is nontrivial with objects like cars. Features depend on the object geometry and can cause problems when recovering a full 3D pose. Also different image modalities cause problems with feature based methods. For example reflections which may appear as image features do not occur in the 3D model projection. Our method on the contrary, does not depend on feature extraction.
Segmentation. The use of shape priors for segmentation and pose estimation have been investigated in [22, 21, 23, 25]. These methods focus on segmenting foreground from background using 3D freeform contours. Our method, on the contrary, does intraobject segmentation (into subsegments) by initialising the segmentation using projections of 3D CAD model parts at an estimated pose. In addition, our method works on more complex objects like real cars.
3 3D Model Registration
We describe the use of a featureless gradient based loss function which is used to register the 3D model over the 2D photo. Our method works on triangulated 3D CAD models with a large number of polygons (including 3D models obtained from laser scans) and utilises image gradients of the 3D model surface normals rather than considering simple edge segments.
Gradient based loss function. We define a gradient based loss function that has a minimum at the correct 3D pose where the projected 3D model matches the object in the given photo/image. The image gradients of the 3D model surface normal components and the image gradients of the 2D photo are used to define a loss function at a given pose .
We use to denote 2D pixel coordinates in the
photo/image and to denote 3D coordinates of the 3D model.
Let be a dimensional matrix (for example if is an RGB image) with elements .
We define the norm ‘gradient magnitude’ matrix of as
(1) 
Based on this we have the gradient magnitude matrix for a 2D photo/image as
(2) 
Let be the unit surface normal
at the 3D point for the 3D model at pose .
The model is rendered with the surface normal components values , and used as RGB color values in the OpenGL renderer to obtain the projected surface normal component matrix such that has surface normal component values at the 2D point in the projected image.
Based on this we have the gradient normal matrix for the surface normal components as
(3) 
The loss function for a given pose is defined as
(4) 
where
is the Pearson’s productmoment correlation coefficient
[20] between the matrix elements of and . This loss has a convenient property of ranging between and . Lower loss values imply a better 3D pose.Visualisation. We illustrate intermediate steps of the loss calculation for a 3D model of a Mazda 3 car. The surface normal components and are shown in Figure 2(ac). Their image gradients are shown in Figure 2(di) and the resulting matrix image is shown in Figure 2(j). Similarly intermediate steps in the calculation of are show in Figure 3 for a real photo and a synthetic photo. We show overlaid images of and at the known matching pose in Figure 4. We show how the overlap changes by applying levels of Gaussian smoothing (described below) in Figures 4 for the real and synthetic photo. The synthetic photos were made by projecting the 3D model at a known pose .
. The x,y and z component matrices of the surface normal vector are shown in
LABEL:LABEL:. Their image gradients are shown in LABEL:LABEL:. The resulting matrix is shown in LABEL:. No Gaussian smoothing has been applied. Colour representation: green=positive, black=zero and red=negative. We use a horizontal axis pointing left to right, vertical axis and pointing top to bottom and an axis which points out of the page.The correlation will be highest in Equation 4 when the 3D model is projected with pose parameters that match the object in the photo , as this has the best overlap. Therefore the loss will be lowest at the correct pose parameters , for values of reasonably close to . We see this in the loss landscapes in Figure 6.
Gaussian smoothing. We do Gaussian smoothing on the photo and rendered surface normal component images before calculating (Equation 2) and (Equation 3). This is done by convolving with a 2D Gaussian kernel followed by downsampling [7]. This makes the loss function landscape less steep and noisy, thus making it easier to optimise. However, the global optimum tends to deviate slightly from the correct pose at high levels of Gaussian smoothing. Compare the 1D loss landscapes shown in Figure 6 for different levels of Gaussian smoothing . Therefore, we do a series of optimisations starting from the highest level of smoothing, using the optimum found at level as the initialisation for level , recursively.
Choosing the norm . We have a choice when selecting the norm for Equations 2 and 3. Having tested both norm and norm cases we have found the norm to be less noisy (as shown in Figure 6) and hence easier to optimise.
Initialisation.
We use a rough pose estimate to seed the optimisation.
An object specific method can be used to obtain the rough pose.
Possible methods for obtaining a coarse initial pose include the work done by [17], [26] and [1].
We have used the wheel match method developed by Hutter and Brewer
[9] to obtain an initial pose for vehicle photos where the wheels are visible.
The wheels need not be visible with the other methods mentioned above.
We use the following to represent the rough pose of cars as prescribed in [9] which neglects the effects of perspective projection.
(5) 
is the visible rear wheel center of the car in the 2D image. is the vector between corresponding rear and front wheel centres of the car in the 2D image. The 2D image is a projection of the 3D model on to the XY plane. is a unit vector in the direction of the rear wheel axle of the 3D car model. Therefore, and need not be explicitly included in the pose representation . This representation is illustrated in Figure 5.
We include an additional perspective parameter (the distance to the camera from the projection plane in the OpenGL 3D frustum) when optimising the loss function to obtain the fine 3D pose.
Hence we define the full 3D pose as follows.
(6) 
is converted to translation, scale and rotation as per [9] to transform the 3D model and along with is used to render the 3D model with perspective projection in OpenGL using pose . Thereby, we estimate the full 3D pose by minimizing Equation 4 w.r.t . Intrinsic camera parameters need not be known explicitly. Note that any other choice of pose parameters would do. We use the above as it is convenient with cars.
Background removal. As the effects of the background clutter in the photo adds considerable noise to the loss function landscape we use an adaptation of the Grabcut [24] method to remove a considerable amount of the background pixels from the photo. Although, this does not result in a perfect removal of the background it significantly improves the pose estimation results. The initial rough pose estimate is used as a prior to generate the background and foreground grabcut masks ^{1}^{1}1We use the cv::grabCut() method provided in OpenCV[2] version 2.1. Figure 7(b) shows results of the background removal.
Optimisation. We use the downhill simplex optimiser [16] to find the pose parameters which give the lowest loss value for Equation 4. This optimiser is very robust and is capable of moving out of local optima by reinitialising the simplex. Downhill simplex does not require gradient calculations. Gradient based optimisers would be problematic given the loss landscapes in Figure 6. We use the fine pose obtained thus to register the 3D model on the 2D photo. This is used to initialise contour detection based image segmentation.
4 Contour Detection
In this section, we discuss the procedure of contour detection used to segment the known object in the image. We use a variation of the level set method which does not require reinitialisation [11] to find boundaries of relevant object parts.
Most active contour models implement an edgefunction to find boundaries.
The edgefunction is a gradient dependant positive decreasing function.
A common formulation is as follows
(7) 
where denotes a smoother version of 2D image ,
is an isotropic Gaussian kernel with standard deviation
, and is the convolution operator. Therefore will be , as approaches infinity, i.e.(8) 
As per [11], a Lipschitz function is used to represent the curve such that ,
(9) 
As with other level set formulations like [4] and [13], the curve is evolved using the mean curvature in the normal direction .
Therefore the curve evolution is represented by as
(10) 
where the evolution of the curve is given by the zerolevel curve at time of the function . is a constant to ensure that the curve evolves in the normal direction, even if the mean curvature is zero.
Theoretically, as the image gradient on an edge/boundary of an image segment tends to infinity, the edge function (Equation 7) is zero on the boundary. This causes the curve to stop evolving at the boundary (Equation 10). However, in practice the edge function may not always be zero at image boundaries of complex images and the performance of the level set method is severely affected by noise. Isotropic Gaussian smoothing can be applied to reduce image noise but over smoothing will also smooth the edges, in which case, the level set curve may miss the boundary altogether. This is a common problem not only for the level set method in [11] but also for other active contour models [4, 14, 12, 13]. Additionally, the efficiency and effectiveness of level set in boundary detection depends a lot on the initialisation of the curve. Without appropriate initialisation, the curve is frequently trapped into local minima.
A very close initialisation curve can eliminate this problem. In our approach, the initialisation curve is obtained by registering a 3D model over the photo as described in Section 3. Since the parts in the 3D model are already known, they can be projected at the known 3D pose to obtain a selected part outline in 2D. An ‘erosion’ morphological operator is applied on to obtain the initial curve which is inside the real boundary.
The green curves (initialisation images in Figures 9, 10 and 11) are used to denote the 2D outlines of projected parts in the 3D model, while the red curves are the initialisation curves obtained by eroding these green curves. The level set starts with the initial curve to find actual boundary in the 2D image of vehicle, for each part . The yellow curves (result images in Figures 9, 10 and 11) indicate the actual boundaries detected.
The entire process of ‘Model Assisted Segmentation’ is given in pseudocode in Algorithm 1.
5 Results
We apply our method to segment components of a real car from a photograph as follows.
Pose estimation. The results of registering the 3D model over the photograph (pose estimation) are shown in Figure 7. A gradient sketch of the 3D model is drawn over the photograph in yellow to indicate the pose of the 3D model at each step in Figure 7. The wheels of the 3D model do not match the wheels in the photo due to the effects of wheel suspension. Since we are interested in segmenting parts of the car body the wheels have been removed from the 3D model for the fine pose estimation. The original photograph in Figure 7(a) shows the side view of a Mazda Astina car. We register a triangulated 3D model of the car obtained by a 3D laser scan. The rough 3D pose obtained using the wheel locations [9] is shown in Figure 7(c). The result of the approximate background removal is shown in Figure 7(b). We optimise the gradient based loss function (Equation 4) for the image in Figure 7(b) with respect to the seven pose parameters (Section 3) to obtain the fine 3D pose. The optimisation is done sequentially moving from the highest level of Gaussian smoothing to the lowest. We start from the rough pose with two levels of Gaussian smoothing and obtain the pose in Figure 7(d). Next we use this pose to initialise an optimisation of the loss function with one level of Gaussian smoothing and obtain the pose in 7(e). Finally, we use this pose to perform one more optimisation with no Gaussian smoothing and obtain the final fine 3D pose shown in Figure 7(f). We note that the visual improvement in the image overlays gets smaller as we go up the Gaussian pyramid. However, the improvement in the 3D pose becomes more apparent when we compare the close ups in Figures 8(a), 8(b) and 8(c).
Segmentation. Segmentation results based on contour detection for the photograph in 7(a) using the fine 3D pose (Figure 7(f)) are shown in Figures 9 and 10. The segmentation results for a selection of car parts (front and back doors, front and back windows, fender, mud guard and front buffer) are shown in Figure 9(b) by the yellow curves. The part boundaries obtained by projecting the 3D model are shown in green and the initialisation curves are shown in red in Figure 9(a). For the sake of clarity we also include close ups of a few parts. The initialisation curves and the segmentation results for the back door and window are shown in Figures 10(a) and 10(b), using the same color code. Close ups for the front parts are shown in Figures 10(e) and 10(f). We see the high amount of reflection in the car body deteriorating the performance of the segmentation results in the latter case, especially around the hood of the car and windshield. In contrast the mud guard, lower parts of the buffer and fender are segmented out quite well in Figure 10(f) as there is less reflection noise in that region. Results for a semiprofile view of the car are shown in Figures 1 and 11 using same convention.
Accuracy.
The accuracy of the results have been compared against a ground truth obtained from the photos by hand annotation in Table 1.
We calculate the accuracy as
(11) 
where and are two binary images of the subsegmentation result and ground truth respectively. We note that the accuracy is considerably high. Also, the side view has a higher accuracy in general because the pose estimation gave a better result and hence the segmentation was better initialised.
Part  Side View  Semi Profile  Avg. 

Fender  97.7%  97.6%  97.7% 
Front door  98.1%  95.3%  96.7% 
Back door  96.8%  93.6%  95.2% 
Mud flap  97.3%  95.1%  96.2% 
Front window  97.8%  97.5%  97.7% 
Back window  99.5%  93.9%  96.7% 
Benchmark tests. Our results from Model Assisted Segmentation were compared with state of the art image segmentation methods ‘Grabcut (GC)’ [24] and ‘Level set (LS)’ [11] which do not use any Model Assistance. A bounding box has been used initialise the benchmark methods. We compare our results (Figures 10(b) and 10(f)) with the benchmark tests in Figure 10. The segmentation using our method are more accurate in general. In addition to this, our method has the added advantage of subsegmenting parts of the same object. This is a nontrivial task for conventional segmentation methods when the subsegments of the object share the same colour and texture. In terms of overall performance, we observe that in our method the segmentation results ‘bleed’ a lot less into adjacent areas, unlike with the benchmark results. In terms of subsegmenting parts of the same object, we see in Figure 10(f) that our method is capable of successfully segmenting out the fender, mud guard and the buffer from the front door unlike the benchmark methods. In fact it would be extremely difficult (if not impossible) to subsegment parts of the front of the car which are painted the same color with conventional methods. Similarly the back door, back window and the smaller glass panel have been segmented out in Figure 10(b) where as the benchmark methods group them together. Results for a semiprofile view of the car are shown in Figure 1 with close ups and benchmark comparisons in Figure 11. Our results are better and separate the object into meaningful parts.
6 Discussion
The Model Assisted Segmentation method described in this paper can segment parts of a known 3D object from a given image. It performs better than the state of the art and can segment (and separate) parts that have similar pixel characteristics. We present our results on images of cars. The highly reflective surfaces of cars make the pose estimation as well as the segmentation tasks more difficult than with nonreflective objects.
We note that a close initialisation curve obtained from the 3D pose estimation significantly improves the performance of contour detection, and hence the image segmentation. However, the presence of reflections can deteriorate the quality of the results. We intend to explore avenues to make the process more robust in the presence of reflections.
Acknowledgment. The authors wish to thank Stephen Gould and Hongdong Li for the valuable feedback and advice. This work was supported by ControlC=xpert.
References
 [1] M. ArieNachimson and R. Basri. Constructing implicit 3d shape models for pose estimation. In ICCV, 2009.
 [2] G. Bradski. The OpenCV Library. Dr. Dobb’s Journal of Software Tools, 2000.
 [3] Brooks, R. A., and Binford, T. O. Geometric modelling in vision for manufacturing. In Proceedings of the Society of PhotoOptical Instrumentation Engineers Conference on Robot Vision, volume 281, pages 141–159, Washington, DC, USA, April 1981.
 [4] V. Caselles, F. Catté, T. Coll, and F. Dibos. A geometric model for active contours in image processing. Numerische Mathematik, 66(1):1–31, 1993.
 [5] Roland T. Chin and Charles R. Dyer. Modelbased recognition in robot vision. ACM Comput. Surv., 18(1):67–108, 1986.
 [6] P. David, D. DeMenthon, R. Duraiswami, and H. Samet. SoftPOSIT: Simultaneous pose and correspondence determination. International Journal of Computer Vision, 59(3):259–284, 2004.
 [7] David A. Forsyth and Jean Ponce. Computer Vision: A Modern Approach. Prentice Hall Professional Technical Reference, 2002.
 [8] B.K.P. Horn. Obtaining shape from shading information. In PsychCV75, pages 115–155, 1975.
 [9] M. Hutter and N. Brewer. Matching 2D Ellipses to 3D Circles with Application to Vehicle Pose Identification. In Image and Vision Computing New Zealand, 2009. IVCNZ’09. 24th International Conference, pages 153–158, 2009.
 [10] Henner Kollnig and HansHellmut Nagel. 3d pose estimation by directly matching polyhedral models to gray value gradients. Int. J. Comput. Vision, 23(3):283–302, 1997.

[11]
Chunming Li, Chenyang Xu, Changfeng Gui, and M.D. Fox.
Level set evolution without reinitialization: a new variational
formulation.
In
Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on
, volume 1, pages 430 – 436 vol. 1, june 2005.  [12] R. Malladi, J. Sethian, and B. Vemuri. Evolutionary fronts for topologyindependent shape modeling and recovery. Computer Vision—ECCV’94, pages 1–13, 1994.
 [13] R. Malladi, J.A. Sethian, and B.C. Vemuri. Shape modeling with front propagation: A level set approach. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 17(2):158–175, 2002.
 [14] Ravikanth Malladi. A topologyindependent shape modeling scheme. PhD thesis, University of Florida, Gainesville, FL, USA, 1993. AAI9505796.
 [15] F. MorenoNoguer, V. Lepetit, and P. Fua. Pose priors for simultaneously solving alignment and correspondence. Computer Vision–ECCV 2008, pages 405–418, 2008.
 [16] JA Nelder and R. Mead. A simplex method for function minimization. The computer journal, 7(4):308, 1965.
 [17] M. Ozuysal, V. Lepetit, and P.Fua. Pose estimation for category specific multiview object localization. In Conference on Computer Vision and Pattern Recognition, Miami, FL, June 2009.
 [18] W. A. Perkins. A modelbased vision system for industrial parts. IEEE Trans. Comput., 27(2):126–143, 1978.
 [19] Poje, J. F., and Delp, E. J. A review of techniques for obtaining depth information with applications to machine vision. Technical report, Center for Robotics and Integrated Manufacturing, Univ. of Michigan, Ann Arbor, 1982.
 [20] Joseph Lee Rodgers and W. Alan Nicewander. Thirteen ways to look at the correlation coefficient. The American Statistician, 42(1):pp. 59–66, 1988.
 [21] B. Rosenhahn, T. Brox, D. Cremers, and H.P. Seidel. A comparison of shape matching methods for contour based pose estimation. Combinatorial Image Analysis, pages 263–276, 2006.
 [22] B. Rosenhahn, T. Brox, and J. Weickert. Threedimensional shape knowledge for joint image segmentation and pose tracking. International Journal of Computer Vision, 73(3):243–262, 2007.
 [23] B. Rosenhahn, C. Perwass, and G. Sommer. Pose estimation of 3D freeform contours. International Journal of Computer Vision, 62(3):267–289, 2005.
 [24] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics (TOG), 23(3):309–314, 2004.
 [25] M. Rousson and N. Paragios. Shape priors for level set representations. Computer Vision—ECCV 2002, pages 416–418, 2002.
 [26] Min Sun, BingXin Xu, Gary Bradski, and Silvio Savarese. Depthencoded hough voting for joint object detection and shape recovery. In ECCV, Crete, Greece, 09/2010 2010.
 [27] T.N. Tan and K.D. Baker. Efficient image gradient based vehicle localization. IEEE Transactions on Image Processing, 9(8):1343–1356, 2000.
 [28] M. Yachida and S. Tsuji. A versatile machine vision system for complex industrial parts. IEEE Trans. Comput., 26(9):882–894, 1977.
Comments
There are no comments yet.