One of the important applications of computer vision is to analyze the shape and structure of natural objects,i.e., quantifying the size, number, and shape of plant leaves. In such applications, instance segmentation plays an important role because of the need for accurately determining individual object instances. A single-view instance segmentation has gained attention, and there have been a series of previous works [maskrcnn, ren2017end, xiong2019upsnet] to address the problem. However, for scenes suffering from severe occlusions, it is needed to perform instance segmentation in a multi-view setting as shown in Fig. 1. We call it a multi-view instance segmentation (MVIS) problem.
An extension of single-view instance segmentation to the multi-view setting is a challenging task, especially when the individual instances are look-alike, for example, plant leaves. Although external markers or background objects, e.g., plant pots, can be a guide for obtaining sparse correspondences to determine camera poses, we cannot expect dense point correspondences on similar instances, e.g., leaves, across views. This setting makes it difficult to directly use existing approaches, such as 3D instance segmentation [hou20193d] and multi-view semantic segmentation [kowdle2012multiple, djelouah2013multi, mustafa2017semantically], which heavily rely on 3D shape information.
To overcome the problem, we propose an MVIS method based on region matching that does not rely on dense correspondences, which can be used for matching objects across views without distinctive textures or shape but with similar appearances. The key idea of our method is to use the epipolar constraint to determine the multi-view region correspondences of segmented instances that are computed at each view. By treating each instance in each view as a node in a graph, and by assigning edge weights based on the degree of intersection of epipolar lines and instance regions, we cast the problem of multi-view instance segmentation to a graph clustering problem. We call the approach the epipolar region matching. We show that it enables reliably establishing correspondences of view-wise segmented instances across multiple views. Further, our epipolar region matching can be easily integrated in modern instance segmentation methods [maskrcnn, xiong2019upsnet, voigtlaender2019mots]
. Our method can be combined with any region segmentation methods; while given a set of regions, our region matching method does not rely on any descriptors. The estimated instances with multi-view correspondences can be used for theinstance-wise 3D reconstruction via existing 3D reconstruction methods, yielding superior reconstruction results.
Experimental results show that the proposed method yields accurate multi-view instance correspondence compared to an instance matching method based on a traditional multi-view stereo (MVS) and a point-based matching. We also demonstrate the effectiveness of MVIS for instance-wise 3D reconstruction.
We propose a multi-view region matching method without using texture and shape descriptors, as a key technical component of MVIS for texture/shape-repetitive scenes. The matching method can be put together in instance segmentation and 3D reconstruction methods for instance-wise 3D reconstruction.
2 Related Works
Our work is closely related to multi-view correspondence matching, as well as the multi-view or 3D extensions of semantic/instance segmentation. This section discusses the related works in these subject areas.
2.0.1 Multi-view correspondence matching.
Multi-view correspondence matching is a fundamental problem in multi-view image analysis, such as structure from motion (SfM) [schoenberger2016sfm] and multi-view stereo (MVS) [schoenberger2016mvs]. As a veteran but still frequently used approach, keypoint detection and matching (e.g., [lowe2004distinctive, alcantarilla2011fast, rublee2011orb]) is used to obtain sparse correspondences; while dense correspondence matching often involves a patch [bleyer2011patchmatch] and plane-based [gallup2007real] matching.
For target scenes containing partly texture-less objects, various approaches have been studied, such as a belief propagation [sun2003stereo, furukawa2004structure]
for MVS. More recently, deep-learning-based approaches to correspondence matching for partially texture-less scenes, which often assume the smoothness of the target scene, have been studied[zhang2019learning, romanoni2019tapa]. Correspondence matching for objects in which surface is fully texture-less, or with repetitive textures is notably challenging because it becomes difficult to find dense correspondences. For string-like objects (e.g., hairs), which have the matching ambiguity along the strings or lines, an MVS method using line-shaped patches has been proposed [nam2019strand]. Regarding the sparse point matching, Dellaert et al. [dellaert2000structure] presented a method for the SfM problem without using texture features, which calculates the camera pose and sparse point correspondences based only on the geometric relationship of keypoints.
Shape-based matching [belongie2002shape, berg2005shape, mikolajczyk2005comparison] such as using the Fourier shape descriptors [bartolini2005warp] or shape contexts [belongie2002shape], is another line of the region matching. These approaches encode the region shape information, e.g., instance shape, into descriptors. In our cases, since the target scene is composed of objects with similar 3D shapes (e.g., plant leaves), it is not realistic to use shape-based matching. Although it is different from the context of instance segmentation, region-based information [matas2004robust], such as the statistics of textures, is a useful cue for the correspondence matching. For example, superpixel stereo [li2016pmsc, mivcuvsik2010multi] or segment-based stereo [klaus2006segment] were developed to increase the robustness of correspondence matching. Segmentation of the same object among multiple views can be categorized in multi-view co-segmentation [kowdle2012multiple, djelouah2013multi, mustafa2017semantically], which also uses the texture features.
Unlike the previous methods, we achieve the descriptor-free multi-view region matching using the epipolar constraint. The epipolar matching is a traditional yet important problem. As far as we are aware, searching regions without (texture or shape) descriptor matching remains unsolved.
2.0.2 Instance segmentation and its multi-view extension.
While various implementations of instance segmentation including one-stage methods [kulikov2020instance] are developed, two-stage frameworks that combine object detection and mask generation are often used due to the advantage of performance and simplicity. A major implementation of two-stage instance segmentation is Mask R-CNN [maskrcnn] based on Faster R-CNN [ren2015faster], which computes the region proposals of the target objects followed by the mask generation for selected proposals.
Video instance segmentation is a multi-image extension of instance segmentation. These approaches perform the instance tracking, as well as the segmentation [milan2015joint, jun2017cdts, ovsep2018track, sharma2018beyond]. An early attempt of this approach [seguin2016instance] used superpixels for instance tracking. A recent study configures the problem as multi-object tracking and segmentation (MOTS), and provides a Mask-R-CNN-based implementation [voigtlaender2019mots]. PanopticFusion [narita2019panopticfusion] aggregates multi-view instances in a 3D volumetric space, while it inputs a video sequence that enables the instance tracking. Video instance tracking is applied for plant image analysis, such as the 3D reconstruction of grape clusters [scholer2015automated]. This batch of approaches assumes the time-series image sequences, which allows the object tracking with smaller displacement between the consequent frames. We rather focus on the multi-view setting, where the images are captured with wider baselines between the viewpoints.
2.0.3 3D semantic/instance segmentation.
Somewhat related to MVIS, 3D semantic/instance segmentation is studied. Approaches for 3D segmentation often input 3D shapes of target objects, such as point cloud [qi2017pointnet, wang2019multi], volumetric space [lahoud20193d], and RGB-D images [hou20193d, dai20183dmv]. Semantic segmentation on multi-view RGB images can be used for 3D reconstruction via the aggregation on a 3D space, which are called semantic 3D reconstruction [hane2013joint, savinov2015discrete, savinov2016semantic]. These methods do not distinguish the objects in a same category in principle.
Nassar et al. [nassar19] proposes an instance warping for multi-view instance correspondence matching using the 3D shape of target scenes. Instance segmentation on 3D point clouds are also studied [wang2018sgpn, yi2019gspn, engelmann20203d], which needs the input of high-quality 3D point clouds. Unlike these approaches, we focus on the instance matching of objects where the 3D reconstruction is challenging; e.g., plants due to thin shape, repetitive textures, and heavy occlusions. As an application-specific study, a plant modeling method proposed by Quan et al. [quan06] uses the combination of 2D segmentation and 3D point cloud clustering for 3D leaf modeling. Because their method assumes a good-quality 3D point cloud as input, it is difficult for the reconstruction of plants with texture-less leaves.
3 Multi-view instance segmentation (MVIS)
Our goal is to find multi-view instance correspondences from the multi-view images of the target. We assume that the camera poses and intrinsics are known, e.g., by an SfM [schoenberger2016sfm] using the sparse correspondences obtained from the scene background. We here assume that instance segmentation in each view is available for now, while the instance correspondence across views is unknown.
3.1 Epipolar region matching
Given a set of instances in each image and camera pose information, we establish instance correspondences across multi-view images. We approach this problem by epipolar geometry; namely, finding the correspondences via epipolar lines drawn on other views. Since multiple instances appear on the same epipolar line (see Fig. 2), we cast the matching problem to a graph clustering problem.
We create an undirected edge-weighted graph , where is the node set that consists of instances appeared in all views, is the edge set, and is the weight function defining the edge weights. The number of nodes corresponds to the total number of instances that appear in all the images.
To define the edge weights, we compute a set of epipolar lines from densely-sampled points on each instance. Here we define the -th view image as , -th instance segment in the -th image as and its region (set of pixels) in the image coordinates as . Let be a fundamental matrix between -th and -th views. Then an image point in instance segment , i.e., , forms an epipolar line in the -th view image as
where is the homogeneous representation of . From all points in region , we have a set of epipolar lines forming a pencil of epipolar lines passing through the epipole. In what follows, we call the pencil an epipolar band in this paper. The epipolar band of region on the -th image is thus defined as
We define the edge weights in the graph as the degree of intersection of an epipolar band and instance regions in analogous to the intersection-over-union (IoU) computation. While original IoU is defined between two areas, the extension to a similarity measure between the epipolar band and a region is not straightforward. To evaluate the degree of intersection between the epipolar band and instance region, we use two measures: The area of intersection and the number of epipolar lines in the epipolar band passing through the instance region, as illustrated in the right side of Fig. 2. In this manner, the edge weight between nodes and can be obtained as
where the function counts the number of pixels belonging to the region, while counts the number of epipolar lines passing through the area. During the computation of , we draw epipolar lines with the same thickness (two pixels was used in our experiment). As with many other similarity measures, takes the range , where it becomes one if all epipolar lines in the epipolar band passes through the instance area and a whole part of the area is filled with epipolar lines (see Fig. 2).
Once the edge weights are defined, we form an adjacency matrix of elements from the graph and perform a graph clustering using symmetric non-negative matrix factorization (SymNMF) [kuang2012symmetric], as . After the factorization, the largest element in each row of indicates the cluster ID, where each cluster forms a set of corresponding instances across multi-view. Hereafter, we call the cluster an instance cluster.
3.2 Application: Epipolar region matching for region proposals
As an application of the proposed method, we here describe the integration of multi-view region matching method with modern instance segmentation. While our region matching method can be used with any instance segmentation methods, we can optionally utilize the correspondences across multi-view region proposals to retain partly occluded instances. Most instance segmentation methods [maskrcnn, xiong2019upsnet, voigtlaender2019mots] rely on region proposals [ren2015faster], which performs region unification by the non-maximum suppression (NMS) [girshick2015deformable] to merge overlapping region proposals. As a result, region proposals for partially occluded instances are often suppressed. To this end, we implemented an NMS process considering the multi-view region correspondences. We found our implementation recovered partially occluded instances, while it did not significantly affect the overall instance segmentation accuracy (see the supplementary material for detailed discussions).
Our method starts from the initial segmentation result, e.g., by Mask R-CNN, and their tentative multi-view correspondence computed by the epipolar region matching. The NMS and region matching processes can be performed alternately to update the set of detected instances and their multi-view correspondences as illustrated in Fig. 3. We assign the instance cluster IDs for each region proposal by the region matching method, and avoid to unify the proposals with different cluster IDs by the NMS process.
To update the cluster ID of each region proposals, sets of corresponding instances (i.e., instance clusters estimated in the previous iteration) are projected to the view using the same manner as the epipolar band projection described in Sec. 3.1. Let the -th instance cluster in the previous iteration be , which is denoted as a set of instances. The projection onto the -th image, , is calculated based on the sum of epipolar bands by the instances (we call an epipolar map hereafter).
where counts the number of instances in the cluster . The right side of Fig. 3 shows an example of the epipolar map.
An updated cluster ID for each region proposal is calculated based on the highest degree of intersection between the epipolar map and the region proposal. Letting the instance mask of -th region proposal in -th image as , the similarity of the two maps, and , is obtained by the similar manner to the IoU computation. Since the epipolar map does not take the value of , we use an extension of the IoU, which is called Ruzicka similarity [deza2009encyclopedia].
where denotes the pixel location. In the equation, we deal with each instance mask as a map taking values, where the value takes one if the pixel is inside the mask. The updated cluster ID for the region proposal is selected as with the largest similarity . The update process of cluster ID is followed by the NMS that does not unify the region proposals with different cluster IDs to retain the partly occluded instances. Our implementation allows to iterate the region matching and NMS processes while updating the set of instances and multi-view correspondences.
3.3 Application: Instance-wise 3D reconstruction
Multi-view correspondences of instance segments can be used for the instance-wise 3D reconstruction, by independently applying a 3D reconstruction method for each of instance clusters. We implemented a simple volumetric reconstruction method based on a back-projection used in a traditional computed tomography [brooks75], which is analogous to the visual hull method [laurentini1994visual].
Let a set of -th instance cluster and the set of projection functions from 3D to 2D image coordinates, . The aggregated value at voxel can be computed as:
in which represents a projection from the voxel to the image coordinates corresponding to the instance . Here, denotes a mask representation of , which returns if the pixel is inside the instance region . The resultant voxel space
represents the ratio of voted instances; while we simply yielded the binarized version of the voxel space for the evaluation of the reconstruction accuracy, using the threshold of.
3.4 Implementation details
3.4.1 Region matching.
During the epipolar band projection, we randomly sampled points in each instance to draw the epipolar lines. The graph clustering is based on a Python implementation of SymNMF [kuang2012symmetric]. Since the clustering algorithm requires the number of instance clusters as an input, we implemented a framework to search the optimal number of clusters. Assuming the instances are evenly occluded and are projected onto the similar number of views, we selected the optimal
with the minimum standard deviation of the number of instances contained in the clusters.
where denotes the mean number of instances in instance clusters. In the supplementary material, we provide a detailed analysis when the number of instance clusters is given (i.e., using the ground-truth number of objects).
3.4.2 MVIS application.
For the integration of the region matching for region proposals, we used a Keras implementation111https://github.com/matterport/Mask_RCNN of Mask R-CNN. We implemented an NMS using numpy and nms package outside the computation graph of Mask R-CNN. To obtain the initial instances, we used the original Mask R-CNN with NMS by a large RoI threshold ( in our experiment) and obtained excessive numbers of object RoIs with their instance masks. During the iterations, our NMS is performed for each instance ID independently with a smaller threshold ( was used in the experiment). With our unoptimized implementation, the whole process took up to several hours on a CPU (2.1 GHz, 8 threads); the projection of epipolar bands spent most of the time, which should be greatly optimized through a better implementation.
We conducted experiments to assess the quality of the multi-view matching and instance-wise 3D reconstruction, which are core part of our framework. The supplementary material provides the detailed analysis and discussions, including the effect on the instance detection by our implementation.
We used the following datasets for the experiment (samples are shown in Fig. 5).
4.1.1 Simulated plant dataset.
We used simulated plant models, which were modified from a dataset used in a plant modeling study [CVPR18_plant]. This dataset contains the ground truth instance masks and their multi-view correspondences, as well as the 3D shape information. For training of Mask R-CNN, we prepared simulated plant models rendered from viewpoints ( images in total). For evaluation, we used four plant models with a different number of leaves (, , , and ), not used for the training. To assess the effect of the varying number of views, we rendered the plants from different number of viewpoints (, , , and ) illustrated in Fig. 5, where we used the ground truth camera parameters.
4.1.2 Real-world dataset.
We prepared the real-world scenes for experiment, in which the ground truth instance masks and multi-view correspondences were manually created. The camera poses for these scenes were estimated via SfM [schoenberger2016sfm].
Real plant (soybeans) scene
is a set of multi-view images of a soybean plant captured by a multi-view capturing system [tanabata2018development], which was originally created for plant science studies. We used views for instance matching. The number of distinctive leaves in the plant was
. COCO-trained Mask R-CNN was fine-tuned using the images of eight soybeans plants captured by the same system, which were not used during the evaluation.
Multi-view balloon scene
contains a scene with balloons captured from views. Mask R-CNN, pre-trained with the COCO dataset, was fine-tuned using balloon images in the balloon dataset222https://github.com/matterport/Mask_RCNN/tree/master/samples/balloon.
Tree ornament scene
contains an artificial tree with decorations captured from views. Because the ornaments have spherical shapes, we fine-tuned the balloon-trained Mask R-CNN using a small number of (i.e., ) training images with tree ornaments, which were not used in the evaluation.
4.2 Multi-view matching results
We evaluated the accuracy of multi-view region matching using the dataset.
Because of no established baseline methods for descriptor-free region matching, we compared the proposed approach with straightforward implementations of instance matching 1) using MVS-based point cloud as guidance, 2) using the center point of regions, and 3) using sparse correspondences.
3D model by traditional MVS should be a guide for multi-view matching. We implemented a baseline matching method using dense 3D point clouds, called MVS-based matching hereafter. In this implementation, we used dense 3D point clouds reconstructed by COLMAP [schoenberger2016mvs]. For , the -th instance in the -th view, we selected a 3D position that represents the region of . An straightforward way to select that is closest to the camera from among the points projected in the region of
. To increase the robustness to the outliers, we designatedas the centroid of the set of top closest points among those projected in the region.
We then project to the other views; for example, if is projected on the -th instance in the -th view, , we deemed and are corresponding instances. In the same manner to our proposed method, we constructed a matching graph with nodes, where the edge weights record the correspondences; e.g., the edge between and are weighted by one. We solved the graph clustering using SymNMF [kuang2012symmetric]. This baseline method expects that a good-quality 3D model is given, which is under the same assumption as the methods using point clouds as an input (e.g., [quan06]).
Another baseline of the proposed approach is to match the centroid of instance regions instead of using the region matching. This method is analogous to the traditional point-based epipolar matching, i.e., methods using epipolar lines instead of bands. In our implementation, an epipolar line is drawn for the centroid of each instance region. The correspondences between instances in different views can be yielded as the nearest epipolar lines in 3D space. Similar to the MVS-based method, we constructed a matching graph, where the edge weights between the corresponding instances are weighted by one, and solved the graph clustering by SymNMF.
Matching based on sparse feature correspondences.
We also assessed a multi-view region matching method using sparse correspondences, which is a typical example of feature descriptors. We used AKAZE keypoints and descriptors [alcantarilla2011fast]. For each instance region, a corresponding instance for each different view is searched based on the number of corresponding points within the regions. We solved the graph matching similarly with the other baseline methods. This method is expected to work well when the scene is observed from similar viewing angles, which is a similar assumption to the video instance tracking.
4.2.2 Evaluation metric.
The matching accuracy
was calculated by the number of correctly classified instances over the number of all instances. Because we solve the multi-view matching as a clustering problem, we need to associate estimated instance clusters and the ground-truth clusters. We determined the ground-truth cluster ID corresponding to the ID of an estimated cluster , by searching the mode of among instances belonging to .
where denotes the estimated ID of the instance , while is the set of ground-truth instance cluster ID. Therefore, returns if the two instance IDs are same. The number of correct matches is divided by , the total number of instances among multi-view images. As an evaluation of clustering problems, is equivalent to the purity metric, which is a common measure for evaluating the success of clustering.
We also evaluated the accuracy of the estimated number of clusters by computing the mean absolute error (MAE) of the estimated number of clusters via Eq. (7).
|Method||# cameras||# leaves||Average|
Table 2 shows the the matching accuracy and the MAE of cluster numbers for the simulated plant dataset. The table compares the accuracy and error averaged over different number of cameras and leaves. The proposed MVIS implementation yielded a better accuracy for most cases and achieved the average matching accuracy of , which outperforms the accuracy by the baseline methods. Also, the result shows MVIS accurately estimates the number of clusters (i.e., the number of objects in the scene). MVS-based matching yielded better accuracy when using a larger number of cameras (i.e., ) because the smaller differences in the viewing angles enable it to find the dense texture-based correspondences. In the cases using a smaller number of views, the MVS-based method notably drops the performance, although the proposed approach still achieved the matching accuracy of over when the number of views is .
Table 2 shows the matching performance for the real-world scenes. The proposed approach also achieved a reasonable accuracy for both multi-view matching and cluster number estimation. For the tree ornament scenes, the matching accuracy was comparable among the methods. Ornaments were small spheres, and well approximated as a point. Proposed MVIS has an advantage for the scenes with difficulties of dense 3D reconstruction (e.g., real plant scene) or with relatively large objects (e.g., balloon scene).
4.3 Instance-wise 3D reconstruction results
We here describe the quantitative result of 3D reconstruction using simulated plant models, which we have the ground-truth shape of the leaves.
We used two baselines for the evaluation of 3D reconstruction accuracy, although these methods do not provide the instance-wise 3D reconstruction. Since we used back-projection-based 3D reconstruction described in Sec. 3.3, we implemented a simple method using the back projection. Without relying on the instance correspondence, we unified the silhouettes of all leaves and inputted to the back projection. This mimics the semantic segmentation of leaves as the silhouette source, instead of using multi-view instances. As another baseline, we simply used the MVS-based point clouds reconstructed by COLMAP [schoenberger2016mvs].
4.3.2 Evaluation metric.
We evaluated the geometric error between the dense point clouds of the reconstructed and the ground truth leaf shapes. For the evaluation, we first unify all 3D shapes of reconstructed leaf instances and convert the 3D voxel representation to a dense point cloud, where a 3D point is located if the voxel is inside leaves. The ground-truth leaf shape was originally modeled using polygons, we oversampled the vertices by Catmull–Clark subdivision [catmull1978recursively] to yield the dense point clouds.
Let and be estimated and the ground-truth 3D points, respectively. The geometric error is defined as a bidirectional Euclidean distance [zhu12] between the two point sets written as
where and are functions to acquire the nearest neighbor point to from point sets and , respectively, and and denote the numbers of points in and .
|Method||# cameras (averaged)||# leaves (10 cameras)|
Table 3 shows the 3D reconstruction error . Since the geometric error is defined only up to scale like most multi-view 3D reconstruction methods, the errors were normalized by the average leaf length. For the comparison across the different number of cameras (left side of the table), we averaged the error over the different number of leaves. For the right half of the table, results using cameras are listed because MVS often failed the dense reconstruction for the smaller number of views. The proposed method (MVIS) achieved better accuracy in most cases. Although the traditional MVS yielded an accurate reconstruction when using a larger number of (i.e., ) cameras or smaller number of leaves, the reconstruction was inaccurate or failed due to the difficulties of finding dense correspondences. The average reconstruction error by the proposed MVIS did not notably drop when decreasing the number of views, which still achieved the error of of the average leaf length via the reconstruction using cameras.
4.3.4 Visual examples.
Figure 6 shows example results of MVIS and instance-wise 3D reconstruction for real-world datasets. In the MVIS result, corresponding instances are visualized with the same color. The proposed method yields the multi-view correspondences and instance-wise 3D shapes convincingly, although we do not have access to the ground-truth 3D shapes for real-world datasets.
4.3.5 Failure cases.
Because the proposed method relies on the region segmentation, the failure in segmentation due to e.g., contact/occlusions of objects, affects the matching accuracy. Our experiment includes such cases when increasing the number of objects (e.g., Fig. 7). The low quality of instance masks is the dominant cause of the mismatching, which is a limitation of the proposed method.
We introduced a multi-view matching method of object instances, which does not rely on the texture or shape descriptors instead of using geometric (i.e., epipolar) constraint and a graph clustering. Experiments with simulated plant models demonstrated the proposed method yielded the average accuracy of the multi-view instance matching over , which outperforms the performances of baseline methods based on descriptor-based approaches such as MVS. Our method also showed the potential to be used for instance-wise 3D reconstruction via the integration with 3D reconstruction methods such as the back projection.
Beyond the computer vision study, potential applications of the proposed method include the growth analysis of plants, as our experiments used a dataset from plant science and agricultural research field.
Acknowledgements. This work was supported in part by JST PRESTO Grant Number JPMJPR17O3.