FLOT: Scene Flow on Point Clouds Guided by Optimal Transport

07/22/2020 ∙ by Gilles Puy, et al. ∙ 0

We propose and study a method called FLOT that estimates scene flow on point clouds. We start the design of FLOT by noticing that scene flow estimation on point clouds reduces to estimating a permutation matrix in a perfect world. Inspired by recent works on graph matching, we build a method to find these correspondences by borrowing tools from optimal transport. Then, we relax the transport constraints to take into account real-world imperfections. The transport cost between two points is given by the pairwise similarity between deep features extracted by a neural network trained under full supervision using synthetic datasets. Our main finding is that FLOT can perform as well as the best existing methods on synthetic and real-world datasets while requiring much less parameters and without using multiscale analysis. Our second finding is that, on the training datasets considered, most of the performance can be explained by the learned transport cost. This yields a simpler method, FLOT_0, which is obtained using a particular choice of optimal transport parameters and performs nearly as well as FLOT.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

FLOT

FLOT: Scene Flow Estimation by Learned Optimal Transport on Point Clouds


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Scene flow [vedula_scene_flow] is the

D motion of points at the surface of objects in a scene. It is one of the low level information for scene understanding, which can be useful,

e.g., in autonomous driving. Its estimation is a problem which has been studied for several years using different modalities as inputs such as colour images, with, e.g., variational approaches [basha13], [wedel08] or methods using piecewise-constant priors [ma_2019_CVPR], [menze_scene_flow_15], [vogel13], or also using both colour and depth as modalities [battrawy_stereo_lidar_flow], [hadfield_kinect], [shao_2018].

In this work,111The code is available at https://github.com/valeoai/FLOT. we are interested in scene flow estimation on point clouds only using D point coordinates as input. In this setting, [dewan_2016] proposed a technique based on the minimisation of an objective function that favours closeness of matching points for accurate scene flow estimate and local smoothness of this estimate. In [ushani17],

D occupancy grids are constructed from the point clouds and given as input features to a learned background removal filter and a learned classifier that find matching grid cells. A minimisation problem using these grid matches is then proposed to compute a raw scene flow before a final refinement step. In

[ushani18a], a similar strategy is proposed but the match between grid cells is done using deep features. In [baur19], [zou_2019], the point clouds are projected onto D cylindrical maps and fed in a traditional CNN trained for scene flow estimation. In contrast, FLOT directly consumes point clouds by using convolutions defined for them. The closest related works are discussed in Section 2.

We split scene flow estimation into two successive steps. First, we find soft-correspondences between points of the input point clouds. Second, we exploit these correspondences to estimate the flow. Taking inspiration from recent works on graph matching that use optimal transport to match nodes/vertices in two different graphs [maretic_got], [peyre_gw], [vayer_fgw], we study the use of such tools for finding soft-correspondences between points.

Our network takes as input two point clouds captured in the same scene at two consecutive instants and . We extract deep features at each point using point cloud convolutions and use these features to compute a transport cost between the points at time and

. A small cost between two points indicates a likely correspondence between them. In the second step of the method, we exploit these soft-correspondences to obtain a first scene flow estimate by linear interpolation. This estimate is then refined using a residual network. The optimal transport and networks’ parameters are learned by gradient descent under full supervision on synthetic datasets.

Our main contributions are: (a) an optimal transport module for scene flow estimation and the study of its performance; (b) a lightweight architecture that can perform as well as the best existing methods on synthetic and real-world datasets with much less parameters and without using multiscale analyses; (c) a simpler method FLOT obtained for a particular choice of the OT parameters and which achieves competing results with respect to the state-of-the-art methods. We arrive at this simplified version by noticing that most of the performance in FLOT are explained by the learned transport cost. We also notice that the main module of FLOT can be seen as an attention mechanism. Finally, we discuss, in the conclusion, some limitations of FLOT concerning the absence of explicit treatment of occlusions in the scene.

2 Related Works

Deep Scene Flow Estimation on Point Clouds. In [behl_pointflownet], a deep network is trained end-to-end to estimate rigid motion of objects in LIDAR scans. The closest related works where no assumption of rigidity is made are [gu_hplflownet], [liu_flownet3d], [wang_pccn], [wu_pointpwcnet]. In [wang_pccn], a parametric continuous convolution that operates on data lying on irregular structures is proposed and its efficiency is demonstrated on segmentation tasks and scene flow estimation. The method [liu_flownet3d] relies on PointNet++ [qi_pointnet] and uses a new flow embedding layer that learns to mix the information of both point clouds to yield accurate flow estimates. In [gu_hplflownet], a technique to perform sparse convolutions on a permutohedral lattice is proposed. This method allows the processing of large point clouds. Furthermore, it is proposed to fuse the information of both point clouds at several scales, unlike in [liu_flownet3d] where the information is fused once at a coarse scale. In contrast, our method fuse the information once at the finest scale. Let us highlight that our optimal transport module is independent of the type of point cloud convolution. We choose PointNet++ but other convolution could be used. In [wu_pointpwcnet], PWC-Net [sun_pwcnet] is adapted to work on point clouds. The flow is estimated in a coarse-to-fine scale fashion showing improvement over the previous method. Finally, let us mention that recent works [mittal_selfflow], [wu_pointpwcnet] address this topic using self-supervision. We however restrict ourselves to full supervision in this work.

Graph Matching by Optimal Transport. Our method is inspired by recent works on graphs comparison using optimal transport. In [maretic_got]

, the graph Laplacian is used to map a graph to a multidimensional Gaussian distribution that represents the graph structure. The Wasserstein distance between these distributions is then used as a measure of graph similarity and permits one to match nodes between graphs. In

[nikolentzos_graph]

, each graph is represented as a bag-of-vectors (one vector per node) and the measure of similarity is the Wasserstein distance between these sets. In

[peyre_gw], a method building upon the Gromov-Wasserstein distance between metric-measure spaces [memoli_gw] is proposed to compare similarity matrices. This method can be used to compare two graphs by, e.g., representing each of them with a matrix containing the geodesic distances between all pairs of nodes. In [vayer_fgw], it is proposed to compare graphs by fusing the Gromov-Wassertsein distance with the Wasserstein distance. The former is used to compare the graph structures while the latter is used to take into account node features. In our work, we use the latter distance. A graph is constructed for each point cloud by connecting each point to its nearest neighbours. We then propose a method to train a network that extract deep features for each point and use these features to match points between point clouds in our optimal transport module.

Algorithm Unrolling. Our method is based on the algorithm unrolling technique which consists in taking an iterative algorithm, unrolling a fixed number of its iterations, and replacing part of the matrix multiplications/convolutions in these unrolled iterations by new ones trained specifically for the task to achieve. Several works build on this technique, such as [gregor_lista], [mardani_prox], [metzler_amp], [mousavi_invert] to solve linear inverse problems, or [chen_diffusion], [liu_rare], [meinhart_prox], [wang_2016] in for image denoising (where the denoiser is sometimes used to solve yet another inverse problem). In this work, we unroll few iterations of the Sinkhorn algorithm and train the cost matrix involved in it. This matrix is trained so that the resulting transport plan provides a good scene flow estimate. Let us mention that this algorithm is also unrolled, e.g., in [genevay_generative] to train a deep generative network, and in [sarlin_feat_matching] for image feature assignments.

3 Method

3.1 Step 1: Finding Soft-Correspondences between Points

Let be two point clouds of the same scene at two consecutive instants and . The vectors are the coordinates of the and points of and , respectively. The scene flow estimation problem on point clouds consists in estimating the scene flow where is the translation of from to .

3.1.1 Perfect World.

Figure 1: The point clouds and go through which outputs a feature for each input point. These features (black arrows) go in our proposed OT module where they are used to compute the pairwise similarities between each pair of points . The output of the OT module is a transport plan which informs us on the correspondences between the points of and . This information permits us to compute a first scene flow estimate , which is refined by to obtain . The convolution layers (conv) are based on PointNet++ [qi_pointnet] but the OT module could accept the output of any other point cloud convolution. The dashed-blue arrows indicate that the point coordinates are passed to each layer to be able to compute convolutions on points.

We construct FLOT starting in the perfect world where , with a permutation matrix. The role of FLOT is to estimate the permutation matrix without the knowledge of . In order to do so, we use tools from optimal transport. We interpret the motion of the points as a displacement of mass between time and . Each point in the first point cloud is attributed a mass which we fix to . Each point then receives the mass from if , or, equivalently, if . We propose to estimate the permutation matrix by computing a transport plan from to which satisfies

(1)

where is the vector with all entries equal to , and is the displacement cost from point to point [peyre_cot]. Each scalar entry of the transport plan represents the mass that is transported from to .

The first constraint in (1) imposes that the mass of each point is entirely distributed over some of the points in . The second constraint imposes that each points receives exactly a mass from some of the points . No mass is lost during the transfer. Note that in the hypothetical case where the cost matrix would contain one zero entry per line and per column then the transport plan is null everywhere except on these entries and the mass constraints are immediately satisfied via a simple scaling of the transport plan. In this hypothetical situation, the mass constraints would be redundant for our application as it would have been enough to find the zero entries of to estimate . It is important to note the mass constraints play a role in the more realistic situation where “ambiguities” are present in by ensuring that each point gives/receives a mass and that each point in has a least one corresponding point in and vice-versa.

We note that satisfies the optimal transport constraints. We need now to construct so that .

3.1.2 Real World and Fast Estimation of .

Input: cost matrix ; parameters .
Output: transport plan .
;
;
for  do
       ;
       ;
      
end for
;
Algorithm 1 Optimal transport module. The symbol denote the element-wise division and multiplication, respectively.

In the real world, the equality does not hold because the surfaces are not sampled at the same physical locations at and and because objects can (dis)appear due to occlusions. A consequence of these imperfections is that the mass preservation in (1) does not hold exactly: mass can (dis)appear. One solution to circumvent this issue is to relax the constraints in (1). Instead of solving (1), we propose to solve

(2)

where , and denotes the -divergence. The term in (2) is an entropic regularisation on the transport plan. Its main purpose, in our case, is to allow the use of an efficient algorithm to estimate the transport plan: the Sinkhorn algorithm [cuturi_sinkhorn]. The version of this algorithm for the optimal transport problem (2) is derived in [chizat_unbalancedot] and is presented in Alg. 1. The parameter controls the amount of entropic regularisation. The smaller is, the sparser the transport plan is, hence finding sparse correspondences between and . The regularisation parameter

adjust how much the transported mass deviates from the uniform distribution, allowing mass variation. One could let

to impose strict mass preservation.

Note that the mass regularisation is controlled by the power in Alg. 1. This power tends to when to impose strict mass preservation and reaches in absence of any regularisation. Instead of fixing the parameters in advance, we let these parameters free and learn them by gradient descent along with the other networks’ parameters.

We would like to recall that, in the perfect world, it is not necessary for the power to reach to yield accurate results as the final quality is also driven by the quality of . In a perfect situation where the cost would be perfectly trained with a bijective mapping already encoded in by its zero entries, then any amount of mass regularisation is sufficient to reach accurate results. This follows from our remark at the end of the previous subsection but also from the discussion in the subsection below on the role of and the mass regularisation. In a real situation, the cost is not perfectly trained and we expect the power to vary in the range of after training, reaching values closer to when trained in a perfect world setting and closer to in presence of occlusions.

3.1.3 Learning the Transport Cost.

An essential ingredient in (2) is the cost where each entry encodes the similarity between to point . An obvious choice could be to take the Euclidean distance between each pair of points , i.e., , but this choice does not yield accurate results. In this work, we propose to learn the displacement costs by training a deep neural network that takes as input a point cloud and output a feature of size for each input point. The entries of the cost matrix are then defined using the cosine distance between the features at points and , respectively:

(3)

The more similar the features and are, the less the cost of transporting a unit mass from to is. The indicator function

(4)

is used to prevent the algorithm to find correspondences between points too far away from each other. We set m.

In order to train the network , we adopt the same strategy as, e.g., in [genevay_generative] to train generative models or in [sarlin_feat_matching] for matching image features. The strategy consists in unrolling iterations of Alg. 1. This unrolled iterations constitute our OT module in Fig. 1

. One can remark that the gradients can backpropagate through each step of this module and allow us to train

.

3.1.4 On the Role of and of the Mass Regularisation.

We gather in this paragraph the earlier discussions on the role of and the mass regularisation. For the sake of the explanation, we come back in the perfect-world setting and consider (1). In this ideal situation, one could further dream that it is possible to train perfectly such that is null for matching points, i.e., when , and strictly positive otherwise. The transport plan would then satisfy with a null transport cost. However, one should note that the solution would entirely be encoded in up to a global scaling factor: the non-zero entries of are at the zero entries of . In that case, the mass transport constraints only adjust the scale of the entries in . Such a perfect scenario is unlikely to occur but these considerations highlight that the cost matrix could be exploited alone and could maybe be sufficient to find the appropriate correspondences between and for scene flow estimation. The mass transport regularisation plays a role in the more realistic case where ambiguities appears in . The regularisation enforces, whatever the quality of and with a “strength” controlled by , that the mass is distributed as uniformly as possible over all points. This avoids that some points in are left with no matching point in , and vice-versa.

3.1.5 Flot.

FLOT is a version of FLOT where only the cost matrix is exploited to find correspondences between and . This method is obtained when removing the mass transport regularisation in (2), i.e., by setting . In this limit, the “transport plan” satisfies

(5)

is then used in the rest of the method as if it was the output of Alg. 1.

3.2 Step 2: Flow Estimation from Soft-Correspondences

We obtained, in the previous step, a transport plan that gives correspondences between the points of . Our goal now is to exploit these correspondences to estimate the flow. As before, it is convenient to start in the perfect world and consider (1). In this setting, we have seen that and that, if is well trained, we expect . Therefore, an obvious estimate of the flow is

(6)

where we exploited the fact that in the last equality.

In the real world, the first equality in (6) does not hold. Yet, the last expression in (6) remains a sensible first estimation of the flow. Indeed, this computation is equivalent to computing, for each point , a corresponding virtual point that is a barycentre of some points in . The larger the transported mass from to is, the larger the contribution of to this virtual point is. The difference between this virtual point and gives an estimate of the flow . This virtual point is a “guess” on the location of made knowing where the mass from is transported in .

However, we remark that the flow estimated in (6) is, necessarily, still imperfect as it is highly likely that some points in cannot be expressed as barycentres of the found corresponding points . Indeed, some portion of objects visible in might not visible any more in due to the finite resolution in point cloud sampling. The flow in these missing regions cannot be reconstructed from but has to be reconstructed using structural information available in , relying on neighbouring information from the well sampled regions. Therefore, we refine the flow using a residual network:

(7)

where takes as inputs the estimated flow and uses convolutions defined on the point cloud .

Let us finally conclude this section by highlighting the fact that, in the case of FLOT, (6) simplifies to

(8)

On can remark that the OT module essentially reduces to an attention mechanism [attention_2017] in that case. The attention mechanism is thus a particular case of FLOT where the entropic regularisation plays the role of the softmax temperature. Let us mention that similar attention layers haved been showed effective in related problems such as rigid registration [wang_cycle, dcp_19, prnet_19].

3.3 Training

The network’s parameters, denoted by , and are trained jointly under full supervision on annotated synthetic datasets of size . Note that to enforce positivity of , we learn their values. A constant offset of is applied to to avoid numerical instabilities in the exponential function during training.

The sole training loss is the -norm between the ground truth flow and the estimated flow :

(9)

where is a diagonal matrix encoding an annotated mask used to remove points where the flow is occluded.

We use a batchsize of at and a batchsize of at using Adam [kingma_adam] and a starting learning rate of . The learning rate is kept constant unless specified in Section 4.

3.4 Similarities and Differences with Existing Techniques

A first main difference between FLOT and [gu_hplflownet], [liu_flownet3d], [wu_pointpwcnet] is the number of parameters which is much smaller for FLOT (see Table 1). Another difference is that we do not use any downsampling and upsampling layers. Unlike [gu_hplflownet], [wu_pointpwcnet], we do not use any multiscale analysis to find the correspondences between points. The information between point clouds is mixed only once, as in [liu_flownet3d], but at the finest sampling resolution and without using skip connections between and .

We also notice that [gu_hplflownet], [liu_flownet3d], [wu_pointpwcnet] rely on a MLP or a convnet applied on the concatenated input features to mix the information between both point clouds. The mixing function is learned and thus not explicit. It is harder to find how the correspondences are effectively done, i.e., identify what input information is kept or not taken into consideration. In contrast, the mixing function in FLOT is explicit with only two scalars adjusted to the training data and whose roles are clearly identified in the OT problem (2). The core of the OT module is a simple cross-correlation between input features, which is a module easy to interpret, study and visualise. Finally, among all the functions that the convnets/MLPs in [gu_hplflownet], [liu_flownet3d], [wu_pointpwcnet] can approximate, it is unlikely that the resulting mixing function actually approximates the Sinkhorn algorithm, or an attention layer, after learning without further guidance than those of the training data.

4 Experiments

4.1 Datasets

Dataset K EPE AS AR Out.
With flow refinement FLOT (fixed)
1
3
5
FLOT (fixed)
1
3
5
FLOT (fixed)
1
3
5
No flow refinement FLOT Same as above
1
3
5
FLOT Same as above
1
3
5
Table 1: Performance of FLOT on the validation sets of , , and  (top). Performance of FLOT measured at the output of the OT module, i.e., before refinement by , on  and  (bottom). The corresponding performance on

 is in the supplementary material. We report average scores and, between parentheses, their standard deviations. Please refer to Section 

4.3 for more details.

As in related works, we train our network under full supervision using FlyingThings3D [mayer_ft3d] and test it on FlyingThings3D and KITTI Scene Flow [menze_scene_flow_15, menze_scene_flow_18]. However, none of the datasets provide point clouds directly. This information needs to be extracted from the original data. There is at least two slightly different ways of extracting these D data, and we report results for both versions for a better assessment of the performance. The first version of the datasets are prepared222Code and pretrained model available at https://github.com/laoreja/HPLFlowNet. as in [gu_hplflownet]. No occluded point remains in the processed point clouds. We denote these datasets  and . The second version of the datasets are the ones prepared333Code and datasets available at https://github.com/xingyul/flownet3d. by [liu_flownet3d] and denoted  and . These datasets contains points where the flow is occluded. These points are present at the input and output of the networks but are not taken into account to compute the training loss (9) nor the performance metrics, like in [liu_flownet3d]. Further information about the datasets is in the supplementary material. Note that we keep aside examples from the original training sets of  and  as validation sets, which are used in Section 4.3.

4.2 Performance Metrics

We use the four metrics adopted in [gu_hplflownet], [liu_flownet3d], [wu_pointpwcnet]: the end point error EPE; two measures of accuracy, denoted by AS and AR, computed with different thresholds on the EPE

; a percentage of outliers also computed using a threshold on the

EPE. The definition of these metrics is recalled in the supplementary material.

Let us highlight that the performance reported on  and  are obtained by using the model trained on  and , respectively without fine tuning. We do not adapt the model for any of the method. We nevertheless make sure that the axes are in correspondence for all datasets.

4.3 Study of FLOT

We use ,  and  to check what values the OT parameters reach after training, to study the effect of on the FLOT’s performance and compare it with that of FLOT.  is exactly the same dataset as  except that we enforce when sampling the point to simulate the perfect world setting. The sole role of this ideal dataset is to confirm that the OT model holds in the perfect world, the starting point of our design.

For these experiments, training is done at for epochs and takes about 9 hours. Each model is trained times starting from a different random draw of to take into account variations due to initialisation. Evaluation is performed at on the validation sets. Note that the points are drawn at random also at validation time. To take into account this variability, validation is performed different times with different draws of the points for each of the trained model. For each score and model, we thus have access to values whose mean and standard deviation are reported in Table 1. We present the scores obtained before and after refinement by .

First, we notice that for all model after training. We recall that we applied a constant offset of to prevent numerical errors occurring in Alg. 1 in the exponential function when reaching to small value of . Hence, the entropic regularisation, or, equivalently, the temperature in FLOT, reaches its smallest possible value. Such small values favour sparse transport plans , yielding sparse correspondences between and . An illustration of these sparse correspondences is provided in Fig. 2. We observe that the correspondences are accurate and that the mass is well concentrated around the target points, especially when these points are near corners of the object.

Figure 2: Illustration of correspondences, found by FLOT () trained on (see Section 4.4), between and in two different scenes of . We isolated one car in each of the scenes for better visualisation. The point cloud captured at time is represented in orange. The lines show the correspondence between a query point and the corresponding point in on which most the mass is transported: . The colormap on represents the values in where yellow corresponds to and blue indicates the maximum entry in and show how the mass is concentrated around .
Dataset Method EPE AS AR Out. Size (MB)
FlowNet3D [liu_flownet3d]
HPLFlowNet [gu_hplflownet]
FLOT ()
PointPWC-Net [wu_pointpwcnet]
PointPWC-Net
FlowNet3D [liu_flownet3d]
HPLFlowNet [gu_hplflownet]
FLOT ()
PointPWC-Net [wu_pointpwcnet]
PointPWC-Net
Table 2: Performance on  and . The scores of FlowNet3D and HPLFlowNet are obtained from [gu_hplflownet]. We also report the scores of PointPWC-Net available in [wu_pointpwcnet], as well as those obtained using the official implementation. Italic entries are for methods publicly available but not yet published at submission time.

Second, the power , which controls the mass regularisation, reaches higher values on  than . This is the expected behaviour as  contains no imperfection and  contains occlusions. The values reached on  are in between those reached on  than . This is also the expected behaviour as  is free of occlusions and the only imperfections are the different sampling of the scene as and .

Third, on , FLOT reduces by the EPE compared to FLOT, which nevertheless already yields good results. Increasing from to further reduces the error and stabilises at . This validates the OT model in our the perfect world setting: the OT optimum and perfect world optimum coincide.

Fourth, on  and , the average scores are better for FLOT than FLOT, except for two metrics at on . The nevertheless good performance of FLOT indicates that most of it is due to the trained transport cost . On  and , changing from to has less impact on the EPE than on . We also detect a slight decrease of performance when increasing from to . The OT model (2) can only be an approximate model of the (simulated) real-world. The real-world optimum and OT optimum do not coincide. Increasing brings us closer to the OT optimum but not necessarily always closer to the real-world optimum. becomes an hyper-parameter that should be adjusted. In the following experiments, we use or .

Finally, the absence of has no effect on the performance on , with FLOT still performing better than FLOT. This shows that OT module is able to estimate accurately the ideal permutation matrix on its own and that the residual network is not needed in this ideal setting. However, plays a important role on the more realistic datasets  and , with an EPE divided by around when present.

4.4 Performance on  and

We compare the performance achieved by FLOT and the alternative methods on  and  in Table 2. We train FLOT using points, as in [gu_hplflownet], [wu_pointpwcnet]. The learning rate is set to for epochs before dividing it by and continue training for more epochs.

The scores of FlowNet3D and HPLFlowNet are obtained directly from [gu_hplflownet]. We report the scores of PointPWC-net available in [wu_pointpwcnet], as well as the better scores we obtained using the associated code and pretrained model.444Code and pretrained model available at https://github.com/DylanWusee/PointPWC. The model sizes are obtained from the supplementary material of [liu_flownet3d] for FlowNet3D, and from the pretrained models provided by [gu_hplflownet] and [wu_pointpwcnet]. HPLFlowNet, PointPWC-Net and FLOT contain M, M, and M parameters, respectively.

FLOT performs better than FlowNet3D and HPLFlowNet on both  and . FLOT achieves a slightly better EPE than PointPWC-Net on  and a similar one on . However, PointPWC-Net achieves better accuracy and has less outliers. FLOT is the method that uses the less trainable parameters ( times less than PointPWC-Net).

We illustrate in Fig. 3 the quality of the scene flow estimation for two scenes of . We notice that FLOT aligns correctly all the objects. We also remark that the flow estimated at the output of the OT module is already of good quality, even though the performance scores are improved after refinement.

4.5 Performance on  and

Dataset Method EPE AS AR Out.
FlowNet3D [liu_flownet3d]
FLOT
FLOT ()
FLOT ()
FlowNet3D [liu_flownet3d]
FLOT
FLOT ()
FLOT ()
Table 3: Performance on  and .

We present another comparison between FlowNet3D and FLOT using  and , originally used in [liu_flownet3d]. We train FlowNet3D using the associated official implementation. We train FLOT and FLOT on points using a learning rate of for epochs before dividing it by and continue training for more epochs.

The performance of both methods is reported in Table 3. We notice that FLOT and FLOT achieve a better accuracy than FlowNet3D with an improvement of AS of points on  and on . The numbers of outliers are reduced by the same amount. FLOT at performs the best with FLOT close behind. On , the best performing model are those of FLOT and FLOT at .

The reader can remark that the results of FlowNet3D are similar to those reported in [liu_flownet3d] but worse on . The evaluation on  is done differently in [liu_flownet3d]: the scene is divided into chunks and the scene flow is estimated within each chunk before a global aggregation. In the present work, we keep the evaluation method consistent with that of Section 4.4 by following the same procedure as in [gu_hplflownet], [wu_pointpwcnet]: the trained model is evaluated by processing the full scene in one pass using random points from the scene.

Inputs (orange) and (blue) Ground truth (orange) and input (blue)
Estimated (orange) and input (blue) Refined (orange) and input (blue)

 
Inputs (orange) and (blue) Ground truth (orange) and input (blue)
Estimated (orange) and input (blue) Refined (orange) and input (blue)

Figure 3: Two scene from  with input point clouds , along with the ground truth , estimated and refined using FLOT () at .

5 Conclusion

We proposed and studied a method for scene flow estimation built using optimal transport tools. It can achieves similar performance to that of the best performing method while requiring much less parameters. We also showed that the learned transport cost is responsible for most of the performance. This yields a simpler method FLOT, which performs nearly as well as FLOT.

We also noticed that the presence of occlusions affects the performance of FLOT negatively. The proposed relaxation of the mass constraints in Eq. (2) permits us to limit the impact of these occlusions on the performance but does not handle them explicitly. There is thus room for improvements by detecting, e.g., by analysing the effective transported mass, and treating occlusions explicitly.

Appendix 0.A Networks architecture

Layer
MLP size - - - - - -
Table 4: Architecture of and where layer is linear and used in only.

The convolutions used in and are based on PointNet++ [qi_pointnet] in our implementation. Each convolution layer takes as inputs the point cloud on which the convolution are performed and the features , , coming from the previous layer . Note that these features are simply the point coordinates at the input of and the estimated flow at the input of . For each point , the indices of the nearest neighbors to in are then computed to obtain features at point , each one satisfying

(10)

. These features are passed through a consisting of a series of fully connected layer, instance normalisation layer with affine correction [ulyanov_instance]

, and leaky ReLu with a negative slope of

, repeated times in the same order. Finally, the new feature at point

is obtained after passing through a final max pooling layer:

(11)

where the is computed independently for each of the channels. These computations are repeated for each point of the point cloud using the same MLP. The networks and share the same architecture, which is given in Table 4. Note nevertheless that the weights are not shared between and .

Appendix 0.B Datasets

The datasets  and  are prepared555Code available at https://github.com/laoreja/HPLFlowNet. as in [gu_hplflownet]. No occluded point remains in the processed point clouds: one can always find a point in such that at full sampling rate . However, in practice, most of the points do not have a direct matching in as both point clouds are randomly and independently sub-sampled to keep only points. This simulates different sampling of the scene. Nevertheless, no object appears or disappears because of occlusions between and .  contains training examples, from which we keep aside for validation, and test examples.  contains examples for which are used for test, as in [gu_hplflownet]. We do not use the remaining KITTI examples. The ground points in  are removed using a threshold on the height. All points whose depth is larger than m are removed in both datasets.

The datasets  and  are the prepared666Datasets available at https://github.com/xingyul/flownet3d. by [liu_flownet3d]. In , masks where the flow is non valid, e.g., due to occlusions, are provided in used in the training loss, like in [liu_flownet3d]. These masks are also used to compute the scores only on valid points at test time for all methods. However, the points where the corresponding flow is non-valid are present at the input of all networks. No mask is provided for .  contains training examples, from which we keep aside for validation, and test examples.777We removed examples with all points marked as occluded ( in the training set and in the test set). One example which contains a non valid value in the training dataset is also removed.  contains test examples. The ground points in  are removed by [liu_flownet3d]. All points whose depth is larger than m are removed in both datasets.

Appendix 0.C Performance metrics

We use the following four metrics adopted in [gu_hplflownet], [liu_flownet3d], [wu_pointpwcnet]:

  • : end point error, averaged over all ;

  • AS: percentage of points such that or ;

  • AR: percentage of points such that or

  • Out.: percentage of points such that or .

The above metrics are computed as follows. The point clouds are obtained by selecting random points out of the provided points in the datasets. The flow is estimated and compared to the ground truth flow on these selected points. The scores are averaged over the whole validation/test set.

Appendix 0.D Additional experimental results

0.d.1 Study of FLOT

We report in Table 5 the performance of FLOT obtained at the output of the OT module on . The corresponding performance with refinement are available in the core of the paper. As on , we remark that the refinement permits to improve the EPE by around , confirming its utility in presence of occlusions.

Dataset K EPE AS AR Out.
FLOT
1
3
5
Table 5: Performance of FLOT measured at the output of the OT module, i.e., before refinement by , on . We report the average scores and their standard deviations between parentheses.

0.d.2 Computation time in the OT module

At , the computation time888Computed on a Nvidia GeForce RTX 2080 Ti. in the OT module is , and ms for FLOT, FLOT , FLOT , respectively. At , the computation time in the OT module is , , ms for FLOT, FLOT , FLOT , respectively. This represents at most of the total computation time which is itself at most of ms at and ms at . Most of the time, at least at and at , is spent in the feature extractor . This shows that the OT module is responsible for just a small fraction of the total computation time.

Note that the time spent in the OT module is independent of the type of convolution used. Replacing our implementation of PointNet++ with a faster one or choosing a faster convolution will directly improve the computation time spend in and . Our implementation of the OT module can also be made faster by avoiding to compute densely the cost matrix by restricting the computation to points that are less than meters apart, as these points never contribute to .

References