1 Introduction
Trajectory triangulation aims to estimate multiview sparse dynamic 3D geometry in the absence of concurrent observations. Recent advances in modeling and estimating the spatiotemporal relationships among 2D observations have yielded solutions with increasing generality and effectiveness. However, such research efforts have focused on developing and exploiting geometric insights and formulations, relegating the analysis of higherorder semantic relationships among the geometric entities being estimated. This work addresses the datadriven explicit characterization and modeling of these properties within the context of generalized trajectory triangulation.
Learning to encode generic spatiotemporal relationships hinges on the geometric reference being used and the scope of the analysis. The choice of geometric reference typically poses a dichotomy between Eulerian (e.g. field approach) vs. Lagrangian (e.g. particle approach) representations, where the former defines interactions among rigidly structured adjacencybased neighborhoods (e.g. voxel laticces), the latter defines interactions based on generic notions of proximity (e.g. nearestneighbor graphs). Although scope is tightly coupled to these interaction mechanisms, the efficiency vs. comprehensiveness tradeoffs between local and global analysis, determine the efficacy of the learned models and representations. We target a discretecontinuous localglobal middle ground by 1) learning to approximate pairwise affinities over all estimated geometric elements, through 2) the use of sparse continuous convolutions.
GTTNet Workflow. Input camera poses and 2D features are mapped to a latent space encoding a pairwise affinity matrix leveraged to estimate 3D geometry.
Along these lines, the recent framework for generalized trajectory triangulation (GTT) described in [40], poses the estimation of such relationships in terms of the iterative continuous optimization of a graphtheoretic representation. However, said optimization offers relatively slow convergence and provides no straightforward mechanisms for codifying internal shape constraints or sequencelevel motion priors. This work focuses on learning to synthesize a global shape affinity matrix directly from input 3D geometry to integrate with and leveraging the representation and formulation used in [40], see Fig. 1. Our contributions are:

[leftmargin=*]

A learningbased solution to the joint reconstruction and sequencing problems from multiview image capture.

A generalizable learning and representation framework applicable across diverse input shape domains.

An efficient and flexible cascaded training framework applicable across diverse types of supervisory information.
2 Related work
2.1 Trajectory Triangulation
Trajectory triangulation operates on the premise of known cameras. However, the lack of concurrency requires enforcing estimation constraints to discriminate among the space of solutions compliant with the input observations.
Motion priors. Avidan and Shashua [6] enforced analytical linear and conical motion constraints upon the estimated 3D point trajectories from monocular capture. Extensions to these motion priors, include [5, 6, 13, 31, 30, 22]. Vo et al. [36] used physicsbased motion priors such as least kinetic energy, to formulate a bundle adjustment framework for jointly optimizing static and dynamic 3D structure, camera poses and crosscapture temporal offsets.
Spatiotemporal smoothness. Enforcing spatiotemporal smoothness on the geometric estimation process [23, 24, 44, 45, 35, 42, 43, 36, 33, 34] has shown to be an effective approach to leverage temporally dense capture, such as those obtained by multiple video observers. Park et al. [23] parameterize a 3D trajectory in terms of linear combinations of a set of Direct Cosine Transform trajectory bases and optimize for each coefficient weight. In [24], Park et al. improve their method by selecting a small number of DCT bases according to Nfold a cross validation method to avoid low reconstructability cases. Zhu et al. [44] improve this result by adding a set of manual keyframes and adding norm regularization to their optimization to force sparsity on the DCT basis, instead of Nfold cross validation. Valmadre et al. [35] modify the reconstructability analysis for the trajectory basis solutions and propose two solutions: reducing trajectory bases by setting a gain threshold and applying a highpass filter. Zheng et al. [43, 42] reconstructed dynamic 3D structure observed by multiple unsynchronized cameras with partial sequencing information, by assuming a selfexpressive motion prior and implementing a biconvex optimization problem. Recent works explicitly model and solve for relationships among dynamic 3D estimates and their spatiotemporal data associations [3, 1, 2, 4]. Along these lines, Xu et al. [40] used a graphbased formulation jointly estimating dynamic 3D structure and its corresponding discrete Laplace operator to reduce reliance on the temporal density and uniformity of the input data.
2.2 Learning for Sparse Dynamic 3D Geometry
Structured 3D Data Representations. Relevant to our problem, some early CNNbased approaches to 3D processing [12, 18] map the 3D representations onto a 2D space, where traditional CNN machinery is deployed. Such representations forgo accurate modeling of the geometric relationships lost or warped during projection. Performing 3D convolutions on volumetric representations [11, 19, 25, 29, 39] encodes 3D positional information and adjacency relations, but may quantize the representation space, leading alternatively, to data merging or sparsity. Riegler et al. [29] addressed this limitation by implementing 3D convolutions on data organized on an octtree data structure.
Unstructured 3D Data Representations. Qi et al. [26]
worked on unstructured data enforcing network invariance to different permutations of the input feature by aggregating global information through max pooling. PointNet++
[27] improved performance by capturing local structure information. Wang et al. [38]propose a continuous convolutional neural network, which similarly to 2D convolutions, computes feature maps in terms of weighted sums of the input features. The use of a multilayer perceptron (MLP) enabled adaptive weight determination based on geometric similarity. Boulch
[10] computes a denser weighting function which takes into account the entire kernel.Deep learning for Dynamic 3D reconstruction. Recently, network architectures have been proposed for the NRSfM problem. Kong et al. [16, 17]
propose an unsupervised autoencoder neural network to solve NRSfM problem under an orthogonal camera model by relying on a multilayer sparse coding framework assumption. Wang et al.
[37] developed a similar multilayer sparse coding framework with improved generalization to weak and strong perspective camera models, along with increased robustness to missing data. Novotny et al. [21] learned a deep network to unambiguously factorize 3D structure and viewpoints by forcing consistency via canonicalization. Bai et al. proposed an endtoend deep network [8] targeting multiview 3D facial reconstruction. Another unsupervised endtoend deep network [32] is introduced by Sidhu, which proposes the first dense neural NRSfM approach.3 Generalized Trajectory Triangulation
The goal of generalized trajectory triangulation (GTT) is to recover timevarying 3D structure from a set of 2D observations with known imaging geometry, but absent of global sequencing relations among input capture frames. Accordingly, GTT may be deemed a structureonly variation of the general nonrigid structure from motion problem (NRSfM). A graphtheoretic formulation. A structuremotion graph representation has been recently presented in [40], where nodes are mapped from input images and have 3D geometry as attributes, while edges have the pairwise affinities as weights. Based on this representation, the GTT problem can be formulated as jointly estimating dynamic 3D geometry with the graph’s Laplacian matrix, given by
(1) 
where is the graph’s affinity matrix, whose values correspond to the edge weights , characterizing the spatiotemporal relationships among 3D estimates . This generalization of the selfexpressive motion prior [43], yields a nonconvex optimization problem of the form
(2) 
where denotes the aggregation of all input 2D observations and their camera parameters, is a data term based on reprojection error, while , , and , are regularizers controlling, respectively, anisotropic smoothness, topological compactness, and multiview reconstructability. Variables and are solved alternatively. That is, for fixed , 3D structure is estimated by unconstrained quadratic programming; while for fixed , is estimated by a linearly constrained quadratic problem. We refer readers to the original publication for further details [40]. While the above formulation achieved state of the art accuracy and robustness, its explicit full graph analysis limits its computational scalability. GTTNet aims to alleviate this limitation by developing an encoderdecoder framework directly mapping the input 3D geometry to the discrete Lapalace operator .
4 GTTNet
As presented in [40], global dependencies required for affinity matrix optimization impose a computational bottleneck. GTTNet learns to directly estimate these affinity values from input data. From an initial geometry , we learn a latent space encoding the affinity among input 3D shapes. A sparse affinity matrix decoded from this latent space is fed to a differentiable quadratic optimization module to determine a refined dynamic geometry estimate
. We use data augmentation to explicitly target equivariance w.r.t. relevant input capture variants and perturbations. To accelerate training, we utilize cascaded training leveraging supervisory loss functions of increasing complexity.
4.1 Network Architecture
Parameterizing Input Geometry. A timevarying set of 3D points is observed in images captured by unsynchronized perspective cameras with known intrinsic and extrinsic matrices and . 3D points are denoted as , while their image projections are . The set of all 3D points to estimate is represented by a matrix
(3) 
where represents a 3D point’s coordinates. Each row of aggregates the 3D points captured in frame and constitutes a perframe shape descriptor from which to estimate affinities. The input matrix is estimated through pseudotriangulation of viewing rays associated with .
Parametric Continuous Convolution Layers. Based on [38] and [10], we perform approximated continuous convolution operations on a given feature descriptor as
(4) 
where is one the nearest neighbors of , is the feature map value function and is a convolution kernel function approximated by a multilayer perceptron (MLP)
(5) 
This yields continuous output values using a finite set of learned weight parameters . We learn two types of filters for each layer, see Fig. 3. The first operates on the N singleframe descriptors and their K nearest neighbors, defining the support neighborhood w.r.t. spatiotemporal proximity among their shapes. Filter values are determined by geometry difference between shapes according to Eq. 5. The second operates on singlecoordinate wholetrajectory descriptors and defines the support neighborhood domain w.r.t. intrashape geometry (i.e. percomponent proximity to their barycenter). Filter values are determined by geometry difference between joints.
UNet AutoEncoder Network Stream. We learn a latent space using a UNet encoderdecoder to perform dimensionality reduction through continuous parametric convolutions, see Fig. 1(a). For translation and scale invariance across different input data, we apply layer normalization [7] for input and hidden layers by subtracting the mean
, dividing by the standard deviation
for each feature channel, while scaling and shifting by learnable parameters and .(6) 
An affinity matrix is computed in closed form as the pairwise similarity between latent space features by
(7) 
Unlike a regular graph affinity matrix, does not encode a graph’s local connectivity. is sparsified into through a layer retaining the highest affinity values among the perfeature convolution support domain . Empirically, we found =2 yielded the best performance (see Fig. 8(b)) and enforced this selection criteria deterministically. Finally, is fed into a differentiable instance of the Discrete Laplace Operator Estimator framework [40], denoted as a DLOElayer, to estimate output 3D geometry .
PointNet Network Stream. To allow for input shapes having different number of 3D points, we integrate a PointNet network [26] to provide a fixedsized input into our UNet, see Fig. 1(b), We normalize each shape by subtracting its barycenter before PointNet maps it to a 30 dimension feature. To retain spatial separation among shapes we interpret PointNet’s output as 10 virtual 3D points, add back the original barycenter, and feed them to the UNet.
4.2 Supervisory Data
Depending on the capture scenario, complete or partial sequencing priors (e.g. sequencing among frames belonging to the same camera or video stream) may be available. As GTTNet encodes these priors in terms of the support domain used for continuous convolution, we only need to train a single network instance that is inclusive of all such variations. We explicitly instantiate such input prior variations within our training data and to account for capture variability, we perform data augmentation tailored to our formulation as in Fig. 4. We inject Gaussian noise to the 2D features to account for feature localization ambiguity and apply geometric transformations to the ground truth data to account for capture variability.
Capture Scenario 1: Independent Images. Independent imagery provides no sequencing information. The convolution support domain for shape descriptors is determined by the spatial distribution of our initial 3D geometry , which is computed by exhaustive pseudotriangulation of sparse 2D features. Once a rough 3D geometry is estimated per each frame, we compute the perframe Knearest neighbors by the combination of triangulation error and viewing ray convergence analysis to eliminate frames with reduced camera baseline and unreliable triangulation. This input feature variant is denoted as .
Capture Scenario 2: Unsynchronized Videos. For unsynchronized videos, sequencing priors are available for each independent video stream, allowing us to summarily eliminate from the support domain any frames from the same stream, and frames within another stream that are not adjacent among themselves. These constraints mitigate repetitive and/or selfintersecting 3D motions. The initialized input feature is defined as .
Capture Scenario 3: Synchronized Videos. For synchronized videos,^{1}^{1}1Synchronization denotes temporal alignment, not capture concurrency global sequencing is known and we can readily determine the Knearest neighbors as those elements temporally adjacent to a given reference video frame. Also, pseudotriangulation efficiency and reliability can benefit from guidance from the known sequencing info. This input feature variant is denoted as .
Data Augmentation 1: Global Structure Rotation. Feature normalization in our encoder layers mitigates global scale and displacement variations. To promote rotational invariance we generate augmented input instances by randomly rotating initial 3D structure and camera poses jointly. While this transformation does not change input 2D feature locations, it targets the generalization of 3D and sequencing estimates. This input feature variant is denoted as .
Data Augmentation 2: Camera Perturbations. We inject structured perturbations to our input by randomly rotating and translating the camera pose of each frame. Since this transformation changes the imaging geometry, it alters the input 2D features used to initialize both 3D structure, and the K nearest neighbors associated to each frame. This input feature variant is denoted as .
4.3 Loss functions
UNet Reconstruction Loss. To train our UNet autoencoder, we penalize the difference between the input and the reconstructed output maps, which correspond, respectively, to the initialized 3D structure and a decoded 3D structure . We penalize the differences between each hidden feature map inside the encoder and the symmetrically corresponding hidden feature map in the decoder as in Fig. 1(a). The loss function is written as,
(8) 
where is the depth of the encoder and decoder layers.
(Pseudo) Ground Truth Affinity Loss. Ground truth affinity matrix optimization is computationally intractable (NPhard). Hence, we use ground truth sequencing to generate a proxy (pseudo) ground truth affinity matrix having affinity values for temporally consecutive frames and zero otherwise. If ground truth structure is available, we estimate realvalued affinities through optimization as in [40]. The reconstruction accuracy training by these two kinds of (pseudo) ground truth affinity matrix are compared in Fig. 7(a). We penalize the difference between and .
(9) 
3D Reconstruction Loss. Given the affinity matrix estimated by GTTNet, we generate the corresponding Laplacian matrix as in Eq.(1) and estimate the 3D geometry by solving a quadratic programming problem. We penalize the 3D structure estimation error w.r.t. ground truth as
(10) 
Smoothness Loss. In the absence of ground truth 3D structure , we penalize the first and second terms in Eq.2, to foster local smoothness and linear topological structure.
(11) 
PointNet Autoencoder reconstruction Loss. If the PointNet stream is considered, we penalize the difference between it’s input and output map reconstructed by a domainspecific decoder . In this scenario, the input to our UNet is PointNet’s fixeddimension latent space.
(12) 
4.4 A Cascaded Supervision Strategy.
The loss functions just described address a diversity of performance aspects we aim to control through supervision. However, they impose different levels supervisory specificity as well as computational burden. In order to streamline the training process, we partition it into sequential stages, each one of them considering supervisory loss functions of increasing specificity and complexity. We aim to bootstrap the training process using efficient weak supervision and later improve upon the quality of the results by incorporating more targeted and computationally burdensome loss functions. We observed that strong supervision based on the output of the DLOE layers, while being the most effective, significantly slowed down convergence rate and increased the processing time for each epoch. Accordingly, DLOEbased supervision is used to finetune affinity estimation and omitted during initial training epochs. We now describe our 2stage cascaded approach, shown in Fig.
5.Stage 1: Bootstrap Affinity Supervision. Stage 1 only enforces sequencing constraints and relies on the and loss functions. The goal is to accurately learn to autoencode UNet’s input signal, while effectively learning pairwise affinity. For sequencingonly supervision, the binary version of is used to target the identification of temporal neighborhoods. Conversely, if ground truth 3D geometry is available, the continuous version of is used to target finegrain affinity estimation.
Stage 2: DLOEbased Supervision. Stage 2 leverages the DLOE model to enforce geometric regularization on the output 3D structure. For sequencingonly supervision, we enforce the smoothness loss function , to learn affinity values in yielding smooth 3D trajectories. For training instances where is available, we replace with a 3D reconstruction loss for fully supervised learning.
5 Experiments
5.1 Motion Capture Datasets
We use motion capture data [20] of 130 human 3D motions for 31 joints with frame rates of 120 Hz. We choose 10 of the 130 motions, each averaging 300 frames for testing. We generate training datasets by randomly choosing from the remaining 120 datasets with varying levels of 2D noise, frames rates and percent of joints missing. We simulate four virtual cameras with resolution and 1000 focal length. Dynamic 3D joint positions are projected on them as 2D observations at a distance of 3m. For each 3D motion in both training and testing datasets, temporal sampling is performed at 30Hz and concurrent observations are systematically avoided to ensure all cameras are unsynchronized. We show results of 3D reconstruction accuracy comparisons in Fig. 7. GTTNet is compared against discrete Laplace operator estimation (DLOE)[40], selfexpressive dictionary learning (SEDL)[43], trajectory basis (TB)[24], highpass filter (HPF)[35] and the pseudotriangulation approach in Sec.4.2. SEDL requires partial sequencing information. TB and HPF require complete ground truth sequencing.
Varying 2D noise. We randomly add 2D Guassian noise with standard deviation from 1 to 5 pixels to our observations. Fig. 6(a) shows GTTNet is competitive with other methods across all sequencing information conditions. When full sequencing info is available, GTTNet outperforms geometryonly methods (e.g. DLOE), indicating we learn improved affinity relations to triangulate 3D trajectories. Even without any sequencing info, GTTNet outperforms methods leveraging global sequencing.
Varying frame rates. We simulate lower frame rate conditions by downsampling the 2D capture to 7.5Hz, 15Hz and 30Hz. As shown in Fig. 6(b), our method performs better than DLOE on the conditions of partial sequencing information and no sequencing information. Working with full sequencing information, our method is still competitive.
Missing data. We randomly decimate 3D joints at rates varying from 10% to 50%, and compare GTTNet’s robustness against missing and/or occluded input features, see Fig. 6(c). Only DLOE, SEDL and TB are able to operate having missing joints. The robustness of GTTNet is competitive in all sequencing information conditions.
Ablation of cascaded training. Fig. 7(a) compares the reconstruction error distribution among the different stages in our cascaded training strategy. We include a selfsupervised version using only and loss functions without external data. Surprisingly, selfsupervised training is strongly competitive with full training cascade results, although subject to grater variability.
PointNet network validation. The PointNetenabled variant of GTTNet is trained on different datasets of articulated 3D motion, such as monkeys[9], hands[41] and humans^{2}^{2}2CMU Mocap ( http://mocap.cs.cmu.edu/), see Fig. 6, all having different joint topology compared to the testing data. In Fig. 7(b), we compare the reconstruction error distribution of three GTTNet variants: 1) a Multidomain PointNetenabled GTTNet, 2) a SingleDomain 3DSupervised GTTNet and 3) a SingleDomain 3DSupervised GTTNet where random rigid motions are applied to individual joint 3D trajectories to decorrelate their motion from the original motion semantics. Our PointNet variant outperforms the variant training on decorrelated input 3D motion and is competitive with the SingleDomain 3DSupervised variant even though our PointNet variant was not exposed to the test domain during training. The fact that training on decorrelated motion provides inferior performance, indicates our GTTNet framework effectively learns to enforce general 3D motion semantics when estimating intershape affinities.
Computational advantages over [40]: GTTNet is over an order of magnitude (
X average speedup) faster than the opensource version of
[40] when estimating a single fullgraph affinity matrix across different sequence lengths, while consistently being more accurate as in Fig. 8(a).5.2 CrossDomain Multiview Video Datasets
Experiments on different 3D shapes classes illustrate the generality of our PointNetenabled GTTNet variant. The multiview Human Ski [28] and Dog [15] datasets were unsynchronized and their provided 2D features were input to GTTNet. Fig. 11 illustrates our qualitative results. GTTNet was not exposed to either test domain during training.
5.3 Panoptic Studio dataset
CMU Panoptic Studio dataset [14] contains synchronized multiview videos, 2D human joint estimates and camera poses. We sample video frames to generate multiview unsynchronized image streams. Again, as the datasetprovided sparse shape feature inputs (i.e. skeleton joints) are different from the 31dimensional sparse shape features used for training, we use the PointNet variant of GTTNet.
Application to Event Segmentation. For multiview videos capturing temporally separated events, our goal is to jointly reconstruct the dynamic 3D structure and segment all events based on the estimated affinity matrix . described a graph with multiple connected components, each of which corresponds to a separate event. For each segmented event, the sequencing of its constituting images was directly extracted from the affinity matrix. From top right in Fig. 9(a), we can notice the chainlike structure for each event by performing spectral analysis on the affinity matrix.
Application to MultiTarget Scenarios. We consider the case where multiple shapes are captured in multiview cameras, but the shape’s correspondence among the images is not available. Given images with maximal shape captured, the goal is to reconstruct the aggregated dynamic 3D structure .We propose a solution for this case based on GTTNet: 1) We separately create virtual frames {(each observing 3D points) for each of the subjects captured in original images. 2) Execute GTTNet on the (up to ) new virtual images to reconstruct the 3D structure and generate the corresponding affinity matrix as in the single shape case. 3) Cluster individual objects based on the affinity matrix by any standard clustering method. 4) Merge estimated 3D shapes originating from the same image. 5) Run GTTNet on the original input images with aggregated shape feature to refine the decoupled event reconstructions from step 2. Fig. 9(b) shows our results for a twotarget scenario.
6 Conclusion
GTTNet uses supervised learning to estimate pairwise spatiotemporal affinities and compute dynamic 3D geometry from image observations. Our framework allows for a diverse set of training scenarios and leverages them on a cascaded supervision strategy to both improve training efficiency and be adaptive to available supervisory information. Moreover, the proposed system is robustly applicable across different shapetrajectory domains, while outperforming the current state of the art.
References
 [1] (2018) A scalable, efficient, and accurate solution to nonrigid structure from motion. Computer Vision and Image Understanding 167, pp. 121–133. Cited by: §2.1.
 [2] (2019) Robust spatiotemporal clustering and reconstruction of multiple deformable bodies. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (4), pp. 971–984. Cited by: §2.1.
 [3] (2018) Deformable motion 3d reconstruction by union of regularized subspaces. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 2930–2934. Cited by: §2.1.
 [4] (2020) Segmentation and 3d reconstruction of nonrigid shape from rgb video. In 2020 IEEE International Conference on Image Processing (ICIP), pp. 2845–2849. Cited by: §2.1.

[5]
(1999)
Trajectory triangulation of lines: reconstruction of a 3d point moving along a line from a monocular image sequence.
In
Proceedings of the 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
, Vol. 2, pp. 62–66. Cited by: §2.1.  [6] (2000) Trajectory triangulation: 3d reconstruction of moving points from a monocular image sequence. IEEE Transactions on Pattern Analysis & Machine Intelligence (4), pp. 348–357. Cited by: §2.1.
 [7] (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.1.
 [8] (202006) Deep facial nonrigid multiview stereo. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.

[9]
(2020)
Openmonkeystudio: automated markerless pose estimation in freely moving macaques
. bioRxiv. Cited by: §5.1.  [10] (2019) Generalizing discrete convolutions for unstructured point clouds.. In 3DOR, pp. 71–78. Cited by: §2.2, §4.1.
 [11] (2017) Submanifold sparse convolutional networks. arXiv preprint arXiv:1706.01307. Cited by: §2.2.
 [12] (2014) Learning rich features from rgbd images for object detection and segmentation. In European conference on computer vision, pp. 345–360. Cited by: §2.2.
 [13] (2004) Reconstruction of a scene with multiple linearly moving objects. International Journal of Computer Vision 59 (3), pp. 285–300. Cited by: §2.1.
 [14] (2017) Panoptic studio: a massively multiview system for social interaction capture. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (1), pp. 190–204. Cited by: §5.3.
 [15] (202006) RGBDdog: predicting canine pose from rgbd sensors. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.2.
 [16] (2019) Deep nonrigid structure from motion. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1558–1567. Cited by: §2.2.
 [17] (2020) Deep nonrigid structure from motion with missing data. IEEE Transactions on Pattern Analysis and Machine Intelligence. Cited by: §2.2.
 [18] (2015) Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440. Cited by: §2.2.
 [19] (2015) Voxnet: a 3d convolutional neural network for realtime object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 922–928. Cited by: §2.2.
 [20] (200706) Documentation mocap database hdm05. Technical report Technical Report CG20072, Universität Bonn. External Links: ISSN 16108892 Cited by: §5.1.
 [21] (2019) C3dpo: canonical 3d pose networks for nonrigid structure from motion. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7688–7697. Cited by: §2.2.
 [22] (2011) 3D reconstruction of a smooth articulated trajectory from a monocular image sequence. In 2011 International Conference on Computer Vision, pp. 201–208. Cited by: §2.1.
 [23] (2010) 3D reconstruction of a moving point from a series of 2d projections. In European Conference on Computer Vision, pp. 158–171. Cited by: §2.1.
 [24] (2015) 3D trajectory reconstruction under perspective projection. International Journal of Computer Vision 115 (2), pp. 115–135. Cited by: §2.1, §5.1.
 [25] (2016) Volumetric and multiview cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5648–5656. Cited by: §2.2.
 [26] (201707) PointNet: deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2, §4.1.
 [27] (2017) Pointnet++: deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems, pp. 5099–5108. Cited by: §2.2.
 [28] (2018) Learning monocular 3d human pose estimation from multiview images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8437–8446. Cited by: §5.2.
 [29] (2017) Octnet: learning deep 3d representations at high resolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3577–3586. Cited by: §2.2.
 [30] (2000) 3d reconstruction from tangentofsight measurements of a moving object seen from a moving camera. In European Conference on Computer Vision, pp. 621–631. Cited by: §2.1.
 [31] (1999) Trajectory triangulation over conic section. In Proceedings of the Seventh IEEE International Conference on Computer Vision, Vol. 1, pp. 330–336. Cited by: §2.1.
 [32] (2020) Neural dense nonrigid structure from motion with latent space constraints. In European Conference on Computer Vision (ECCV), Cited by: §2.2.
 [33] (2014) Separable spatiotemporal priors for convex reconstruction of timevarying 3d point clouds. In European Conference on Computer Vision, pp. 204–219. Cited by: §2.1.
 [34] (2017) Kroneckermarkov prior for dynamic 3d reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (11), pp. 2201–2214. Cited by: §2.1.
 [35] (2012) General trajectory prior for nonrigid reconstruction. In Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1394–1401. Cited by: §2.1, §5.1.
 [36] (2016) Spatiotemporal bundle adjustment for dynamic 3d reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1710–1718. Cited by: §2.1, §2.1.
 [37] (2020) Deep nrsfm++: towards 3d reconstruction in the wild. arXiv preprint arXiv:2001.10090. Cited by: §2.2.
 [38] (2018) Deep parametric continuous convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2589–2597. Cited by: §2.2, §4.1.
 [39] (2015) 3d shapenets: a deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1912–1920. Cited by: §2.2.
 [40] (201910) Discrete laplace operator estimation for dynamic 3d reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §1, §2.1, §3, §4.1, §4.3, §4, Figure 9, §5.1, §5.1.
 [41] (2017) Bighand2. 2m benchmark: hand pose dataset and state of the art analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4866–4874. Cited by: §5.1.
 [42] (2015) Sparse dynamic 3d reconstruction from unsynchronized videos. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4435–4443. Cited by: §2.1.
 [43] (2017) Selfexpressive dictionary learning for dynamic 3d reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (9), pp. 2223–2237. Cited by: §2.1, §3, §5.1.
 [44] (2011) 3D motion reconstruction for realworld camera motion. In Proceedings of the 2011 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. Cited by: §2.1.
 [45] (2015) Convolutional sparse coding for trajectory reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence 37 (3), pp. 529–540. Cited by: §2.1.