Captured visual data is often a projection of a higher-dimensional signal “collapsed” along some dimension. For example, long-exposure, motion-blurred photographs are produced by projecting motion trajectories along the time dimension [11, 25]. Recent “corner cameras” leverage the fact that a corner-like edge occluder vertically projects light rays of hidden scenes to produce a 1D video . Medical x-ray machines use spatial projectional radiography, where x-rays are distributed by a generator, and the imaged anatomy affects the signal captured by the detector . Given projected data, is it possible to synthesize the original signal? In this work, we present an algorithm that enables this synthesis. We focus on recovering images and video from spatial projections, and recovering a video from a long-exposure image obtained via temporal projection.
The task of inverting projected, high-dimensional signals is ill-posed, making the task infeasible without some priors or constraints on the true signal. This ambiguity includes object orientations and poses in spatial projections, and the “arrow of time”  in temporal projections (Fig. 1). We leverage the fact that the effective dimension of most natural images is often much lower than the pixel representation, because of the shared structure in a given domain. We handle this ambiguity by formulating a probabilistic model for the generation of signals given a projection. The model consists of parametric functions that we implement with convolutional neural networks (CNNs). Using variational inference, we derive an intuitive objective function. Sampling from this deprojection network at test time produces plausible examples of signals that are consistent with an input projection.
We evaluate our work both quantitatively and qualitatively. We demonstrate that our method can recover the distribution of human gait videos from 2D spacetime images, and face images from their 1D spatial projections. We also show that our method can model distributions of videos conditioned on motion-blurred images using the Moving MNIST dataset .
2 Related Work
Projections play a central role in computer vision, starting from the initial stages of image formation, where light from the 3D world is projected onto a 2D plane. We focus on a particular class of projections where higher-dimensional signals of interest are collapsed along one dimension to produce observed data.
2.1 Corner Cameras
Corner cameras exploit reflected light from a hidden scene occluded by obstructions with edges to “see around the corner” . Reflected light rays from the scene from the same angular position relative to the corner are vertically integrated to produce a 1D video (one spatial dimension + time). That study used the temporal gradient of the 1D video to coarsely indicate angular positions of the human with respect to the corner, but did not reconstruct the hidden scene. As an initial step towards this difficult reconstruction task, we show that videos and images can be recovered after collapsing one spatial dimension.
2.2 Compressed Sensing
Compressed sensing techniques efficiently reconstruct a signal from limited observations by finding solutions to underdetermined linear systems [8, 12]. This is possible because of the redundancy of natural signals in an appropriate basis. Several methods show that it is possible to accurately reconstruct a signal from a small number (s) of bases through convex optimization, even when the bases are chosen randomly [6, 7, 16]
. We tackle an extreme variant where one dimension of a signal is completely lost. We also take a learning-based approach to the problem that yields a distribution of potential signals instead of one estimate.
2.3 Conditional Image/Video Synthesis and Future Frame Prediction
Neural network-based image and video synthesis has received significant attention. In conditional image synthesis, an image is synthesized conditioned on some other information, such as a class label or another image of the same dimension (image-to-image translation)[5, 17, 29, 38, 42, 47]. In contrast to our work, most of these studies condition on data of the same dimensionality as the output.
Video synthesis algorithms mainly focus on unconditional generation [33, 39, 40] or video-to-video translation [9, 34, 41]. In future video frame prediction, frames are synthesized conditioned on one or more past images. Several of these algorithms treat video generation as a stochastic problem [2, 24, 44]
, using a variational autoencoder (VAE) style framework. The inputs and outputs in these problems take a similar form to ours, but the information in the input is different. We draw insights from the stochastic formulation in these studies for our task.
2.4 Inverting a Motion-blurred Image to Video
One application we explore is the formation of videos from dramatically motion-blurred images, created by temporally aggregating photons from a scene over an extended period of time. Two recent studies present the deterministic recovery of a video sequence from a single motion-blurred image [18, 30]. We propose a general deprojection framework for dimensions including, but not limited to time. In addition, our framework is probabilistic, capturing the distribution of signal variability instead of a single deterministic output (see Fig. 1).
We assume a dataset of pairs of original signals and projections , where is the number of dimensions of and is the projected dimension. We assume a projection function with parameters . In our experiments, we focus on a case often observed in practice, where is a linear operation in along , such as averaging: , where is the slice of along dimension . For example, a grayscale video might get projected to an image by averaging pixels across time. Deprojection is a highly underconstrained problem. Even if the values of are known, there are as many variables (size of ) as constraints (size of ).
We aim to capture the distribution for a particular scenario with data. We first present a probabilistic model for the deprojection task which builds on the conditional VAE (CVAE)  (Fig. 2). We let be a multivariate normal latent variable which captures variability of unexplainable from alone. Intuitively, encodes information orthogonal to the unprojected dimensions. For example, it could capture the temporal variation of the various scenes that may have led to a long-exposure image.
as a Gaussian distribution:
is a per-pixel noise variance andis a deprojection function, parameterized by and responsible for producing a noiseless estimate of given and .
3.1 Variational Inference and Loss Function
Our goal is to estimate :
Evaluating this integral directly is intractable because of its reliance on potentially complex parametric functions and the intractability of estimating the posterior
. We instead use variational inference to obtain a lower bound of the likelihood, and use stochastic gradient descent to optimize it[20, 23]. We introduce an approximative posterior distribution :
Using Jensen’s inequality, we achieve the following evidence lower bound (ELBO) for :
is the Kullback-Leibler divergence encouraging the variational distribution to approximate the conditional prior, resulting in a regularized embedding. We estimate the expectation term by drawing onefrom within the network using the reparametrization trick  and evaluating the expression:
This leads to the training loss function to be minimized:
where is a tradeoff parameter capturing the relative importance of the regularization term. The per-pixel reconstruction term in Eq. (6) can result in blurry outputs. For datasets with subtle details such as face images, we also add a perceptual error, computed over a learned feature space [13, 19, 45]. We use a distance function , computed over high-dimensional features learned by the VGG16 network  with parameters
, trained to perform classification on ImageNet.
3.2 Network Architectures
We implement and the Gaussian parameters of and with neural networks. Fig. 3 depicts the architecture for the 2D-to-3D temporal deprojection task. Our 2D-to-3D spatial deprojection architecture is nearly identical, differing only in the dimensions of and the reshaping operator’s dimension ordering. We handle 1D-to-2D deprojections by using the lower-dimensional versions of the convolution and reshaping operators. The number of convolutional layers, and number of parameters vary by dataset based on their complexities.
3.2.1 Posterior and Prior Encoders
The encoder for the distribution parameters of the posterioris reached. We flatten this volume and use two fully connected layers to obtain and , the distribution parameters. The encoder for the conditional prior is implemented in a similar way, with 2D strided convolutions. One is drawn from and fed to the deprojection function. At test time, is drawn from to visualize results.
3.2.2 Deprojection Function
The function deprojects into an estimate . We first use a UNet-style architecture  to compute per-pixel features of . The UNet consists of two stages. In the first stage, we apply a series of strided 2D convolutional operators to extract multiscale features. We apply a fully connected layer to , reshape these activations into an image, and concatenate this image to the coarsest features. The second stage applies a series of 2D convolutions and upsampling operations to synthesize an image of the same dimensions as and many more data channels. Activations from the first stage are concatenated to the second stage activations with skip connections to propagate learned features.
We expand the resulting image along the collapsed dimension to produce a 3D volume. To do this, we apply a 2D convolution to produce data channels, where is the size of the collapsed dimension (time in this case), and is some number of features. Finally, we reshape this image into a 3D volume, and apply a few 3D convolutions to refine and produce a signal estimate .
4 Experiments and Results
We first evaluate our method on 1D-to-2D spatial deprojections of human faces using FacePlace . We then show results for 2D-to-3D spatial deprojections using an in-house dataset of human gait videos collected by the authors. Finally, we demonstrate 2D-to-3D temporal deprojections using the Moving MNIST  dataset. We focus on projections where pixels are averaged along a dimension for all experiments. For all experiments we split the data into train/test/validation non-overlapping groups.
We implement our models in Keras
with a Tensorflow backend. We use the ADAM optimizer  with a learning rate of
. We trained separate models for each experiment. We select the regularization hyperparameterseparately for each dataset such that the KL term is between on our validation data, to obtain adequate data reconstruction while avoiding mode collapse. We set the dimension of to 10 for all experiments.
4.2 Spatial Deprojections with FacePlace
FacePlace consists of over 5,000 images of 236 different people. There are many sources of variability, including different ethnicities, multiple views, facial expressions, and props. We randomly held out all images for 30 individuals to form a test set. We scaled images to pixels and performed data augmentation with translation, scaling and saturation variations. We compare our method against the following baselines:
Nearest neighbor selector (-NN): Selects the images from the training dataset with projections closest to the test projection using mean squared error distance.
A deterministic model (DET) identical to the deprojection network of our method, without the incorporation of a latent variable .
A linear minimum mean squared error (LMMSE) estimator which assumes that and are drawn from distributions such that is linear in : for some parameters and . Minimizing the expected MSE of yields a closed form expression for :
where and are the covariance matrices of and and is their cross-covariance matrix.
For both our method and DET, we used the perceptual loss metric. Fig. 4 presents visual results, with a few randomly chosen samples from our method. -NN varies in performance depending on the test example, and can sometimes produce faces from the wrong person. LMMSE produces very blurry outputs, indicating the highly nonlinear nature of this task. DET produces less blurry outputs, but still often merges different plausible faces together. Our method captures uncertainty of head orientations as well as appearance variations, such as hair color and facial structure. Ambiguity in head orientation is more apparent with the horizontal projections, since pose changes affect that dimension the most. The outputs of our methods are also sharper than LMMSE and DET, and are more consistent with ground truth than -NN.
We also quantitatively evaluate the models. We use PSNR (peak-signal-to-noise-ratio, higher is better) to measure reconstruction quality between images. For each test projection, we sample deprojection estimates from each model (DET always returns the same estimate) and record the highest PSNR between any estimate and the ground truth image. For each deprojection estimate, we reproject and record the average PSNR of the output projections with respect to the the test (initial) projection.
Fig. 5 illustrates the results with varying samples for 100 test projections. As the number of samples increases, our method’s signal (deprojection) PSNR improves, highlighting the advantage of our probabilistic approach. Best estimates from -NN approach the best estimates of our method in signal reconstruction with increasing , but many poor estimates are also retrieved by -NN as evidenced by its decreasing projection PSNR curve. LMMSE has perfect projection PSNR because it captures the exact linear relationship between the signal and projection by construction. DET has higher signal PSNR when drawing one sample, because it averages over plausible images, while our method does not. Our proposed method surpasses DET after 1 sample.
4.3 Spatial Deprojections with Walking Videos
We qualitatively evaluate our method on reconstructing human gait videos from vertical spatial projections. This scenario is of practical relevance for corner cameras, described in Sec. 2.1
. We collected 35 videos of 30 subjects walking in a specified area for one minute each. Subjects had varying attire, heights (5’2”- 6’5”), ages (18-60), and sexes (18m/12f). Subjects were not instructed to walk in any particular way, and many walked in odd patterns. The background is identical for all videos. We downsampled the videos to 5 frames per second and each frame topixels, and apply data augmentation of horizontal translations to each video. We held out 6 subjects to produce a test set. We predict sequences of 24 frames (roughly 5 seconds in real time).
Fig. 6 presents several reconstruction examples, obtained by setting , the mean of the prior distribution. Our method recovers many details from the vertical projections alone. The background is easily synthesized because it is consistent among all videos in the dataset. Remarkably, many appearance and pose details of the subjects are also recovered. Subtle fluctuations in pixel intensity and the shape of the projected foreground trace contain clues about the foreground signal along the collapsed dimension. For example, the method seems to learn that a trace that gets darker and wider with time likely corresponds to a person walking closer to the camera.
The third subject is an illustrative result for which our method separates the white shirt from black pants despite their aspects not being obvious in the projection. Projected details, along with a learned pattern that shirts are often lighter colors than pants, likely enable this recovery. Finally, the method may struggle with patterns rarely seen in the training data, such as the large step by the fourth subject in the fifth frame.
4.4 Temporal Deprojections with Moving MNIST
The Moving MNIST dataset consists of video sequences of two moving handwritten digits. The digits can occlude one another, and bounce off the edges of the frames. Given a dataset of -sized video subclips, we generate each projection by averaging the frames in time, similar to other studies that generate motion-blurred images at a large scale [18, 21, 27, 28]. Despite the simple appearance and dynamics of this dataset, synthesizing digit appearances and capturing the plausible directions of each trajectory is challenging.
Sample outputs of our method for three test examples are visualized in Fig. 9. To illustrate the temporal aspects learned by our method, we sample 10 sequences from our method for each projection, and present the sequences with the lowest mean squared error with respect to the ground truth clip run forwards and backwards. Our method is able to infer the shape of the characters from a dramatically motion-blurred input image, difficult to interpret even by human standards. Furthermore, our method captures the multimodal dynamics of the dataset, which we illustrate by presenting the two motion sequences: the first sequence matches the temporal direction of the ground truth, and the second matches the reverse temporal progression.
We quantify our accuracy using PSNR curves, similar to the first experiment, displayed in Fig. 8. Because of the prohibitive computational costs of generating the full joint covariance matrix, we do not evaluate LMMSE in this experiment. DET produces blurry sequences, by merging different plausible temporal orderings. Similar to the first experiment, this results in DET outputs having the best expected signal (deprojection) PSNR only for . Our method clearly outperforms DET in signal PSNR for . DET performs better in projection PSNR, since in this experiment an average of all plausible sequences yields a very accurate projection. -NN performs relatively worse in this experiment compared to the FacePlace experiments, because of the difficulty in finding nearest neighbors in higher-dimensions.
In this work, we introduced the novel problem of visual deprojection: synthesizing an image or video that has been collapsed along a dimension into a lower-dimensional observation. We presented a first general method that handles both images and video, and projections along any dimension of these data. We addressed the uncertainty of the task by first introducing a probabilistic model that captures the distribution of original signals conditioned on a projection. We implemented the parameterized functions of this model with CNNs to learn shared image structures in each domain and enable accurate signal synthesis.
Though information from a collapsed dimension is often seemingly unrecoverable from a projection to the naked eye, our results demonstrate that much of the “lost” information is recoverable. We demonstrated this by reconstructing subtle details of faces in images and accurate motion in videos from spatial projections alone. Finally, we illustrate that videos can be reconstructed from dramatically motion blurred images, even with multimodal trajectories, using the Moving MNIST dataset. This work illustrates promising results in a new, ambitious imaging task and opens exciting possibilities in future applications of revealing the invisible.
This work was funded by DARPA REVEAL Program under Contract No. HR0011-16-C-0030, NIH 1R21AG050122 and Wistron Corp.
Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §4.1.
-  (2017) Stochastic variational video prediction. arXiv preprint arXiv:1710.11252. Cited by: §1, §2.3.
-  (2012) Depth information in human gait analysis: an experimental study on gender recognition. In International Conference Image Analysis and Recognition, pp. 98–105. Cited by: §4.3.
-  (2017) Turning corners into cameras: principles and methods. In International Conference on Computer Vision, Vol. 1, pp. 8. Cited by: §1, §2.1.
Unsupervised pixel-level domain adaptation with generative adversarial networks.
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 1, pp. 7. Cited by: §2.3.
-  (2005) Signal recovery from random projections. In Computational Imaging III, Vol. 5674, pp. 76–87. Cited by: §2.2.
-  (2006) Near-optimal signal recovery from random projections: universal encoding strategies?. IEEE transactions on information theory 52 (12), pp. 5406–5425. Cited by: §2.2.
-  (2008) An introduction to compressive sampling. IEEE signal processing magazine 25 (2), pp. 21–30. Cited by: §2.2.
-  (2017) Coherent online video style transfer. In Proc. Intl. Conf. Computer Vision (ICCV), Cited by: §2.3.
-  (2015) Keras. Note: https://keras.io Cited by: §4.1.
-  (2008) Motion from blur. Cited by: §1.
-  (2006) Compressed sensing. IEEE Transactions on information theory 52 (4), pp. 1289–1306. Cited by: §2.2.
-  (2016) Generating images with perceptual similarity metrics based on deep networks. In Advances in Neural Information Processing Systems, pp. 658–666. Cited by: §3.1.
-  (2006) Removing camera shake from a single photograph. In ACM transactions on graphics (TOG), Vol. 25, pp. 787–794. Cited by: §1.
-  (2002) Example-based super-resolution. IEEE Computer graphics and Applications 22 (2), pp. 56–65. Cited by: §1.
-  (2006) Signal reconstruction from noisy random projections. IEEE Transactions on Information Theory 52 (9), pp. 4036–4048. Cited by: §2.2.
Image-to-image translation with conditional adversarial networks. arXiv preprint. Cited by: §1, §2.3.
-  (2018) Learning to extract a video sequence from a single motion-blurred image. arXiv preprint arXiv:1804.04065. Cited by: §2.4, §4.4.
-  (2016) Perceptual losses for real-time style transfer and super-resolution. In European Conference on Computer Vision, pp. 694–711. Cited by: §3.1.
-  (1999) An introduction to variational methods for graphical models. Machine learning 37 (2), pp. 183–233. Cited by: §3.1.
-  (2017) Online video deblurring via dynamic temporal blending network. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 4058–4067. Cited by: §4.4.
-  (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
-  (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.3, §3.1.
-  (2018) Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523. Cited by: §2.3.
-  Motion-invariant photography. ACM Transactions on Graphics (SIGGRAPH 2008). Cited by: §1.
-  (2016) X-ray imaging: fundamentals, industrial techniques and applications. CRC Press. Cited by: §1.
-  (2017) Deep multi-scale convolutional neural network for dynamic scene deblurring. In CVPR, Vol. 1, pp. 3. Cited by: §4.4.
-  (2017) Motion deblurring in the wild. In German Conference on Pattern Recognition, pp. 65–77. Cited by: §4.4.
Conditional image synthesis with auxiliary classifier gans. arXiv preprint arXiv:1610.09585. Cited by: §2.3.
Bringing alive blurred moments. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6830–6839. Cited by: §2.4.
-  (2012) Recognizing disguised faces. Visual Cognition 20 (2), pp. 143–169. Cited by: §4.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §3.2.2.
Temporal generative adversarial nets with singular value clipping. In IEEE International Conference on Computer Vision (ICCV), Vol. 2, pp. 5. Cited by: §2.3.
-  (2005) Space-time super-resolution. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (4), pp. 531–545. Cited by: §2.3.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.1.
-  (2015) Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pp. 3483–3491. Cited by: §3.
-  (2015) Unsupervised learning of video representations using lstms. In International conference on machine learning, pp. 843–852. Cited by: §1, §4.
-  (2016) Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200. Cited by: §2.3.
-  (2017) Mocogan: decomposing motion and content for video generation. arXiv preprint arXiv:1707.04993. Cited by: §2.3.
-  (2016) Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pp. 613–621. Cited by: §2.3.
-  (2018) Video-to-video synthesis. In Advances in Neural Information Processing Systems (NIPS), Cited by: §2.3.
-  (2017) High-resolution image synthesis and semantic manipulation with conditional gans. arXiv preprint arXiv:1711.11585. Cited by: §2.3.
-  (2018) Learning and using the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8052–8060. Cited by: §1.
-  (2018) Visual dynamics: stochastic future generation via layered cross convolutional networks. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §2.3.
The unreasonable effectiveness of deep features as a perceptual metric. arXiv preprint. Cited by: §3.1.
-  (2015) Image demosaicing. In Color image and video enhancement, pp. 13–54. Cited by: §1.
-  (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv preprint. Cited by: §2.3.