1 Introduction
Captured visual data is often a projection of a higherdimensional signal “collapsed” along some dimension. For example, longexposure, motionblurred photographs are produced by projecting motion trajectories along the time dimension [11, 25]. Recent “corner cameras” leverage the fact that a cornerlike edge occluder vertically projects light rays of hidden scenes to produce a 1D video [4]. Medical xray machines use spatial projectional radiography, where xrays are distributed by a generator, and the imaged anatomy affects the signal captured by the detector [26]. Given projected data, is it possible to synthesize the original signal? In this work, we present an algorithm that enables this synthesis. We focus on recovering images and video from spatial projections, and recovering a video from a longexposure image obtained via temporal projection.
The task of inverting projected, highdimensional signals is illposed, making the task infeasible without some priors or constraints on the true signal. This ambiguity includes object orientations and poses in spatial projections, and the “arrow of time” [43] in temporal projections (Fig. 1). We leverage the fact that the effective dimension of most natural images is often much lower than the pixel representation, because of the shared structure in a given domain. We handle this ambiguity by formulating a probabilistic model for the generation of signals given a projection. The model consists of parametric functions that we implement with convolutional neural networks (CNNs). Using variational inference, we derive an intuitive objective function. Sampling from this deprojection network at test time produces plausible examples of signals that are consistent with an input projection.
There is a rich computer vision literature on recovering highdimensional data from partial observations. Singleimage superresolution
[15], image demosaicing [46], and motion blur removal [14] are all special cases. Here, we focus on projections where a spatial or temporal dimension is entirely removed, resulting in dramatic loss of information. To the best of our knowledge, ours is the first general recovery method in the presence of a collapsed dimension. We build on insights from related problems to develop a first solution for extrapolating appearance and motion cues (in the case of videos) to unseen dimensions. In particular, we leverage recent advances in neural networkbased synthesis and stochastic prediction tasks [2, 17, 44].We evaluate our work both quantitatively and qualitatively. We demonstrate that our method can recover the distribution of human gait videos from 2D spacetime images, and face images from their 1D spatial projections. We also show that our method can model distributions of videos conditioned on motionblurred images using the Moving MNIST dataset [37].
2 Related Work
Projections play a central role in computer vision, starting from the initial stages of image formation, where light from the 3D world is projected onto a 2D plane. We focus on a particular class of projections where higherdimensional signals of interest are collapsed along one dimension to produce observed data.
2.1 Corner Cameras
Corner cameras exploit reflected light from a hidden scene occluded by obstructions with edges to “see around the corner” [4]. Reflected light rays from the scene from the same angular position relative to the corner are vertically integrated to produce a 1D video (one spatial dimension + time). That study used the temporal gradient of the 1D video to coarsely indicate angular positions of the human with respect to the corner, but did not reconstruct the hidden scene. As an initial step towards this difficult reconstruction task, we show that videos and images can be recovered after collapsing one spatial dimension.
2.2 Compressed Sensing
Compressed sensing techniques efficiently reconstruct a signal from limited observations by finding solutions to underdetermined linear systems [8, 12]. This is possible because of the redundancy of natural signals in an appropriate basis. Several methods show that it is possible to accurately reconstruct a signal from a small number (s) of bases through convex optimization, even when the bases are chosen randomly [6, 7, 16]
. We tackle an extreme variant where one dimension of a signal is completely lost. We also take a learningbased approach to the problem that yields a distribution of potential signals instead of one estimate.
2.3 Conditional Image/Video Synthesis and Future Frame Prediction
Neural networkbased image and video synthesis has received significant attention. In conditional image synthesis, an image is synthesized conditioned on some other information, such as a class label or another image of the same dimension (imagetoimage translation)
[5, 17, 29, 38, 42, 47]. In contrast to our work, most of these studies condition on data of the same dimensionality as the output.Video synthesis algorithms mainly focus on unconditional generation [33, 39, 40] or videotovideo translation [9, 34, 41]. In future video frame prediction, frames are synthesized conditioned on one or more past images. Several of these algorithms treat video generation as a stochastic problem [2, 24, 44]
, using a variational autoencoder (VAE) style framework
[23]. The inputs and outputs in these problems take a similar form to ours, but the information in the input is different. We draw insights from the stochastic formulation in these studies for our task.2.4 Inverting a Motionblurred Image to Video
One application we explore is the formation of videos from dramatically motionblurred images, created by temporally aggregating photons from a scene over an extended period of time. Two recent studies present the deterministic recovery of a video sequence from a single motionblurred image [18, 30]. We propose a general deprojection framework for dimensions including, but not limited to time. In addition, our framework is probabilistic, capturing the distribution of signal variability instead of a single deterministic output (see Fig. 1).
3 Methods
We assume a dataset of pairs of original signals and projections , where is the number of dimensions of and is the projected dimension. We assume a projection function with parameters . In our experiments, we focus on a case often observed in practice, where is a linear operation in along , such as averaging: , where is the slice of along dimension . For example, a grayscale video might get projected to an image by averaging pixels across time. Deprojection is a highly underconstrained problem. Even if the values of are known, there are as many variables (size of ) as constraints (size of ).
We aim to capture the distribution for a particular scenario with data. We first present a probabilistic model for the deprojection task which builds on the conditional VAE (CVAE) [36] (Fig. 2). We let be a multivariate normal latent variable which captures variability of unexplainable from alone. Intuitively, encodes information orthogonal to the unprojected dimensions. For example, it could capture the temporal variation of the various scenes that may have led to a longexposure image.
We define
as a Gaussian distribution:
(1) 
where
is a perpixel noise variance and
is a deprojection function, parameterized by and responsible for producing a noiseless estimate of given and .3.1 Variational Inference and Loss Function
Our goal is to estimate :
(2) 
Evaluating this integral directly is intractable because of its reliance on potentially complex parametric functions and the intractability of estimating the posterior
. We instead use variational inference to obtain a lower bound of the likelihood, and use stochastic gradient descent to optimize it
[20, 23]. We introduce an approximative posterior distribution :(3) 
Using Jensen’s inequality, we achieve the following evidence lower bound (ELBO) for :
(4)  
where
is the KullbackLeibler divergence encouraging the variational distribution to approximate the conditional prior, resulting in a regularized embedding. We estimate the expectation term by drawing one
from within the network using the reparametrization trick [23] and evaluating the expression:(5) 
This leads to the training loss function to be minimized:
(6) 
where is a tradeoff parameter capturing the relative importance of the regularization term. The perpixel reconstruction term in Eq. (6) can result in blurry outputs. For datasets with subtle details such as face images, we also add a perceptual error, computed over a learned feature space [13, 19, 45]. We use a distance function [45], computed over highdimensional features learned by the VGG16 network [35] with parameters
, trained to perform classification on ImageNet.
3.2 Network Architectures
We implement and the Gaussian parameters of and with neural networks. Fig. 3 depicts the architecture for the 2Dto3D temporal deprojection task. Our 2Dto3D spatial deprojection architecture is nearly identical, differing only in the dimensions of and the reshaping operator’s dimension ordering. We handle 1Dto2D deprojections by using the lowerdimensional versions of the convolution and reshaping operators. The number of convolutional layers, and number of parameters vary by dataset based on their complexities.
3.2.1 Posterior and Prior Encoders
The encoder for the distribution parameters of the posterior
is implemented using a series of strided 3D convolutional operators and Leaky ReLU activations until a volume of resolution less than
is reached. We flatten this volume and use two fully connected layers to obtain and , the distribution parameters. The encoder for the conditional prior is implemented in a similar way, with 2D strided convolutions. One is drawn from and fed to the deprojection function. At test time, is drawn from to visualize results.3.2.2 Deprojection Function
The function deprojects into an estimate . We first use a UNetstyle architecture [32] to compute perpixel features of . The UNet consists of two stages. In the first stage, we apply a series of strided 2D convolutional operators to extract multiscale features. We apply a fully connected layer to , reshape these activations into an image, and concatenate this image to the coarsest features. The second stage applies a series of 2D convolutions and upsampling operations to synthesize an image of the same dimensions as and many more data channels. Activations from the first stage are concatenated to the second stage activations with skip connections to propagate learned features.
We expand the resulting image along the collapsed dimension to produce a 3D volume. To do this, we apply a 2D convolution to produce data channels, where is the size of the collapsed dimension (time in this case), and is some number of features. Finally, we reshape this image into a 3D volume, and apply a few 3D convolutions to refine and produce a signal estimate .
4 Experiments and Results
We first evaluate our method on 1Dto2D spatial deprojections of human faces using FacePlace [31]. We then show results for 2Dto3D spatial deprojections using an inhouse dataset of human gait videos collected by the authors. Finally, we demonstrate 2Dto3D temporal deprojections using the Moving MNIST [37] dataset. We focus on projections where pixels are averaged along a dimension for all experiments. For all experiments we split the data into train/test/validation nonoverlapping groups.
4.1 Implementation
We implement our models in Keras
[10]with a Tensorflow
[1] backend. We use the ADAM optimizer [22] with a learning rate of. We trained separate models for each experiment. We select the regularization hyperparameter
separately for each dataset such that the KL term is between on our validation data, to obtain adequate data reconstruction while avoiding mode collapse. We set the dimension of to 10 for all experiments.4.2 Spatial Deprojections with FacePlace
FacePlace consists of over 5,000 images of 236 different people. There are many sources of variability, including different ethnicities, multiple views, facial expressions, and props. We randomly held out all images for 30 individuals to form a test set. We scaled images to pixels and performed data augmentation with translation, scaling and saturation variations. We compare our method against the following baselines:

[itemsep=0ex, topsep=0pt]

Nearest neighbor selector (NN): Selects the images from the training dataset with projections closest to the test projection using mean squared error distance.

A deterministic model (DET) identical to the deprojection network of our method, without the incorporation of a latent variable .

A linear minimum mean squared error (LMMSE) estimator which assumes that and are drawn from distributions such that is linear in : for some parameters and . Minimizing the expected MSE of yields a closed form expression for :
(7) where and are the covariance matrices of and and is their crosscovariance matrix.
For both our method and DET, we used the perceptual loss metric. Fig. 4 presents visual results, with a few randomly chosen samples from our method. NN varies in performance depending on the test example, and can sometimes produce faces from the wrong person. LMMSE produces very blurry outputs, indicating the highly nonlinear nature of this task. DET produces less blurry outputs, but still often merges different plausible faces together. Our method captures uncertainty of head orientations as well as appearance variations, such as hair color and facial structure. Ambiguity in head orientation is more apparent with the horizontal projections, since pose changes affect that dimension the most. The outputs of our methods are also sharper than LMMSE and DET, and are more consistent with ground truth than NN.
We also quantitatively evaluate the models. We use PSNR (peaksignaltonoiseratio, higher is better) to measure reconstruction quality between images. For each test projection, we sample deprojection estimates from each model (DET always returns the same estimate) and record the highest PSNR between any estimate and the ground truth image. For each deprojection estimate, we reproject and record the average PSNR of the output projections with respect to the the test (initial) projection.
Fig. 5 illustrates the results with varying samples for 100 test projections. As the number of samples increases, our method’s signal (deprojection) PSNR improves, highlighting the advantage of our probabilistic approach. Best estimates from NN approach the best estimates of our method in signal reconstruction with increasing , but many poor estimates are also retrieved by NN as evidenced by its decreasing projection PSNR curve. LMMSE has perfect projection PSNR because it captures the exact linear relationship between the signal and projection by construction. DET has higher signal PSNR when drawing one sample, because it averages over plausible images, while our method does not. Our proposed method surpasses DET after 1 sample.
4.3 Spatial Deprojections with Walking Videos
We qualitatively evaluate our method on reconstructing human gait videos from vertical spatial projections. This scenario is of practical relevance for corner cameras, described in Sec. 2.1
. We collected 35 videos of 30 subjects walking in a specified area for one minute each. Subjects had varying attire, heights (5’2” 6’5”), ages (1860), and sexes (18m/12f). Subjects were not instructed to walk in any particular way, and many walked in odd patterns. The background is identical for all videos. We downsampled the videos to 5 frames per second and each frame to
pixels, and apply data augmentation of horizontal translations to each video. We held out 6 subjects to produce a test set. We predict sequences of 24 frames (roughly 5 seconds in real time).Fig. 6 presents several reconstruction examples, obtained by setting , the mean of the prior distribution. Our method recovers many details from the vertical projections alone. The background is easily synthesized because it is consistent among all videos in the dataset. Remarkably, many appearance and pose details of the subjects are also recovered. Subtle fluctuations in pixel intensity and the shape of the projected foreground trace contain clues about the foreground signal along the collapsed dimension. For example, the method seems to learn that a trace that gets darker and wider with time likely corresponds to a person walking closer to the camera.
The third subject is an illustrative result for which our method separates the white shirt from black pants despite their aspects not being obvious in the projection. Projected details, along with a learned pattern that shirts are often lighter colors than pants, likely enable this recovery. Finally, the method may struggle with patterns rarely seen in the training data, such as the large step by the fourth subject in the fifth frame.
4.4 Temporal Deprojections with Moving MNIST
The Moving MNIST dataset consists of video sequences of two moving handwritten digits. The digits can occlude one another, and bounce off the edges of the frames. Given a dataset of sized video subclips, we generate each projection by averaging the frames in time, similar to other studies that generate motionblurred images at a large scale [18, 21, 27, 28]. Despite the simple appearance and dynamics of this dataset, synthesizing digit appearances and capturing the plausible directions of each trajectory is challenging.
Sample outputs of our method for three test examples are visualized in Fig. 9. To illustrate the temporal aspects learned by our method, we sample 10 sequences from our method for each projection, and present the sequences with the lowest mean squared error with respect to the ground truth clip run forwards and backwards. Our method is able to infer the shape of the characters from a dramatically motionblurred input image, difficult to interpret even by human standards. Furthermore, our method captures the multimodal dynamics of the dataset, which we illustrate by presenting the two motion sequences: the first sequence matches the temporal direction of the ground truth, and the second matches the reverse temporal progression.
We quantify our accuracy using PSNR curves, similar to the first experiment, displayed in Fig. 8. Because of the prohibitive computational costs of generating the full joint covariance matrix, we do not evaluate LMMSE in this experiment. DET produces blurry sequences, by merging different plausible temporal orderings. Similar to the first experiment, this results in DET outputs having the best expected signal (deprojection) PSNR only for . Our method clearly outperforms DET in signal PSNR for . DET performs better in projection PSNR, since in this experiment an average of all plausible sequences yields a very accurate projection. NN performs relatively worse in this experiment compared to the FacePlace experiments, because of the difficulty in finding nearest neighbors in higherdimensions.
5 Conclusion
In this work, we introduced the novel problem of visual deprojection: synthesizing an image or video that has been collapsed along a dimension into a lowerdimensional observation. We presented a first general method that handles both images and video, and projections along any dimension of these data. We addressed the uncertainty of the task by first introducing a probabilistic model that captures the distribution of original signals conditioned on a projection. We implemented the parameterized functions of this model with CNNs to learn shared image structures in each domain and enable accurate signal synthesis.
Though information from a collapsed dimension is often seemingly unrecoverable from a projection to the naked eye, our results demonstrate that much of the “lost” information is recoverable. We demonstrated this by reconstructing subtle details of faces in images and accurate motion in videos from spatial projections alone. Finally, we illustrate that videos can be reconstructed from dramatically motion blurred images, even with multimodal trajectories, using the Moving MNIST dataset. This work illustrates promising results in a new, ambitious imaging task and opens exciting possibilities in future applications of revealing the invisible.
Acknowledgments
This work was funded by DARPA REVEAL Program under Contract No. HR001116C0030, NIH 1R21AG050122 and Wistron Corp.
References

[1]
(2016)
Tensorflow: a system for largescale machine learning
. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pp. 265–283. Cited by: §4.1.  [2] (2017) Stochastic variational video prediction. arXiv preprint arXiv:1710.11252. Cited by: §1, §2.3.
 [3] (2012) Depth information in human gait analysis: an experimental study on gender recognition. In International Conference Image Analysis and Recognition, pp. 98–105. Cited by: §4.3.
 [4] (2017) Turning corners into cameras: principles and methods. In International Conference on Computer Vision, Vol. 1, pp. 8. Cited by: §1, §2.1.

[5]
(2017)
Unsupervised pixellevel domain adaptation with generative adversarial networks.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Vol. 1, pp. 7. Cited by: §2.3.  [6] (2005) Signal recovery from random projections. In Computational Imaging III, Vol. 5674, pp. 76–87. Cited by: §2.2.
 [7] (2006) Nearoptimal signal recovery from random projections: universal encoding strategies?. IEEE transactions on information theory 52 (12), pp. 5406–5425. Cited by: §2.2.
 [8] (2008) An introduction to compressive sampling. IEEE signal processing magazine 25 (2), pp. 21–30. Cited by: §2.2.
 [9] (2017) Coherent online video style transfer. In Proc. Intl. Conf. Computer Vision (ICCV), Cited by: §2.3.
 [10] (2015) Keras. Note: https://keras.io Cited by: §4.1.
 [11] (2008) Motion from blur. Cited by: §1.
 [12] (2006) Compressed sensing. IEEE Transactions on information theory 52 (4), pp. 1289–1306. Cited by: §2.2.
 [13] (2016) Generating images with perceptual similarity metrics based on deep networks. In Advances in Neural Information Processing Systems, pp. 658–666. Cited by: §3.1.
 [14] (2006) Removing camera shake from a single photograph. In ACM transactions on graphics (TOG), Vol. 25, pp. 787–794. Cited by: §1.
 [15] (2002) Examplebased superresolution. IEEE Computer graphics and Applications 22 (2), pp. 56–65. Cited by: §1.
 [16] (2006) Signal reconstruction from noisy random projections. IEEE Transactions on Information Theory 52 (9), pp. 4036–4048. Cited by: §2.2.

[17]
(2017)
Imagetoimage translation with conditional adversarial networks
. arXiv preprint. Cited by: §1, §2.3.  [18] (2018) Learning to extract a video sequence from a single motionblurred image. arXiv preprint arXiv:1804.04065. Cited by: §2.4, §4.4.
 [19] (2016) Perceptual losses for realtime style transfer and superresolution. In European Conference on Computer Vision, pp. 694–711. Cited by: §3.1.
 [20] (1999) An introduction to variational methods for graphical models. Machine learning 37 (2), pp. 183–233. Cited by: §3.1.
 [21] (2017) Online video deblurring via dynamic temporal blending network. In Computer Vision (ICCV), 2017 IEEE International Conference on, pp. 4058–4067. Cited by: §4.4.
 [22] (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.1.
 [23] (2013) Autoencoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §2.3, §3.1.
 [24] (2018) Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523. Cited by: §2.3.
 [25] Motioninvariant photography. ACM Transactions on Graphics (SIGGRAPH 2008). Cited by: §1.
 [26] (2016) Xray imaging: fundamentals, industrial techniques and applications. CRC Press. Cited by: §1.
 [27] (2017) Deep multiscale convolutional neural network for dynamic scene deblurring. In CVPR, Vol. 1, pp. 3. Cited by: §4.4.
 [28] (2017) Motion deblurring in the wild. In German Conference on Pattern Recognition, pp. 65–77. Cited by: §4.4.

[29]
(2016)
Conditional image synthesis with auxiliary classifier gans
. arXiv preprint arXiv:1610.09585. Cited by: §2.3. 
[30]
(2019)
Bringing alive blurred moments
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6830–6839. Cited by: §2.4.  [31] (2012) Recognizing disguised faces. Visual Cognition 20 (2), pp. 143–169. Cited by: §4.
 [32] (2015) Unet: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computerassisted intervention, pp. 234–241. Cited by: §3.2.2.

[33]
(2017)
Temporal generative adversarial nets with singular value clipping
. In IEEE International Conference on Computer Vision (ICCV), Vol. 2, pp. 5. Cited by: §2.3.  [34] (2005) Spacetime superresolution. IEEE Transactions on Pattern Analysis and Machine Intelligence 27 (4), pp. 531–545. Cited by: §2.3.
 [35] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §3.1.
 [36] (2015) Learning structured output representation using deep conditional generative models. In Advances in Neural Information Processing Systems, pp. 3483–3491. Cited by: §3.
 [37] (2015) Unsupervised learning of video representations using lstms. In International conference on machine learning, pp. 843–852. Cited by: §1, §4.
 [38] (2016) Unsupervised crossdomain image generation. arXiv preprint arXiv:1611.02200. Cited by: §2.3.
 [39] (2017) Mocogan: decomposing motion and content for video generation. arXiv preprint arXiv:1707.04993. Cited by: §2.3.
 [40] (2016) Generating videos with scene dynamics. In Advances In Neural Information Processing Systems, pp. 613–621. Cited by: §2.3.
 [41] (2018) Videotovideo synthesis. In Advances in Neural Information Processing Systems (NIPS), Cited by: §2.3.
 [42] (2017) Highresolution image synthesis and semantic manipulation with conditional gans. arXiv preprint arXiv:1711.11585. Cited by: §2.3.
 [43] (2018) Learning and using the arrow of time. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8052–8060. Cited by: §1.
 [44] (2018) Visual dynamics: stochastic future generation via layered cross convolutional networks. IEEE transactions on pattern analysis and machine intelligence. Cited by: §1, §2.3.

[45]
(2018)
The unreasonable effectiveness of deep features as a perceptual metric
. arXiv preprint. Cited by: §3.1.  [46] (2015) Image demosaicing. In Color image and video enhancement, pp. 13–54. Cited by: §1.
 [47] (2017) Unpaired imagetoimage translation using cycleconsistent adversarial networks. arXiv preprint. Cited by: §2.3.
Comments
There are no comments yet.