I Introduction
Noncontact 3D shape reconstruction using the structuredlight technique is commonly used in a wide range of applications including machine vision, reverse engineering, quality assurance, 3D printing, entertainment, etc. The technique typically retrieves the depth or height information with an algorithm based on geometric triangulation, where the structured light helps facilitate the required image matching or decoding process. According to the number of images required for each 3D reconstruction, the structuredlight techniques can be classified into two categories: multishot and singleshot
Li et al. (2016, 2005); Zuo et al. (2018); Zhu et al. (2016); Cai et al. (2016). The multishot techniques are good at capturing highresolution 3D images at a limited speed, and the singleshot techniques are capable of acquiring 3D images at a fast speed to deal with dynamic scenes. Consequently, the multishot techniques are widely used as an industrial metrology for accurate shape reconstructions, whereas the singleshot ones receive tremendous attentions in the fields of entertainment and robotics. As technologies evolve at an everincreasing pace, applying the concept of deep machine learning to the highly demanded singleshot 3D shape reconstruction has become feasible. This is the motivation of this letter.
In the machine learning field, deep convolutional neural networks (CNNs) have found numerous applications in object detection, image segmentation, image classification, scene understanding, medical image analysis, and natural language processing, etc. The recent advances of using the deep CNNs for image segmentation intend to make the network architecture become an endtoend learning process. For example, Long
et al. Long et al. (2015) restored the downsampled feature map to the original size of the input using backwards convolution. An impressive network architecture, named UNet and proposed by Ronneberger et al. Ronneberger et al. (2015), extended the decoding path from Long’s framework to yield a precise output with a relatively small number of training images. Similarly, Badrinarayanan et al. Badrinarayanan et al. (2017) used an idea of upsampling the lowest of the encoder output to improve the resolution of the output with less computational resources.In the CNNbased 3D reconstruction and depth detection applications, Eigen et al. Eigen et al. (2014) and Liu et al. Liu et al. (2015)
respectively proposed a scheme to conduct the depth estimation from a single view using the CNNs. In their work, they used a thirdparty training data set produced by Kinect RGBD sensors, which has low accuracy and is insufficient for a good learning. Inspired by these two methods, Choy
et al. Choy et al. (2016)proposed a novel architecture which employs recurrent neural networks (RNNs) among the autoencoder CNNs for single and multiview 3D reconstructions. Over the past year, the utilization of CNNs framework for fringe pattern analysis has been explored, such as pattern denoising and phase distribution determinations. For intance, Feng
Feng et al. (2019) integrates the deep CNNs with three highfrequency patterns for 3D reconstruction and reliable phase unwrapping. Jeught Jeught and Dirckx (2019) proposed a neural network with a large simulation dataset in the training process to acquire the height information of an object from a singleshot fringe pattern. A number of investigations Yan et al. (2019); Hao et al. (2019) have shown promising results on using the DNN models to improve the estimation and detection of phase distributions.Based on the previous successful applications of the CNNs to image segmentation and 3D scene reconstruction, the exploration of utilizing the deep CNNs to accurately reconstruct the 3D shapes from a single structuredlight image should be quite viable. With numerous parameters, a deep CNN model can be trained to approximate a very complex nonlinear regressor that is capable of mapping a conventional structuredlight image to its corresponding 3D depth or height map. At present, the robustness and importance of integrating the deep CNNs with one of the most widely used structuredlight methods, fringe projection profilometry (FPP) technique, have not been fully recognized. In this letter, such an integration for accurate 3D shape reconstruction is investigated. The main idea is to transform a singleshot image, which has a highfrequency fringe pattern projected on the target, into a 3D image using a deep CNN that has a contracting encoder path and an expansive decoder path. Fig 1 demonstrates an example of the autoencodershaped model use in the proposed approach. Compared with the conventional 3D shape measurement techniques, the proposed technique is considerably simpler without using any geometric information or any complicated stereovision or triangulationbased computation.
Using real and accurate training data is essential for a reliable machine learning model. Because the FPP technique is one of the most accurate 3D shape measurement techniques and is capable of performing 3D imaging with accuracy better than 0.1 mm, it is employed in this work to generate the required training and validation data for learning and the test data for evaluation. The proposed approach is elaborated below, starting from the training and validation data generation, and followed by the description of three deep CNNs.
Ii Methodology
ii.1 Fringe projection profilometry technique for training data generation
The most reliable FPP technique involves projecting a set of phaseshifted sinusoidal fringe patterns from a projector onto the objects, where the surface depth or height information is naturally encoded into the cameracaptured fringe patterns for subsequent 3D reconstruction process. Technically, the fringe patterns help establish the correspondences between the captured image and the original reference image projected by the projector. In practice, the FPP technique determines the height or depth map from the phase distributions of the captured fringe patterns. The phase extraction process normally uses phaseshifted fringe patterns to calculate the fringe phase.
In general, the original reference fringes are straight, evenly spaced, and vertically (or horizontally) oriented. They are generated in a numerical way with the following function:
(1) 
where is the intensity value of the pattern at pixel coordinate ; the subscript denotes the th phaseshifted image with , and m is the number of the phaseshift steps (e.g., ); is a constant coefficient indicating the value of intensity modulation; is the number of fringes in the pattern; is the width of the digital image; is the phaseshift amount; and is the fringe phase.
The fringe phase can be calculated from a standard fourstep phaseshifting algorithm; however, such a phase value is wrapped in a range of 0 to and must be unwrapped to obtain the true phase. In order to cope with the phaseunwrapping difficulty encountered in the cases of complex shapes and geometric discontinuities, a scheme of using multifrequency fringe patterns is often employed by the FPP. The unwrapped phase distributions can be accordingly calculated from:
(2) 
In the equation, the subscript indicates the th fringefrequency pattern with , and is the number of fringe frequencies; the superscript denotes the wrapped phase; represents the function of rounding to the nearest integer; is again the number of fringes in the projection pattern, and . A practical example is with , , and .
The essential task of the FPP technique is to retrieve the outofplane depth or height map from the aforecalculated phase distributions of the highest frequency fringes. The governing equation for a generalized setup where the system components can be arbitrarily positioned Vo et al. (2012) is:
(3) 
where is the outofreferenceplane height or depth at the point corresponding to the pixel in the captured images; is the unwrapped phase of the highestfrequency fringe pattern at the same pixel; and coefficients and can be determined by a calibration process using a flexible calibration board or a few gage blocks. Details can be found in Vo et al. (2012).
As mentioned earlier, the FPP technique is employed not only to generate the training and validation data sets, but also to obtain the groundtruth results of the test data set for evaluation purpose. Once the training and validation data sets are available, they can be fed into a deep CNN model for the subsequent learning process.
ii.2 Network architecture
The adopted deep network is mainly made up of two components: the encoder path and the decoder path. Given a single fringe image of an object or objects, the network first encodes the input image into lowresolution feature maps. Then the network decodes the feature maps while upsampling them back to the original resolution with the final output as a 3D depth or height map. In the proposed approach, three deep CNNs are adopted, and they are described as follows:

Fully convolutional networks Fully convolutional network (FCN) is a wellknown network that has been successfully applied to semantic segmentation. FCN adopts the encoder path from the contemporary classification networks (such as AlexNet, VGGNet, and GoogLeNet) and transforms the fully connected layers into convolution layers before upsampling the coarse output map to the same size as the input.

Autoencoder networks
The autoencoder network (AEN) has an encoder path and a symmetric decoder path. There are totally 33 layers, including 22 standard convolution layers, 5 max pooling layers, 5 transpose operation layers, and a
convolution layer. The AEN architecture is illustrated in Fig 1. 
UNet The UNet is also a wellknown network, and it has a similar architecture to the AEN. The key difference is that in the UNet the local context information from the encoder path are concatenated with the upsampled output, which can help increase the resolution of the final output.
In the learning process of the three CNNs, the training or validation data set is a fourdimensional array of size , where is the number of the training or validation samples; and are the spatial dimensions or the image dimensions; is the channel dimension, with for grayscale images and
for color images. The networks contain convolution layers, pooling layers, transpose convolution layers, and unpooling layers; they do not contain any fully connected layers. Each convolution layer learns the local features from the input and produces the output features where the spatial axes of the output map remain the same but the depth axis changes following the convolution operation filters. A nonlinear activation function named rectified linear unit (ReLU), expressed as
, is employed in each convolution layer. The max pooling layers with awindow and a stride of 2 are applied to downsample the feature maps through extracting only the max value in each window. In the AEN and UNet, the 2D transpose convolution layers are applied in the decoder path to transform the lower feature input back to higher resolution. Finally, a
convolution layer with one filter is attached to the final layer to transform the feature maps to the desired depth or height map. Unlike the conventional 3D shape reconstruction schemes that often require complex algorithms based on a profound understanding of techniques, the proposed approach depends on the numerous parameters in the networks, which are automatically trained, to play a vital role in the singleshot 3D reconstruction.Iii Experiments and Results
An experiment has been conducted to validate and demonstrate the capability of the proposed approach. The experiment uses a desktop computer with an Intel Core i38100 3.6GHz processor, 16 GB RAM, and a Nvidia GeForce GTX 1070 graphics card. Keras, a popular Python deep learning library, is utilized to train the different CNN models. Nvidia’s cuDNN deep neural network library is adopted to speed up the training process. The field of view of the experiment is about 155 mm, and the distance from the camera to the scene of interest is around 1.2 m. A number of small plaster sculptures serve as the objects, whose sizes and surface natures can help get reliable and highaccuracy 3D data sets.
iii.1 Training and test data acquisition
As described previously, the multifrequency FPP technique is employed to prepare the training, validation, and test data. The experiment uses four fringe frequencies (1, 4, 20, and 100) and the fourstep phaseshifting schemes, which usually yield a good balance among accuracy, reliability, and capability. The first image of the last frequency (i.e., ) is chosen as the input image, and all other images are captured solely for the purpose of generating the groundtruth 3D height labels. Totally, there are 540, 60, and 72 samples in the training, validation, and test data sets, respectively; and each sample contains a single fringe image of object(s) and a corresponding height map. The data split ratio is roughly 80%10%10% and is appropriate for such a case of small data sets. The data sets have been made publicly available for download Nguyen and Wang (2019). Visualization 1 shows the training and test data, where the left side displays the input images and the right side illustrates the corresponding ground truth. The background helps cover the field of view, but it is neither mandatory nor has to be flat. It is noted that the background are hidden in the visualization for better demonstration purpose, and the original shadow areas are excluded from the learning process.
iii.2 Training, analysis, and evaluation
The training and validation data are applied to the learning process of the aforementioned three CNN models: FCN, AEN, and UNet. The optimization adopts a total of 300 epochs with a minibatch size of 2 images. The learning rate is reduce by half whenever the validation loss does not improve within 20 consecutive epochs. A number of advanced regularization approaches has been implemented to tackle the overfitting problem as well as to yield better performance, such as data augmentation, weight regularization, and dropout. Especially, the idea of using Grid Search method has been conducted to obtain the best hyperparameters for the training model. In order to check the network performance as the training process iterates, one of the test samples is randomly selected in advance. At the end of each epoch, a callback is performed on the selected test image to predict and save the intermediate result using the newly updated learning model (see Visualization 2). The learning can adopt either the binary crossentropy or the mean squared error as the loss function. If the binary crossentropy is selected, the ground truth data is normalized to a range between 0 and 1.
The evaluation is carried out by calculating the errors of the reconstructed 3D shapes. Two errors, the mean relative error (MRE) and the root mean squared error (RMSE), are used in the analysis. Table 1 shows the performance errors of the three CNN models for singleshot 3D shape reconstruction. It can be seen that the FCN model yields the largest error among the three CNNs, and its learning time is also the longest because of the involved elementwise summation. The AEN model requires the least learning time, but its performance is slightly inferior to that of the UNet in terms of accuracy.
Model  FCN  AEN  UNET  
Training time  7 hrs  5 hrs  6 hrs  
Training  MRE  1.28e3  8.10e4  7.01e4 
RMSE (mm)  1.47  0.80  0.71  
Validation  MRE  1.78e3  1.65e3  1.47e3 
RMSE (mm)  1.73  1.43  1.27  
Test  MRE  2.49e3  2.32e3  2.08e3 
RMSE (mm)  2.03  1.85  1.62 
Figure 2 demonstrates a visual comparison of the groundtruth 3D data and the reconstructed 3D data acquired with the three networks. The first image in each row is a representative input which captured an object with projected fringe patterns, and following the input image is the corresponding 3D groundtruth image. The next three images in each row are the reconstructed results from the deep learning with FCN, AEN, and UNet, respectively. In the reconstructed 3D figures, the background has been removed for better visualization purpose. Figure 3 shows the height distributions relative to the background plane along an arbitrary line highlighted in each of the initial input image. Again, it is evident from the Figs. 2 and 3 that the AEN and UNet models perform better than the FCN model. The main reason is that the FCN abruptly restores the high resolution feature map from the lower one, therefore, many details are lacking in the final reconstructed 3D results. The AEN and UNet each consists of its decoder path that is symmetric to the encoder path, which helps steadily propagate the context information between layers to produce features depicting detailed information. Unlike the AEN, the UNet contains concatenation operation to send extra local features to the decoder path. This handling helps the network to perform the best among the three networks, as can be seen from the figures.
Based on the successful reconstruction of an isolated object from the first dataset, a second dataset Nguyen and Wang (2019) with multiple complex objects is captured and fed into the CNNs framework to check whether the proposed technique can solve the phase ambiguity, discontinuous surfaces issues as well as determine 3D depth information same as traditional technique. In the second dataset, there are 630, 70, and 40 samples in the training, validation, and test data sets, respectively. The UNet model with the best hyperparamters has been chosen for the training process of the second dataset. Figure 4 displays the multiple objects test image, the groundtruth depth reconstruction from FPP method, and the representative 3D reconstruction from UNet model. By visually comparing the result from the groundtruth shape and the predicted output shape, it can be seen that the Unet model successfully produces the accuracy output shape and solve the problems of phase ambiguity from multiple discontinuous objects even using only a single fringe image.
It is noteworthy that the 3D reconstruction time for a new single image is less than 50 ms on the aforementioned computer, which indicates that a realtime 3D shape reconstruction is practicable. Technically, the performance can be further improved with much larger training data sets as well as deeper networks. In practice, however, preparing a considerably large number of highaccuracy ground truth data is very timeconsuming and challenging; furthermore, a deeper network will require a large amount of computer memory and computational time for the learning process. The future work can include improving the network model and preparing a larger data set as well as using less memoryconsuming algorithms.
Iv Conclusion
In summary, a novel singleshot 3D shape reconstruction approach is presented. The approach employs three deep CNN models, including FCN, AEN and UNet, to quickly reconstruct the 3D shapes from a single image of the target with projected fringe patterns. The learning process is carried out through using the training and validation data acquired by a highaccuracy FPP technique. Experiments show that the UNet performs the best among the three networks. The validity of the approach gives great promise in the future research and development, which will include, but not limited to, using much larger data sets and large numbers of various objects as well as conducting a rigorous indepth investigation on the CNN models.
V Acknowledgment
The authors thank Dr. Thanh Nguyen at The Catholic University of America for helpful discussion on the CNN models.
References
 SegNet: a deep convolutional encoderdecoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 39, pp. 2481 – 2495. Cited by: §I.
 Structured light field 3d imaging. Opt. Express 24, pp. 20324–20334. Cited by: §I.
 3Dr2n2: a unified approach for single and multiview 3d object reconstruction. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp. . Cited by: §I.
 Depth map prediction from a single image using a multiscale deep network. in Proceedings of the International Conference on Neural Information Processing Systems (NIPS), , pp. 2366–2374. Cited by: §I.
 Micro deep learning profilometry for highspeed 3d surface imaging. Opt. Lasers Eng. 121, pp. 416–427. Cited by: §I.
 Batch denoising of espi fringe patterns based on convolutional neural network. Appl. Opt. 58, pp. 3338–3346. Cited by: §I.
 Deep neural networks for single shot structured light profilometry. Opt. Express 27, pp. 17091–17101. Cited by: §I.

Singleshot absolute 3d shape measurement with fourier transform profilometry
. Appl. Opt. 55, pp. 5219–5225. Cited by: §I.  Multifrequency and multiple phaseshift sinusoidal fringe projection for 3d profilometry. Opt. Express 13, pp. 1561–1569. Cited by: §I.
 Deep convolutional neural fields for depth estimation from a single image. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 5162–5170. Cited by: §I.
 Fully convolutional networks for semantic segmentation. in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pp. 3431–3440. Cited by: §I.
 Singleshot 3d shape reconstruction dataset. figshare , pp. https://figshare.com/s/fe74b1fad5093d3846fb. Cited by: §III.1, §III.2.
 Unet: convolutional networks for biomedical image segmentation. in Intentional Conference on Medical Image Computing and ComputerAssisted Intervention , pp. 234–241. Cited by: §I.
 Hyperaccurate flexible calibration technique for fringeprojectionbased threedimensional imaging. Opt. Express 20, pp. 16926–16941. Cited by: §II.1.
 Fringe pattern denoising based on deep learning. Opt. Comn. 437, pp. 148–152. Cited by: §I.
 Accurate and fast 3d surface measurement with temporalspatial binary encoding structured illumination. Opt. Express 25, pp. 28549–28560. Cited by: §I.
 Micro fourier transform profilometry (ftp): 3d shape measurement at 10,000 frames per second. Opt. Lasers Eng. 102, pp. 70–91. Cited by: §I.