Deep neural network for fringe pattern filtering and normalisation

06/14/2019 ∙ by Alan Reyes-Figueroa, et al. ∙ 7

We propose a new framework for processing Fringe Patterns (FP). Our novel approach builds upon the hypothesis that the denoising and normalisation of FPs can be learned by a deep neural network if enough pairs of corrupted and cleaned FPs are provided. Although similar proposals have been reported in the literature, we propose an improvement of a well-known deep neural network architecture, which produces high-quality results in terms of stability and repeatability. We test the performance of our method in various scenarios: FPs corrupted with different degrees of noise, and corrupted with different noise distributions. We compare our methodology versus other state-of-the-art methods. The experimental results (on both synthetic and real data) demonstrate the capabilities and potential of this new paradigm for processing interferograms. We expect our work would motivate more sophisticated developments in this direction.



There are no comments yet.


page 1

page 6

page 7

page 8

page 9

page 11

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Fringe Pattern (FP) denoising–normalisation consists of removing background illumination variations, normalising contrast and filtering noise, which means transforming an FP corresponding to the mathematical model


into the normalised FP modelled by


Here, is the pixel position, is the background illumination, is the fringe contrast, is the phase map and is an additive or correlated noise. Such a normalisation can be represented by the transformation


Fringe pattern normalisation is a step of the fringe analysis processing. Refs. [1, 2]

present useful reviews of challenges related to FP analysis and a good list of proposed solutions. There is a consensus that FP analysis can be seen as a pipeline that involves the following steps: denoising, normalisation, phase extraction and phase unwrapping, even if some of these steps are merged by some methods. For example, the denoising, the normalisation and the phase extraction can be accomplished using the two-dimensional Windowed Fourier Transform (WFT)

[3], the wavelet transform (WT), [4, 5] or a Gabor Filter Bank (GFB) based method [6]. As is noted in Refs. [4, 5]

, WFT and FT have limitations in dealing with FPs when phase discontinuities are present or the field of view is not the full image. As we will show in this work, the same limitation applies for the GFB approach. These techniques estimate a transformation of the form (

3), for a central pixel in an image neighbourhood (image patch or weighted image window).

Figure 1: Normalisation of an FP with incomplete field of view: (a) data, (b) GFB (note the artefacts in regions with low–frequency fringes and near the border of the region of interest) and (c) proposal.

WFT, WT and GFB methods rely upon the assumption that the neighbourhood of a pixel of interest (image patch) has an almost constant frequency and phase, i.e.locally, the phase map is close to being a plane. The limitation of the mentioned methods occur at patches were the main assumption is violated; i.e., at phase discontinuities. Figure 1 shows a denoised-filtered FP computed with a GFB based method and with our proposal. Further information on alternative strategies for FP normalisation can be found in Ref. [2]. However, methods based on local spectral analysis (e.g., WFT, WT and GFB) have shown to be very robust general methods for dealing with high noise levels [4, 5, 6, 7, 8]. In this work, we propose to implement such a transformation (3) with a Deep Neural Network (DNN).

Neural Networks (NNs) are known for being universal approximators [9]. Early attempts to use a NN for FP analysis are of limited application since they are based on simple multilayer schemes. In particular, in Ref. [10] a multilayer NN is trained for computing the phase and its gradient at the central pixel of an image patch. Instead, our proposal computes the restoration for the entire image patch. In addition, our work is based on a deep auto–encoder NN that allows us to deal with noise and large illumination changes.

2 Brief review of the auto–encoder

The auto–encoders were originally devised to compress (codify) and decompress (decodify) data vectors

[11]. Fig. 2 shows a scheme of a basic auto–encoder, where one can observe its two main components:

  1. The encoder takes the data in its original “spatial” dimension and produces a compressed vector . Mathematically, the encoding can be expressed by


    where is the original data, is the weights matrix, is the bias, is the encoded data and

    is an activation function (


    , ReLU, sigmoid or softmax).

  2. The decoder takes the compressed data and computes a reconstruction of the original data . This is expressed by


    where is a weights matrix and is a bias and is the activation function.

    Figure 2: Scheme of a basic auto–encoder.

In this case, given a training dataset , the auto–encoder (coder and decoder) is trained by solving an optimisation problem of the form:


where represents a metric or divergence measure. The auto–encoder illustrated in Fig. 2, has a hidden layer associated with the coded variable and produces an output of the same dimension as the original data .

3 Method

Image analysis models based on DNNs have demonstrated their ability to represent diverse transformations to map an input image onto an output image

. Auto–encoders are one kind of NNs that can map a tensor into another tensor (a tensor is a multidimensional array). In this work, we deal with monochromatic images that are represented as tensors of order two. RGB-codified images can be represented as tensors of order three.

In this work we propose to use a deep auto–encoder for performing image restoration.

3.1 U–net model for image segmentation

The auto–encoders motivated the development of the fully convolutional U–net model for image classification (segmentation) [12]

. The loss function for the U–net is of the form


where is a segmentation of , and represent the encoder and decoder stages, respectively; finally, is the vector of the auto–encoder parameters. For each input image, fully convolutional models produces an image of labels of the same size as the input [13], unlike standard convolutional networks whose output is a single value. One can note important differences between the classical auto–encoder and the U–net:

  1. U–net is a Deep model; i.e., the number of layers in U–net is substantially larger than the number of layers in the classic auto–encoder.

  2. U–net implements convolutional 2D filters so that, the weights of each layer are codified into an array of matrices (a 3D tensor) and produces a vector of processed images (a 3D tensor). On the other hand, classical auto–encoders vectorise the input image and therefore the pixel’s spatial relationships are lost. Figure 3 illustrates the U–net architecture: connections, number of layers and dimension of the input and output of each layer. The graphical representation used in Refs. [12, 14] allows one to visualise the NN–convolutions (filters) and the dimensional changes of the processed data. Herein, we use such a representation for illustrating our deep architecture.

  3. The input of a decoder layer in U–net is the concatenation of the output tensor of the previous layer and output tensor of the symmetric encoder layer (so–called “skip links”); see Fig. 3. The purpose of skip links can be understood in the context of residual–nets [15]: they allow to construct the solution using both, coarse and processed data.

    The residual–net combines and by means of an addition, while U–net learns a more general form to combine the data. For a better understanding, the residual–net version of the auto–encoder in equation (5) is . In contrast, the U–net version is written as , where is the function learned by the model to combine the data. In any case, skip–links tackle the well–known problem of “vanishing–gradient” on deep networks and improve the training process [16].

Figure 3: The U–net and V–net architectures look similar at block level. However, the number of filters per block is inversely distributed: U–net is for image segmentation (classification) while V–net is designed for image reconstruction (regression).

As in standard convolutional DNNs, U–net follows the thumb–rule for processing inputs tensors: the encoding layers increase the number of channels (filters) of the input tensor, and reduce its spatial dimension, while the decoding layers shrink the number of channels and extend the spatial dimensions of the processed tensors. Therefore, one improves the computational efficiency by applying a reduced number of filters on tensors with larger spatial dimension and a larger number of filters on spatial small–sized tensors. In the training stage, the filters are optimised for detecting the useful features that allow the U–net to correctly classify the pixels.

3.2 V–net model: An improved U–net for image restoration

Since classification (image segmentation) and regression (image restoration) may require different features, in this work we propose an improvement of the U–net designed to achieve image restoration instead of image segmentation. Despite the computational cost that it implies, our network applies a larger number of filters on the input tensor (original image) in order to capture local details and achieve a precise restoration. Also, as opposed to the standard U–net, we reduce the number of filters as the layers are deeper on the encoder. Thus, the deepest layer in the encoder stage (bottom layer on Figure 3) produces a tensor with the smallest dimension (spatial size and number of channels). These characteristics distinguish our architecture and provide our DNN with an advantage for the regression task. We call our improved model “V–net” because it uses tensors with few channels on deeper layers.

The V–net encoder is composed by a sequence of encoding blocks (Down–Blocks), followed by the decoder, which consist of a sequence of decoding blocks (Up–Blocks), and a Tail (composed by two last convolutional layers), see Fig. 3. The th Down–Block, , starts by applying two convolutional layers with channels of size . This number determines the amount of output channels at each stage. A complete description of each encoding and decoding block architectures is illustrated in Fig. 4.

In our implementation, we set (number of spatial-size levels); and the number of channels , , is determined by the number of channels in the previous Down–Block following the rule . The same occurs with the number of channels for each Up–Block. In our model, the respective number of channels is set as , for . Although more general configurations for the number of channels can be implemented, we have chosen as global parameter the number of channels in the last Down–Block (bottom level), and the other ’s are determined by as is indicated in Fig. 3.

In the following, we denote by the output tensor of the th Down–Block, , as well as the second input of the ()th Up–Block, . Each tensor has dimensions (number of channels, number of rows, number of columns), . In particular, is the input tensor for the model, and equals the dimension of each input patch. Similarly, denotes the first input tensor of the th Up–Block, , as well as the output for the ()th Up–Block, . Each tensor has the same dimensions as has . In particular, , and corresponds to the output tensor of the last Up–Block. We will denote by the final output of the model.

Now, we shall describe the mathematics of all components in our proposed model. We start with equations defining the operations in each Down–Block:

  1. Apply a 2D convolution with filters of :


    where is the th kernel (of size ), , of the first convolution in the th Down–Block; note that the 2D convolution is implemented by sliding a kernel with the same number of channels than the convolved data among the rows and columns of the input tensor. The output of this convolution is the tensor of dimension . Since we use padding, the number of rows and columns for the input tensor and the convolved one are equal: we extend borders of the input tensor with zeros—one rows of zeros above and bellow, and one column of zeros to the left and another to the right.

  2. Apply the tensorial activation function :


    where is a tensorial activation function applied to each entry of the tensor . Our implementation uses Rectified Linear function (ReLU) as in the standard implementation of the U–net:

  3. Apply a second 2D convolution with filters of :


    where is the th kernel (of size ), , of the second convolution.

  4. Apply the activation function :


    Note that we use the same activation after the convolutions 1 and 2 on the Down–Block.

  5. Apply a dropout of in the training stage. Generate a binary random mask with of zeros (the reminder entries are set to one) with a size equal to and compute


    where is the element–wise product. The zero entries of turn off the corresponding outputs in the responses and such a dropout mask is kept constant for each training data batch. This mask acts as a regulariser that avoids to be trapped by a bad

    local minima in the early training epochs


  6. Apply a MaxPooling of

    with stride equal

    to the tensor . Mathematically this subsamplig is written as


    for and . Then, the output of the th Down–Block is the tensor of dimension equal to (, with and .

Figure 4: V–net components: (a) Down–Block and (b) Up–Block.

The above steps (1 to 6) can be represented by the operator , where the subindex reflects the dependency on the parameter ; i.e., the number of filters of the convolutional layers. Then, the calculation of given is expressed by


Since the entire V–net’s encoder consists of Down–Blocks, then the operator that define such encoder is given by the composition


On the other hand, the V–net’s decoder, represented with the operator , consists of the composition of Up–Blocks:


where for , represents the th Up–Block operator given by:


where . The calculation of uses the output of the previous Up–Block, , and the output of the mirror Down–Block represented with a skip–link, see Fig. 3. The details of the Up–Block calculations are as follows:

  1. Upsample the previous output,

    with a nearest neighbour interpolation:


    for and . Then, the output has dimensions (. Note that these are twice the dimensions of .

  2. Apply a 2D convolution with kernel size of and filters:


    where is the th kernel (of size ), , of the first convolution and is the ReLU activation function. The purpose of this convolution it to smooth the nearest neighbour interpolation of the previous step.

  3. Concatenate the previous output, , with the output of the mirror Down–Block in the decoder, :


    This result in the tensor of dimensions .

  4. Apply a 2D convolution with kernel size of and filters:


    where is the th kernel (of size ) of the second convolution and is the ReLU activation function.

  5. Apply a third 2D convolution of and filters:


    where is the th kernel (of size ) of the convolution.

The architecture of the Down–Block and Up–Block are summarised in Panels (a) and (b) of Fig. 4, respectively.

Note. Since the spatial down-sampling in equation (6) is performed at each Down–Block, we require that the spatial dimensions of each input tensor , , must be even. Moreover, since the encoder consist of Down–Blocks, hence the input patches have dimensions multiple of . This is a dispensable requirement, but it helps to simplify the internal arithmetic of the up-sampling step.

A final Tail stage, consisting of the application of two final convolutional layers is then performed. Their purpose is to smoothing and simplify the output , in order to produce a final output with less channels (just 1 channel to be precise). This 1–channel output is the estimated FP corresponding to the corrupted input . The steps are


where is the th kernel (of size ), and is the ReLU activation function; and


where is the unique kernel (of size ), and is the ReLU activation function.

Finally, the training of the V–net can be written as


where is the input tensor (stack of all input patches ), is the desired output (stack of all normalised FP patches ), and is the output tensor (stack of all patch estimations ) of the V–net. The operator represents the Tail stages given by (24) and (25). Here, is the set of all model parameters (filter weights and bias). We use the norm as loss function, because it induces lower reconstruction errors.

4 Implementation details

4.1 Simulated data

To quantitatively evaluate the performance of the proposed V–net based normalisation, we randomly generated 46 pairs of FPs (with size equal to pixels): the corrupted FPs were generated according to the model in (1) and the normalised FPs (ground–truth) according to the model in (2

). The normally distributed random noise was generated with the Python Numpy package. On the other hand, the random smooth functions (illumination components and phase) were constructed using random numbers with uniform distribution generated with our implementation of a Linear Congruential Generator with POSIX parameters


in order to guarantee the FPs generation replicability. In the following, we explain the smooth random surface generation procedure. We generated the pseudo–random phase with a radial basis function with Gaussian kernel



where we define


the vector of the random kernels centers that are uniformly distributed into the FPs lattice (i.e., ), and the vector of random uniformly–distributed Gaussian heights, with . Similarly, we generated the illumination term with and . In our data, we selected and uniformly distributed over the image domain (using our implementation of the POSIX algorithm) and we set , and . Figure 5 depicts an example of the synthetic data used for training: Panels (a) and (b) show the ideal and corrupted FPs, respectively. Panels (c) and (d) show a selected region where the noise level can be better appreciated. As we have said, we trained our V–net model with relatively small-sized patches with respect to the original FP size. An example of a patch-pair used for training is depicted in panels (e) and (f).

Figure 5: Example of training data. (a) Synthetic normalised FP of pixels with a selected region of interest of pixels (yellow square), small random patches of in blue; (b) the same FP corrupted with Gaussian noise, patches in red; (c) and (d) regions of interest in (a) and (b), respectively. A random patch–pair used for training: (e) Ground–truth and (f) corrupted input.
Figure 6: FP normalisation (inference). A set of overlapped patches that cover the entire FP to normalise is computed and used to feed our trained V–net model, the predicted patches are assembled to reconstruct the FP original. Pixels with multiple predictions (because of the patches’ overlapping) are averaged for computing the normalised (reconstructed) FP.

In the Experiments section, we evaluate our method performance for different noise types, as Gaussian, salt–pepper, speckle, and combinations of them. In the case of FPs with speckle noise, we used the model


instead of (1); where and are spatially independent and identically distributed noise: has uniform distribution (with values into ) and

has Gaussian distribution (with zero mean and standard deviation


In the case of salt–pepper noise, we randomly select the of the pixels and saturate them to values and in equal proportion.

4.2 Training data set

The training data set consists of random patches of pixels sampled from 30 generated FPs (the set of training images); of those patches were used for validation. In addition, the remainder FPs were used as the test data set. We stacked the corrupted patches in the tensor and the corresponding normalised patches form the desired output .

The patch–size is a user-defined parameter. We chose the size as by considering a maximum frequency close to 1.5 fringes per patch. Moreover, the V–net also requires a patch–size divisible by ; where is the number of levels.

4.3 Prediction of a full FP from reconstructed patches

Recall that our V–net is designed to reconstruct small FP patches of pixels. To reconstruct an entire FP, we generated a set of patches using a sliding window scheme, with a stride (pixels shift) of four pixels in both horizontal and vertical directions. Those patches were fed to the V–net to compute their normalisations; see Fig. 6. Each pixel in the entire reconstructed FP was computed as the average of the values in the same pixel position obtained from overlapped normalised patches. We preferred the mean because it is more efficiently computed than the median and we did not appreciate a significative difference if the median is used.

For a 2-dimensional FP image, let be the number of patches computed over the dimensions (rows and columns). Then, is given by




is the image size, is the patch size, is the step and


In the case of our experiments, we set , , for . Then, the number of patches required to reconstruct a single FP is . This quantity is substantially larger than the number of patches in the training set, patches. The expected number of patches per training FP was 833 (. Then we ran Montecarlo simulations to estimate the covered area by selected patches: in average, it was of each entire FP. If we increase the number of training patches to , the averaged covered area would be .

5 Experiments

In order to evaluate the performance of the U–net and the proposed improvement V–net for the FP normalisation task, we conducted three experiments. In the first one, we evaluated the U–net and V–net with respect the noise level (assuming Gaussian noise). In the second experiment, we evaluated such models under different noise distributions: Gaussian, salt–pepper, speckle, combination of noise and the effect of incomplete field of view (named Pupil in this work). Finally, the third experiment compares our proposals with methods of the state of the art.

For all the evaluated networks, we equally set parameters for the training process. We used the ADAM algorithm [20] as optimiser with a learning rate , a decay rate , a batch size equal to 32 and we select the best trained model over 150 epochs.

5.1 Performance comparison of U–net and V–net for different noise levels

In this experiment, we simulated noise levels as in the acquisition of typical interferometric FPs. We investigated the U–net and V–net models performance for seven levels of Gaussian noise; i.e., seven standard deviations for the noise in (1). Such values are indicated in first column in Table 1. For each trained model, we used a randomly generated training set and randomly generated initial starting point for the the models’ parameters (weights). Table 1 reports the averaged Mean–Absolute–Error (MAE) of the reconstructions over ten different trained models.

Standard deviation U–net MAE V–net MAE
( with variable) () ()
0.00 2.142 2.266
0.05 2.016 2.383
0.10 2.416 2.457
0.15 2.552 2.539
0.20 2.762 2.620
0.25 2.769 2.726
0.30 2.807 2.702
Table 1: Summary of the synthetic experiments (full-images). FPs were generated using (1).

Table 1 shows that, in general, V–net performs better than U–net for denoising FPs corrupted with Gaussian noise. According to Fig. 7, the V–net model has a superior performance for higher standard deviation values ( with a signal’s dynamic range into the interval ). Both models have a similar performance for close to . On the other hand, U–net produces better reconstructions for low noise levels (). Fig. 8 shows examples of FPs normalised with our method (noise with

). In general, V–net presents lower error variance, that is understood as a better precision of the results.

Figure 7: Summary of experiments for normalizing 46 FPs corrupted with Gaussian noise and different levels of .
Figure 8: Denoised–normalised FPs with V–net: a) Ground–truth; b) corrupted FPs, ; and c) reconstructions, .

5.2 Performance comparison of U–net and V–net for different noise types

The following experiment reports the U–net and V–net model’s performance for the normalisation of FPs under different noise distributions; in all cases, the illumination components and the phase were generated according to the method presented in subsection 4.1. The pupil was defined with a centered circular region of diameter equal to of the image size.

Table 2 and Fig. 9 report results for corrupted FPs under different scenarios: Salt–pepper noise, Speckle noise, Gaussian–and–speckle noise and an incomplete field of view (pupil). Note that V–net produces better results for Speckle noise, Gaussian+Speckle noise and pupil. In contrast, U–net has a better performance when the task requires processing data with few intensity levels: as in the reconstruction of FP corrupted with salt–pepper noise (to remove a few data and to interpolate such pixels) or if only low–frequency illumination changes are present and the noise is not a problem.

Figure 9: Summary of experiments for normalizing 46 FPs corrupted with different noise distributions.
Noise U–net MAE V–net MAE
( with variable) () ()
(no noise) 2.142 2.266
Salt-pepper 2.416 2.455
Speckle 2.552 2.539
Gaussian-speckle 2.762 2.620
Gaussian-speckle-pupil 2.769 2.726
Table 2: Summary of the synthetic experiments (patches). Speckle FPs were generated using (29).

5.3 Comparison versus state of the art methods

In this subsection we evaluate the performance of the proposed models versus other methods of state of the art based on deep neural networks. In addition we also introduce, and compare, with a second variant method of V–net. This variant is built upon the residual network paradigm which assumes that the input can be decomposed as the sum of two components: the FP and the corrupting elements. We call this variant Res V–Net and it is implemented by changing the concatenation (step 3) in each Up–Block of the V–net architecture. We replaced the merging procedure given in (21) by an element–wise subtraction (denoted by ):


In the Res V–net one expects that the terms capture the corrupted elements in the input signal.

Figure 10: Normalised FPs. (a) Data generated with Gaussian and speckle noise. (b) GFB based normalisation. (c) Our results.

We compared our proposed networks with recently reported Deep Neural Networks: optical Fringe Patterns Denoising (FPD) convolutional neural network proposed in Ref.

[21], Deep Convolutional Neural Network (DCNN) [22] and the application reported in Ref. [23] of the general purpose image denoising deep neural network (FFD) [24].

In Refs. [21, 22] are presented favourable comparisons of their networks with respect to a filtering based on the Windowed Fourier Transform (WFT) [4, 25]. The authors argue they have chosen WFT since it is one of the classical procedures with better performance for fringe denoising. We have compared our proposals with a particular case of WFT: the Gabor Filter Bank—in [8] is reported the relationship between GFB and WFT. The results of our comparison are consistent with the reported in Ref. [22]: the GFB method fails to reconstruct the FP at regions with phase discontinuities and low–frequency. Figure 10 depicts evidence that supports this claim.

The computational time for the GFB was around 348 secs (using a Python convolutional CPU implementation on an i5 Intel 3.7GHz and executing the process in one core) for a single FP of

pixel. In contrast, the trained V–net normalises the same image in 18.5 secs (using a Python–TensorflowKeras implementation with a GPU NVIDIA Titan XP).

In the following experiment we evaluated the performance of our models (U–net, V–net and Res V–net) and the recent methods reported in Refs. [24, 21, 23]. We evaluated all the methods in the task of normalising FPs corrupted with Gaussian noise, speckle noise, illumination components variations and incomplete field of view. The training was conducted using patches for our models (U–net, V–net and Res V–net) and patches for the compared methods. The training set (patches) were randomly sampled over 30 of the total of 46 FPs. The remaining 16 FPs constitute the test set. For all the methods, we used the same training parameters, as described at the beginning of this section. Figure 11 shows examples of reconstructed patches. Figure 12 shows examples of reconstructed full FPs. The procedure for reconstructing complete FPs was described in subsection 4.3. From a visual inspection of Figures 11 and 12, one can note that the proposed networks produce the better results.

Tables 35 summarise the experiments results. Table 3 shows the averaged Mean Square Error (MSE), averaged Mean Absolute Error (MAE) and averaged Peak Signal to Noise Ratio (PSNR) over the full reconstructions of the 30 FPs used to generate the training set. Since we used patches for training, the full FPs were never seen for the networks. Table 4 shows the averaged errors for the reconstructed 16 FPs used to generate the test set.

x (input) 6.644 1.410 1.182
FFD 5.008 1.938 1.302
DCNN 4.048 1.047 1.416
FPD 4.166 1.475 1.422
U–net 0.846 0.587 2.127
V–net 0.764 0.655 2.147
Res V–net 0.824 0.788 2.087
Table 3: Averaged errors over the training FPs (accuracy).
x (input) 6.501 1.393 1.194
FFD 5.251 2.012 1.283
DCNN 3.797 1.004 1.434
FPD 3.983 1.506 1.426
U–net 0.807 0.585 2.138
V–net 0.765 0.681 2.138
ResV–net 0.812 0.787 2.090
Table 4: Averaged errors over the test FPs (accuracy).

Finally, Table 5 shows the averaged variance of the computed MSE and MAE of the errors in Table 4. The third column shows the square of the Coefficient of Variation (), where the relative standard deviation is defined as . The is a measure of the precision and repeatability of the results.

Figure 11: Results of the compared deep neural networks models: normalised patches.
Figure 12: Results of the compared deep neural networks models: normalised complete FPs.
Figure 13: Results of the compared deep neural networks models with real interferometric FP.
x (input) 12.021 1.637 1.849
FFD 3.714 3.181 0.978
DCNN 7.959 1.861 1.998
FPD 18.296 3.623 3.484
U–net 1.737 0.664 2.151
V–net 0.702 0.476 0.918
Res V–net 0.313 0.348 0.385
Table 5: Averaged variations of the errors over the test FPs (precision).

Figure 13 shows a real interferometric FP and the results of the normalisation obtained with the evaluated methods. We use the same models that were trained with simulated data in this experiment. The performance of all the evaluated methods is consistent with the results obtained when processing synthetic FPs.

6 Discussion and conclusion

6.1 Method limitations

One can note that all the deep neural networks fails to reconstruct the high–frequencies in Figure 13. This limitation is explained by the lack of enough patches with similar high–frequencies in the training set. This is equivalent processing with a GFB that lacks of filters tuned to such high–frequencies. In the case of neural networks, this problem is solved by including examples with such frequencies in the training set.

U–net, V–net and Res V–net models base the FPs reconstruction by processing image patches with local information. Despite this is an advantage in terms of computational efficiency, there are some limitations for solving FP analysis problems associated with global information. To illustrate this point, we trained a V–net to estimate the quadrature normalised FP. Mathematically, we trained a V–net to estimate an operator , that given observations modelled by (1), produces normalised FPs according to


Figure 14 shows a reconstructed quadrature FP where the “global sign” problem is evident. The problem actually occurs at patch level. There are patches for which the network can not infer the correct sign of the sine function, see reconstructed patch in last row in Figure 15. However, this sign change does not appear in arbitrary orientations, it seems that there exists a principal axis in the orientation domain where the sign changes systematically appear.

We also observed that U–net, V–net and Res V–net models have limitations for filtering–out noise at regions with very low–frequencies and low–contrast; those are regions where a visual inspection does not suggest a clear local dominant frequency. Moreover, we observed that Res V–net cannot completely remove the noise of figures in experiment of subsection 5.3. An explanation for this behaviour it that the Res V–net model assumes additive noise while our experimental data contains correlated noise (speckle).

Figure 14: Computation of the normalised signal in quadrature. Note the global sign problem.
Figure 15: Computation of the normalised signal in quadrature at patch level. Note the sign problem in the patch at last row.

6.2 Conclusions

The proposed normalisation–denoising method for FPs is based on the deep learning paradigm. Under this paradigm, one trains a neural network (universal approximator) that estimates an appropriated transformation between observations (corrupted inputs) and outputs. In particular, our solution builds upon a deep auto–encoder that produces results with very small errors. Our results show that the proposed U–net and V–net schemes can be applied to real FP images, in which the image’s corruption irregularities correspond to high noise levels, illumination component variations or an incomplete field of view; see Figures 10 and 12. Our models can process real FPs, even if we train the networks with simulated data.

We observed that DNNs can compute reconstructions with lower error than GFBs near the boundary of the field of view; see Figure 10. We found that V–net produces higher quality reconstruction for FPs for higher levels of noise and pupils than U–net and the compared methods. In our opinion, the reason is that the V–net filter distribution across the layers is designed to retain more details, as opposed to U–net which is designed to segment images.

We believe that the evaluated methods are a new research branch for developing more sophisticated FPs analysis methods, that based on deep neural networks, can compute solutions in a front–to–end strategy. It could be interesting to design and implement specific deep networks architectures for solving challenging problems in FP analysis; e.g., the phase recovery from a single interferogram with closed fringes and the analytic (quadrature) FP computation.


The authors thank Adonai González for providing the data for Figure 13. This research was supported in part by Conacyt, Mexico (A1-S-43858 research grant and A.R.F. PhD. studies scholarship) and the NVIDIA Academic program.