Manifold Modeling in Embedded Space: A Perspective for Interpreting "Deep Image Prior"

by   Tatsuya Yokota, et al.

Deep image prior (DIP), which utilizes a deep convolutional network (ConvNet) structure itself as an image prior, has attractive attentions in computer vision community. It empirically showed that the effectiveness of ConvNet structure in various image restoration applications. However, why the DIP works so well is still in black box, and why ConvNet is essential for images is not very clear. In this study, we tackle this question by considering the convolution divided into "embedding" and "transformation", and proposing a simple, but essential, modeling approach of images/tensors related with dynamical system or self-similarity. The proposed approach named as manifold modeling in embedded space (MMES) can be implemented by using a denoising-auto-encoder in combination with multiway delay-embedding transform. In spite of its simplicity, the image/tensor completion and super-resolution results of MMES were very similar even competitive with DIP in our experiments, and these results would help us for reinterpreting/characterizing the DIP from a perspective of "smooth patch-manifold prior".



There are no comments yet.


page 1

page 5

page 6

page 7

page 8

page 9


Manifold Modeling in Quotient Space: Learning An Invariant Mapping with Decodability of Image Patches

This study proposes a framework for manifold learning of image patches u...

Image Restoration Using Convolutional Auto-encoders with Symmetric Skip Connections

Image restoration, including image denoising, super resolution, inpainti...

Deep Image Prior

Deep convolutional networks have become a popular tool for image generat...

StegNet: Mega Image Steganography Capacity with Deep Convolutional Network

Traditional image steganography often leans interests towards safely emb...

Densely Connected High Order Residual Network for Single Frame Image Super Resolution

Deep convolutional neural networks (DCNN) have been widely adopted for r...

Patch alignment manifold matting

Image matting is generally modeled as a space transform from the color s...

Unified Dynamic Convolutional Network for Super-Resolution with Variational Degradations

Deep Convolutional Neural Networks (CNNs) have achieved remarkable resul...

Code Repositories


Manifold Modeling in Embedded Space [IEEE-TNNLS, 2020]

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The most important piece for image/tensor restoration would be the “prior” which usually modifies the optimization problems from ill-posed to well-posed, or gives some robustness for specific noises and outliers. Many priors were studied in computer science problems such as low-rank

[28, 15, 14, 36], smoothness [10, 30, 20], sparseness [35], non-negativity [19, 4], independency [16], and so on. Especially in today’s computer vision problems, total variation (TV) [12, 40], low-rank [22, 17, 50, 41], and non-local similarity [3, 5] priors are often used for image modeling. These priors can be obtained by analyzing basic properties of natural images, and categolized as “unsupervised image modeling”.

By contrast, the deep image prior (DIP) [37] has been come from a part of “supervised” or “data-driven” image modeling framework (i.e. deep learning) although the DIP itself is one of the state-of-the-art unsupervised image restoration methods. The method of DIP can be simply explained to only optimize an untrained fully convolutional generator network (ConvNet) for minimizing squares loss between its generated image and an observed image (e.g. noisy image), and stop the optimization before the overfitting. In [37], authors explain the reason why a high-capacity ConvNet can be used as a prior by the following statement: network resists “bad” solutions and descends much more quickly towards naturally-looking images, and its phenomenon of “impedance of ConvNet” was confirmed by a toy experiment. However, most readers would not be convinced from only above explanation because it is just a part of whole. One of the essential questions is why is it ConvNet? In more practical perspective, to explain what is “priors in DIP” with simple and clear words (like smoothness, sparseness, low-rank etc) is very important.

Fig. 1: Manifold modeling in embedded space. A case for image inpainting task.

In this study, we tackle a question why ConvNet is essential as an image prior, and try to translate the “deep image prior” with words. First, we consider the convolution operation divided into “embedding” and “transformation” (see Fig. 2). Here, the “embedding” stands for delay/shift-embedding (i.e.

Hankelization) which is a copy/duplication operation by sliding window of some kernel size. The “transformation” is basically linear transformation in a simple convolution operation, and it also indicates some non-linear transformation with non-linear activation.

To simplify the complicated “encoder-decoder” structure of ConvNet used in DIP, we consider the following network structure: embedding (linear), encoding (non-linear), decoding (non-linear), and backward embedding (linear). Fig. 3 shows a simplified illustration of ConvNet and the proposed network. When we set the horizontal dimension of hidden tensor with , each -dimensional fiber in

, which is a vectorization of each

-patch of an input image, is encoded into -dimensional space. Note that the volume of hidden tensor looks to be larger than that of input/output image, but representation ability of is much lower than input/output image space since the first/last tensor (,) must have Hankel structure (i.e. its representation ability is equivalent to image) and the hidden tensor is reduced to lower dimensions from .

Here, we assume , and its low-dimensionality indicates the existence of similar ()-patches (i.e. self-similarity) in the image, and it would provide some “impedance” which passes self-similar patches and resist/ignore others. Each fiber of Hidden tensor represents a coordinate on the patch-manifold of image, and then the proposed network structure can be interpreted by manifold modeling in embedded space (MMES) like Fig. 1 (for details please see Section II). Hence, we refer to it as MMES network. It should be noted that the MMES network is a special case of deep neural networks. In fact, the proposed MMES can be considered as a new kind of auto-encoder in which convolution operations have been replaced by Hankelization in pre-processing and post-processing. Compared with ConvNet, the forward and backward embedding operations can be implemented by convolution and transposed convolution with one-hot-filters (see Fig. 6 for details). Note that the encoder-decoder part can be implemented by multiple convolution layers with kernel size (1,1) and activations. In our model, we do not use convolution explicitly but just do linear transform and non-linear activation for “filter-domain” (i.e. horizontal axis of tensors in Fig. 3).

Fig. 2: Decomposition of 1D and 2D convolutions: Valid convolution can be divided into delay-embedding/Hankelization and linear transformation.

In computational experiments, we apply the proposed MMES network for unsupervised signal, image, and tensor restoration problems, and achieved some similar results to the DIP.

The contributions in this study can be summarized as follow: (1) An interpretable approach of image/tensor modeling is proposed which translates the ConvNet, (2) effectiveness of the proposed method and similarity to the DIP are demonstrated in experiments, and (3) most importantly, there is a prospect for reinterpreting/characterizing the DIP as “smooth patch-manifold prior”.

Note that the idea of low-dimensional patch manifold itself has been proposed in [29, 26]. Peyre had firstly formulated the patch manifold model of natural images and solve it by dictionaly learning and manifold pursuit (sparse modeling) [29]. Osher et al. formulated the regularization function to minimize dimension of patch manifold, and solved Laplace-Beltrami equation by point integral method [26]. In comparison with these studies, we minimize the dimension of patch-manifold by utilizing novel auto-encoder shown in Fig. 3.

A related technique, low-rank tensor modeling in embedded space, has been studied in [44]. However, the modeling approaches are different: multi-linear vs manifold. Thus, our study would be interpreted as manifold version of [44] in a perspective of tensor completion method.

Another related work is devoted to group sparse representation (GSR) [48]. The GSR is roughly characterized as a combination of similar patch-grouping and sparse-land modeling which is similar to the combination of embedding and manifold-modeling. However, the computational cost of similar patch-grouping is obviously higher than embedding, and this role is included in manifold learning.

The main difference between above studies and our is the motivation: Essential, interpretable, and simple image modeling which can translate the ConvNet/DIP. The proposed MMES is having many connections with ConvNet/DIP such as embedding, non-linear mapping, and the training with noise. We believe that the simplicity and interpretability of the method is often more important than obtaining the best performance compared with other state-of-the-art methods. However, the proposed method derives often competitive performance.

Fig. 3: Comparison of typical auto-encoder ConvNet and the proposed MMES network.

Ii Manifold Modeling in Embedded Space

Here, on the contrary to Section I, we start to explain the proposed method from the concept, and systematically derive the MMES structure from it. Conceptually, the proposed tensor reconstruction method can be formulated by

s.t. (1)

where is an observed corrupted tensor,

is an estimated tensor,

is a linear operator which represents the observation system,

is padding and Hankelization operator with sliding window of size

, and we impose each column of matrix can be sampled from an -dimensional manifold in -dimensional Euclid space. We have . For simplicity, we putted and . For tensor completion task, is a projection operator onto support set so that the missing elements to be zero. For super-resolution task, is a down-sampling operator of images/tensors. Fig. 1 shows the concept of proposed manifold modeling in case of image inpainting (i.e. ). We minimize the distance between observation and reconstruction with its support , and all patches in should be included in some restricted manifold . In other words, is represented by the patch-manifold, and the property of the patch-manifold can be image priors.

Ii-a Multiway-delay embedding for tensors

In [44], multiway-delay embedding for tensors is defined by using the multi-linear tensor product with multiple duplication matrices and tensor reshaping. Basically, we use the same operation, but a padding operation is added. Thus, the multiway-delay embedding used in this study is defined by


where is a -dimensional reflection padding operator111For one dimensional array , we have . of tensor, is an unfolding operator which outputs a matrix from an input -th order tensor, and is a duplication matrix. Fig. 4 shows the duplication matrix with . By using the reflection padding, all elements of can be equally duplicated. Fig. 5 shows an example of our multiway-delay embedding in case of second order tensors. The overlapped patch grid is constructed by multi-linear tensor product with . Finally, all patches are splitted, lined up, and vectorized.

Fig. 4: Duplication matrix. In case that we have columns, it consists of identity matrices of size .
Fig. 5: Flow of multiway-delay-embedding operation ().

The Moore-Penrose pseudo inverse of is given by


where is a pseudo inverse of , , and is a trimming operator for removing elements at start and end of each mode.

Ii-A1 Delay embedding using convolution

Delay embedding and its pseudo inverse can be implemented by using convolution with all one-hot-tensor windows of size . The one-hot-tensor windows can be given by folding a

-dimensional identity matrix

into . Fig. 6 shows a calculation flow of multi-way delay embedding using convolution in a case of . Multi-linear tensor product is replaced with convolution with one-hot-tensor windows.

Pseudo inverse of the convolution with padding is given by its adjoint operation, which is called as the “transposed convolution” in some neural network library, with trimming and simple scaling with .

Fig. 6: Multiway-delay-embedding using convolution ().

Ii-B Definition of low-dimensional manifold

We consider an auto-encoder to define the -dimensional manifold in ()-dimensional Euclid space as follows:


where is an encoder, is a decoder, and is an auto-encoder constructed from . Note that, in general, the use of auto-encoder models is a widely accepted approach for manifold learning [13].

Ii-C Problem formulation

In this section, we combine the conceptual formulation (1) and the auto-encoder guided manifold constraint to derive a equivalent more practical optimization problem. First, we redefine as an output of generator:



. At this moment,

is a function of , however Hankel structure of matrix can not be always guaranteed under the unconstrained condition of . For guaranteeing the Hankel structure of matrix , we further transform it as follow:


where we put as an operator which auto-encoding each column of a input matrix with , and as a matrix, which has Hankel structure and is transformed by Hankelization of some input tensor . Obviously, is the most compact representation for Hankel matrix . The flow of Eq. (7) is equivalent to the MMES network shown in Fig. 3: , , and

are respectively corresponding to forward embedding, encoding, decoding, and backward embedding, where encoder and decoder are defined by multi-layer perceptrons (

i.e. repetition of linear transformation and non-linear activation).

From this formulation, Problem (1) is transformed as


where is an auto-encoder which defines the manifold . In this study, the auto-encoder/manifold is learned from an observed tensor itself, thus the optimization problem is given by


where we refer respectively the first and second terms by a reconstruction loss and an auto-encoding loss, and is a trade-off parameter for balancing both losses.

Ii-D Design of auto-encoder

In this section, we discuss how to design the neural network architecture of auto-encoder for restricting the manifold . The simplest way is controlling the value of , and it directly restricts the dimensionality of latent space. There are many other possibilities: Tikhonov regularization [9], drop-out [8], denoising auto-encoder [39], variational auto-encoder [6], adversarial auto-encoder [24], alpha-GAN [31], and so on. All methods have some perspective and promise, however the cost is not low. In this study, we select an attractive and fundamental one: “denoising auto-encoder”(DAE) [39]. The DAE is attractive because it has a strong relationship with Tikhonov regularization [2], and decreases the entropy of data [33].

Finally, we designed an auto-encoder with controlling the dimension

and the standard deviation

of additive zero-mean Gaussian noise. Fig. 7 shows the illustration of an example of architecture of auto-encoder which we used in this study. In this case, it consists of five hidden variables of which sizes are

with leaky ReLU activation.

Fig. 7: An example of architecture of auto-encoder.

Ii-E Optimization

Optimization problem (9

) consists of two terms: a reconstruction loss, and an auto-encoding loss. Hyperparameter

is set to balance both losses. Basically, should be large because auto-encoding loss should be zero. However, very large prohibits minimizing the reconstruction loss, and may lead to local optima. Therefore, we adjust the value of in the optimization process.

Algorithm 1 shows an optimization algorithm which used in this study. Adaptation algorithm of is just an example, and it can be modified appropriately with data. By exploiting the convolutional structure of and (see Section II-A1), the calculation flow of and can be easily implemented by using neural network libraries such as TensorFlow. We employed Adam [18] optimizer for updating . The trade-off parameter is adjusted for keeping , but for no large gap between both losses.

  input: , , , , ;
  initialize: , auto-encoder , ;
      with ;
     generate noise with ;
     update by Adam for ;
     if  then
     end if
  until converge
  output: ;
Algorithm 1 Optimization algorithm

Ii-F A special setting for color-image recovery

In case of multi-channel or color image recovery case, we use a special setting of generator network because spacial pattern of individual channels are similar and the patch-manifold can be shared. Fig. 8 shows an illustration of the auto-encoder shared version of MMES in a case of color image recovery. In this case, we put three channels of input and each channel input is embedded, independently. Then, three Hankel matrices are concatenated, and auto-encoded simultaneously. Inverted three images are stacked as a color-image (third-order tensor), and finally color-transformed. The last color-transform can be implemented by convolution layer with kernel size (1,1), and it is also optimized as parameters. It should be noted that the input three channels are not necessary to correspond to RGB, but it would be optimized as some compact color-representation.

Fig. 8: Generator network in a case of color-image recovery.

Iii Experiments

Fig. 9: Time series signal recovery of subspace and manifold models in embedded space.
Fig. 10: Two-dimensional (8,8)-patch manifold learned from a 50% missing gray-scale image of ‘Lena’.

Iii-a Toy examples

In this section, we apply the proposed method into a toy example of signal recovery. Fig. 9 shows a result of this toy experiment. A one-dimensional time-series signal is obtained from Lorentz system, and corrupted by additive Gaussian noise, random missing, and three block occlusions. The corrupted signal was recovered by the subspace modeling [44], and the proposed manifold modeling in embedded space. Window size of delay-embedding was

, the lowest dimension of autoencoder was

, and additive noise standard deviation was . Manifold modeling catched the structure of Lorentz attractor better than subspace modeling.

Fig. 10 visualizes a two-dimensional -patch manifold learned by the proposed method from a 50% missing gray-scale image of ‘Lena’. For this figure, we set , , . Similar patches are located near each other, and the smooth change of patterns can be observed. It implies the relationship between non-local similarity based methods [3, 5, 11, 48], and the manifold modeling (i.e. DAE) plays a kind of “patch-grouping” in the proposed method. The difference from the non-local similarity based approach is that the manifold modeling is “global” rather than “non-local” which finds similar patches of the target patch from its neighborhood area.

Iii-A1 Optimization behavior

For this experiment, we recovered 50% missing gray-scale image of ‘Lena’. We stopped the optimization algorithm after 20,000 iterations. Learning rate was set as 0.01, and we decayed the learning rate with 0.98 every 100 iterations. was adapted by Algorithm 1 every 10 iterations. Fig. 11 shows optimization behaviors of reconstructed image, reconstruction loss , auto-encoding loss , and trade-off coefficient . By using trade-off adjustment, the reconstruction loss and the auto-encoding loss were intersected around 1,500 iterations, and both losses were jointly decreased after the intersection point.

Fig. 11: Optimization behavior.

Iii-B Hyper-parameter sensitivity

We evaluate the sensitivity of MMES with three hyper-parameters: , , and . First, we fixed the patch-size as , and dimension and noise standard deviation were varied. Fig. 13 shows the reconstruction results of a 99% missing image of ‘Lena’ by the proposed method with different settings of . The proposed method with very low dimension () provided blurred results, and the proposed method with very high dimension () provided results which have many peaks. Futhermore, some appropriate noise level () provides sharp and clean results. For reference, Fig. 14 shows the difference of DIP optimized with and without noise. From both results, the effects of learning with noise can be confirmed.

Next, we fixed the noise level as , and the patch-size were varied with some values of . Fig. 13 shows the results with various patch-size settings for recovering a 99% missing image. The patch sizes of (8,8) or (10,10) were appropriate for this case. Patch size is very important because it depends on the variety of patch patterns. If patch size is too large, then patch variations might expand and the structure of patch-manifold is complicated. By contrast, if patch size is too small, then the information obtained from the embedded matrix is limited and the reconstruction becomes difficult in highly missing cases. The same problem might be occured in all patch-based image reconstruction methods [3, 5, 11, 48]. However, good patch sizes would be different for different images and types/levels of corruption, and the esimation of good patch size is an open problem. Multi-scale approach [43] may reduce a part of this issue but the patch-size is still fixed or tuned as a hyper-parameter.

Fig. 12: Performance of reconstruction of color image of ‘Lena’ with 99% pixels missing for various parameter setting.
Fig. 13: Reconstuction of ‘Lena’ image for various patch sizes .
Fig. 12: Performance of reconstruction of color image of ‘Lena’ with 99% pixels missing for various parameter setting.
Fig. 14: Reconstruction of ‘home’ image by training with/without noise in deep image prior.
Fig. 15: Completion results from images with 99% missing pixels by HaLTRC [22], TMac [42], tSVD [49], Tucker inc. [44], LRTV [45], SPC [47], GSR [48], MDT-Tucker [44], DIP [37] and the proposed MMES.
Fig. 16: Comparison of averages of PSNR and SSIM for eight color image completion with various missing rates (from 50% to 99% missing pixels).

Iii-C Comparisons

Iii-C1 Color image completion

In this section, we compare performance of the proposed method with several selected unsupervised image inpainting methods: low-rank tensor completion (HaLRTC) [22], parallel low-rank matrix factorization (TMac) [42], tubal nuclear norm regularization (tSVD) [49], Tucker decomposition with rank increment (Tucker inc.) [44], low-rank and total-variation (LRTV) regularization222For LRTV, the MATLAB software was downloaded from [45, 46], smooth PARAFAC tensor completion (SPC)333For SPC, the MATLAB software was downloaded from [47], GSR444For GSR, each color channel was recovered, independently, using the MATLAB software downloaded from [48], multi-way deley embedding based Tucker modeling (MDT-Tucker)555For MDT-Tucker, the MATLAB software was downloaded from [44], and DIP666For DIP, we implemented by ourselves in Python with TensorFlow. [37].

For this experiments, hyper-parameters of all methods were tuned manually to perform the best peak-signal-to-noise ratio (PSNR) and for structural similarity (SSIM), although it would not be perfect. For DIP, we did not try the all network stuructures with various kernel sizes, filter sizes, and depth. We just employed “default architecture”, which the details are available in supplemental material777 of [37], and employed the best results at the appropriate intermediate iterations in optimizations based on the value of PSNR. For the proposed MMES method, we adaptively selected the patch-size , and dimension . Table I shows parameter settings of and for MMES. Noise level of denoising auto-encoder was set as for all images. For auto-encoder, same architecture shown in Fig. 7 was employed. Initial learning rate of Adam optimizer was 0.01 and we decayed the learning rate with 0.98 every 100 iterations. The optimization was stopped after 20,000 iterations for each image.

Fig. 16 shows the eight test images and averages of PSNR and SSIM for various missing ratio {50%, 70%, 90%, 95%, 99%}. The proposed method is quite competitive with DIP. Fig. 15 shows the illustration of results of color image completion. The 99% of randomly selected voxels are removed from 3D (256,256,3)-tensors, and the tensors were recovered by DIP and the proposed MMES. The reconstructed images by DIP and MMES were very similar and much better than others.

airplane baboon barbara facade house lena peppers saiboat
50 % (16,4) (10,4) (6,4) (10,4) (16,4) (6,4) (6,4) (6,4)
70 % (16,4) (10,4) (6,4) (16,4) (16,4) (6,4) (16,4) (6,4)
90 % (16,4) (4,8) (6,4) (16,4) (16,4) (8,4) (16,4) (4,4)
95 % (16,4) (4,6) (6,4) (16,4) (16,4) (6,8) (16,4) (6,8)
99 % (8,32) (4,4) (6,4) (4,1) (8,16) (10,32) (8,8) (6,4)
TABLE I: Parameter settings for MMES in image completion experiments
Fig. 17: Results of MRI completion: Optimization behaviors of PSNR with final values of PSNR/SSIM by DIP and proposed MMES.
Fig. 18: Illustration of MRI reconstructed from 90% missing tensor.

Iii-C2 Volumetric/3D image/tensor completion

In this section, we show the results of MR-image/3D-tensor completion problem. The size of MR image is (109,91,91). We ramdomly remove 50%, 70%, and 90% voxels of the original MR-image and recover the missing MR-images by the proposed method and DIP. For DIP, we implemented the 3D version of default architecture in TensorFlow, but the number of filters of shallow layers were slightly reduced because of the GPU memory constraint. For the proposed method, 3D patch-size was set as , the lowest dimention was , and noise level was . Same architecture shown in Fig. 7 was employed.

Fig. 17 shows reconstruction behavior of PSNR with final value of PSNR/SSIM in this experiment. From the values of PSNR and SSIM, the proposed MMES outperformed DIP in low-rate missing cases, and it is quite competitive in highly missing cases. The some degradation of DIP might be occurred by the insufficiency of filter sizes since much more filter sizes would be required for 3D ConvNet than 2D ConvNet. Moreover, computational times required for our MMES were significantly shorter than that of DIP in this tensor completion problem. Fig. 18 shows reconstructed MR images in a case of 90% missing voxels.

Fig. 19: Super-resolution results: The first and second lines ‘Starfish’ and ‘Leaves’ were up-scaled from (64,64,3) to (256,256,3), the third line ‘Monarch’ was up-scaled from (128,128,3) to (512,512,3), and the fourth line ‘Airplane’ was up-scaled from (64,64,3) to (512,512,3).
PSNR / SSIM Bicubic GSR DIP MMES (proposed)
Starfish (64 to 256) 23.98 / .7124 25.73 / .7922 25.79 / .7930 26.18 / .8099
House (64 to 256) 26.21 / .7839 28.05 / .8394 28.33 / .8420 28.79 / .8448
Leaves (64 to 256) 19.10 / .6673 22.60 / .8511 22.54 / .8535 23.96 / .8935
Airplane (128 to 512) 26.30 / .9176 27.74 / .9487 27.49 / .9375 28.40 / .9503
Airplane (64 to 512) 22.93 / .7545 23.79 / .8061 23.83 / .8155 24.10 / .8207
Baboon (128 to 512) 20.61 / .6904 20.93 / .7542 20.52 / .7260 20.92 / .7486
Baboon (64 to 512) 19.38 / .4505 19.61 / .5039 19.64 / .5085 19.64 / .5024
Lena (128 to 512) 28.64 / .9172 30.36 / .9481 29.91 / .9406 29.76 / .9406
Lena (64 to 512) 25.23 / .7710 26.47 / .8271 26.71 / .8340 26.68 / .8327
Monarch (128 to 512) 24.88 / .9322 27.67 / .9679 27.90 / .9576 28.81 / .9686
Monarch (64 to 512) 20.65 / .7697 22.13 / .8393 22.65 / .8594 23.01 / .8627
Peppers (128 to 512) 27.27 / .9392 29.19 / .9642 28.78 / .9578 28.85 / .9584
Peppers (64 to 512) 24.15 / .8173 25.52 / .8753 26.07 / .8904 25.75 / .8794
Sailboat (128 to 512) 24.38 / .8885 25.43 / .9262 25.13 / .9130 25.72 / .9273
Sailboat (64 to 512) 21.22 / .6898 21.94 / .7463 22.32 / .7664 23.37 / .7705
Average 23.66 / .7801 25.14 / .8393 25.19 / .8401 25.53 / .8474
TABLE II: Values of PSNR and SSIM in super-resolution task

Iii-C3 Color image superresolution

In this section, we compare performance of the proposed method with several selected unsupervised image super-resolution methods: bicubic interporlation, GSR888For GSR, each color channel was recovered, independently, using the MATLAB software downloaded from We slightly modified its MATLAB code for applying it to super-resolution task. [48], and DIP [37].

In this experiments, DIP was conducted with the best number of iterations from {1000, 2000, 3000, …, 9000}. For four times (x4) up-scaling in MMES, we set , , and . For eight times (x8) up-scaling in MMES, we set , , and . For all images in MMES, the architecture of auto-encoder consists of three hidden layers with sizes of . We assumed the same Lanczos2 kernel for down-sampling system for all super-resolution methods.

Tab. II

shows values of PSNR and SSIM of the results. We used three (256,256,3) color images, and six (512,512,3) color images. Super resolution methods scaling up them from four or eight times down-scaled images of them. According to this quantitative evaluation, bicubic interpolation was clearly worse than others. Basically, GSR, DIP, and MMES were very competitive. In detail, DIP was slightly better than GSR, and the proposed MMES was slightly better than DIP.

Fig. 19 shows selected high resolution images reconstructed by four super-resolution methods. In general, bicubic method reconstructed blurred images and these were visually worse than others. GSR results had smooth outlines in all images, but these were slightly blurred. DIP reconstructed visually sharp images but these images had jagged artifacts along the diagonal lines. The proposed MMES reconstructed sharp and smooth outlines for all images. Focusing on ‘Starfish’, high resolution image reconstructed by MMES had sharp outlines of texture compared with others. Focusing on ‘Leaves’, high resolution image reconstructed by MMES represented clear leaf tips compared with others. Focusing on ‘Monarch’ and ‘Airplane’ these three methods were very competitive.

Iv Discussions and Conclusions

A beutiful manifold representation of complicated signals in embedded space has been originally discovered in a study of dynamical system analysis (i.e. chaos analysis) for time-series signals [27]. After this, many signal processing and computer vision applications has been studied but most methods have considered linear approximation because of the diffculty of non-linear modeling [38, 34, 21, 7, 25]. However nowaday, the study of non-linear/manifold modeling has been well progressed with deep learning, and it was successfully applied in this study. Interestingly, we could apply this non-linear system analysis not only for time-series signals but also natural color images and tensors (e.g. videos). The best of our knowledge, this is the first study to apply Hankelization with auto-encoder into general tensor data.

The interpretability of MMES is obviously higher than recent sophisticated deep learning models, and it keeps the relationship with ConvNet based on the convolution divided into embedding and transformation. Thus, it helps us to understand how work ConvNet using MMES. Moreover, our experiments showed an important indication of the patch-manifold reconstruction (see Fig. 10) in ConvNet.

Furthermore, we were pointing out the effect of “learning with noise” in DIP, and also applied the denoising-auto-encoder in the proposed method. In fact, the learning with noise plays an essential role as illustrated in Fig. 14 and Fig. 13. It is indicated that the learning with noise helps to reconstruct smooth manifold even if the capability of ConvNet structure is very high and data is highly corrupted.

The main proposition of DIP study [37] was that there are some image priors in ConvNet structure itself, however the priors could not be explicitly explained with words. In this study, we claim that one of the priors in ConvNet structure, which is exploited in DIP, would be a “smooth patch-manifold prior”. It gave us a deep interpretation of DIP from a perspective of smooth patch-manifold reconstruction, and makes it easy to use DIP in more general applications like tensors. Futhremore, it bridges between slightly different research areas of the dynamical system analysis, the deep learning, and the tensor modeling.

A limitation of this study is that the proposed model does not incorporate the structures of “multi-layered convolution” and “multi-resolution model” which are implicitly incorporated in DIP by multiple upsampling/downsampling with skip connections. In other words, the MMES is still developping and can be improved in many aspects. The difficulty here is that how keep interpretability of the model, and it would be open problem, and included in future works. Futhermore, we only considered a task of image/tensor completion and super-resolution in this study, and other tasks like denoising, compressed sensing, and image dehazing, would be included in our future works.

In a perspective for the manifold modeling, we employed the denoising-auto-encoder in this study. However, there are other manifold modeling methods such as locally-linear embedding [32], Laplacian eigenmap [1], and t-distributed stochastic neighbor embedding [23]. The replacement of manifold modeling with such methods are promising.

The MMES network architecture is basically designed for self patch-manifold learning to apply the image/tensor restoration problem. It is also possibile to apply the MMES network for supervised learning, and it might become one of the approaches of interpretable deep learning.


This work was supported by JST ACT-I: Grant Number JPMJPR18UU, and THE HORI SCIENCES AND ARTS FOUNDATION.


  • [1] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373–1396, 2003.
  • [2] C. M. Bishop. Training with noise is equivalent to tikhonov regularization. Neural Computation, 7(1):108–116, 1995.
  • [3] A. Buades, B. Coll, and J.-M. Morel. A non-local algorithm for image denoising. In Proceedings of CVPR, volume 2, pages 60–65. IEEE, 2005.
  • [4] A. Cichocki, R. Zdunek, A. H. Phan, and S.-i. Amari. Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. John Wiley & Sons, 2009.
  • [5] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian. Image denoising by sparse 3-d transform-domain collaborative filtering. IEEE Transactions on Image Processing, 16(8), 2007.
  • [6] M. W. Diederik P Kingma. Auto-encoding variational bayes. In Proceedings of ICLR, 2014.
  • [7] T. Ding, M. Sznaier, and O. I. Camps. A rank minimization approach to video inpainting. In Proceedings of ICCV, pages 1–8. IEEE, 2007.
  • [8] Y. Gal and Z. Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of ICML, pages 1050–1059, 2016.
  • [9] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT press, 2016.
  • [10] W. E. L. Grimson. From images to surfaces: A computational study of the human early visual system. MIT Press, 1981.
  • [11] S. Gu, L. Zhang, W. Zuo, and X. Feng. Weighted nuclear norm minimization with application to image denoising. In Proceedings of CVPR, pages 2862–2869, 2014.
  • [12] F. Guichard and F. Malgouyres. Total variation based interpolation. In Proceedings of EUSIPCO, pages 1–4. IEEE, 1998.
  • [13] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313(5786):504–507, 2006.
  • [14] F. L. Hitchcock. The expression of a tensor or a polyadic as a sum of products. Journal of Mathematics and Physics, 6(1):164–189, 1927.
  • [15] H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6):417, 1933.
  • [16] A. Hyvarinen, J. Karhunen, and E. Oja. Independent component analysis, volume 46. John Wiley & Sons, 2004.
  • [17] H. Ji, C. Liu, Z. Shen, and Y. Xu. Robust video denoising using low rank matrix completion. In Proceedings of CVPR, pages 1791–1798. IEEE, 2010.
  • [18] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [19] D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788, 1999.
  • [20] S. Z. Li. Markov random field models in computer vision. In Proceedings of ECCV, pages 361–370. Springer, 1994.
  • [21] Y. Li, K. R. Liu, and J. Razavilar. A parameter estimation scheme for damped sinusoidal signals based on low-rank Hankel approximation. IEEE Transactions on Signal Processing, 45(2):481–486, 1997.
  • [22] J. Liu, P. Musialski, P. Wonka, and J. Ye. Tensor completion for estimating missing values in visual data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1):208–220, 2013.
  • [23] L. v. d. Maaten and G. Hinton. Visualizing data using t-SNE.

    Journal of Machine Learning Research

    , 9(Nov):2579–2605, 2008.
  • [24] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
  • [25] I. Markovsky. Structured low-rank approximation and its applications. Automatica, 44(4):891–909, 2008.
  • [26] S. Osher, Z. Shi, and W. Zhu. Low dimensional manifold model for image processing. SIAM Journal on Imaging Sciences, 10(4):1669–1690, 2017.
  • [27] N. H. Packard, J. P. Crutchfield, J. D. Farmer, and R. S. Shaw. Geometry from a time series. Physical Review Letters, 45(9):712, 1980.
  • [28] K. Pearson. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.
  • [29] G. Peyre. Manifold models for signals and images. Computer Vision and Image Understanding, 113(2):249–260, 2009.
  • [30] T. Poggio, V. Torre, and C. Koch. Computational vision and regularization theory. Nature, 317:26, 1985.
  • [31] M. Rosca, B. Lakshminarayanan, D. Warde-Farley, and S. Mohamed. Variational approaches for auto-encoding generative adversarial networks. arXiv preprint arXiv:1706.04987, 2017.
  • [32] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290(5500):2323–2326, 2000.
  • [33] S. Sonoda and N. Murata. Transportation analysis of denoising autoencoders: a novel method for analyzing deep neural networks. arXiv preprint arXiv:1712.04145, 2017.
  • [34] M. Szummer and R. W. Picard. Temporal texture modeling. In Proceedings of ICIP, volume 3, pages 823–826. IEEE, 1996.
  • [35] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • [36] L. R. Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966.
  • [37] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Deep image prior. In Proceedings of CVPR, pages 9446–9454, 2018.
  • [38] P. Van Overschee and B. De Moor. Subspace algorithms for the stochastic identification problem. In Proceedings of IEEE Conference on Decision and Control, pages 1321–1326. IEEE, 1991.
  • [39] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol.

    Extracting and composing robust features with denoising autoencoders.

    In Proceedings of ICML, pages 1096–1103, 2008.
  • [40] C. R. Vogel and M. E. Oman. Fast, robust total variation-based reconstruction of noisy, blurred images. IEEE Transactions on Image Processing, 7(6):813–824, 1998.
  • [41] W. Wang, V. Aggarwal, and S. Aeron. Efficient low rank tensor ring completion. In Proceedings of ICCV, pages 5697–5705, 2017.
  • [42] Y. XU, R. HAO, W. YIN, and Z. SU. Parallel matrix factorization for low-rank tensor completion. Inverse Problems & Imaging, 9(2), 2015.
  • [43] N. Yair and T. Michaeli. Multi-scale weighted nuclear norm image restoration. In Proceedings of CVPR, pages 3165–3174, 2018.
  • [44] T. Yokota, B. Erem, S. Guler, S. K. Warfield, and H. Hontani. Missing slice recovery for tensors using a low-rank model in embedded space. In Proceedings of CVPR, pages 8251–8259, 2018.
  • [45] T. Yokota and H. Hontani. Simultaneous visual data completion and denoising based on tensor rank and total variation minimization and its primal-dual splitting algorithm. In Proceedings of CVPR, pages 3732–3740, 2017.
  • [46] T. Yokota and H. Hontani. Simultaneous tensor completion and denoising by noise inequality constrained convex optimization. IEEE Access, 2019.
  • [47] T. Yokota, Q. Zhao, and A. Cichocki. Smooth PARAFAC decomposition for tensor completion. IEEE Transactions on Signal Processing, 64(20):5423–5436, 2016.
  • [48] J. Zhang, D. Zhao, and W. Gao. Group-based sparse representation for image restoration. IEEE Transactions on Image Processing, 23(8):3336–3351, 2014.
  • [49] Z. Zhang, G. Ely, S. Aeron, N. Hao, and M. Kilmer. Novel methods for multilinear data completion and de-noising based on tensor-svd. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3842–3849, 2014.
  • [50] Q. Zhao, L. Zhang, and A. Cichocki. Bayesian CP factorization of incomplete tensors with automatic rank determination. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9):1751–1763, 2015.