MMES
Manifold Modeling in Embedded Space [IEEE-TNNLS, 2020]
view repo
Deep image prior (DIP), which utilizes a deep convolutional network (ConvNet) structure itself as an image prior, has attractive attentions in computer vision community. It empirically showed that the effectiveness of ConvNet structure in various image restoration applications. However, why the DIP works so well is still in black box, and why ConvNet is essential for images is not very clear. In this study, we tackle this question by considering the convolution divided into "embedding" and "transformation", and proposing a simple, but essential, modeling approach of images/tensors related with dynamical system or self-similarity. The proposed approach named as manifold modeling in embedded space (MMES) can be implemented by using a denoising-auto-encoder in combination with multiway delay-embedding transform. In spite of its simplicity, the image/tensor completion and super-resolution results of MMES were very similar even competitive with DIP in our experiments, and these results would help us for reinterpreting/characterizing the DIP from a perspective of "smooth patch-manifold prior".
READ FULL TEXT VIEW PDFManifold Modeling in Embedded Space [IEEE-TNNLS, 2020]
The most important piece for image/tensor restoration would be the “prior” which usually modifies the optimization problems from ill-posed to well-posed, or gives some robustness for specific noises and outliers. Many priors were studied in computer science problems such as low-rank
[28, 15, 14, 36], smoothness [10, 30, 20], sparseness [35], non-negativity [19, 4], independency [16], and so on. Especially in today’s computer vision problems, total variation (TV) [12, 40], low-rank [22, 17, 50, 41], and non-local similarity [3, 5] priors are often used for image modeling. These priors can be obtained by analyzing basic properties of natural images, and categolized as “unsupervised image modeling”.By contrast, the deep image prior (DIP) [37] has been come from a part of “supervised” or “data-driven” image modeling framework (i.e. deep learning) although the DIP itself is one of the state-of-the-art unsupervised image restoration methods. The method of DIP can be simply explained to only optimize an untrained fully convolutional generator network (ConvNet) for minimizing squares loss between its generated image and an observed image (e.g. noisy image), and stop the optimization before the overfitting. In [37], authors explain the reason why a high-capacity ConvNet can be used as a prior by the following statement: network resists “bad” solutions and descends much more quickly towards naturally-looking images, and its phenomenon of “impedance of ConvNet” was confirmed by a toy experiment. However, most readers would not be convinced from only above explanation because it is just a part of whole. One of the essential questions is why is it ConvNet? In more practical perspective, to explain what is “priors in DIP” with simple and clear words (like smoothness, sparseness, low-rank etc) is very important.
In this study, we tackle a question why ConvNet is essential as an image prior, and try to translate the “deep image prior” with words. First, we consider the convolution operation divided into “embedding” and “transformation” (see Fig. 2). Here, the “embedding” stands for delay/shift-embedding (i.e.
Hankelization) which is a copy/duplication operation by sliding window of some kernel size. The “transformation” is basically linear transformation in a simple convolution operation, and it also indicates some non-linear transformation with non-linear activation.
To simplify the complicated “encoder-decoder” structure of ConvNet used in DIP, we consider the following network structure: embedding (linear), encoding (non-linear), decoding (non-linear), and backward embedding (linear). Fig. 3 shows a simplified illustration of ConvNet and the proposed network. When we set the horizontal dimension of hidden tensor with , each -dimensional fiber in
, which is a vectorization of each
-patch of an input image, is encoded into -dimensional space. Note that the volume of hidden tensor looks to be larger than that of input/output image, but representation ability of is much lower than input/output image space since the first/last tensor (,) must have Hankel structure (i.e. its representation ability is equivalent to image) and the hidden tensor is reduced to lower dimensions from .Here, we assume , and its low-dimensionality indicates the existence of similar ()-patches (i.e. self-similarity) in the image, and it would provide some “impedance” which passes self-similar patches and resist/ignore others. Each fiber of Hidden tensor represents a coordinate on the patch-manifold of image, and then the proposed network structure can be interpreted by manifold modeling in embedded space (MMES) like Fig. 1 (for details please see Section II). Hence, we refer to it as MMES network. It should be noted that the MMES network is a special case of deep neural networks. In fact, the proposed MMES can be considered as a new kind of auto-encoder in which convolution operations have been replaced by Hankelization in pre-processing and post-processing. Compared with ConvNet, the forward and backward embedding operations can be implemented by convolution and transposed convolution with one-hot-filters (see Fig. 6 for details). Note that the encoder-decoder part can be implemented by multiple convolution layers with kernel size (1,1) and activations. In our model, we do not use convolution explicitly but just do linear transform and non-linear activation for “filter-domain” (i.e. horizontal axis of tensors in Fig. 3).
In computational experiments, we apply the proposed MMES network for unsupervised signal, image, and tensor restoration problems, and achieved some similar results to the DIP.
The contributions in this study can be summarized as follow: (1) An interpretable approach of image/tensor modeling is proposed which translates the ConvNet, (2) effectiveness of the proposed method and similarity to the DIP are demonstrated in experiments, and (3) most importantly, there is a prospect for reinterpreting/characterizing the DIP as “smooth patch-manifold prior”.
Note that the idea of low-dimensional patch manifold itself has been proposed in [29, 26]. Peyre had firstly formulated the patch manifold model of natural images and solve it by dictionaly learning and manifold pursuit (sparse modeling) [29]. Osher et al. formulated the regularization function to minimize dimension of patch manifold, and solved Laplace-Beltrami equation by point integral method [26]. In comparison with these studies, we minimize the dimension of patch-manifold by utilizing novel auto-encoder shown in Fig. 3.
A related technique, low-rank tensor modeling in embedded space, has been studied in [44]. However, the modeling approaches are different: multi-linear vs manifold. Thus, our study would be interpreted as manifold version of [44] in a perspective of tensor completion method.
Another related work is devoted to group sparse representation (GSR) [48]. The GSR is roughly characterized as a combination of similar patch-grouping and sparse-land modeling which is similar to the combination of embedding and manifold-modeling. However, the computational cost of similar patch-grouping is obviously higher than embedding, and this role is included in manifold learning.
The main difference between above studies and our is the motivation: Essential, interpretable, and simple image modeling which can translate the ConvNet/DIP. The proposed MMES is having many connections with ConvNet/DIP such as embedding, non-linear mapping, and the training with noise. We believe that the simplicity and interpretability of the method is often more important than obtaining the best performance compared with other state-of-the-art methods. However, the proposed method derives often competitive performance.
Here, on the contrary to Section I, we start to explain the proposed method from the concept, and systematically derive the MMES structure from it. Conceptually, the proposed tensor reconstruction method can be formulated by
s.t. | (1) | |||
where is an observed corrupted tensor,
is an estimated tensor,
is a linear operator which represents the observation system,is padding and Hankelization operator with sliding window of size
, and we impose each column of matrix can be sampled from an -dimensional manifold in -dimensional Euclid space. We have . For simplicity, we putted and . For tensor completion task, is a projection operator onto support set so that the missing elements to be zero. For super-resolution task, is a down-sampling operator of images/tensors. Fig. 1 shows the concept of proposed manifold modeling in case of image inpainting (i.e. ). We minimize the distance between observation and reconstruction with its support , and all patches in should be included in some restricted manifold . In other words, is represented by the patch-manifold, and the property of the patch-manifold can be image priors.In [44], multiway-delay embedding for tensors is defined by using the multi-linear tensor product with multiple duplication matrices and tensor reshaping. Basically, we use the same operation, but a padding operation is added. Thus, the multiway-delay embedding used in this study is defined by
(2) |
where is a -dimensional reflection padding operator^{1}^{1}1For one dimensional array , we have . of tensor, is an unfolding operator which outputs a matrix from an input -th order tensor, and is a duplication matrix. Fig. 4 shows the duplication matrix with . By using the reflection padding, all elements of can be equally duplicated. Fig. 5 shows an example of our multiway-delay embedding in case of second order tensors. The overlapped patch grid is constructed by multi-linear tensor product with . Finally, all patches are splitted, lined up, and vectorized.
The Moore-Penrose pseudo inverse of is given by
(3) |
where is a pseudo inverse of , , and is a trimming operator for removing elements at start and end of each mode.
Delay embedding and its pseudo inverse can be implemented by using convolution with all one-hot-tensor windows of size . The one-hot-tensor windows can be given by folding a
-dimensional identity matrix
into . Fig. 6 shows a calculation flow of multi-way delay embedding using convolution in a case of . Multi-linear tensor product is replaced with convolution with one-hot-tensor windows.Pseudo inverse of the convolution with padding is given by its adjoint operation, which is called as the “transposed convolution” in some neural network library, with trimming and simple scaling with .
We consider an auto-encoder to define the -dimensional manifold in ()-dimensional Euclid space as follows:
(4) | |||
(5) |
where is an encoder, is a decoder, and is an auto-encoder constructed from . Note that, in general, the use of auto-encoder models is a widely accepted approach for manifold learning [13].
In this section, we combine the conceptual formulation (1) and the auto-encoder guided manifold constraint to derive a equivalent more practical optimization problem. First, we redefine as an output of generator:
(6) |
where
. At this moment,
is a function of , however Hankel structure of matrix can not be always guaranteed under the unconstrained condition of . For guaranteeing the Hankel structure of matrix , we further transform it as follow:(7) |
where we put as an operator which auto-encoding each column of a input matrix with , and as a matrix, which has Hankel structure and is transformed by Hankelization of some input tensor . Obviously, is the most compact representation for Hankel matrix . The flow of Eq. (7) is equivalent to the MMES network shown in Fig. 3: , , and
are respectively corresponding to forward embedding, encoding, decoding, and backward embedding, where encoder and decoder are defined by multi-layer perceptrons (
i.e. repetition of linear transformation and non-linear activation).From this formulation, Problem (1) is transformed as
(8) |
where is an auto-encoder which defines the manifold . In this study, the auto-encoder/manifold is learned from an observed tensor itself, thus the optimization problem is given by
(9) |
where we refer respectively the first and second terms by a reconstruction loss and an auto-encoding loss, and is a trade-off parameter for balancing both losses.
In this section, we discuss how to design the neural network architecture of auto-encoder for restricting the manifold . The simplest way is controlling the value of , and it directly restricts the dimensionality of latent space. There are many other possibilities: Tikhonov regularization [9], drop-out [8], denoising auto-encoder [39], variational auto-encoder [6], adversarial auto-encoder [24], alpha-GAN [31], and so on. All methods have some perspective and promise, however the cost is not low. In this study, we select an attractive and fundamental one: “denoising auto-encoder”(DAE) [39]. The DAE is attractive because it has a strong relationship with Tikhonov regularization [2], and decreases the entropy of data [33].
Finally, we designed an auto-encoder with controlling the dimension
and the standard deviation
of additive zero-mean Gaussian noise. Fig. 7 shows the illustration of an example of architecture of auto-encoder which we used in this study. In this case, it consists of five hidden variables of which sizes arewith leaky ReLU activation.
Optimization problem (9
) consists of two terms: a reconstruction loss, and an auto-encoding loss. Hyperparameter
is set to balance both losses. Basically, should be large because auto-encoding loss should be zero. However, very large prohibits minimizing the reconstruction loss, and may lead to local optima. Therefore, we adjust the value of in the optimization process.Algorithm 1 shows an optimization algorithm which used in this study. Adaptation algorithm of is just an example, and it can be modified appropriately with data. By exploiting the convolutional structure of and (see Section II-A1), the calculation flow of and can be easily implemented by using neural network libraries such as TensorFlow. We employed Adam [18] optimizer for updating . The trade-off parameter is adjusted for keeping , but for no large gap between both losses.
In case of multi-channel or color image recovery case, we use a special setting of generator network because spacial pattern of individual channels are similar and the patch-manifold can be shared. Fig. 8 shows an illustration of the auto-encoder shared version of MMES in a case of color image recovery. In this case, we put three channels of input and each channel input is embedded, independently. Then, three Hankel matrices are concatenated, and auto-encoded simultaneously. Inverted three images are stacked as a color-image (third-order tensor), and finally color-transformed. The last color-transform can be implemented by convolution layer with kernel size (1,1), and it is also optimized as parameters. It should be noted that the input three channels are not necessary to correspond to RGB, but it would be optimized as some compact color-representation.
In this section, we apply the proposed method into a toy example of signal recovery. Fig. 9 shows a result of this toy experiment. A one-dimensional time-series signal is obtained from Lorentz system, and corrupted by additive Gaussian noise, random missing, and three block occlusions. The corrupted signal was recovered by the subspace modeling [44], and the proposed manifold modeling in embedded space. Window size of delay-embedding was
, the lowest dimension of autoencoder was
, and additive noise standard deviation was . Manifold modeling catched the structure of Lorentz attractor better than subspace modeling.Fig. 10 visualizes a two-dimensional -patch manifold learned by the proposed method from a 50% missing gray-scale image of ‘Lena’. For this figure, we set , , . Similar patches are located near each other, and the smooth change of patterns can be observed. It implies the relationship between non-local similarity based methods [3, 5, 11, 48], and the manifold modeling (i.e. DAE) plays a kind of “patch-grouping” in the proposed method. The difference from the non-local similarity based approach is that the manifold modeling is “global” rather than “non-local” which finds similar patches of the target patch from its neighborhood area.
For this experiment, we recovered 50% missing gray-scale image of ‘Lena’. We stopped the optimization algorithm after 20,000 iterations. Learning rate was set as 0.01, and we decayed the learning rate with 0.98 every 100 iterations. was adapted by Algorithm 1 every 10 iterations. Fig. 11 shows optimization behaviors of reconstructed image, reconstruction loss , auto-encoding loss , and trade-off coefficient . By using trade-off adjustment, the reconstruction loss and the auto-encoding loss were intersected around 1,500 iterations, and both losses were jointly decreased after the intersection point.
We evaluate the sensitivity of MMES with three hyper-parameters: , , and . First, we fixed the patch-size as , and dimension and noise standard deviation were varied. Fig. 13 shows the reconstruction results of a 99% missing image of ‘Lena’ by the proposed method with different settings of . The proposed method with very low dimension () provided blurred results, and the proposed method with very high dimension () provided results which have many peaks. Futhermore, some appropriate noise level () provides sharp and clean results. For reference, Fig. 14 shows the difference of DIP optimized with and without noise. From both results, the effects of learning with noise can be confirmed.
Next, we fixed the noise level as , and the patch-size were varied with some values of . Fig. 13 shows the results with various patch-size settings for recovering a 99% missing image. The patch sizes of (8,8) or (10,10) were appropriate for this case. Patch size is very important because it depends on the variety of patch patterns. If patch size is too large, then patch variations might expand and the structure of patch-manifold is complicated. By contrast, if patch size is too small, then the information obtained from the embedded matrix is limited and the reconstruction becomes difficult in highly missing cases. The same problem might be occured in all patch-based image reconstruction methods [3, 5, 11, 48]. However, good patch sizes would be different for different images and types/levels of corruption, and the esimation of good patch size is an open problem. Multi-scale approach [43] may reduce a part of this issue but the patch-size is still fixed or tuned as a hyper-parameter.
In this section, we compare performance of the proposed method with several selected unsupervised image inpainting methods: low-rank tensor completion (HaLRTC) [22], parallel low-rank matrix factorization (TMac) [42], tubal nuclear norm regularization (tSVD) [49], Tucker decomposition with rank increment (Tucker inc.) [44], low-rank and total-variation (LRTV) regularization^{2}^{2}2For LRTV, the MATLAB software was downloaded from https://sites.google.com/site/yokotatsuya/home/software/lrtv_pds [45, 46], smooth PARAFAC tensor completion (SPC)^{3}^{3}3For SPC, the MATLAB software was downloaded from https://sites.google.com/site/yokotatsuya/home/software/smooth-parafac-decomposition-for-tensor-completion. [47], GSR^{4}^{4}4For GSR, each color channel was recovered, independently, using the MATLAB software downloaded from https://github.com/jianzhangcs/GSR. [48], multi-way deley embedding based Tucker modeling (MDT-Tucker)^{5}^{5}5For MDT-Tucker, the MATLAB software was downloaded from https://sites.google.com/site/yokotatsuya/home/software/mdt-tucker-decomposition-for-tensor-completion. [44], and DIP^{6}^{6}6For DIP, we implemented by ourselves in Python with TensorFlow. [37].
For this experiments, hyper-parameters of all methods were tuned manually to perform the best peak-signal-to-noise ratio (PSNR) and for structural similarity (SSIM), although it would not be perfect. For DIP, we did not try the all network stuructures with various kernel sizes, filter sizes, and depth. We just employed “default architecture”, which the details are available in supplemental material^{7}^{7}7https://dmitryulyanov.github.io/deep_image_prior of [37], and employed the best results at the appropriate intermediate iterations in optimizations based on the value of PSNR. For the proposed MMES method, we adaptively selected the patch-size , and dimension . Table I shows parameter settings of and for MMES. Noise level of denoising auto-encoder was set as for all images. For auto-encoder, same architecture shown in Fig. 7 was employed. Initial learning rate of Adam optimizer was 0.01 and we decayed the learning rate with 0.98 every 100 iterations. The optimization was stopped after 20,000 iterations for each image.
Fig. 16 shows the eight test images and averages of PSNR and SSIM for various missing ratio {50%, 70%, 90%, 95%, 99%}. The proposed method is quite competitive with DIP. Fig. 15 shows the illustration of results of color image completion. The 99% of randomly selected voxels are removed from 3D (256,256,3)-tensors, and the tensors were recovered by DIP and the proposed MMES. The reconstructed images by DIP and MMES were very similar and much better than others.
airplane | baboon | barbara | facade | house | lena | peppers | saiboat | |
---|---|---|---|---|---|---|---|---|
50 % | (16,4) | (10,4) | (6,4) | (10,4) | (16,4) | (6,4) | (6,4) | (6,4) |
70 % | (16,4) | (10,4) | (6,4) | (16,4) | (16,4) | (6,4) | (16,4) | (6,4) |
90 % | (16,4) | (4,8) | (6,4) | (16,4) | (16,4) | (8,4) | (16,4) | (4,4) |
95 % | (16,4) | (4,6) | (6,4) | (16,4) | (16,4) | (6,8) | (16,4) | (6,8) |
99 % | (8,32) | (4,4) | (6,4) | (4,1) | (8,16) | (10,32) | (8,8) | (6,4) |
In this section, we show the results of MR-image/3D-tensor completion problem. The size of MR image is (109,91,91). We ramdomly remove 50%, 70%, and 90% voxels of the original MR-image and recover the missing MR-images by the proposed method and DIP. For DIP, we implemented the 3D version of default architecture in TensorFlow, but the number of filters of shallow layers were slightly reduced because of the GPU memory constraint. For the proposed method, 3D patch-size was set as , the lowest dimention was , and noise level was . Same architecture shown in Fig. 7 was employed.
Fig. 17 shows reconstruction behavior of PSNR with final value of PSNR/SSIM in this experiment. From the values of PSNR and SSIM, the proposed MMES outperformed DIP in low-rate missing cases, and it is quite competitive in highly missing cases. The some degradation of DIP might be occurred by the insufficiency of filter sizes since much more filter sizes would be required for 3D ConvNet than 2D ConvNet. Moreover, computational times required for our MMES were significantly shorter than that of DIP in this tensor completion problem. Fig. 18 shows reconstructed MR images in a case of 90% missing voxels.
PSNR / SSIM | Bicubic | GSR | DIP | MMES (proposed) |
---|---|---|---|---|
Starfish (64 to 256) | 23.98 / .7124 | 25.73 / .7922 | 25.79 / .7930 | 26.18 / .8099 |
House (64 to 256) | 26.21 / .7839 | 28.05 / .8394 | 28.33 / .8420 | 28.79 / .8448 |
Leaves (64 to 256) | 19.10 / .6673 | 22.60 / .8511 | 22.54 / .8535 | 23.96 / .8935 |
Airplane (128 to 512) | 26.30 / .9176 | 27.74 / .9487 | 27.49 / .9375 | 28.40 / .9503 |
Airplane (64 to 512) | 22.93 / .7545 | 23.79 / .8061 | 23.83 / .8155 | 24.10 / .8207 |
Baboon (128 to 512) | 20.61 / .6904 | 20.93 / .7542 | 20.52 / .7260 | 20.92 / .7486 |
Baboon (64 to 512) | 19.38 / .4505 | 19.61 / .5039 | 19.64 / .5085 | 19.64 / .5024 |
Lena (128 to 512) | 28.64 / .9172 | 30.36 / .9481 | 29.91 / .9406 | 29.76 / .9406 |
Lena (64 to 512) | 25.23 / .7710 | 26.47 / .8271 | 26.71 / .8340 | 26.68 / .8327 |
Monarch (128 to 512) | 24.88 / .9322 | 27.67 / .9679 | 27.90 / .9576 | 28.81 / .9686 |
Monarch (64 to 512) | 20.65 / .7697 | 22.13 / .8393 | 22.65 / .8594 | 23.01 / .8627 |
Peppers (128 to 512) | 27.27 / .9392 | 29.19 / .9642 | 28.78 / .9578 | 28.85 / .9584 |
Peppers (64 to 512) | 24.15 / .8173 | 25.52 / .8753 | 26.07 / .8904 | 25.75 / .8794 |
Sailboat (128 to 512) | 24.38 / .8885 | 25.43 / .9262 | 25.13 / .9130 | 25.72 / .9273 |
Sailboat (64 to 512) | 21.22 / .6898 | 21.94 / .7463 | 22.32 / .7664 | 23.37 / .7705 |
Average | 23.66 / .7801 | 25.14 / .8393 | 25.19 / .8401 | 25.53 / .8474 |
In this section, we compare performance of the proposed method with several selected unsupervised image super-resolution methods: bicubic interporlation, GSR^{8}^{8}8For GSR, each color channel was recovered, independently, using the MATLAB software downloaded from https://github.com/jianzhangcs/GSR. We slightly modified its MATLAB code for applying it to super-resolution task. [48], and DIP [37].
In this experiments, DIP was conducted with the best number of iterations from {1000, 2000, 3000, …, 9000}. For four times (x4) up-scaling in MMES, we set , , and . For eight times (x8) up-scaling in MMES, we set , , and . For all images in MMES, the architecture of auto-encoder consists of three hidden layers with sizes of . We assumed the same Lanczos2 kernel for down-sampling system for all super-resolution methods.
Tab. II
shows values of PSNR and SSIM of the results. We used three (256,256,3) color images, and six (512,512,3) color images. Super resolution methods scaling up them from four or eight times down-scaled images of them. According to this quantitative evaluation, bicubic interpolation was clearly worse than others. Basically, GSR, DIP, and MMES were very competitive. In detail, DIP was slightly better than GSR, and the proposed MMES was slightly better than DIP.
Fig. 19 shows selected high resolution images reconstructed by four super-resolution methods. In general, bicubic method reconstructed blurred images and these were visually worse than others. GSR results had smooth outlines in all images, but these were slightly blurred. DIP reconstructed visually sharp images but these images had jagged artifacts along the diagonal lines. The proposed MMES reconstructed sharp and smooth outlines for all images. Focusing on ‘Starfish’, high resolution image reconstructed by MMES had sharp outlines of texture compared with others. Focusing on ‘Leaves’, high resolution image reconstructed by MMES represented clear leaf tips compared with others. Focusing on ‘Monarch’ and ‘Airplane’ these three methods were very competitive.
A beutiful manifold representation of complicated signals in embedded space has been originally discovered in a study of dynamical system analysis (i.e. chaos analysis) for time-series signals [27]. After this, many signal processing and computer vision applications has been studied but most methods have considered linear approximation because of the diffculty of non-linear modeling [38, 34, 21, 7, 25]. However nowaday, the study of non-linear/manifold modeling has been well progressed with deep learning, and it was successfully applied in this study. Interestingly, we could apply this non-linear system analysis not only for time-series signals but also natural color images and tensors (e.g. videos). The best of our knowledge, this is the first study to apply Hankelization with auto-encoder into general tensor data.
The interpretability of MMES is obviously higher than recent sophisticated deep learning models, and it keeps the relationship with ConvNet based on the convolution divided into embedding and transformation. Thus, it helps us to understand how work ConvNet using MMES. Moreover, our experiments showed an important indication of the patch-manifold reconstruction (see Fig. 10) in ConvNet.
Furthermore, we were pointing out the effect of “learning with noise” in DIP, and also applied the denoising-auto-encoder in the proposed method. In fact, the learning with noise plays an essential role as illustrated in Fig. 14 and Fig. 13. It is indicated that the learning with noise helps to reconstruct smooth manifold even if the capability of ConvNet structure is very high and data is highly corrupted.
The main proposition of DIP study [37] was that there are some image priors in ConvNet structure itself, however the priors could not be explicitly explained with words. In this study, we claim that one of the priors in ConvNet structure, which is exploited in DIP, would be a “smooth patch-manifold prior”. It gave us a deep interpretation of DIP from a perspective of smooth patch-manifold reconstruction, and makes it easy to use DIP in more general applications like tensors. Futhremore, it bridges between slightly different research areas of the dynamical system analysis, the deep learning, and the tensor modeling.
A limitation of this study is that the proposed model does not incorporate the structures of “multi-layered convolution” and “multi-resolution model” which are implicitly incorporated in DIP by multiple upsampling/downsampling with skip connections. In other words, the MMES is still developping and can be improved in many aspects. The difficulty here is that how keep interpretability of the model, and it would be open problem, and included in future works. Futhermore, we only considered a task of image/tensor completion and super-resolution in this study, and other tasks like denoising, compressed sensing, and image dehazing, would be included in our future works.
In a perspective for the manifold modeling, we employed the denoising-auto-encoder in this study. However, there are other manifold modeling methods such as locally-linear embedding [32], Laplacian eigenmap [1], and t-distributed stochastic neighbor embedding [23]. The replacement of manifold modeling with such methods are promising.
The MMES network architecture is basically designed for self patch-manifold learning to apply the image/tensor restoration problem. It is also possibile to apply the MMES network for supervised learning, and it might become one of the approaches of interpretable deep learning.
This work was supported by JST ACT-I: Grant Number JPMJPR18UU, and THE HORI SCIENCES AND ARTS FOUNDATION.
Journal of Machine Learning Research
, 9(Nov):2579–2605, 2008.Extracting and composing robust features with denoising autoencoders.
In Proceedings of ICML, pages 1096–1103, 2008.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 3842–3849, 2014.
Comments
There are no comments yet.