OneNet
One Network to Solve Them All
view repo
While deep learning methods have achieved stateoftheart performance in many challenging inverse problems like image inpainting and superresolution, they invariably involve problemspecific training of the networks. Under this approach, different problems require different networks. In scenarios where we need to solve a wide variety of problems, e.g., on a mobile camera, it is inefficient and costly to use these speciallytrained networks. On the other hand, traditional methods using signal priors can be used in all linear inverse problems but often have worse performance on challenging tasks. In this work, we provide a middle ground between the two kinds of methods  we propose a general framework to train a single deep neural network that solves arbitrary linear inverse problems. The proposed network acts as a proximal operator for an optimization algorithm and projects nonimage signals onto the set of natural images defined by the decision boundary of a classifier. In our experiments, the proposed framework demonstrates superior performance over traditional methods using a wavelet sparsity prior and achieves comparable performance of speciallytrained networks on tasks including compressive sensing and pixelwise inpainting.
READ FULL TEXT VIEW PDFAt the heart of many image processing tasks is a linear inverse problem, where the goal is to reconstruct an image from a set of measurements of the form , where is the measurement operator and is the noise. For example, in image inpainting, is an image with masked regions and is the linear operation applying a pixelwise mask to the original image ; in superresolution, is a lowresolution image and the operation downsamples high resolution images; in compressive sensing, denotes compressive measurements and is the measurement matrix, e.g., a random Gaussian matrix. Linear inverse problems, like those described above, are often underdetermined, i.e., they involve fewer measurements than unknowns. Such underdetermined systems are extremely difficult to solve since the operator has a nontrivial null space and there are an infinite number of feasilbe solutions but only a few of them are natural images.
Solving linear inverse problems. There are two broad classes of methods for solving linear underdetermined problems. At one end, we have techniques that use signal priors to regularize the inverse problems. Signal priors enable identification of the true solution from the infinite set of feasible solutions by enforcing imagespecific features to the solution. Thereby, designing a signal prior plays a key role when solving linear inverse problems. Traditionally, signal priors are handdesigned based on empirical observations of images. For example, since natural images are usually sparse after wavelet transformation and are generally piecewise smooth, signal priors that constrain the sparsity of wavelet coefficients or spatial gradients are widely used [1, 2, 3, 4, 5]. Even though these signal priors can be used in any linear inverse problems related to images and usually have efficient solvers, these signal priors are often too generic, in that many nonimage signals can also satisfy the constraints. Thereby, these handdesigned signal priors cannot easily deal with challenging problems like image inpainting or superresolution.
Instead of using a universal signal prior, a second class of method learn a mapping from the linear measurement domain of to the image space of , with the help of large datasets and deep neural nets [6, 7, 8]. For example, to solve image superresolution, lowresolution images are generated from a highresolution image dataset, and the mapping between the corresponding image pairs are learned with a neural net [6]. Similarly, a network can be trained to solve compressive sensing problem [9, 10, 11] or image debluring [12], etc. These methods have achieved stateoftheart performance in many challenging problems.
Despite their superior performance, these speciallytrained solvers are designed for specific problems and usually cannot solve other problems without retraining the mapping function — even when the problems are similar. For example, a superresolution network cannot be easily readapted to solve or superresolution problems; a compressive sensing network for Gaussian random measurements is not applicable to subsampled Hadamard measurements. Training a new network for every single instance of an inverse problem is a wasteful proposition. In comparison, traditional methods using handdesigned signal priors can solve any linear inverse problems but have poorer performance on an individual problem. Clearly, a middle ground between these two classes of methods is needed.
One network to solve them all. In this paper, we ask the following question: if we have a large image dataset, can we learn from the dataset a signal prior that can deal with any linear inverse problems of images? Such a signal prior can significantly lower the cost to incorporate inverse algorithms into consumer products, for example, via the form of specialized hardware design. To answer this question, we observe that in optimization algorithms for solving linear inverse problems, signal priors usually appears in the form of proximal operators. Geometrically, the proximal operator projects
the current estimate closer to the feasible sets (natural images) constrained by the signal prior. Thus, we propose to learn the proximal operator with a
deep projection model, which can be integrated into many standard optimization frameworks for solving arbitrary linear inverse problems.Contributions. We make the following contributions.
[leftmargin=*]
We propose a general framework that implicitly learns a signal prior and a projection operator from large image datasets. When integrated into an alternating direction method of multipliers (ADMM) algorithm, the same proposed projection operator can solve challenging linear inverse problems related to images.
We identify sufficient conditions for the convergence of the nonconvex ADMM with the proposed projection operator, and we use these conditions as guidelines to design the proposed projection network.
We show that it is inefficient to solve generic linear inverse problems with stateoftheart methods using speciallytrained networks. Our experiment results also show that they are prone to be affected by changes in the linear operators and noise in the linear measurements. In contrast, the proposed method is more robust to these factors.
Given noisy linear measurements and the corresponding linear operator , which is usually underdetermined, the goal of linear inverse problems is to find a solution , such that and be a signal of interest, in our case, an image. Based on their strategies to deal with the underdetermined nature of the problem, algorithms for linear inverse problems can be roughly categorized into those using handdesigned signal priors and those learning from datasets. We briefly review some of these methods.
Handdesigned signal priors. Linear inverse problems are usually regularized by signal priors in a penalty form:
(1) 
where is the signal prior and is the nonnegative weighting term. Signal priors constraining the sparsity of in some transformation domain have been widely used in literatures. For example, since images are usually sparse after wavelet transformation or after taking gradient operations, a signal prior can be formulated as , where is a operator representing either wavelet transformation, taking image gradient, or other handdesigned linear operation that produces sparse features from images [13]. Using signal priors of norms enjoys two advantages. First, it forms a convex optimization problem and provides global optimality. The optimization problem can be solved efficiently with a variety of algorithms for convex optimization. Second, priors enjoy many theoretical guarantees, thanks to results in compressive sensing [14]. For example, if the linear operator satisfies conditions like the restricted isometry property and is sufficiently sparse, the optimization problem (1) provides the sparsest solution.
Despite their algorithmic and theoretical benefits, handdesigned priors are often too generic to constrain the solution set of the inverse problem (1) to be images — we can easily generate noise signals that have sparse wavelet coefficients or gradients.
Learningbased methods. The evergrowing number of images on the Internet enables stateoftheart algorithms to deal with challenging problems that traditional methods are incapable of solving. For example, image inpainting and restoration can be performed by pasting image patches or transforming statistics of pixel values of similar images in a large dataset [15, 16]. Image denoising and superresolution can be performed with dictionary learning methods that reconstruct image patches with sparse linear combinations of dictionary entries learned from datasets [17, 18]. Large datasets can also help learn endtoend mappings from the linear measurement domain to the image domain. Given a linear operator and a dataset , the pairs can be used to learn an inverse mapping by minimizing the distance between and , even when is underdetermined. Stateoftheart methods usually parametrize the mapping functions with deep neural nets. For example, stacked autoencoders and convolutional neural nets have been used to solve compressive sensing and image deblurring problems [11, 10, 12, 9]. Recently, adversarial learning [19] has been demonstrated its ability to solve many challenging image problems, such as image inpainting [8] and superresolution [7, 20].
Despite their ability to solve challenging problems, solving linear inverse problems with endtoend mappings have a major disadvantage — the number of mapping functions scales linearly with the number of problems. Since the datasets are generated based on specific operators s, these endtoend mappings can only solve the given problems . Even if the problems change slightly, the mapping functions (nerual nets) need to be retrained. For example, a mapping to solve superresolution cannot be used directly to solve superresolution with satisfactory performance; it is even more difficult to repurpose a mapping for image inpainting to solve image superresolution. This specificity of endtoend mappings makes it costly to incorporate them into consumer products that need to deal with a variety of image processing applications.
Deep generative models. Another thread of research learns generative models from image datasets. Suppose we have a dataset containing samples of a distribution . We can estimate and sample from the model [21, 22, 23], or directly generate new samples from without explicitly estimating the distribution [19, 24]. Dave et al. [25]
use a spatial longshortterm memory network to learn the distribution
; to solve linear inverse problems, they solve a maximum a posteriori estimation — maximizing
for such that . Nguyen et al. [26]use a discriminative network and denoising autoencoders to implicitly learn the joint distribution between the image and its label
, and they generate new samples by sampling the joint distribution , i.e., the network, with an approximated Metropolisadjusted Langevin algorithm. To solve image inpainting, they replace the values of known pixels in sampled images and repeat the sampling process. Similar to the proposed framework, these methods can be used to solve a wide variety of inverse problems. They use a probability framework and thus can be considered orthogonal to the proposed framework, which is motivated by a geometric perspective.
Signal priors play an important role in regularizing underdetermined inverse problems. As mentioned in the introduction, traditional priors constraining the sparsity of signals in gradient or wavelet bases are often too generic, in that we can easily create nonimage signals satisfying these priors. Instead of using traditional signal priors, we propose to learn a prior from a large image dataset. Since the prior is learned directly from the dataset, it is tailored to the statistics of images in the dataset and, in principle, provide stronger regularization to the inverse problem. In addition, similar to traditional signal priors, the learned signal prior can be used to solve any linear inverse problems pertaining to images.
The proposed framework is motivated by the optimization technique, alternating direction method of multipliers method (ADMM) [27], that is widely used to solve linear inverse problems as defined in (1). A typical first step in ADMM is to separate a complicated objective into several simpler ones by variable splitting, i.e., introducing an additional variable that is constrained to be equal to . This gives us the following optimization problem:
(2)  
(3) 
which is equivalent to the original problem (1). The scaled form of the augmented Lagrangian of (3) can be written as
(4) 
where is the penalty parameter of the constraint , and is the dual variables divided by . By alternatively optimizing over , , and , ADMM is composed of the following procedures:
(5)  
(6)  
(7) 
The update of (6) is a least squares problem and can be solved efficiently via algorithms like conjugate gradient descent. The update of (5) is the proximal operator of the signal prior with penalty , denoted as , where . When the signal prior uses norm, the proximal operator is simply a softthresholding on . Notice that the ADMM algorithm separates the signal prior from the linear operator . This enables us to learn a signal prior that can be used with any linear operator.
Since the signal prior only appears in the form of a proximal operator in ADMM, instead of explicitly learning a signal prior and solving the proximal operator in each step of ADMM, we propose to directly learn the proximal operator.
Let represent the set of all natural images. The best signal prior is the indicator function of , denoted as , and its corresponding proximal operator is a projection operator that projects onto from the geometric perspective— or equivalently, finding a such that is minimized. However, we do not have the oracle indicator function in practice, so we cannot evaluate to solve the projection operation. Instead, we propose to train a classifier with a large dataset whose classification cost function approximates . Based on the learned classifier , we can learn a projection function that maps a signal to the set defined by the classifier. The learned projection function can then replace the proximal operator (5), and we simply update via
(8) 
An illustration of the idea is shown in Figure 2.
There are some caveats for this approach. First, when the classification cost function of the classifier is nonconvex, the overall optimization becomes nonconvex. For solving general nonconvex optimization problems, the convergence result is not guaranteed. Based on the theorems for the convergence of nonconvex ADMM [28], we provide the following theorem to the proposed ADMM framework.
Assume the function solves the proximal operator (5). If the gradient of is Lipschitz continuous and with large enough , the ADMM algorithm is guaranteed to attain a stationary point.
The proof follows directly from [28] and we omit the details here. Although Theorem 1 only guarantees convergence to stationary points instead of the optimal solution as other nonconvex formulations, it ensures that the algorithm will not diverge after several iterations. Second, we initialize the scaled dual variables with zeros and with the pseudoinverse of the leastsquare term. Since we initialize , the input to the proximal operator resembles an image. Thereby, even though it is in general difficult to fit a projection function from any signal in to the natural image space, we expect that the projection function only needs to deal with inputs that are close to images, and we train the projection function with slightly perturbed images from the dataset. Third, techniques like denoising autoencoders learn projectionlike operators and, in principle, can be used in place of a proximal operator; however, our empirical findings suggest that ignoring the projection cost and simply minimizing the reconstruction loss , where is a perturbed image from , leads to instability in the ADMM iterations.
We use two deep neural nets as the classifier and the projection operator , respectively. Based on Theorem 1, we require the gradient of to be Lipschitz continuous. Since we choose to use cross entropy loss, we have and in order to satisfy Theorem 1, we need to be differentiable. Thus, we use the smooth exponential linear unit [29]
as activation function, instead of rectified linear units. To bound the gradients of
w.r.t. , we truncate the weights of the network after each iteration.We show an overview of the proposed method in Figure 3 and leave the details in Appendix A. The projector shares the same architecture of a typical convolutional autoencoder, and the classifier is a residual net [30]. One way to train the classifier is to feed natural images from a dataset and their perturbed counterparts. Nevertheless, we expect the projected images produced by the projector be closer to the dataset (natural images) than those perturbed images. Therefore, we jointly train two networks using adversarial learning: The projector is trained to minimize (5), that is, confusing the classifier by projecting to the natural image set defined by the decision boundary of . When the projector improves and generates outputs that are within or closer to the boundary, the classifier can be updated to tighten its decision boundary. Although we start from a different perspective from [19], the above joint training procedure can also be understood as a two player game in adversarial learning, where the projector tries to confuse the classifier .
Specifically, we optimize the projection network with the following objective function:
(9)  
(10) 
where is the function we used to generate perturbed images, and the first two terms in (9) are similar to (denoising) autoencoders and are added to help the training procedure. The remaining terms in (10) form the projection loss as we need in (5). We use two classifiers and , for the output (image) space and the latent spaces of the projector ( in Figure 3), respectively. The latent classifier is added to further help the training procedure [31]. We find that adding also helps the projector to avoid overfitting. In all of our experiments, we set , , , and . The diagram of the training objective is shown in 3.
We briefly discuss the architecture of the networks. More details can be seen in Appendix A. As we have discussed above, to improve the convergence of ADMM, we use exponential linear unit and truncate the weights of and after each training iteration. Following the guidelines in [24, 32], we use a
layer convolution neural net that uses strides to downsample/upsample the images, and we use virtual batch normalization. However, we do not truncate the outputs of
with a tanh or a sigmoid function, in order to authentically compute the
projection loss. We find that using linear output helps the convergence of the ADMM procedure. We also use residual nets with six residual blocks as and instead of typical convolutional neural nets. We found that the stronger gradients provided by the shortcuts usually helps speed up the training process. Besides, we add a channelwise fully connected layer followed by a convolution to enable the projector to learn the context in the image, as in [8]. The classifiers and are trained to minimize the negative cross entropy loss. We use early stopping in order to alleviate overfitting. The complete architecture information is shown in the supplemental material.Image perturbation. While adding Gaussian noise may be the simplest method to perturb an image, we found that the projection network will easily overfit the Gaussian noise and become a dedicated Gaussian denoiser. Since during the ADMM process, the inputs to the projection network,
, do not usually follow a Gaussian distribution, the overfitted projection network may fail to project the general signals produced by the ADMM process. To avoid overfitting, we generate perturbed images with two methods — adding Gaussian noise with spatially varying standard deviations and smoothing the input images. We generate the noise by multiplying a randomly sampled standard Gaussian noise with a weighted mask upsampled from a lowdimensional mask with bicubic algorithm. The weighted mask is randomly sampled from a uniform distribution ranging from
. Note that the images are ranging from . To smooth the input images, we first downsample the input images and then use nearestneighbor method to upsample the results. The ratio to the downsample is uniformly sampled from. After smoothing the images, we add the noise described above. We only use the smoothed images on ImageNet and MSCele1M datasets.


The proposed framework, which is composed of a classifier network and a projection network, is very similar to the works involving both adversarial learning and denoising autoencoder, e.g., context encoder [8] and moderegularized GAN [33]. Compared to the probabilistic perspective typically used in adversarial learning [19] that matches the distributions of the dataset and the generated images, the proposed framework is based on the geometric perspective and the ADMM framework. We approximate the oracle indicator function by and its proximal operator by the projection operator . Our use of the adversarial training is simply for learning a tighter decision boundary, based on the hypothesis that images generated by should be closer in distance to than the arbitrarily perturbed images.
The projection network can be thought as a denoising autoencoder with a regularization term. Compared to the typical denoising autoencoder, which always projects a perturbed image back to the origin image , the proposed may project to the closest in . In our empirical experience, the additional projection loss help stabilize the ADMM process.
A recent work by Dave et al. can also be used to solve generic linear inverse problems. They use a spatial recurrent generative network to model the distribution of natural images and solve linear inverse problems by performing maximum a posteriori inference (maximizing given ). During the optimization, their method needs to compute the gradient of the network w.r.t. the input in each iteration, which can be computationally expensive when the network is very deep and complex. In contrast, the proposed method directly provides the solution to the xupdate (8) and is thus computationally efficient.
The proposed method is also very similar to the denoisingbased approximate message passing algorithm (DAMP) [34], which has achieved stateoftheart performance in solving compressive sensing problems. DAMP is also motivated by geometric perspective of linear inverse problems and solve the compressive sensing problem with a variant of proximal gradient algorithm. However, instead of designing the proximal operator as the proposed method, DAMP uses existing Gaussian denoising algorithms and relies on a Onsager correction term to ensure the noise resemble Gaussian noise. Thereby, DAMP can only deal with linear operator formed by random Gaussian matrices. In contrast, the proposed method directly learns a proximal operator that projects a signal to the image domain, and therefore, has fewer assumptions to the linear operator .
Unlike traditional signal priors whose weights can be adjusted at the time of solving the optimization problem (1), the prior weight of the proposed framework is fixed once the projection network is trained. While an ideal projection operator should not be affected by the value of the prior weights, sometimes, it may be preferable to control the effect of the signal prior to the solution. In our experiments, we find that adjusting sometimes has similar effects as adjusting .
The convergence analysis of ADMM in Theorem 1 is based on the assumption that the projection network can provide global optimum of (5). However, in practice the optimality is not guaranteed. While there are convergence analyses with inexact proximal operators, the general properties are too complicated to analyze for deep neural nets. In practice, we find that for problems like pixelwise inpainting, compressive sensing, superresolution and scattered inpainting the proposed framework converges gracefully, as shown in Figure 3(a), but for more challenging problems like image inpainting with large blocks and superresolution on ImageNet dataset, we sometimes need to stop the ADMM procedure early.







We evaluate the proposed framework on three datasets:
[align=left,style=unboxed,leftmargin=2mm]
contains randomly deformed images of MNIST dataset. The images are and grayscale. We train the projector and the classifier networks on the training set and test the results on the test set. Since the dataset is relatively simpler, we remove the upper three layers from both and . We use batches with instances and train the networks for iterations.
contains a total of 8 million aligned and cropped face images of 10 thousand people from different viewing angles. We randomly select images of 73678 people as the training set and those of 25923 people as the test set. We resize the images into . We use batches with instances and train the network for iterations.
contains 1.2 million training images and 100 thousand test images on the Internet. We resize the images into . We use batches with instances and train the network for iterations.
For each of the datasets, we perform the following tasks:
[align=left,style=unboxed,leftmargin=2mm]
We use random Gaussian matrices of different compression () as the linear operator
. The images are vectorized into
dimensional vectors and multiplied with the random Gaussian matrices to form .We randomly drop pixel values (independent of channels) by filling zeros and add Gaussian noise with different standard deviations.
We randomly drop small blocks by filling zeros. Each block is of width and height of the input.
We fill the center region of the input images with zeros.
We downsample the images into and of the original width and height using boxaveraging algorithm.
Comparison to speciallytrained networks. For each of the tasks, we train a speciallytrained neural network using context encoder [8] with adversarial training. For compressive sensing, we design the network based on the work of [9], which applies to the linear measurements and resize it into the image size to operate in image space. The measurement matrix is a random Gaussian matrix and is fixed. For pixelwise inpainting and denoise, we randomly drop of the pixels and add Gaussian noise with for each training instances. For blockwise inpainting, we drop a block with size of the images at a random location in the images. For super resolution, we follow the work of Dong et al. [6] which first upsamples the lowresolution images to the target resolution using bicubic algorithm. We train a network for super resolution. Note that we do not train a network for super resolution and for scattered inpainting — to demonstrate that the speciallytrained networks do not generalize well to similar tasks. Since the inputs to the super resolution network are bicubicupsampled images, we also apply the upsampling to resolution images and feed them to the same network. We also feed scattered inpainting inputs to the blockwise inpainting network.
Comparison to handdesigned signal priors. We compare the proposed framework with the traditional signal prior using norm of wavelet coefficients. We tune the weight of the prior, , based on the dataset. For image denoising, we compare with the stateoftheart algorithm, BM3D [38]. We add Gaussian random noise with different standard deviation (out of ) to the test images, which were taken by the author with a cell phone camera. The value of of each image is provided to BM3D. For the proposed method, we let
, the identity matrix, and set
. We use the same projection network learned from ImageNet dataset and apply it to patches. As shown in Figure 9, when is larger than , the proposed method consistently outperform BM3D. For image superresolution, we compare with the work of Freeman and Fattal [39]. We perform  and super resolution on the images from the result website of [39]. The superresolution results are shown in Figure 10.
















For each of the experiments, we use if not mentioned. The results on MNIST, MSCeleb1M, and ImageNet dataset are shown in Figure 5, Figure 6, and Figure 7, respectively. In addition, we apply the projection network trained on ImageNet dataset on an image on the Internet [40]. To deal with the image, when solving the projection operation (5), we apply the projection network on patches and stitch the results directly. The reconstruction outputs are shown in Figure 1, and their statistics of each iteration of ADMM are shown in Figure 3(a).
As can be seen from the results, using the proposed projection operator/network learning from datasets enables us to solve more challenging problems than using the traditional wavelet sparsity prior. In Figure 5 and Figure 6, while the traditional prior of wavelet coefficients is able to reconstruct images from compressive measurements with , it fails to handle larger compression ratios like and . Similar observations can be seen on pixelwise inpainting of different dropping probabilities and scattered and blockwise inpainting. In contrast, since the proposed projection network is tailored to the datasets, it enables the ADMM algorithm to solve challenging problems like compressive sensing with small and blockwise inpainting on MSCeleb dataset.
Robustness to changes in linear operator and to noise. Even though the speciallytrained networks are able to generate stateoftheart results on their designing tasks, they are unable to deal with similar problems, even with a slight change of the linear operator . For example, as shown in Figure 6, the blockwise inpainting network is able to deal with much larger vacant regions; however, it overfits the problem and fails to fill contents to smaller blocks in scatted inpainting problems. The super resolution network also fails to reconstruct higher resolution images for super resolution tasks, despite both inputs are upsampled using bicubic algorithm beforehand. We extend this argument with a compressive sensing example. We start from the random Gaussian matrix used to train the compressive sensing network, and we progressively resample elements in from the same distribution constructing . As shown in Figure 8, once the portion of resampled elements increases, the speciallytrained network fails to reconstruct the inputs, even though the new matrices are still Gaussian. The network also shows lower tolerant to Gaussian noise added to the clean linear measurements . In comparison, the proposed projector network is robust to changes of linear operators and noise.
Convergence of ADMM. Theorem 1 provides a sufficient condition for the nonconvex ADMM to converge. As discussed in Section 3.3, based on Theorem 1, we use exponential linear units as the activation functions in and and truncate their weights after each training iteration, in order for the gradient of and to be Lipschitz continuous. Even though Theorem 1 is just a sufficient condition, in practice, we observe improvement in terms of convergence. We conduct experiments on scattered inpainting on ImageNet dataset using two projection networks — one trained with and using the smooth exponential linear units, and the other trained with and using the nonsmooth leaky rectified linear units. Note that leaky rectified linear units are indifferentiable and thus violate the sufficient condition provided by Theorem 1. Figure 3(b) shows the root mean square error of , which is a good indicator of the convergence of ADMM, of the two networks. As can be seen, using leaky rectified linear units results in higher and spikier root mean square error of than using exponential linear units. This indicates a less stable ADMM process. These results show that following Theorem 1 can benefit the convergence of ADMM.
Failure cases. The proposed projection network can fail to solve very challenging problems like the blockwise inpainting on ImageNet dataset, which has higher varieties in image contents than the other two datasets we test on. As shown in Figure 7, the proposed projection network tries to fill in random edges in the missing regions. In these cases, the projection network fails to project inputs to the natural image set, and thereby, violates our assumption in Theorem 1 and affects the overall ADMM framework. Even though increasing can improve the convergence, it may produce lowquality, overly smoothed outputs.
In this paper, we propose a general framework to implicitly learn a signal prior — in the form of a projection operator — for solving generic linear inverse problems. The learned projection operator enjoys the high flexibility of deep neural nets and wide applicability of traditional signal priors. With the ability to solve generic linear inverse problems like denoising, inpainting, superresolution and compressive sensing, the proposed framework resolves the scalability of speciallytrained networks. This characteristic significantly lowers the cost to design specialized hardwares (ASIC for example) to solve image processing tasks. Thereby, we envision the projection network to be embedded into consumer devices like smart phones and autonomous vehicles to solve a variety of image processing problems.
We now describe the architecture of the networks used in the paper. We use exponential linear unit (elu) [29] as activation function. We also use virtual batch normalization [32], where the reference batch size
is equal to the batch size used for stochastic gradient descent. We weight the reference batch with
. We define some shorthands for the basic components used in the networks.conv(w, c, s): convolution with window size, output channels and stride.
dconv(w, c, s) deconvolution (transpose of the convolution operation) with window size, output channels and stride.
vbn: virtual batch normalization.
cfc: a channelwise fully connected layer, whose output dimension is with same size as the input dimension.
fc(s): a fullyconnected layer with the output size .
To simply the notation, we use the subscript ve on a component to indicate that it is followed by vbn and elu.
The projection network is composed of one encoder network and one decoder network, like a typical autoencoder. The encoder projects an input to a dimensional latent space, and the decoder projects the latent representation back to the image space. The architecture of is as follows.
(11) 
The decoder is a symmetric counter part of the encoder:
(12) 
As shown in Figure of the paper, we use two classifiers — one operates in the image space and discriminates natural images from the projection outputs, the other operates in the latent space of based on the hypothesis that after encoded by , a perturbed image and a natural image should already lie in the same set.
The latent space classifier operates on the output of the encoder . Since the input dimension is smaller than that of , we use fewer blocks than we did in .
(14) 
L. Xu, J. S. Ren, C. Liu, and J. Jia, “Deep convolutional neural network for image deconvolution,” in
NIPS, 2014.R. Salakhutdinov and G. Hinton, “Deep Boltzmann machines,” in
AISTATS, 2009.Foundations and Trends® in Machine Learning
, vol. 3, no. 1, pp. 1–122, 2011.G. Loosli, S. Canu, and L. Bottou, “Training invariant support vector machines using selective sampling,” in
Large Scale Kernel Machines, L. Bottou, O. Chapelle, D. DeCoste, and J. Weston, Eds. MIT Press, 2007, pp. 301–320.Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Msceleb1m: A dataset and benchmark for largescale face recognition,” in
ECCV, 2016.K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, “Bm3d image denoising with shapeadaptive principal component analysis,” in
Signal Processing with Adaptive Sparse Structured Representations, 2009.