1 Introduction
Image denoising is a wellstudied problem in the field of computer vision and image processing where the task is to take noised images and restore them to the original images. Due to the power of GPU computation, now a day’s convolutional neural network is performing very well at image denoising and recognition task [17, 7, 5]. In these models the noised images are filtered by deep convolutional encoder decoder with convolution and deconvolution steps. All of these models are concerned with only the image denoising part, but when a part of the image is missing these models are not very good at generating missing parts as convolutional neural network is not so good at sequence processing. To generate a missing part of the image a model is needed which can understand and predict the upcoming part of the image through reading the current portion of the image. So this model will need memory to capture the sequence and Long Short Term Memory (LSTM) [14]
is a well performed recurrent neural network which is good at sequence learning. LSTM also can solve general sequence to sequence problems
[15], which helps to improve machine translation [2] and speech recognition task [3, 4]. In the field of natural language processing sequence to sequence mapping with LSTM encoder and decoder performs excellently. Not only in this field but also in the field of machine vision it is performing so well, like in the image description job
[8, 9].Being inspired by these works we have come up with a CNN encoder and LSTM decoder technique with direct attention. In our model a convolutional neural network reads an image and obtain a fixed size vector representation of the image. Another multilayer LSTM takes that vector and the corrupted image row by row along with the last output from previous timestamp, then it produces the desired image. In our model decoder has the access to the raw image so that if encoder misses any information to capture, it can get it from raw distorted image. The main advantage of our model is that it can clean image as well as can produce lost information. We have trained our model on MNIST handwritten database [19]. For training purpose, we have made lower half of every image as black and then we have added noise on that image. So the image becomes highly destructed and even for a human it is tough to understand what the digit was in that image. We trained our model on these highly distorted images and it was able to remove noises along with the retrieval of the lost information, where convolutional encoderdecoder was able to remove noise but it was unable to find the lost information accurately. As Convolutional neural network is good at cleaning image and LSTM is good at sequence generation, we have combined the power of them to remove the noise from an image and generate missing part of that image.
2 Related Work
For image denoising task there has been lot of works till today. But due to the success of deep learning, many deep learning models outperformed the old models. For image denoising task deep convolutional encoderdecoder performs so well [17, 7, 5]. Sample convolutional encoder and decoder is shown in the figure 1.
These models use a convolutional neural network as encoder which encode the image and symmetric deconvolutional layers as decoder and construct the clean images. For image denoising task these model has outperformed many old models [17]
. There is no headache for preprocessing of the image as it is an end to end learning system. It is not common to use convolutional and LSTM together to do the image denoising task. Convnet and LSTM encoder decoder is a well performed model for dense captioning and image description jobs
[8, 9].3 Background knowledge
For sequence processing Recurrent Neural Network (RNN) [12, 1]
is a general form of feed forward neural network. A RNN outputs sequence (y
, y, ……., y) from the inputs (x, x, ……., x) through the iteration over the equation 1 and 2(1) 
(2) 
Where,
x = input at time t
h = input at time t
h = hidden state at time t
y = output of time t
w = weights
b = bias
= any activation function
= sigmoid activation function
In figure 2
we can see the the structure of the RNN cells stacked together to process sequence, where at every timestamp there is an output. Due to the vanishing gradient problem it is hard to train recurrent neural network over longer sequence with these settings
[13, 18]. The Long ShortTerm Memory is known to solve these long term dependencies as it has memory state which can carry information longer with the help of input gate, output gate and forget gate [14].4 Model
4.1 Overview of CNNLSTM encoder decoder with direct attention
Our model is consisted of an encoder and a decoder. Encoder takes the corrupted image and decoder outputs the cleaned and reconstructed image. So the simplified version of the total system can be expressed through equation 3 and 4.
(3) 
(4) 
So the intuition behind this model is, the encoder reads the corrupted image and creates a thought vector where it is supposed to have the full information of the image in the vector. Here encoder has the full privilege to encode anything which will help the model to reduce the loss, so it can encode a clean version of the corrupted image in the vector if it wants.
Then decoder takes that vector and try to produce a clean image where it has the access to that corresponding row of the corrupted image. So overall, when decoder reads an image it has the information about the current row of that image, thought vector and what was the last output by the decoder, so when some part of an image is missing it can reproduce that by its memory like, if lower half of a digit “2” image is missing it can draw that part as it is known to the shape of’ “2”. Figure 3 shows the architecture of the whole model. The encoder part of the figure shows the convolutional neural network architecture of the model of following 32, Maxpooling, 64, Maxpooling, fully connected layers and the decoder part shows the 5 layers LSTM which produces the final image row by row.
4.2 Detail of the Architecture
4.2.1 Encoder
The Encoder of the model is consist of a convolutional neural network. The output from the encoder is a fully connected layer. If the image is represented as x, then filter k with size n
, convolve over the different channel of the image with stride of size s
. The weights of the filter k is shared spatially and they are different for every channel of the feature map. There are subsampling/pooling layers to shrink the width and height of the feature map with filter size p and stride size sThe equation for convolutional layer [6] is(5) 
Where, x is the i channel of input, k is the convolution kernel, y is the hidden layer, is activation function.
The equation 6 for pooling layer is
(6) 
Here, x is value of the i feature map at position , ; is vertical index in local neighborhood, is horizontal index in local neighborhood, y is pooled and sub sampled layer.
The fully connected layer is achieved through a dot product between the final layer and weight matrix
and then adding bias vector
. Then the output is passed through any activation function .(7) 
Final fully connected layer is the though vector which is passed to the decoder to draw the final image.
4.2.2 Decoder
The Decoder of our model is consist of multilayer LSTM. The LSTM is consist of Input gate, Output gate, Forget gate and New Memory cell. The Input gate, Output gate, Forget gate and New Memory cell/Current State is generated from the raw input, precious hidden state, thought vector from encoder and previous output from decoder.
The input gate decide the passing of new memory where the forget gate decide whether to keep the previous memory or not. The output gate decide the outflow of the current hidden state.
In our decoder the initial hidden state and memory state is zero. Initial bottom LSTM cells of the multilayer LSTM, takes the thought vector produced by encoder, current row of the corrupted image and the output image row from its previous units.
denotes as the Weight and denotes as the Bias. The equations LABEL:8 to LABEL:13 for the LSTM cell of the decoder is stated below (Biases are not shown in the equations)
(8) 
(9) 
(10) 
(11) 
(12) 
(13) 
Here, is the current row of the corrupted image, is the previous hidden state, v is the thought vector and is the last output of the decoder. The output for every row is derived through below equation
(14) 
4.2.3 Loss function
The loss function is the mean squared loss. The loss is given by,
(15) 
Where, is the predicted output and y is the original output.
4.2.4 Activation Functions
(16) 
For the final fully connected layer of the decoder activation function is used. In decoder for the final output layer Sigmoid activation function is used to make the output very near to the original image as the image was normalized thorough making the pixel values between 0 and 1. Figure 4 shows different activation functions.
4.2.5 Regularizations
For the regularization purpose, L2 loss is used. The L2 loss is shown in equation 17.
(17) 
Where, represents the weights and is a hyper parameter. How much L2 loss is added to the final loss of the mode is determined by.
4.2.6 Optimization
The weights of the model is trained with both Adadelta optimization cite20 and Adam optimizer [21]. Finally it is observed that Adam performed well over Adadelta.
5 Training Details
5.1 Dataset Preparation
For the training purpose the MNIST handwritten digit dataset was used. We have made some the lower half of the image as blank/ black. After that we have added Saltandpepper noise or white noise top of that image, which made the image highly distorted. Figure
5 demonstrate the training data generation process. We have divided the data into train and test where train contains 75 % of the data and 25 % of the data is belongs test set.5.2 Configuration
For the proposed CNNLSTM model, after trying different combinations the selected Encoder’s architecture is shown in Table 1
Layer Name 





Conv 1  1@28*28  32@5*5  32@28*28  
MaxPool 1  32@28*28  2*2  32@14*14  
Conv 2  32@14*14  64@5*5  64@14*14  
MaxPool 2  64@14*14  2*2  64@7*7  
Fully Connected  64*7*7  64*7*7*100  100 
The decoder was fivelayered LSTM. We have also trained a convolutional encoderdecoder (CNN) of 32, 64 convolutional layer and symmetric deconvolutional layer for comparison. Figure 1 shows the architecture of this model.
5.3 Other Details
We have written the code with the help of deep learning framework “Tensorflow” provided by Google
[10]. The images were edited with python programming language. We have run both of the model on the same dataset and the iteration was 500k with batch size 100. Dropout [16] of 25% was used to deter the model from over fitting.6 Results
Proposed CNNLSTM model and CNNCNN model was trained on Nvidia GTX 1080 Gpu, for 500k iteration of batch size 100. Required time for both models is shown in Table 2.
Model Name  Required Time 

CNNCNN  4 Hours 19 Minutes and 52 seconds 
CNNLSTM  6 Hours 58 Minutes and 24 seconds 
Loss for both model is depicted in Figure 6. The graph shows the movement of loss for both CNNCNN and CNNLSTM models. It is observed from the graph that both losses have a jumping behavior throughout the training period and it happened as it was stochastic training process. We fitted a log trend line to see the smooth movement of the loss. The loss of CNNCNN was unable to improve after certain iteration but for CNNLSTM, the loss was decreasing over the iterations. From the graph it can be inferred that after 5000 iterations the loss might goes down.
Our model was capable of removing noise from the image as well as it was capable of retrieving the lost part of the images with minimal error. The difficult task was to draw the lost shape of the digits and CNNLSTM performed outstandingly and for most of the case generated the lost shape with great perfection where even for a human eye it was too tough to understand the content of the distorted images. Figure 7 shows the performance of the proposed model and comparison with the CNNCNN model. There is an amazing fact to notice that our CNNLSTM model with direct attention, produced cleaned and fined image. The edge of the produced images are not as rough as the original. Our model produced smoothed images. This amazing result happed due to the RMSE loss. Due to RMSE loss the model tried to predict pixel values which is not far from original value. The width of the stroke of the images are different and it has not pattern to learn that’s why the model learned the smoothed shape so that it minimize the loss for all types of strokes.
Convolutional encoderdecoder performs very well at image denoising task but for the generation of the missing part it was not so good. So our model has outperformed the convolutional encoder and decoder in the part of generating the missing part of the images. Convolutional neural network is very good at denoising and that’s why we have used CNN encoder and LSTM decoder as this is very good at sequence generation.
7 Conclusion
Image denoising is an active field of research in the area of image processing. If some portion of the image is missing then almost all algorithms try to produce that missing part from surrounding pixels. If major portion of an image is missing then the missing part cannot be produced through observing the neighborhood pixels. To produce a missing part from an image the knowledge about that object is necessary. If half of a digit is missing then any algorithm needs to understand what that digit looks like, then it can produce the missing part keeping the relevance with the current portion of the image. The algorithm need to learn the generic shape of a digit to do this job. CNNCNN encoder decoder outperformed many models in terms of image denoising but it is not good at producing missing part of an image as it has no memory. Among deep learning models Recurrent Neural Networks are good at sequence processing as they have memory module. To combine the power of CNN and RNN model we proposed CNNLSTM Encoder Decoder with Direct Attention model, which is very good at denoising images as well producing lost part of the image. The model was trained on MNIST handwritten digit dataset after massive distortion. The goal of our model is to learn to remove noise and the shape of digit and our model has done that with minimal errors which can be removed through long training time.
References
 [1] D. Rumelhart, G. E. Hinton, and R. J. Williams. “Learning representations by backpropagating errors.” Nature, 323(6088):533–536, 1986.
 [2] D. Bahdanau, K. Cho, and Y. Bengio. “Neural machine translation by jointly learning to align and translate.” arXiv preprint arXiv:1409.0473, 2014.
 [3] Dario Amodei, Rishita Anubhai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, et al. “Deep speech 2: Endtoend speech recognition in english and mandarin.” arXiv preprint arXiv:1512.02595, 2015.
 [4] Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos, G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S., Coates, A., and Ng, A. Y. “ Deep Speech: Scaling up endtoend speech recognition.” ArXiv eprints, December 2014.
 [5] Jain and S.H. Seung. “Natural image denoising with convolutional networks.” In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Leon Bottou, editors, Advances in Neural Information Processing Systems 21 (NIPS’08), 2008.
 [6] Jarrett, K., Kavukcuoglu, K., Ranzato, M., and LeCun, Y. (2009). “What is the best multistage architecture for object recognition?”In ICCV’09.
 [7] J. Xie, L. Xu, and E. Chen. “Image denoising and inpainting with deep neural networks.” Advances Neural Inform. Process. Syst., 26:1–8, 2012.
 [8] J. Johnson, A. Karpathy, and L. FeiFei. “Densecap: Fully convolutional localization networks for dense captioning.” arXiv preprint arXiv:1511.07571, 2015.
 [9] Karpathy, A. and FeiFei, L. “Deep visualsemantic alignments for generating image descriptions.” CoRR, abs/1412.2306, 2014.
 [10] Martin Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. “Tensorflow: Largescale machine learning on heterogeneous distributed systems.” arXiv preprint arXiv:1603.04467, 2016.

[11]
Nair, V. and Hinton, G. E. (2010). “Rectified linear units improve restricted boltzmann machines.” In Proc. 27th International Conference on Machine Learning.

[12]
P. Werbos. “Backpropagation through time: what it does and how to do it.” Proceedings of IEEE, 1990.
 [13] S. Hochreiter. “Untersuchungen zu dynamischen neuronalen netzen.” Master’s thesis, Institut fur Informatik, Technische Universitat, Munchen, 1991.
 [14] S. Hochreiter and J. Schmidhuber. “Long shortterm memory.” Neural Computation, 1997.
 [15] Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc VV. “Sequence to sequence learning with neural networks.” In NIPS, pp. 3104– 3112, 2014.
 [16] Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex, Sutskever, Ilya, and Salakhutdinov, Ruslan. “Dropout: A simple way toprevent neural networks from overfitting.” JMLR, 15:1929– 1958, 2014.
 [17] XiaoJiao Mao, Chunhua Shen, YuBin Yang, “Image Denoising Using Very Deep Fully Convolutional EncoderDecoder Networks with Symmetric Skip Connections” http://arxiv.org/abs/1603.09056, 30 Mar 2016.
 [18] Y. Bengio, P. Simard, and P. Frasconi. “Learning longterm dependencies with gradient descent is difficult.” IEEE Transactions on Neural Networks, 5(LABEL:GrindEQ__2_):157–166, 1994.
 [19] Y. LeCun, “THE MNIST handwritten digit database”, (1998).
 [20] Zeiler, M. D. (2012). “ADADELTA: An Adaptive Learning Rate Method.” CoRR, abs/1212.5701.
 [21] Kingma, Diederik P and Ba, Jimmy Lei. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 [22] Goodfellow, Ian J., PougetAbadie, Jean, Mirza, Mehdi, Xu, Bing, WardeFarley, David, Ozair, Sherjil, Courville, Aaron C., and Bengio, Yoshua, “Generative adversarial nets.”, NIPS, 2014.
 [23] X. Huang, Y. Li, O. Poursaeed, J. Hopcroft, and S. Belongie, “Stacked generative adversarial networks.”, In CVPR, 2017.
 [24] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim, “Learning to discover crossdomain relations with generative adversarial networks.” arXiv preprint arXiv:1703.05192, 2017.