1 Introduction
Recently, (Zhou et al., 2018) proposed a method for computeraided design (CAD) which can find efficient and innovative CAD models automatically. The method will first describe the target product by a shape illustration image (SII), Figure 1 for example; second compress the SII to get a low dimensional latent code ; third modify by a random searching algorithm to get a new latent code ; fourth reconstruct a new SII from and test its performance by a computeraided engineering software. A highly efficient SII will be found automatically by replacing with and doing the third and the fourth steps repeatedly. The image encoding and decoding techniques are the bottlenecks of the CAD method. A high compression rate can get low dimensional that will lead to shorter optimization periods. But current encoding techniques are all sacrificing details to get a high compression rate which is unacceptable to SIIs whose details are highly correlated to their performance.
Because SIIs are usually generated by following predefined principals, they are similar to each other even though they are containing sharp edges and/or lots of small shapes. It is possible to learn the principals by deep learningbased autoencoders and express them with low dimensional features. Most autoencoders (Chen et al., 2016; Nalisnick and Smyth, 2016; Xu et al., 2019; Qi et al., 2014; Dong et al., 2018; Bojanowski et al., 2017; Sønderby et al., 2016; Wang et al., 2012; Creswell and Bharath, 2018; Kiasari et al., 2018; Wang et al., 2016) are using the traditional structure where images are encoded into low dimensional features and then decoded into the reconstructed images. When dealing with SIIs, the traditional autoencoders are shothanded because of the lack of mechanics to emphasize hard patterns. The hard patterns are part of details that distinguish an SII from the others and are difficult to be captured by autoencoders. (Zhao and Li, 2018) proposed to learn features with image pyramids generated by smoothing and downsampling operations. Although image pyramids can highlight details, the details are found by nontrainable operations that are not necessarily capable of identifying the hard patterns.
In this work, a framework, namely ResidualRecursion Autoencoder (RRAE), has been proposed to encode SIIs into low dimensional latent code recursively. RRAE will try to reconstruct the original image times. The input of the autoencoder has channels whose first channel is the original image. The residual between the reconstructed image and the original image will be filled to its reserved channel in the input. The updated input will be used to encode and reconstruct the original image again. At th autoencoder forward propagation, the output of the encoder will be kept as the latent code and the decoder output will be the final reconstructed image. By the residualrecursion mechanic, the hard patterns are detected by a trainable operator, the autoencoder itself. The hard patterns will be highlighted in each channel of the input except the first channel. RRAE can wrap over different autoencoders and increase their performance. From the experiment results, the reconstruction loss is decreased by 86.47% for convolutional autoencoder with highresolution SIIs, 10.77% for variational autoencoder and 8.06% for conditional variational autoencoder with MNIST. Because high resolution means more hard patterns, autoencoders have been improved by big margins on highresolution SIIs.
2 Methodology
Algorithm 1 shows the ResidualRecursion Autoencoder (RRAE). The autoencoder network
can be any structure that takes a tensor as input and outputs a reconstructed one. The loss function
consists of loss functions that are required by . For example, can be where is the input image, is the reconstructed image and is the latent code. Minimizing will minimize reconstruction error and impose the sparsity of the latent code. There is no limitation on the optimization function as long as it can work with and optimize its weights . The residual function is used to get the residual. For example, . Supposing , Figure 2 shows an example.3 Experiments
All experiments have run on 1080ti GPU and Pytorch
(Paszke et al., 2019) framework. As default, the optimization function is Adam (Kingma and Ba, 2014) with default configurations, learning rate is 1e4, weight decay is 1e5, the residual function is, total epoch number is 300.
3.1 Mnist
MNIST (LeCun et al., 1998) is a handwritten digital dataset in which 60000 images for training and 10000 for testing. Images of MNIST are similar to SIIs except the resolution is much lower than SIIs’. In this experiment, code from Github ^{1}^{1}1https://github.com/timbmg/VAECVAEMNIST has been modified to run RRAE on MNIST with Variational Autoencoder (VAE) (Kingma and Welling, 2013) and Conditional Variational Autoencoder (CVAE) (Sohn et al., 2015) as the autoencoder
respectively whose networks are consisted of Linear and ReLU layers. The encoding and decoding part is wrapped by RRAE where the autoencoder will try reconstruct the image
times and return the reconstructed image , the latent code mean, the natural logarithm of latent code variance
and the latent code of the last trial. Before every trial, the latest residual will be filled to where which also is the default residual function of all experiments in this work ^{2}^{2}2Please refer to the uploaded code in VAECVAE folder for implementation details.. According to the code, the loss function is , where and are Binary Cross Entropy (equation (1)) and KL divergence (equation (2)). and other operations are all pixel or elementwise. The results are listed in Table 1 where total epoch number is 300, learning rate is 0.001 without weight decay, the optimization function is Adam (Kingma and Ba, 2014), the first column is the dimension of the latent code , the second column is the total trial times , the MSE column is the best testing mean square error between original image and reconstructed image during training, the DR column is the decrease rate of MSE respecting to the baseline (the result). From Table 1, it is obvious that the RRAE helps a lot in decreasing reconstruction error without increasing the dimension of the latent code. Usually, bigger leads to better performance. But too many trials will consume too much computation with little improvements. So, in the following experiments, the upper limit of is 3.(1) 
(2) 
VAE  CVAE  

MSE  DR/%  MSE  DR/%  
2  1  0.039362  0  0.033084  0 
2  2  0.037122  5.69  0.031937  3.47 
2  3  0.036552  7.14  0.031546  4.65 
2  4  0.037878  3.77  0.031456  4.92 
5  1  0.025071  0  0.021242  0 
5  2  0.024199  3.48  0.020755  2.29 
5  3  0.023470  10.77  0.020561  3.21 
5  4  0.023040  8.1  0.020649  2.79 
10  1  0.015637  0  0.013968  0 
10  2  0.014536  7.04  0.013009  6.87 
10  3  0.014621  6.50  0.012842  8.06 
10  4  0.014175  9.35  0.012852  7.99 
Convolutional autoencoders have been tested on MNIST and its highresolution version ^{3}^{3}3Please refer to code in folder CNN for implementation details.. The autoencoders are piled up by layers of 2D convolution, Group Normalization (Wu and He, 2018)
and ReLU without skipping links. The highresolution MNIST is a 512x512 binary image dataset that is generated by bilinear interpolation in which 60000 images for trianing and 10000 for testing. The images are binarized with threshold 127.5. All images are mean and std normalized. The loss function
of the original MNIST is L1. For high resolution, the loss function is (Mentzer et al., 2018) where is MSSSIM (Wang et al., 2003). Table 2 shows the best testing results from which we can conclude that the RRAE is much more efficient for highresolution images than lowresolution images.MNIST28  MNIST512  

L1  DR/%  DR/%  
1  1  0.07645  0  50  1  2.851  0 
1  2  0.07895  3.27  50  2  1.800  36.86 
2  1  0.06544  0  50  3  1.534  46.19 
2  2  0.06358  2.84  100  1  1.399  0 
2  3  0.06425  1.82  100  2  0.4281  69.40 
3.2 SIIs of Savonius Rotors
A SII dataset has been constructed according to (Zhou et al., 2018) which is consisted of crosssectional images of Savonius Rotors. Since the shape of a Savonius Rotor is controlled by a parabolic curve that is specified by four enumerable variables, 26973 SIIs have been generated by enumerating the height , the length , the down left point where and ^{4}^{4}4Please refer to code /CNN/parabolarBlade.py for detailed implementation.. Figure 3 shows some random samples of the SII dataset. The resolution is 512x512. 1 out of 7 images are selected randomly for testing. The others are kept for training.
A convolutional autoencoder that wrapped by RRAE is used to encode the crosssectional images ^{5}^{5}5Please refer to code in folder CNN for implementation details.
. The autoencoder is piled up by layers of 2D convolution, batch normalization
(Ioffe and Szegedy, 2015) and ReLU without skipping links. The input crosssectional image is mean and std normalized. The output of the last transposed convolution layer is denormalized to get the reconstructed image. The reconstructed image and the original crosssectional image are used to calculate the residual image and the loss. Table 3 shows the results of the crosssectional autoencoding experiment in which the best test results have been listed. From the results, RRAE has improved the reconstruction accuracy significantly. When the latent code dimension is low, 2 for example in this experiment, too many trials (e.g. ) will harm the performance of RRAE. The performance decreasing can also be observed in Table 2. Figure 4 shows the results of each trial that are from one test run of the & experiment. The reconstructed images are clamped to the range from 0 to 1. To illustrate the residual images, they are added by the bias 0.5 to keep the pixel value in the range from 0 to 1. From Figure 4, it is obvious that the residual gets smaller after each trial. The final reconstructed image () is very close to the original image which has been encoded into just 5 float variables.DR/%  DR/%  

2  1  5.069  0  4  1  0.2821  0 
2  2  3.665  27.70  4  2  0.05777  79.52 
2  3  4.910  3.14  4  3  0.04142  85.32 
3  1  0.7347  0  5  1  0.2081  0 
3  2  0.2983  59.40  5  2  0.04240  79.63 
3  3  0.2591  65.73  5  3  0.02815  86.47 
The L1 loss of the final reconstructed image is 0.0020 in Figure 4. Figure 5 shows the results of different image compression algorithm with the same original image. The DCT method is same as the one introduced in (Zhou et al., 2018) where a image will be encoded into latent codes by 2D discrete cosine transformation and Zigzag reordering ^{6}^{6}6Please refer to code in folder DCT for implementation details.. Jpeg and Jpeg2000 are provided by MATLAB2014a. The DCT method needs 119157 double variables to reconstruct the image with L1 loss 0.0020. Jpeg has the loss 0.0024 with file size 5177 bytes. Jpeg2000 has the loss 0.0021 with file size 3333 bytes. Although L1 losses are close to each other, the code length of image compression methods are much longer than RRAE. From the residual images of Figure 5, there are obvious noises in the reconstructed images. Comparing to Figure 5, the final reconstructed image in Figure 4 has smoother and cleaner edges that are important in describing CAD shapes.
4 Conclusion
An autoencoder framework, ResidualRecursion Autoencoder (RRAE), has been proposed to boost the performance of any autoencoder that encodes the target image into a latent code and reconstructs the image from the latent code. RRAE can endow the autoencoders with the ability of learning, capturing and highlighting hard patterns of the target image. When wrapped by RRAE, autoencoders will try to reconstruct the target image several times. After each trial, the residual between the reconstructed image and the target image will be filled to the reserved channel of the input tensor. Recursively, the input tensor will be full of residual images in which hard patterns may be repeated several times. With the fully filled input tensor, autoencoder can reconstruct the target image accurately with low dimensional latent code. The significant improvements over the baseline autoencoders have verified the performance of RRAE.
The target image should contain lots of hard patterns, for example, shape illustration images that consist of binary or gray shapes with sharp edges and large areas of blanks. Otherwise, RRAE will not bring in any significant improvements. This conclusion is supported by the comparative experiments of MNIST and its highresolution version. RRAE with the highresolution MNIST that contains much more hard patterns yielded more significant improvement than the original MNIST.
Supposing the computational cost of an autoencoder is , the cost will be after the wrapping of RRAE. According to the experiment results, the upper limit of is 3. In our experiments, RRAE increased computation cost by 2 times and decreased the reconstruction error by 86.47%.
References
 Optimizing the latent space of generative networks. arXiv preprint arXiv:1707.05776, pp. 1–10. Cited by: §1.
 Variational lossy autoencoder. arXiv preprint arXiv:1611.02731, pp. 1–17. Cited by: §1.
 Denoising adversarial autoencoders. IEEE transactions on neural networks and learning systems 30 (4), pp. 968–984. Cited by: §1.
 A review of the autoencoder and its variants: a comparative perspective from target recognition in syntheticaperture radar images. IEEE Geoscience and Remote Sensing Magazine 6 (3), pp. 44–68. Cited by: §1.
 Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, pp. 1–11. Cited by: §3.2.
 Coupled generative adversarial stacked autoencoder: cogasa. Neural Networks 100, pp. 1–9. Cited by: §1.
 Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980, pp. 1–15. Cited by: §3.1, §3.
 Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, pp. 1–14. Cited by: §3.1.
 Gradientbased learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §3.1.

Conditional probability models for deep image compression
. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 4394–4402. Cited by: §3.1.  Stickbreaking variational autoencoders. arXiv preprint arXiv:1605.06197, pp. 1–12. Cited by: §1.
 PyTorch: an imperative style, highperformance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §3.
 Robust feature learning by stacked autoencoder with maximum correntropy criterion. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6716–6720. Cited by: §1.
 Learning structured output representation using deep conditional generative models. In Advances in neural information processing systems, pp. 3483–3491. Cited by: §3.1.
 Ladder variational autoencoders. In Advances in neural information processing systems, pp. 3738–3746. Cited by: §1.
 A folded neural network autoencoder for dimensionality reduction. Procedia Computer Science 13, pp. 120–127. Cited by: §1.
 Autoencoder based dimensionality reduction. Neurocomputing 184, pp. 232–242. Cited by: §1.
 Multiscale structural similarity for image quality assessment. In The ThritySeventh Asilomar Conference on Signals, Systems & Computers, 2003, Vol. 2, pp. 1398–1402. Cited by: §3.1.
 Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: §3.1.
 Stacked wasserstein autoencoder. Neurocomputing 363, pp. 195–204. Cited by: §1.
 Unsupervised representation learning with laplacian pyramid autoencoders. arXiv preprint arXiv:1801.05278, pp. 1–6. Cited by: §1.
 Innovative savonius rotors evolved by genetic algorithm based on 2ddct encoding. Soft Computing 22 (23), pp. 8001–8010. Cited by: Figure 1, §1, §3.2, §3.2.