1 Introduction
In many realworld applications, the raw data measurements (e.g., audio/speech, images, video, biological signals) often have very high dimensionality. Adequately handling highdimensionality often requires the application of dimensionality reduction techniques Maaten:2009
that transform the original data into meaningful feature representations of reduced dimensionality. Such feature representations should reduce the dimensionality to the minimum number required to capture the salient properties of the data. Dimensionality reduction is vital in many machine learning applications, since one needs to mitigate the socalled “curse of dimensionality”
Jimnez:1997 . In the past few decades, latent representation learning based on autoencoders (AEs) Hinton:2006 ; Schloz:2008 ; Kramer:1991 ; DeMers:1993 ; Ng:2011 ; Vincent:2010 ; Doersch:2016 ; Sonderby:2016 has been widely used for dimensionality reduction, since this nonlinear technique has shown superior realworld performance compared to classical linear counterparts, such as principal component analysis (PCA).One of the challenges in dimensionality reduction is to determine the optimal latent dimensionality that can sufficiently capture the data features required for particular applications. Although some regularization techniques, such as sparse AE (SAE) Ng:2011 and ratedistortion AE Giraldo:2013 , may be useful to selfadjust the effective dimensionality, there are no existing methods that provide a rateless property MacKay:2005 that allows for seamlessly adjustment of the latent dimensionality depending on varying distortion requirements for different downstream applications, without modification of the trained AE model. However, realizing a rateless AE is not straightforward, since traditional AEs typically learn nonlinear manifolds where the latent variables are equally important, unlike the linear manifold models used for PCA.
In this paper, we introduce a novel AE framework which can universally achieve flexible dimensionality reduction while achieving high performance. Motivated by the fact that the traditional PCA is readily adaptable to any dimension by just appending or dropping sorted principal components, we propose a stochastic bottleneck architecture to associate upper latent variables with higherprincipal nonlinear features so that the user can freely discard the leastprincipal latent variables if desired. Our contributions are summarized below:

We introduce a new concept of rateless AEs designed for flexible dimensionality reduction.

A stochastic bottleneck framework is proposed to prioritize the latent space nonuniformly.

An extended regularization technique called TailDrop is considered to realize rateless AEs.

We discuss dropout distribution optimization under the principle of multiobjective learning.

We demonstrate that the proposed AEs achieve excellent distortion performance over the variable range of dimensionality in the standard MNIST and CIFAR10 image datasets.

We evaluate AE models trained for a perceptual distortion measure based on structural similarity (SSIM) Wang:2004 as well as the traditional meansquare error (MSE) metric.
2 Rateless autoencoder (RLAE)
2.1 Dimensionality reduction
Due to the curse of dimensionality, representation learning to reduce the dimensionality is often of great importance to handle highdimensional datasets in machine learning. To date, there have existed many algorithms for dimensionality reduction Maaten:2009
, e.g., PCA, kernel PCA, Isomap, maximum variance unfolding, diffusion maps, locally linear embedding, Laplacian eigenmaps, local tangent space analysis, Sammon mapping, locally linear coordination and manifold charting along with AE. Among all, AE
Hinton:2006 ; Schloz:2008 ; Kramer:1991 ; DeMers:1993 ; Ng:2011 ; Vincent:2010 ; Doersch:2016 ; Sonderby:2016has shown its high potential to learn lowerdimensional latent variables required in the nonlinear manifold underlying the datasets. AE is an artificial neural network having a bottleneck architecture as illustrated in Fig.
1(a), where dimensional data is transformed to dimensional latent representation (for ) via an encoder network. The latent variables should contain sufficient feature capable of reconstructing the original data through a decoder network.From the original data , the corresponding latent representation , with a reduced dimensionality is generated by the encoder network as , where denotes the encoder network parameters. The latent variables should adequately capture the statistical geometry of the data manifold, such that the decoder network can reconstruct the data as , where denotes the decoder network parameters and . The encoder and decoder pair are jointly trained to minimize the reconstruction loss (i.e., distortion), as given by:
(1) 
where the loss function
is chosen to quantify the distortion (e.g., MSE) between and .2.2 Motivation: rateless property
By analogy, AEs are also known as nonlinear PCA (NLPCA) Schloz:2008 ; Kramer:1991 ; DeMers:1993 . If we consider a simplified case where there is no nonlinear activation in the AE model, then the encoder and decoder functions will reduce to simple affine transformations. Specifically, the encoder becomes where trainable parameters are the linear weight and the bias . Likewise, the decoder becomes with
. If the distortion measure is MSE, then the optimal linear AE coincides with the classical PCA when the data follows the multivariate Gaussian distribution according to the Karhunen–Loève theorem.
To illustrate, assume Gaussian data with mean and covariance , which has the eigendecomposition: , where
is the unitary eigenvectors matrix and
is the diagonal matrix of ordered eigenvalues
. For PCA, the encoder uses principal eigenvectors to project the data onto an dimensional latent subspace with and , wheredenotes the incomplete identity matrix with diagonal elements equal to one and zero elsewhere. The decoder uses the transposed projection with
and . The MSE distortion is given by(2) 
Since the eigenvalues are sorted, the distortion gracefully degrades as principal components are removed in the corresponding order. Of course, the MSE would be considerably worse if an improper ordering (e.g., reversed) is used.
One of the benefits of classical PCA is its graceful rateless property due to the ordering of principal components. Similar to rateless channel coding such as fountain codes MacKay:2005 , PCA does not require a predetermined compression ratio for dimensionality reduction (instead it can be calculated with ), and the latent dimensionality can be later freely adjusted depending on the downstream application. More specifically, the PCA encoder and decoder learned for a dimensionality of can be universally used for any lowerdimensional PCA of latent size without any modification of the PCA model but simply dropping the leastprincipal components () in , i.e., nullifying the tail variables as for all .
The rateless property is greatly beneficial in practical applications since the optimal latent dimensionality is often not known beforehand. Instead of training multiple encoder and decoder pairs for different compression rates, one common PCA model can cover all rates for by simply dropping trailing components, while still attaining good performance as given by . For example, a medical institute could release a massively highdimensional magnetic resonance imaging (MRI) dataset alongside a trained PCA model with a reduceddimensionality of targeted for a specific diagnostic application. However, for under various other applications (e.g., different analysis or diagnostic contexts), an even further reduced dimensionality may suffice and/or improve learning performance for the ultimate task. Even for endusers that require fewer latent variables in various applications, the excellent ratedistortion tradeoff (under Gaussian data assumptions) is still achieved, without updating the PCA model, by simply discarding the leastprincipal components.
Nevertheless, traditional PCA often underperforms in comparison to nonlinear dimensionality reduction techniques on realworld datasets. Exploiting nonlinear activation functions such as rectified linear unit (ReLU), AEs can better learn inherent nonlinearities of the latent representations underlying the data. However, existing AEs do not readily achieve the rateless property, because the latent variables are generally learned to be equally important. Hence, multiple AEs would need to be trained and deployed for different target dimensionalities. This drawback still holds for the progressive dimensionality reduction approaches employed by stacked AEs
Hinton:2006 and hierarchical AEs Schloz:2008 , those of which require multiple training and retuning for different dimensionality. In this paper, we propose a simple and effective technique of employing a stochastic bottleneck to realize rateless AEs that are adaptable to any compression rates.2.3 StochasticWidth bottleneck
Several variants of AE have been proposed, e.g., sparse AE (SAE) Ng:2011 , variational AE (VAE) Vincent:2010 ; Doersch:2016 ; Sonderby:2016 , ratedistortion AE Giraldo:2013 , and compressive AE Theis:2017 . We introduce a new AE family which has no fixed bottleneck architecture to realize the rateless property for seamless dimensionality reduction. Our method can be viewed as an extended version of SAE, similar in its overcomplete architecture, but also employing a varying dropout distribution across the width of the network. This aspect of our approach is key for achieving good reconstruction performance while allowing a flexibly varying compression rate for the dimensionality reduction.
Unlike a conventional AE with a deterministic bottleneck architecture, as shown in Fig. 1(a), the SAE employs a probabilistic bottleneck with an effective dimensionality that is stochastically reduced by dropout, as depicted in Fig. 1(b). For example, the SAE encoder generates dimensional variables
which are randomly dropped out at a probability of
, resulting in an effective latent dimensionality of . Although the SAE has better adaptability than deterministic AE to further dimensionality reduction by dropping latent variables, the latent variables are still trained to be equally important for reconstruction of the data, and thus it is limited in achieving flexible ratelessness.Our AE employs a stochastic bottleneck that imposes a specific dropout rate distribution that varies across both the width and depth of the network, as shown in Fig. 1(c). In particular, our StochasticWidth technique employs a monotonically increasing dropout rate from the head (upper) latent variable nodes to the tail (lower) nodes in order to encourage the latent variables to be ordered by importance, in a manner analogous to PCA. By concentrating more important features in the head nodes, we hope to enable adequate data reconstruction even when some of the least important dimensions (analogous to leastprincipal components) are later discarded.
This nonuniform dropout rate may also offer another benefit for gradient optimization. For existing AEs, the distortion is invariant against node permutations with permuted weights and bias in neural networks, which implies that there are a large number of global solutions minimizing the loss function. A plurality of solutions may distract the stochastic gradient, while nonuniform dropout rates can give a particular priority at every node that prevents permutation ambiguity.
2.4 TailDrop regularization
Dropout Hinton:2012 ; Srivastava:2014
has been widely used to regularize overparameterized deep neural networks. The role of dropout is to improve generalization performance by preventing activations from becoming strongly correlated, which in turn leads to overtraining. In the standard dropout implementation, network activations are discarded (by zeroing the activation for that neuron node) during training (and testing for some cases) with independent probability
. A recent theory Gal:2016provides a viable interpretation of dropout as a Bayesian inference approximation.
There are many related regularization methods proposed in literature; e.g., DropConnect Wan:2013 , DropBlock Wu:2018 , StochasticDepth Huang:2016 , DropPath Larsson:2016 , ShakeDrop Yamada:2018 , SpatialDrop Tompson:2015 , ZoneOut Krueger:2016 , ShakeShake regularization Gastaldi:2017 , and datadriven drop Huang:2017 . In order to facilitate the rateless property for stochastic bottleneck AE architectures, we introduce an additional regularization mechanism referred to as TailDrop, as one realization of StochasticWidth.
The stochastic bottleneck uses nonuniform dropout to adjust the importance of each neuron as explained in Fig. 1(c). This regularization technique is related to StochasticDepth Huang:2016 used in deep residual networks. As illustrated in Fig. 2(a), StochasticDepth drops out entire layers at a higher chance when dropping deeper layers so that an effective network depth is constrained and shallower layers are dominantly trained. Analogously, nonuniform dropouts are carried out across the width direction for StochasticWidth as shown in Fig. 2(b), where independent dropouts at increasing rates are used for each neuron. The monotonically increasing dropout rates can be also realized by dropping consecutive nodes at the tail as shown in Fig. 2
(c), which we call TailDrop. For TailDrop, the desired dropout rates can be achieved by adjusting the probability distribution of the tail drop length as depicted in Fig.
2(d). Considering the scenarios that the user would discard the leastprincipal latent variables to adjust dimensionality later, we focus on the use of this TailDrop regularization for rateless AE in this paper.2.5 Multiobjective learning
Finding an appropriate dropout probability distribution is a key consideration in the design of highperformance rateless AEs. We now give offer insights on how to do so, however a rigorous theoretical development remains an open problem for future study. The objective function in (1) should be reformulated to realize the rateless property. Our ultimate goal is to find AE model parameters and that simultaneously minimize distortion across multiple rates. Specifically, this problem is an ary multiobjective optimization as follows:
(3) 
where denotes the expected distortion for the candidate AE model parameterized by and , given that the dimensional latent variables are further reduced to dimensional variables by dropping the last variables. In this multiobjective problem, optimizing an AE to minimize one component of the loss objective, i.e., for a particular dimensionality , generally does not yield the optimal model for other dimensionalities . Hence, a rateless AE model must account for the best balance across multiple dimensionalities in order to approach the Paretofront solutions.
One commonly used naïve method in multiobjective optimization is a weighted sum optimization to reduce the problem to a single objective function as follows:
(4) 
with some weights . One may choose the weights to scale the distortion to a similar amplitude as for positive distortions where denotes the ground solution. As the expected distortion may depend on the eigenvalues as shown in (2), understanding the nonlinear eigenspectrum can facilitate in optimizing the weight distributions. The stochastic TailDrop regularization at training phase can be interpreted as a weight since the conventional singleobjective optimization in (1) will effectively become the weighted sum optimization in (4). Accordingly, the weights will be the survivor length probability, i.e., the TailDrop distribution is .
Besides the weighted sum approach, there are several improved methods in multiobjective optimizations such as the weighted metric method. We leave such an optimization framework for future work. In this paper, we consider parametric eigenspectrum assumptions for simplicity. Under a modelbased approach of nonlinear eigenspectrum assumptions, we evaluated several parametric distributions for TailDrop probability, e.g., Poisson, Laplacian, exponential, sigmoid, Lorentzian, polynomial, and Wigner distributions, some of which are depicted in Fig. 2
(d). Through a preliminary experiment, it was found that the power cumulative distribution function
for an order of ( denotes a compression rate) performed well for most cases. Accordingly, we focus on the use of the power distribution for TailDrop in the experiments below.3 Experiments
To demonstrate the principleofconcept benefits of our rateless AEs, we use standard image datasets of MNIST and CIFAR10 Krizhevsky:2009 . MNIST contains handwritten class grayscale images of size by, and thus the raw data dimensionality is . The dataset has and images for training and testing, respectively. CIFAR10 is a dataset of by color images, representing classes of natural scene objects. The raw data dimensionality is thus . The training set and test set contain and images, respectively.
The AE models were implemented using the Chainer framework Tokui:2015
. For simplicity, we use fullyconnected threelayer perceptron with ReLU activation functions for both encoder and decoder networks. Note that the concept of StochasticWidth regularization to realize orderedprincipal feature can be applied to recurrent and convolutional networks in a straightforward manner. The number of nodes in the hidden layers is
for MNIST and for CIFAR10. For conventional SAE, we usedsparsity as a baseline to evaluate the robustness of flexible latent dimensionality. Model training was performed using the adaptive momentum (Adam) stochastic gradient descent method
Kingma:2014 with a learning rate of , and a minibatch size of. The maximum number of epochs is
while early stopping with a patience of was applied.3.1 MSE measure
Figs. 3(a) and (b) show the MSE performance of the conventional SAE and proposed RLAE for MNIST and CIFAR10 datasets, respectively. For conventional SAE, multiple AE models are trained at the intended latent dimensionality of for . The rateless AE is optimized at the dimensionality of using TailDrop with a power distribution (with for MNIST, and for CIFAR10). The parameter was chosen from a finite set between and to achieve a good ratedistortion tradeoff. The latent dimensionality used for image reconstruction is varied during testing evaluation by deterministically dropping tail variables.
As shown in Fig. 3(a), the conventional AE does not adapt well to variable dimensionality, with the MSE performance drastically degrading when the testing dimensionality is reduced from the intended dimensionality . For the SAE model trained for , dropping of the latent variables to yield a reduced dimensionality of , the MSE degrades to dB from the dB obtained at , which is significantly worse than an SAE model trained for that obtains an MSE of dB. This shows that the existing SAEs cannot be universally reused for flexible dimensionality reduction, and hence adaptive switching between multiple trained SAE models would be required depending on the desired dimensionality. However, our proposed RLAE, which is trained once for dimensionality , flexibly operates over the wide range of further reduced dimensionalities , while achieving low MSE distortion close to the ideal MSEs obtained by SAE models trained for the specific dimensionality .
Similar observations can be made in the results for the CIFAR10 dataset, as shown in Fig. 3(b). It confirms that high performance can be achieved by a single AE model for different compression rates by using the stochastic bottleneck regularization. This benefit comes from nonuniform dropout rates across neurons to concentrate the mostprincipal feature in upper nodes. Conventional uniformrate dropout, as used in existing SAEs, still requires the target dimensionality to be known during training.
It should be noted that the linear PCA dimensionality reduction performs surprisingly well, competitive to the proposed nonlinear AE for CIFAR10 datasets in Fig. 3(b). Because MNIST images are nearly binary bitmaps whose statistics are far from the Gaussian distribution, PCA did not work well as shown in Fig. 3(a). However, most natural images such as CIFAR10 are often wellmodeled by the Gauss–Markov process. This may be the primary reason why PCA works sufficiently well in particular for the MSE metric. Although it was unexpected that the nonlinear AEs could not improve the MSE performance over the linear PCA for CIFAR10 datasets, the MSE curve of our AE perfectly agreed that of PCA for , which implies that our stochastic bottleneck approach could learn the orderedprincipal components as intended.
3.2 SSIM measure
Here, we verify that the advantage of our rateless AEs extends beyond the MSE distortion criterion. Since the classical MSE metric is known to be inconsistent with perceptual image quality, the structural similarity (SSIM) index Wang:2004 has been recently used as an alternative measure of perceptual distortion. The SSIM index ranges from to , indicating perceptual similarity between the original and distorted images, from the worst to best quality, respectively. We use a negative SSIM index as a new loss function to finetune the AE models, which were pretrained for the MSE metric, so as to improve the perceptual image quality.
Figs. 4(a) and (b) plot the negative SSIM index of the reconstructed images by the conventional SAE and proposed RLAE for MNIST and CIFAR10 datasets, respectively. It is confirmed in those figures that the conventional SAE cannot be universally used for flexibly varying dimensionality in the SSIM distortion metric. Although the proposed RLAE may perform worse than the conventional SAEs at some dimensionalities, for which the SAE models were dedicatedly optimized, our RLAE flexibly achieves SSIM performance closely comparable to the best SSIMs obtained by the ensemble of SAEs over the wide range of dimensionalities .
We can also see that the traditional PCA has a higher loss in the perceptual SSIM metric compared to the MSE metric. In particular for MNIST in Fig. 4(a), the SSIM degradation of the PCA over our RLAE is noticeable over the whole range of dimensionalities, while the PCA worked well for lower dimensionality for the MSE metric, as seen in Fig. 3(a). More importantly, our AE can offer a perceptual performance benefit in the SSIM metric over PCA even for CIFAR10 datasets, for which the AEs could not outperform the PCA in the MSE metric as discussed in Fig. 3(b). This makes sense because the PCA does not consider any perceptual quality but only the signal energy relevant for the MSE measure.
3.3 Reconstructed images
Figs. 5(a) and (b) show visual samples randomly chosen from MNIST test datasets, respectively for SAE and RLAE reconstructions. The top row displays the original MNIST images, and the subsequent rows are reconstructed images for a reduced dimensionality of . Both types of models are trained at a latent dimensionality of under the MSE measure. Our proposed RLAE clearly exhibits improved visual quality for flexible dimensionality reduction versus the conventional SAE, without requiring retraining for each reduced dimensionality. Similar results can be seen for CIFAR10 in Figs. 6(a) and (b).
Tables 1 and 2 show the corresponding averaged MSE and SSIM index performance at for MNIST and CIFAR10, respectively. Here, we also present the
label classification accuracy when a classical support vector machine (SVM) is applied to the reduceddimension latent variables. Besides the higher image quality, we also observe higher classification accuracy achieved by the proposed rateless AE across the variable dimensionality.
Dimensionality  

MSE (dB)  Conv. AE  
Prop. AE  
SSIM Index  Conv. AE  
Prop. AE  
SVM Acc.  Conv. AE  
Prop. AE 
Dimensionality  

MSE (dB)  Conv. AE  
Prop. AE  
SSIM Index  Conv. AE  
Prop. AE  
SVM Acc.  Conv. AE  
Prop. AE 
3.4 Latent representation
Finally we show a latent space geometry in Figs. 7(a) and (b), where the first two latent variables of all MNIST test images are plotted for the traditional SAE and proposed RLAE, respectively. One can clearly see that the labeldependent distribution in our RLAE is more clearly observable than the conventional AE, since the mostprincipal latent components are properly associated with the upper latent variables via the proposed stochastic bottleneck technique. This observation is expected from the higher SVM accuracy performance in Table 1.
4 Conclusions
We proposed new a type of autoencoders employing a form of stochastic bottlenecking with nonuniform dropout rates for flexible dimensionality reduction. The proposed autoencoders are rateless, i.e., the compression rate in dimensionality reduction is not predetermined at the training phase and the user can freely change the dimensionality at testing phase without severely degrading quality. To realize rateless AEs, a simple regularization method called TailDrop was introduced to impose higher priority at upper neurons for learning the mostprincipal nonlinear features. This paper showed proofofconcept results based on the standard MNIST and CIFAR10 image datasets. Universally good distortion performance was obtained with a single AE model irrespective of the flexible dimensionality reduction rate, which was obtained by simply dropping the leastprincipal latent dimensions. More rigorous analysis and theoretical optimization of dropout rate distributions for realworld data are left for future work. Multiobjective learning to account for various downstream applications is also an important open question to pursue.
References
 (1) Van Der Maaten, L., Postma, E. & Van den Herik, J. (2009). Dimensionality reduction: A comparative review, J Mach Learn Res. 10(6671):13.
 (2) Jimenez, L.O. & Landgrebe, D.A. (1998). Supervised classification in highdimensional space: Geometrical, statistical, and asymptotical properties of multivariate data. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews 28(1):39–54.
 (3) Hinton, G.E. & Salakhutdinov, R.R. (2006). Reducing the dimensionality of data with neural networks. Science 313(5786):504–507.

(4)
Scholz, M., Fraunholz, M. & Selbig, J. (2008).
Nonlinear principal component analysis: Neural network models and applications.
In Principal manifolds for data visualization and dimension reduction
(pp. 44–67). Springer, Berlin, Heidelberg.  (5) Kramer, M.A. (1991). Nonlinear principal component analysis using autoassociative neural networks. AIChE Journal 37(2):233–243.
 (6) DeMers, D., Cottrell, G.W. (1993). Nonlinear dimensionality reduction. Advances in Neural Information Processing Systems 5, San Mateo, CA, Morgan Kaufmann, 580–587.

(7)
Ng, A. (2011). Sparse autoencoder.
CS294A Lecture notes 72:1–19. 
(8)
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y. & Manzagol, P. A. (2010). Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion.
Journal of machine learning research 11:3371–3408.  (9) Doersch, C. (2016). Tutorial on variational autoencoders. arXiv preprint arXiv:1606.05908.
 (10) Sønderby, C.K., Raiko, T., Maaløe, L., Sønderby, S.K. & Winther, O. (2016). Ladder variational autoencoders. In Advances in neural information processing systems (pp. 3738–3746).
 (11) Giraldo, L.G.S. & Principe, J.C. (2013). Ratedistortion autoencoders. arXiv preprint arXiv:1312.7381.
 (12) Theis, L., Shi, W., Cunningham, A. & Huszár, F. (2017). Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395.
 (13) Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R.R. (2012). Improving neural networks by preventing coadaptation of feature detectors. arXiv preprint arXiv:1207.0580.
 (14) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1):1929–1958.
 (15) Wan, L., Zeiler, M., Zhang, S., Le Cun, Y. & Fergus, R. (2013). Regularization of neural networks using dropconnect. In International conference on machine learning (pp. 1058–1066).

(16)
Wu, Z., Nagarajan, T., Kumar, A., Rennie, S., Davis, L.S., Grauman, K. & Feris, R. (2018).
Blockdrop: Dynamic inference paths in residual networks.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
(pp. 8817–8826).  (17) Huang, G., Sun, Y., Liu, Z., Sedra, D. & Weinberger, K.Q. (2016). Deep networks with stochastic depth. In European conference on computer vision (pp. 646–661). Springer, Cham.
 (18) Larsson, G., Maire, M. & Shakhnarovich, G. (2016). Fractalnet: Ultradeep neural networks without residuals. arXiv preprint arXiv:1605.07648.
 (19) Yamada, Y., Iwamura, M., Akiba, T. & Kise, K. (2018). ShakeDrop Regularization for Deep Residual Learning. arXiv preprint arXiv:1802.02375.
 (20) Tompson, J., Goroshin, R., Jain, A., LeCun, Y. & Bregler, C. (2015). Efficient object localization using convolutional networks. IEEE Conference on Computer Vision and Pattern Recognition (pp. 648–656).
 (21) Krueger, D., Maharaj, T., Kramár, J., Pezeshki, M., Ballas, N., Ke, N.R. & Pal, C. (2016). Zoneout: Regularizing RNNs by randomly preserving hidden activations. arXiv preprint arXiv:1606.01305.
 (22) Gastaldi, X. (2017). Shakeshake regularization. arXiv preprint arXiv:1705.07485.
 (23) Huang, Z. & Wang, N. (2017). Datadriven sparse structure selection for deep neural networks. arXiv preprint arXiv:1707.01213.

(24)
Gal, Y. & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning.
In international conference on machine learning (pp. 1050–1059).  (25) Wang, Z., Bovik, A.C., Sheikh, H.R. & Simoncelli, E.P. (2004). Image quality assessment: From error visibility to structural similarity. IEEE transactions on image processing 13(4):600–612.
 (26) MacKay, D.J. (2005). Fountain codes. IEE ProceedingsCommunications 152(6): 1062–1068.
 (27) Krizhevsky, A. & Hinton, G. (2009). Learning multiple layers of features from tiny images. Technical report, University of Toronto 1(4):7.

(28)
Tokui, S., Oono, K., Hido, S. & Clayton, J. (2015). Chainer: a nextgeneration open source framework for deep learning.
In Proceedings of workshop on machine learning systems (LearningSys) in the twentyninth annual conference on neural information processing systems (NIPS) 5:1–6.  (29) Kingma, D.P. & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
 (30) Xiao, H., Rasul, K. & Vollgraf, R. (2017). FashionMNIST: A novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747.
 (31) Clanuwat, T., BoberIrizar, M., Kitamoto, A., Lamb, A., Yamamoto, K. & Ha, D. (2018). Deep learning for classical Japanese literature. arXiv preprint arXiv:1812.01718.
 (32) Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B. & Ng, A. Y. (2011). Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning.
5 Supplementary Experiments
We show the MSE performance of the proposed RLAE for different datasets as follows:

FashionMNIST (FMNIST) Xiao:2017 is a set of fashion articles represented by grayscale by images, associated with a label from classes, consisting a training set of examples and a test set of examples. FMNIST was intended to serve as a direct replacement for the MNIST dataset for benchmarking.

KuzushijiMNIST (KMNIST) Clanuwat:2018 is another set of handwritten Japanese characters represented by class grayscale by images with the same data sizes of MNIST and FMNIST.

The street view house numbers (SVHN) dataset Netzer:2011 is similar to MNIST but composed of cropped by color images of house numbers. It contains digits for training and digits for testing.

CIFAR100 Krizhevsky:2009 is a set of small natural images, just like the CIFAR10, except it has classes containing images each. There are training images and testing images per class. The classes in the CIFAR100 are grouped into superclasses.
Figs. 8(a) through (d) show the MSE performance as a function of survivor latent dimensionality for FMNIST, KMNIST, SVHN, and CIFAR100, respectively. We can confirm that the proposed AE achieves graceful performance over the wide range of dimensionality, competitive to the best performance which the conventional AEs can offer at a predetermined dimensionality. Although the linear PCA also achieves rateless performance, a significant MSE loss is seen for grayscale datasets of FMNIST and KMNIST, similar to MNIST in Fig. 3(a). However for color datasets, PCA performed well just like in CIFAR10 in Fig. 3(b). Nonetheless, our AE achieves nearly best performance, outperforming the conventional AE. In addition, our AE may achieve better perceptual quality and classification accuracy as discussed for CIFAR10. The experimental results verified that a simple mechanism with nonuniform dropout regularization can enable a reasonable rateless property.
Figs. 9, 10, 11, and 12 show visual snapshots of randomlychosen images reconstructed by the conventional AE and proposed AE for FMNIST, KMNIST, SVHN, and CIFAR100, respectively. One can observe a clear advantage of the RLAE over the SAE to maintain higher quality across variable dimensionality.